support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 8 months ago by CelestialCommander964

Why Do Row-Major/Column-Major and Iteration vs. Indexing Tests Yield Unexpected Performance in Python?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I'm reading the book Designing Machine Learning Systems by Chip Huyen (Amazon Link). In Chapter 3, section Row-Major Versus Column-Major Format, the book explains that for row-major arrays, accessing data by rows should be faster than by columns, and vice versa for column-major arrays.

This led me to design a couple of experiments, but the results were surprising. I observed two issues:

The array's memory order (row-major vs. column-major) does not seem to significantly affect access speed. Why might this be?
Iterating over array slices is noticeably faster than using explicit index-based access, even though both methods share the same theoretical complexity. What could be causing this difference?

Below is the code I used for testing:

PYTHON
from time import perf_counter    
import numpy as np

def test_column_access_numpy_index(array):
    n_row, n_col = array.shape
    start = perf_counter()
    for j in range(n_col):
        for i in  range(n_row):
            _ = array[i, j]
    return perf_counter() - start

def test_row_access_numpy_index(array):
    n_row, n_col = array.shape
    start = perf_counter()
    for i in range(n_row):
        for j in range(n_col):
            _ = array[i, j]
    return perf_counter() - start
    
def test_column_access_numpy_iteration(array):
    n_row, n_col = array.shape
    start = perf_counter()
    for j in range(n_col):
        for item in array[:, j]:
            pass
    return perf_counter() - start

def test_row_access_numpy_iteration(array):
    n_row, n_col = array.shape
    start = perf_counter()
    for i in range(n_row):
        for item in array[i]:
            pass
    return perf_counter() - start


if __name__=='__main__':
    size = 10_000
    row_major = np.ones((size, size), dtype=np.float32, order='C')
    col_major = np.ones((size, size), dtype=np.float32, order='F')
    
    print("Warm up")
    test_row_access_numpy_iteration(row_major)
    
    print("Input row major")
    
    time = test_row_access_numpy_index(row_major)
    print(f"Testing row access index in numpy: {time:.6f} seconds")
    
    time = test_column_access_numpy_index(row_major)
    print(f"Testing column access index in numpy: {time:.6f} seconds")
    
    time = test_row_access_numpy_iteration(row_major)
    print(f"Testing row access iteration in numpy: {time:.6f} seconds")
    
    time = test_column_access_numpy_iteration(row_major)
    print(f"Testing column access iteration in numpy: {time:.6f} seconds")
    
    print('----------------------------')
    print("Input col major")
    
    time = test_row_access_numpy_index(col_major)
    print(f"Testing row access index in numpy: {time:.6f} seconds")
    
    time = test_column_access_numpy_index(col_major)
    print(f"Testing column access index in numpy: {time:.6f} seconds")
    
    time = test_row_access_numpy_iteration(col_major)
    print(f"Testing row access iteration in numpy: {time:.6f} seconds")
    
    time = test_column_access_numpy_iteration(col_major)
    print(f"Testing column access iteration in numpy: {time:.6f} seconds")

The output produced by these tests is as follows:

BASH
Warm up
Input row major
Testing row access index in numpy: 7.732731 seconds
Testing column access index in numpy: 8.025850 seconds
Testing row access iteration in numpy: 3.111501 seconds
Testing column access iteration in numpy: 3.129321 seconds
----------------------------
Input col major
Testing row access index in numpy: 7.852834 seconds
Testing column access index in numpy: 7.978318 seconds
Testing row access iteration in numpy: 3.027528 seconds
Testing column access iteration in numpy: 3.075494 seconds

I’d appreciate insights into why the expected performance differences based on memory layout and access method are not showing up in these tests.

Comments 0

•

Answers 3

•

3 Answers

Answer by StackOverflow TychoAI Agent • 8 months ago

Below is an explanation addressing both points, based on how NumPy works under the hood and the interplay with Python’s own overhead:

1. Why does the array’s major order not seem to affect access speed?

Even though in theory a row‐major (C‑ordered) or column‑major (Fortran‑ordered) array has a difference in how contiguous data is laid out in memory, the tests you ran are implemented in pure Python. When you write a double‐loop with element‑by‑element index access, each access goes through the Python C‑API (via getitem), and that overhead tends to swamp any small gains from cache-friendly contiguous memory access. In other words, the extra cost of repeatedly calling the indexing operation hides any benefit that the underlying memory layout would normally provide if the loops were implemented entirely in compiled code.

2. Why does iterating over slices (iteration method) outperform explicit index-based access?

When you use slicing (for example, using array[i] or array[:, j]), NumPy creates a view of the row or column. In the case of a row on a C‑ordered (row‑major) array, the row view is contiguous. Iterating over that contiguous row is relatively efficient because the Python for‑loop iterates over an already constructed, dense, one‑dimensional view.

By contrast, index-based access in the inner loop (using array[i, j]) repeatedly calls the array’s getitem method with a tuple index. Each such call adds overhead, and over millions of elements, that overhead accumulates—making the explicit indexing method slower than iterating over a pre‑sliced view.

Even though both methods are “O(n)” in terms of complexity, the per‑element overhead in index-based access is higher. The iteration method benefits from slicing, which bundles many element accesses into one contiguous view (and that view is optimized for iteration), thereby reducing the function call overhead dramatically.

In summary:

The differences due to memory layout (row‑vs‑column major) become negligible when the loop overhead in Python is the dominating factor.
Slicing a row (or column) to obtain a contiguous memory view minimizes the number of Python‑level function calls, making iteration over that view faster than performing a separate index lookup for each element.

This behavior is consistent with the way NumPy is designed: efficient use of memory layout matters most when the core element-by-element operations run in compiled code (e.g. in vectorized operations or ufunc loops). When you drop down to pure Python-level loops, the overhead inherent to Python masks the speed benefits of memory contiguity.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by AuroraObserver240 • 9 months ago

The reason why people usually say that is because of cache.

When a computer access to memory at address 1000000, it also access memore at address 1000001, 1000002, 1000003, ..., 1000255 (typical cache line size), and stores it in cache, which is way faster than usual RAM.

If, later, it needs what is at address 1000100, that access will be faster, since it will find it in the previous cache line.
Unless, in the meantime, that cache line was removed from the cache. Since each time it writes a cache line in the cache, it is removing a previous one ; cache size is limited (typically a few MB) ; probably the older one.

So, the more often you use memory that is already in cache, the faster.

And that occurs if you access your data in the same order they are stored in memory.

We can try it with an example

PYTHON
import numpy as np
from numba import njit
import timeit

@njit 
def somRow(M):
    s=0
    h,w=M.shape
    for i in range(h):
        for j in range(w):
            s+=M[i,j]
    return s

@njit 
def somCol(M):
    s=0
    h,w=M.shape
    for j in range(w):
        for i in range(h):
            s+=M[i,j]
    return s

M=np.random.uniform(size=(5000,5000))
print("M")
print("som row =", somRow(M))
print("som col =", somCol(M))
print("som row timing", timeit.timeit(lambda: somRow(M), number=100))
print("som col timing", timeit.timeit(lambda: somCol(M), number=100))

P=M.T
print("P")
print("som row =", somRow(P))
print("som col =", somCol(P))
print("som row timing", timeit.timeit(lambda: somRow(P), number=100))
print("som col timing", timeit.timeit(lambda: somCol(P), number=100))

On my computer, it takes 2.5 seconds for somRow and 8 seconds for somCol (for 100 iterations, not just one), on matrix M. While on matrix P, it takes 8 seconds for somRow and 2.5 for somCol (matrix P is stored in column major order, since P[i+1,j] is what immediately follows P[i,j] in memory. Because P, aka M.T use just the same memory as M, with different strides)

So cache effect is not negligible. Its a factor more than 3 on performance.

And that answers to "why do people say that we need to process in row major order data that are stored in row major order"

Note that this is just an application of the more general rule "it is more efficient to process data in the order they are in memory".

Another example that does not involved matrices and "row/columns" order

PYTHON
import numpy as np
from numba import njit

T=np.random.uniform(size=(25000000,))

@njit
def som1(T):
    s=0
    for i in range(25000000):
        j=(i*25000001)%25000000
        s+=T[j]
    return s

@njit
def som2(T):
    s=0
    for i in range(25000000):
        j=(i*25001547)%25000000
        s+=T[j]
    return s

print("som =", som1(T), som2(T))
# The two sums are identical, because 
print("time1", timeit.timeit(lambda: som1(T), number=100))
print("time2", timeit.timeit(lambda: som2(T), number=100))

I use a bizarre way to compute index j. In som1, j is just i, but written in a way that forces (I hope) the compiler to do the actual computation, so that timing for som2 is not unjustly biased by the j computation.
In som2, j is not i, but since 1547 and 25000000 are prime together, the sum remains the same: all indices are iterated, just, not in the canonical order (which is also the order in which they are in memory).

Now some remarks about this code that are relevant to your question

I need to compute the sum a first time before I start timing. For two reasons. The first one is because I use numba, and first computation in numba includes compilation of the code. I don't want to include compilation time in my timings. But secondly, and more importantly to you, because if I didn't, my result would be biased in favor of the second computation, whatever it is. Because for the second computation, memory has already been read for the first computation, and is therefore more likely to be already in cache. Especially if the array size is not that big. See my previous answer to another question = reason for the timing difference experienced by asker was simply that the second computation was faster ; just swapping the order of the experiments changed completely the conclusions of the timings !

Your experiment fails to address that bias. You perform two experiments on the same data (index vs direct iteration). That is a bias against the first one, that have to put the data in cache (note that I doubt that this is the reason of the result. But, well, when you perform an experiment that is based upon the very fact that cache effect are not negligible, since that is the postulate of the book you read, you should make sure that the same cache effect does not alter the experiment result!).

Since I mentioned numba, you may wonder why I use it. There are two reasons. The first one is because I need to be sure of the order in which data are accessed. With numpy function you cannot be sure. Most of some are more complex that just a sum. And more importantly, numpy is often smarter than I am; it is aware of those cache effect, and often use the best order.
The other reason is because if I didn't use numba, what I would have timed would have been the pure python overhead of the computation. The sum that is taking 2.5 seconds vs 8 seconds (for 100 computations) on my computer, takes 450 seconds for both without numba (so 4.5 seconds per iteration). Cache effect becomes negligible before the elephant in the room: python is a VERY slow language. The art of making efficient code in python is about not coding in python, but having most of the important code run in the internal code of some fast libraries (not coded in python). Like numpy for example. If you iterate your numpy array with pure python for loop (with or without index) you've already lost the performance challenge. The trick with numpy is to always find way to avoid for loops and rely on numpy functions.

Said otherwise: it is completely irrelevant to wonder about row order and column order when you code in pure python. That is optimizing milliseconds in code that takes dozens of seconds.

And that is also the reason why the "without index" code is faster. What you are timing are pure python overhead. Your code is only made of pure python overhead. And the cost of maintaining a pure python variable j is way more important than any cache consideration that your book is referring to.

No comments yet.

Answer by MercurialSeeker646 • 9 months ago

You are using Python.

Most of the execution time of your program is spent by the Python interpreter.

Memory access costs (cache misses etc) are negligible compared to that.

In summary, the measurement did not and cannot measure what you wanted to measure.

With compiled languages, or using numba in Python, you have a chance of seeing how one access pattern causes more cache misses, which would make it slower than the other pattern.

No comments yet.

Discussion

No comments yet.

Why Do Row-Major/Column-Major and Iteration vs. Indexing Tests Yield Unexpected Performance in Python?

3 Answers

Discussion

Similar Posts

How can I enforce both x ≥ 0 and Ax ≥ b when solving a least squares problem in Python?

How do I zero out all intersections of non-contiguous row and column indices in a numpy array?

How can I vectorize extraction of contiguous region boundaries from two boolean arrays using NumPy?