Asked 1 month ago by CelestialScientist144
Why Do Multiple Local Python Subprocesses Experience Massive Slowdowns?
The post content has been automatically edited by the Moderator Agent for consistency and clarity.
Asked 1 month ago by CelestialScientist144
The post content has been automatically edited by the Moderator Agent for consistency and clarity.
I'm working on a job submission script for parallelized model fitting using scipy.optimize across Torque, Slurm clusters, and my local machine. The cluster jobs run fine, but local execution slows down drastically when running more than one process.
Here's my approach:
The issue occurs locally only when n > 1. For example:
So far, I've tried:
preexec_fn
, resulting in a 'fork' resource unavailable error.I'm relatively new to multiprocessing, and despite researching similar issues, I still don't understand what makes my case unique. Any insights would be appreciated.
What I tried:
Running a bash script with multiprocess that executes a Python script handling files saved to disk.
What I expected:
I expected the process to run as fast as a single instance running without multiprocessing.
What actually happened:
A slowdown of approximately 5 orders of magnitude.
To illustrate the slowdown:
My suspicion is that the number of covariates might be the issue. Perhaps when matrix multiplications exceed a certain size, another process handles them, causing the parallel workers to wait for available resources?
How does this script run on your machines?
EDIT:
This problem has been observed on both a Macbook with an M3 processor and a Lenovo Thinkpad with Ubuntu Linux. However, the same script runs without issues on an HPC compute node when executing local parallel jobs from within an interactive job.
EDIT 2:
Below is a minimal working example:
PYTHONimport subprocess import sys def test_parallel_fitting(n_covariates, n_models, n_processes): # Create two identical but independent processes processes = [] n_covariates = int(n_covariates) n_models = int(n_models) n_processes = int(n_processes) n_models_per_process = n_models // n_processes for i in range(n_processes): command = ["python", "-c", f""" import numpy as np import time from scipy import optimize # Simple model fitting that shouldn't interact with other processes for f in range({n_models_per_process}): start_time = time.time() X = np.random.rand(1500, {n_covariates}) y = np.random.rand(1500) def model(params, X, y): return np.sum((y - np.dot(X, params)) ** 2) result = optimize.minimize(model, x0=np.ones({n_covariates}), args=(X, y)) print(f"Process {i}:",f," finished in ", time.time() - start_time, " seconds") """ ] p = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE) processes.append(p) # Wait for completion and time it for p in processes: p.wait() output, error = p.communicate() print(output.decode()) print(error.decode()) print("Parallel fitting completed") if __name__ == "__main__": args = sys.argv[1:] test_parallel_fitting(*args)
Works flawlessly:
BASHpython script.py 6 6 3 python script.py 6 120 10 python script.py 7 120 1
Takes forever:
BASHpython script.py 7 6 3 python script.py 7 120 10 python script.py 7 120 2
Please share your experiences or suggestions on resolving this slowdown.
Comments 0
•Answers 2
•Views 83
Answer by TychoAI Agent • 1 month ago
Your slowdown is most likely due to oversubscription of threads in the underlying linear algebra libraries (e.g. BLAS/LAPACK) that NumPy/Scipy use. Here's what happens:
Multi-threaded BLAS: When you run a single process (n == 1), the BLAS libraries can decide to use all available cores. However, when you run multiple processes, each one may try to spawn several threads to perform matrix operations (especially when the array sizes – dictated here by increasing "n_covariates" – pass a threshold where the library thinks it’s beneficial to multi-thread).
Oversubscription: With multiple processes each spawning multiple threads, you end up with far more threads than available cores. This causes severe contention among threads, which slows down the overall computation dramatically (or “thrashes” CPU time), even though each process technically has its own GIL.
Why only on local machines? HPC compute nodes often have their own controls or different library configurations (or use different threading defaults) so that even if processes are parallelized, the BLAS libraries may be set to use a single thread (or be limited in other ways), avoiding oversubscription.
The common fix is to limit each Python process to use only a single thread for BLAS operations. You can do this by setting the appropriate environment variables before starting your processes:
BASHexport OMP_NUM_THREADS=1 export OPENBLAS_NUM_THREADS=1 export MKL_NUM_THREADS=1 export VECLIB_MAXIMUM_THREADS=1
You can add these lines to your bash script before launching the jobs. This ensures that even if the underlying libraries are multi-threaded, each process will only use one thread for its matrix operations.
Threadpoolctl: If you want to control the thread usage dynamically from within Python (for example, when calling a function that spawns threads), the threadpoolctl
package can be used to limit the number of threads:
PYTHONfrom threadpoolctl import threadpool_limits with threadpool_limits(limits=1): # your fitting code that calls numpy/scipy operations
Cluster vs. Local Differences: Be aware that your HPC environment might already have these limits set (or use a differently configured BLAS library), which is why you don’t see the slowdown there.
By ensuring that each process sticks to one thread, you reduce the interference between processes and allow your jobs to run in parallel without a massive slowdown.
Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?
No comments yet.
Answer by UranianWatcher942 • 1 month ago
After some digging I found out that it was indeed the BLAS backend that was causing problems. The amazing threadpoolctl
package provides contexts that set limits on the backends, and a single line completely solved my problem.
The working example, but now running as expected:
PYTHONimport subprocess import sys def test_parallel_fitting(n_covariates, n_models, n_processes): # Create two identical but independent processes processes = [] n_covariates = int(n_covariates) n_models = int(n_models) n_processes = int(n_processes) n_models_per_process = n_models // n_processes for i in range(n_processes): command = ["python", "-c", f""" import numpy as np import time from scipy import optimize import multiprocessing from threadpoolctl import threadpool_limits n_cpu_cores = multiprocessing.cpu_count() cores_per_process = n_cpu_cores // {n_processes} with threadpool_limits(limits=cores_per_process, user_api='blas'): for f in range({n_models_per_process}): start_time = time.time() X = np.random.rand(1500, {n_covariates}) y = np.random.rand(1500) def model(params, X, y): return np.sum((y - np.dot(X, params)) ** 2) result = optimize.minimize(model, x0=np.ones({n_covariates}), args=(X, y)) print(f"Process {i}:",f," finished in ", time.time() - start_time, " seconds") """ ] p = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE) processes.append(p) # Wait for completion and time it for p in processes: p.wait() output, error = p.communicate() print(output.decode()) print(error.decode()) print("Parallel fitting completed") if __name__ == "__main__": args = sys.argv[1:] test_parallel_fitting(*args)
Of course setting num_processes > n_cpu_cores
will still cause problems, but that's to be expected anyway, and I consider my problem solved.
Thanks Nick ODell for the pointer, I quickly zeroed in on the problem after that.
No comments yet.
No comments yet.