I'm learning Dask to make my Python projects more efficient and scalable. To compare performance, I wrote a script that calculates the mean of the "points" column in a large CSV file using both Pandas and Dask.
Here's my code:
import pandas as pd
import dask.dataframe as dd
import time
from memory_profiler import memory_usage
filename = "large_dataset_3.csv"
df_pd = pd.read_csv(filename)
df_dask = dd.read_csv(filename, blocksize=75e6)
start = time.time()
mean_pd = df_pd["points"].mean()
stop = time.time()
print(f"Pandas Mean Computation Time {stop - start:.5f} seconds")
start = time.time()
mean_dask = df_dask["points"].mean().compute(num_workers=4)
stop = time.time()
print(f"Dask Mean Computation Time {stop - start:.5f} seconds")
When I run this, Pandas computes the mean in about 0.02 seconds, while Dask takes over 4.5 seconds. This is surprising since I expected Dask’s parallel processing to be faster.
For context:
The CSV file contains 100 million rows (totaling 292.4 MB).
System Specs:
Processor: Intel® Core™ i5-8365U × 8 (4 cores, 8 threads)
RAM: 16 GB
My Questions:
- Why is Dask slower than Pandas in this scenario?
- What optimizations or configurations can improve Dask's performance?