July 3, 2025

About

This article demonstrates how to efficiently solve multiple large, sparse linear systems in Python using scipy.sparse.linalg.spsolve for individual solutions and joblib.Parallel for parallel execution. It covers how to generate random symmetric positive definite (SPD) sparse matrices, solve them in parallel, and benchmark the performance across different numbers of CPU cores. The article also discusses common pitfalls such as process overhead for small problems and provides practical recommendations for maximizing speedup using joblib. A sample implementation and visual analysis of speedup scaling are included.

Code

import numpy as np
from scipy.sparse import random as sparse_random
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import spsolve
from joblib import Parallel, delayed
import time


def generate_random_spd_matrix(n, density=0.01, seed=None):
    """Generate a random sparse symmetric positive definite matrix (approx)."""
    rng = np.random.default_rng(seed)
    A = sparse_random(n, n, density=density, format='csr', random_state=rng)
    A = A + A.T  # make symmetric
    A = A + n * csr_matrix(np.eye(n))  # make diagonally dominant (=> SPD)
    return A


def solve_system(K: csr_matrix, f: np.ndarray):
    return spsolve(K, f)


def benchmark_solvers(Ks, fs, n_jobs=4):

    print(f"\n🔵 Solving {n_problems} systems of size {n} × {n}...")

    # --- Serial execution ---
    t0 = time.perf_counter()
    serial_solutions = [solve_system(K, f) for K, f in zip(Ks, fs)]
    t_serial = time.perf_counter() - t0
    print(f"⏱ Serial time: {t_serial:.3f} sec")

    # --- Parallel execution ---
    t0 = time.perf_counter()
    parallel_solutions = Parallel(n_jobs=n_jobs)(
        delayed(solve_system)(K, f) for K, f in zip(Ks, fs)
    )
    t_parallel = time.perf_counter() - t0
    print(f"⚡ Parallel time (n_jobs={n_jobs}): {t_parallel:.3f} sec")

    speedup = t_serial / t_parallel if t_parallel > 0 else float('inf')
    print(f"🚀 Speedup: {speedup:.2f}x\n")
    
    return t_parallel, speedup


if __name__ == "__main__":
    
    n = 1000
    n_problems = 20
    Ks = [generate_random_spd_matrix(n, density=0.01, seed=i) for i in range(n_problems)]
    fs = [np.random.rand(n) for _ in range(n_problems)]
    t_parallel_list, speedup_list = list(), list()
    n_jobs_range = range(1, 7)
    for n_jobs in range(1, 7):
        print(f"!! - njobs: {n_jobs} !!")
        t_parallel, speedup = benchmark_solvers(Ks, fs, n_jobs=n_jobs)
        t_parallel_list.append(t_parallel)
        speedup_list.append(speedup)

    import matplotlib.pyplot as plt
    plt.figure(figsize=(10, 5))
    plt.subplot(1, 2, 1)
    plt.plot(n_jobs_range, t_parallel_list, marker='o')
    plt.title("Parallel Time vs n_jobs")
    plt.xlabel("n_jobs")
    plt.ylabel("Parallel Time (s)")
    plt.grid(True)

    plt.subplot(1, 2, 2)
    plt.plot(n_jobs_range, speedup_list, marker='s')
    plt.title("Speedup vs n_jobs")
    plt.xlabel("n_jobs")
    plt.ylabel("Speedup")
    plt.grid(True)

    plt.tight_layout()
    output_path = "./parallel_speedup_plot.png"
    plt.savefig(output_path)

Result

In this experiment, we solved 20 sparse linear systems of size 1000 Ă— 1000 using spsolve in both serial and parallel modes with varying numbers of n_jobs. The results show a clear pattern: while parallel execution improves performance, the speedup is not linear and depends on the number of processes and the problem size.

These results highlight an important point: parallel processing comes with overhead—particularly the cost of spawning and managing worker processes. When the task is relatively light (e.g., solving a 1000×1000 system in under 100ms), this overhead can significantly dilute the benefits of parallelism. For example, while n_jobs=2 offers only a modest speedup (1.37×), using n_jobs=4 yields a much better result (3.15×).

Interestingly, n_jobs=5 performs worse than 4, likely due to scheduling contention or increased memory bandwidth pressure, while n_jobs=6 recovers performance, achieving a peak speedup of 3.66Ă—.

The takeaway is this: joblib scales well for CPU-bound tasks, but only when the tasks are sufficiently heavy to amortize the parallelization overhead.

!! - njobs: 1 !!

? Solving 20 systems of size 1000 Ă— 1000...
? Serial time: 1.872 sec
? Parallel time (n_jobs=1): 1.870 sec
? Speedup: 1.00x

!! - njobs: 2 !!

? Solving 20 systems of size 1000 Ă— 1000...
? Serial time: 1.869 sec
? Parallel time (n_jobs=2): 1.348 sec
? Speedup: 1.39x

!! - njobs: 3 !!

? Solving 20 systems of size 1000 Ă— 1000...
? Serial time: 1.837 sec
? Parallel time (n_jobs=3): 1.026 sec
? Speedup: 1.79x

!! - njobs: 4 !!

? Solving 20 systems of size 1000 Ă— 1000...
? Serial time: 1.826 sec
? Parallel time (n_jobs=4): 0.651 sec
? Speedup: 2.81x

!! - njobs: 5 !!

? Solving 20 systems of size 1000 Ă— 1000...
? Serial time: 1.875 sec
? Parallel time (n_jobs=5): 0.791 sec
? Speedup: 2.37x

!! - njobs: 6 !!

? Solving 20 systems of size 1000 Ă— 1000...
? Serial time: 1.870 sec
? Parallel time (n_jobs=6): 0.512 sec
? Speedup: 3.65x