HPC: Run Spark Clusters on SLURM – Reproducible Setup with Pixi and sparkhpc
· 7 min read
Running distributed Spark workloads on HPC clusters is a common task in bioinformatics and data science. However, integrating Spark with SLURM—the dominant HPC job scheduler—requires careful orchestration: you need to allocate compute resources via SLURM, start a Spark master, coordinate worker processes, and ensure all dependencies (Java, PySpark, Python) are available. This post shows how to set up reproducible Spark clusters on SLURM using Pixi for environment management and sparkhpc for cluster orchestration, based on the gkit Spark-on-SLURM implementation.