Skip to main content

HPC: Run Spark Clusters on SLURM – Reproducible Setup with Pixi and sparkhpc

· 7 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Running distributed Spark workloads on HPC clusters is a common task in bioinformatics and data science. However, integrating Spark with SLURM—the dominant HPC job scheduler—requires careful orchestration: you need to allocate compute resources via SLURM, start a Spark master, coordinate worker processes, and ensure all dependencies (Java, PySpark, Python) are available. This post shows how to set up reproducible Spark clusters on SLURM using Pixi for environment management and sparkhpc for cluster orchestration, based on the gkit Spark-on-SLURM implementation.

1. The Challenge: Spark + SLURM Integration

1.1. Why Spark on SLURM?

On most HPC clusters, SLURM controls resource allocation and job execution. If you want to run Spark:

  • You can't just launch Spark master and worker processes directly—SLURM must allocate the resources first
  • Dependencies (Java, PySpark, Python) must be available and consistent across all compute nodes
  • The Spark driver needs to discover and connect to the master and worker nodes
  • If jobs fail or time out, you need deterministic cleanup and logging

1.2. The Traditional Approach (Manual and Fragile)

Without a proper orchestration layer, developers often do this:

# Manually request resources
salloc -N 2 -c 4 --time=00:30:00

# SSH to first node, start Spark master manually
ssh node1
SPARK_HOME=/opt/spark ./bin/spark-class org.apache.spark.deploy.master.Master

# SSH to second node, start worker manually
ssh node2
SPARK_HOME=/opt/spark ./bin/spark-class org.apache.spark.deploy.worker.Worker \
spark://node1:7077 --cores 4

# Run your driver script
python my_analysis.py

# Manual cleanup
# Kill processes if they're still running...

Problems:

  • Error-prone (easy to forget ports, hostnames, cleanup)
  • Not reproducible (different developers may set it up differently)
  • Dependencies scattered across system paths or manual installations
  • Difficult to integrate into automated pipelines

1.3. The Better Approach: Pixi + sparkhpc

# Single command sets everything up with reproducible dependencies
make setup

# Single command runs the entire workflow
make example

Behind the scenes:

  • Pixi creates an isolated environment with Python, Java, and PySpark pinned to specific versions
  • sparkhpc submits a SLURM batch job that orchestrates Spark startup
  • A Python driver connects, runs computations, and cleans up automatically

2. Understanding the Architecture

2.1. The Component Stack

┌─────────────────────────────────────────────┐
│ User Machine (submit dir) │
│ ┌───────────────────────────────────────┐ │
│ │ Python Driver (run_example.py) │ │
│ └───────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
↓ (sparkhpc.submit())
┌─────────────────────────────────────────────┐
│ SLURM Controller │
│ (sbatch sparkjob.slurm.template) │
└─────────────────────────────────────────────┘
↓ (srun)
┌─────────────────────────────────────────────┐
│ Spark Master (Node 1) │
│ Spark Workers (Nodes 1-N via srun) │
│ │
│ Environment: Python, Java, PySpark │
│ (from Pixi) │
└─────────────────────────────────────────────┘

2.2. Key Files in the gkit Spark-on-SLURM Setup

spark-on-slurm/
├── pixi.toml # Environment definition
├── Makefile # User entry points
├── sparkhpc/
│ ├── run_example.py # End-to-end workflow
│ ├── sparkhpc/
│ │ ├── sparkjob.py # Base Spark job class
│ │ ├── slurmsparkjob.py # SLURM-specific implementation
│ │ └── templates/
│ │ └── sparkjob.slurm.template # SLURM batch script template
│ └── scripts/
└── sparkhpc.log # Cluster logs

3. Reproducibility with Pixi

3.1. What is Pixi?

Pixi is a cross-platform package manager that creates reproducible environments. Unlike venv or conda, Pixi locks all transitive dependencies, ensuring identical setups across machines.

3.2. The pixi.toml Configuration

[workspace]
authors = ["nttg8100 <nttg8100@gmail>"]
channels = ["conda-forge", "bioconda"]
name = "spark-on-slurm"
platforms = ["linux-64"]
version = "0.1.0"

[tasks]
sparkhpc-example = "python sparkhpc/run_example.py"

[dependencies]
python = "3.11.*"
openjdk = "==17.0.18"
pyspark = ">=4.1.1,<5"

Key aspects:

  • Channels: conda-forge and bioconda provide pre-built packages (including Java and Spark)
  • Pinned versions: openjdk = "==17.0.18" ensures exact Java version across all runs
  • Task definition: sparkhpc-example command runs the Python driver inside the Pixi environment
  • Platform: linux-64 ensures reproducibility on HPC clusters (typically Linux)

3.3. Environment Setup

# Install Pixi (first time)
curl -sSL https://pixi.sh/install.sh | sh

# Install dependencies into pixi.lock
pixi install

# Run a command in the Pixi environment
pixi run sparkhpc-example
# or via Makefile
make example

When pixi run sparkhpc-example executes:

  1. Pixi activates the locked environment (Python 3.11, OpenJDK 17, PySpark 4.1)
  2. JAVA_HOME is automatically set
  3. SPARK_HOME is resolved from the PySpark installation
  4. The Python driver runs in this isolated context

4. Spark Cluster Orchestration with sparkhpc

4.1. The sparkjob Class

The sparkjob class wraps Spark on SLURM by:

  1. Generating a SLURM batch script from a template
  2. Submitting the batch job via sbatch
  3. Polling for the Spark master URL (read from a metadata file)
  4. Starting a PySpark context that connects to the master
  5. Stopping the cluster and cleaning up when done

Example usage (from run_example.py):

from sparkhpc import sparkjob

# Create a Spark job: 2 cores total, 2 cores per executor, 10 min walltime
sj = sparkjob.sparkjob(ncores=2, cores_per_executor=2, walltime="00:10")

# Submit to SLURM
cluster_id = sj.submit()
print(f"submitted cluster_id={cluster_id} jobid={sj.jobid}")

# Poll until master starts (max 3 minutes)
started = False
deadline = time.time() + 180
while time.time() < deadline:
master = sj.master_url()
if master:
print(f"master={master}")
started = True
break
time.sleep(1)

if not started:
sj.stop()
raise RuntimeError("Spark master did not start in time")

# Start PySpark context
sc = sj.start_spark(graphframes_package=None)

# Run Spark actions
count_result = sc.parallelize(range(100)).count()
sum_result = sc.parallelize(range(1, 11)).sum()
print(f"count={count_result}")
print(f"sum={sum_result}")

# Always clean up
try:
# ... run spark tasks ...
finally:
sc.stop()
sj.stop()
print("cluster stopped")

4.2. The SLURM Batch Script Template

Behind the scenes, sparkhpc generates a SLURM script that looks like:

#!/bin/bash
#SBATCH --job-name=sparkjob
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH --time=00:10:00
#SBATCH --output=sparkcluster-%j.log

# Start Spark master on this node
export SPARK_HOME=/path/to/pyspark
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.master.Master \
--host $(hostname) \
--port 7077 \
--webui-port 8080 \
> master.log 2>&1 &
MASTER_PID=$!

# Extract master URL and write to metadata file
sleep 2
MASTER_URL="spark://$(hostname):7077"
echo $MASTER_URL > $HOME/.sparkhpc_${SLURM_JOBID}_master

# Start Spark workers via srun (parallel on allocated nodes)
srun $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker \
--cores 2 \
$MASTER_URL

# Cleanup
wait $MASTER_PID

Key points:

  • Master binds to the node's hostname and port 7077
  • Master URL is written to a metadata file for the driver to discover
  • srun parallelizes worker startup across allocated nodes
  • All processes run within the SLURM allocation and clean up when the job ends

4.3. How the Driver Discovers the Master

The master_url() method polls for the metadata file:

def master_url(self):
"""Check if master URL is available in metadata file."""
metadata_path = f"{os.path.expanduser('~')}/.sparkhpc_{self.jobid}_master"
if os.path.exists(metadata_path):
with open(metadata_path, 'r') as f:
return f.read().strip()
return None

This avoids hardcoding hostnames (which vary across clusters) and allows the driver and master to run asynchronously.


5. Running the Example Locally (or on a SLURM Cluster)

5.1. Local Setup

If you have a local SLURM cluster (e.g., via Docker):

cd spark-on-slurm

# Install dependencies
make setup

# Run the example
make example

Expected output:

submitted cluster_id=abc123 jobid=12345
master=spark://node1:7077
count=100
sum=55
cluster stopped

5.2. On a Real HPC Cluster

The same commands work on any SLURM-managed HPC cluster:

# Log in to the cluster
ssh user@hpc.example.com
cd spark-on-slurm

# First time: install Pixi and dependencies
make setup

# Run Spark via SLURM
make example

# View Spark master logs if needed
tail -f sparkhpc/sparkcluster-*.log

5.3. Cleaning Up

# Remove generated artifacts
make clean

# Manually check for lingering jobs
squeue -u $USER

6 Key Takeaways

  1. Pixi provides reproducible environments — all dependencies (Python, Java, PySpark) locked in pixi.lock
  2. sparkhpc orchestrates Spark on SLURM — handles SLURM batch submission, master startup, worker coordination
  3. Metadata files enable driver-master discovery — no hardcoding of hostnames or ports
  4. Single command to runmake setup then make example or pixi run sparkhpc-example
  5. Suitable for HPC pipelines — extends beyond local testing to real clusters with thousands of cores

By combining Pixi's reproducibility with sparkhpc's SLURM orchestration, you can build reliable, auditable Spark workflows that scale from laptops to production HPC clusters.


References