Skip to main content

2 posts tagged with "spark"

View All Tags

HPC: Run Spark Clusters on SLURM – Reproducible Setup with Pixi and sparkhpc

· 7 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Running distributed Spark workloads on HPC clusters is a common task in bioinformatics and data science. However, integrating Spark with SLURM—the dominant HPC job scheduler—requires careful orchestration: you need to allocate compute resources via SLURM, start a Spark master, coordinate worker processes, and ensure all dependencies (Java, PySpark, Python) are available. This post shows how to set up reproducible Spark clusters on SLURM using Pixi for environment management and sparkhpc for cluster orchestration, based on the gkit Spark-on-SLURM implementation.

Variant Calling (Part 3): Production Scale HPC Deployment and Performance Optimization

· 20 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In Part 1, we built a solid bash baseline. In Part 2, we migrated to Nextflow with MD5 validation. Now it's time to deploy on HPC clusters with SLURM and optimize for production scale: configure executors for small clusters, tune resources per tool, replace bottleneck steps with faster alternatives (fastp + Spark-GATK), and demonstrate scaling from 1 to 100 samples. This practical guide will help you run your variant calling pipeline efficiently on real HPC infrastructure.