Skip to main content

2 posts tagged with "data-processing"

View All Tags

HPC: Run Spark Clusters on SLURM – Reproducible Setup with Pixi and sparkhpc

· 7 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Running distributed Spark workloads on HPC clusters is a common task in bioinformatics and data science. However, integrating Spark with SLURM—the dominant HPC job scheduler—requires careful orchestration: you need to allocate compute resources via SLURM, start a Spark master, coordinate worker processes, and ensure all dependencies (Java, PySpark, Python) are available. This post shows how to set up reproducible Spark clusters on SLURM using Pixi for environment management and sparkhpc for cluster orchestration, based on the gkit Spark-on-SLURM implementation.

Unix Pipes in Bioinformatics: How Streaming Data Reduces Memory and Storage

· 22 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Unix pipes (|) are one of the most powerful yet underutilized features in bioinformatics. They allow you to chain multiple commands together, processing data in a streaming fashion that dramatically reduces memory usage and disk I/O. This post explores why pipes are essential for bioinformatics work and shows how they work under the hood.