Skip to main content

How can a country build its own 1000 Genomes Project?

· 45 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Recent national genome initiatives highlight the growing need for population-specific genomic resources. After reading VN1K: a genome graph-based and function-driven multi-omics and phenomics resource for the Vietnamese population and EGP1K: Whole-Genome Sequencing of 1,024 Egyptians Characterizes Population Structure and Genetic Diversity, it becomes clear that building a national-scale 1000-genome project is increasingly important for understanding genetic diversity, improving disease research, and enabling precision medicine.

Variant Calling (Part 11): Population-Scale Genotyping Using gVCF and Joint Variant Calling

· 21 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Population-scale variant calling is a critical step in building genomic population projects. While single-sample variant calling is well established, scaling joint genotyping to thousands of WGS samples introduces challenges in performance, storage, and incremental updates. In this blog, I explore gVCF-based joint variant calling approaches and evaluate scalable solutions using modern open-source tools. I also discuss practical architecture considerations to efficiently construct population-scale genomics projects.

Customizing nf-core Modules: Building a Domain-Specific Variant Calling Library

· 12 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

While nf-core provides a comprehensive, well-maintained library of Nextflow modules for bioinformatics, organizations often need domain-specific variants tailored to their analysis pipelines and computational infrastructure. The gianglabs/nf-modules repository demonstrates how to extend the nf-core framework by integrating both shared nf-core modules and custom organization-specific modules optimized for genomic data processing, particularly variant discovery and annotation workflows. This post explores the core differences between standard nf-core modules and the customized gianglabs implementation, including specialized module variants, domain-focused subworkflows, and adapted CI/CD practices.

Testing in Bioinformatics: Why Running Code with Input Data Isn't Enough

· 9 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In Part 2 of our CI/CD series, we showed how to set up automated testing with make test-e2e—running your workflow with test data and checking if it produces results. If the script runs without crashing, you might think "everything works fine." But here's the uncomfortable truth: a pipeline that runs successfully doesn't mean it produces correct results.

This post explains why testing goes far beyond "running code and checking it doesn't crash." We'll explore the different types of tests bioinformaticians should care about and show practical examples of how to catch real bugs that simple end-to-end tests would miss.

Scalable Nextflow Modules: Building a Template with Copier, CI/CD, and nf-test

· 19 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Creating and maintaining a library of reusable Nextflow modules is a significant challenge for bioinformatics teams. Without a consistent structure, code quality standards, and automated testing, modules quickly become difficult to share, validate, and integrate into pipelines. The nf-modules-template solves this by providing a production-ready template that uses Copier to scaffold new module repositories, GitHub Actions for automated CI/CD workflows, pre-commit hooks for code quality, and nf-test with intelligent sharding for scalable module testing. This post explores how these technologies work together to enable reproducible, maintainable Nextflow module libraries.

HPC: Run Spark Clusters on SLURM – Reproducible Setup with Pixi and sparkhpc

· 7 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Running distributed Spark workloads on HPC clusters is a common task in bioinformatics and data science. However, integrating Spark with SLURM—the dominant HPC job scheduler—requires careful orchestration: you need to allocate compute resources via SLURM, start a Spark master, coordinate worker processes, and ensure all dependencies (Java, PySpark, Python) are available. This post shows how to set up reproducible Spark clusters on SLURM using Pixi for environment management and sparkhpc for cluster orchestration, based on the gkit Spark-on-SLURM implementation.

HPC: Test Ansible Playbook With Molecule – From Manual Vagrant to Automated CI/CD

· 10 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Testing Ansible playbooks for HPC clusters is challenging. You can manually spin up VMs with Vagrant, but debugging issues across controller and worker nodes takes time. Instead, Molecule provides a repeatable, automated testing framework that validates your playbook configuration before deployment. This post shows how to transition from manual Vagrant testing to Molecule, and then integrate it into GitHub Actions CI/CD.

GIAB Pilot Study: Establishing High-Confidence Genomic Benchmarks

· 12 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

The Genome in a Bottle (GIAB) Consortium has become the gold standard for variant calling benchmarking in clinical and research genomics. While many bioinformaticians routinely use GIAB reference materials to validate pipelines and assess variant caller performance, fewer understand the rigorous methodological framework underlying these benchmarks. This review examines the foundational 2014 pilot study by Zook et al., which established the technical infrastructure and integration strategies that continue to guide reference material development today.

Variant Calling (Part 10): The Challenges of Structural Variant Calling with Short Reads

· 18 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Structural variants (SVs) are large-scale genomic alterations (≥50 bp) including deletions, duplications, inversions, and translocations. While short-read sequencing has revolutionized SNP and small indel detection, structural variant calling remains a significant challenge. This blog explores SV calling tools for germline samples with 30X short reads, benchmarks five popular callers (Manta, TIDDIT, Delly, Smoove, CNVnator), and reveals fundamental limitations in both detection tools and evaluation methods. The results show that even the best-performing tool (Manta) achieves only 36.8% precision when evaluated with Truvari, while alternative evaluation methods from 2019 report >80% precision - highlighting the complexity and ambiguity inherent in SV analysis.

Variant Calling (Part 9): Storage Cost Optimization

· 11 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Whole Genome Sequencing (WGS) projects generate massive amounts of data. While analysis costs are significant, storage costs often become the dominant expense over time. The key challenge: you need to preserve raw data and alignments for potential re-analysis with new tools, but you can't afford unlimited storage. This blog post explores how CRAM format provides a solution, achieving 45% storage savings compared to BAM while maintaining full lossless compression and re-alignment capability. Therefore, on the new version, nf-germline-short-read-variant-calling supports cram file for better storage cost and re-analysis.

Variant Calling (Part 8): Structural Variant Calling Short Read Benchmark

· 8 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Structural variants (SVs) — deletions, insertions, duplications, inversions, and translocations — are large genomic alterations (typically ≥50 bp) that play a major role in disease but are much harder to detect than SNPs or small indels. In this post, we benchmark Manta, the SV caller integrated in nf-core/sarek, against the GIAB HG002 truth set using Truvari, and explore why short-read SV calling remains a fundamentally difficult problem.

Variant Calling (Part 7): Variant Annotation with VEP and SnpSift: Integrating Functional Prediction and Variant Databases

· 16 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

After calling variants with high accuracy from the previous benchmark, the next step is variant annotation—understanding what each variant tells us about the sample. Variant annotation can be broadly categorized into two approaches: (1) using computational tools to predict the functional effect of variants, and (2) cross-referencing against databases of known variant effects.

Variant Calling (Part 6): Do we really need complex pipelines to achieve high-quality variant calling?

· 11 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In many bioinformatics workflows, pipelines keep getting more complex — more preprocessing steps, more tools, more layers of abstraction. But sometimes a simple question is worth asking: Do we actually need all of that? While benchmarking germline variant calling on the HG002 sample from the Genome In A Bottle (GIAB) truth set. Surprisingly, the results were very similar to the ones produced by nf-core/sarek, achieving >99% accuracy for SNPs and INDELs on HG002.

Variant Calling (Part 5): Benchmarking Germline Variant Calling with nf-core/sarek

· 13 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

We have a series for variant calling, however, one of the most important thing is not about how your workflow is advanced with modern framework, it is about the scoring system that show your workflow achieve high quality score with gold standard criteria. Therefore, in this series, I used the Genome In A Bottle (GIAB) with the truth set variants are curated. With nf-core/sarek-it shows the consistency with the results has been published in its paper with more than 99% in SNP/INDEL variants.

Variant Calling (Part 3): Production Scale HPC Deployment and Performance Optimization

· 20 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In Part 1, we built a solid bash baseline. In Part 2, we migrated to Nextflow with MD5 validation. Now it's time to deploy on HPC clusters with SLURM and optimize for production scale: configure executors for small clusters, tune resources per tool, replace bottleneck steps with faster alternatives (fastp + Spark-GATK), and demonstrate scaling from 1 to 100 samples. This practical guide will help you run your variant calling pipeline efficiently on real HPC infrastructure.

Variant Calling (Part 2): From Bash to Nextflow: GATK Best Practice With Nextflow

· 27 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In Part 1, we built a complete 10-step GATK variant calling pipeline in bash—perfect for academic research and 1-10 samples. But what happens when you need to scale to 100+ samples? This is where Nextflow becomes essential.

📁 Repository: All code from this tutorial is organized in the nf-germline-short-read-variant-calling repository. The structure follows best practices with separate directories for bash (bash-gatk) and Nextflow (nextflow-gatk) implementations.

Variant Calling (Part 1): Building a Reproducible GATK Variant Calling Bash Workflow with Pixi

· 19 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

This blog is designed as a practical starting point for building bioinformatics workflows focused on germline variant calling. You'll begin with a straightforward, standard approach using bash and reproducible environments. In future posts, we'll explore how to transition to best-practice workflow management with Nextflow, allowing for further optimization, customization, and integration of additional tools to enhance workflow quality.

Working with Remote Files using bcftools and samtools (HTSlib)

· 18 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

HTSlib-based tools like bcftools and samtools provide powerful capabilities for working with genomic data stored on remote servers. Whether your data is in AWS S3, accessible via FTP, or hosted on HTTPS endpoints, these tools allow you to efficiently query and subset remote files without downloading entire datasets. This guide covers authentication, remote file access patterns, and practical workflows.

Setting Up a Local Nextflow Training Environment with Code-Server and HPC

· 9 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Setting up a robust development environment for Nextflow training across local and HPC systems requires a unified solution. Code-server provides a browser-based VS Code interface accessible from any machine, making it perfect for teams collaborating on Nextflow workflows. This guide walks you through configuring a complete Nextflow training environment with code-server, Singularity containers, and Pixi-managed tools.

For a comprehensive introduction to Pixi and package management, see our Pixi new-conda era.

Docker Out of Docker: Running Interactive Web Applications for Data Analysis

· 10 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Running interactive web applications like RStudio, JupyterLab, and Code Server in containers is a powerful way to provide reproducible analysis environments. However, users often need to spawn additional containerized tools from within these applications. Docker out of Docker (DooD) elegantly solves this by allowing containers to access the host's Docker daemon. This post explains how to set up DooD for interactive web applications and why it's the right approach for bioinformatics workflows.

Containers on HPC: From Docker to Singularity and Apptainer

· 9 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Container technologies have revolutionized software deployment and reproducibility in scientific computing. However, traditional Docker faces significant limitations in High-Performance Computing (HPC) environments. This post explores why Docker struggles on HPC systems and introduces modern alternatives like Docker rootless, Singularity, and Apptainer.

How to Migrate from In-House Pipelines to Enterprise-Level Workflows: A Proven 3-Step Validation Framework

· 18 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Whether your lab uses bash scripts, Python workflows, Snakemake pipelines, or custom solutions—your in-house pipeline works fine locally. It's been running for years. But as your research scales, you face a hard truth: in-house pipelines don't scale, aren't reproducible across teams, and require constant manual fixes.

Unix Pipes in Bioinformatics: How Streaming Data Reduces Memory and Storage

· 22 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Unix pipes (|) are one of the most powerful yet underutilized features in bioinformatics. They allow you to chain multiple commands together, processing data in a streaming fashion that dramatically reduces memory usage and disk I/O. This post explores why pipes are essential for bioinformatics work and shows how they work under the hood.

Containers in Bioinformatics: Community Tooling and Efficient Docker Building

· 21 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Docker containers are revolutionizing bioinformatics by automating reproducibility and portability across platforms. But what problems can they actually solve? This post shows real-world applications of containers in bioinformatics workflows, then guides you through the simplest possible ways to use, build and debug them.

Bioinformatics Workflow Template: Standardizing Python Pipelines with Modular Design

· 13 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Building reproducible bioinformatics pipelines is hard. Every project starts from scratch with its own testing, CI/CD, and deployment strategy. What if you could clone a template, add your analysis tools, and be ready to go?

This post introduces a standardized bioinformatics workflow template featuring consistent testing, CI/CD, and project structure. Developed from real production experience with bioinfor-wf-template, this template reduces setup time from days to minutes, ensures research reproducibility, and promotes modular, reusable code. It is Python-based and ideal for proof-of-concept projects. Support for more advanced and widely adopted bioinformatics frameworks (such as Snakemake and Nextflow) is planned, applying the same core principles while leveraging their native testing systems.

Running GitHub Actions Locally with act: 5x Faster Development

· 12 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

GitHub Actions are powerful for automating bioinformatics pipelines, but waiting 5-10 minutes for each cloud run is painful during development. act lets you run GitHub Actions workflows locally on your machine in seconds, slashing feedback time by 5x.

In this post, we'll explore act, a command-line tool that runs GitHub Actions locally using Docker. Perfect for testing ML pipelines, gene expression analysis, and CI/CD workflows before pushing to GitHub.

Machine Learning in Bioinformatics Part 1: Building KNN from Scratch

· 12 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Machine learning is transforming bioinformatics, enabling us to discover patterns in biological data. In this first part, we'll build a K-Nearest Neighbors (KNN) classifier from scratch using only Python, then apply it to simulated gene expression data. This post is designed for anyone who knows basic Python and biology—no advanced ML experience required!

Introduction to AI/ML in Bioinformatics: Classification Models & Evaluation

· 12 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Machine learning is transforming bioinformatics by automating pattern discovery from biological data. But what problems can it actually solve? This post shows real-world applications of classification models, then builds the simplest possible classifiers to understand how they work and how to evaluate them. This is Part 0—the practical foundation before diving into complex algorithms like KNN.

Bioinformatics Cost Optimization For Input Using Nextflow (Part 2)

· 18 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Amazon S3 (Simple Storage Service) is built around the concept of storing files as objects, where each file is identified by a unique key rather than a traditional file system path. While this architecture offers scalability and flexibility for storage, it can present challenges when used as a standard file system, especially in bioinformatics workflows. When running Nextflow with S3 as the input/output backend, there are trade-offs to consider—particularly when dealing with large numbers of small files. In such cases, Nextflow may spend significant time handling downloads and uploads via the AWS CLI v2, which can impact overall workflow performance.On this blog post, we will start with downloading input first. Let’s explore this in more detail.

Bioinformatics Cost Optimization for Computing Resources Using Nextflow (Part 1)

· 13 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Many bioinformatics tools provide options to adjust the number of threads or CPU cores, which can reduce execution time with a modest increase in resource cost. But does doubling computational resources always result in processes running twice as fast? In practice, the speed-up is often less than linear, and each tool behaves differently.

The Evolution of Version Control - CI/CD in bioinformatics (Part 2)

· 14 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Welcome to Part 2 of our series on version control in bioinformatics. In Part 1, we introduced Git fundamentals, branching strategies, and collaborative workflows. In this post, we'll dive into how Continuous Integration and Continuous Deployment (CI/CD) can transform your bioinformatics projects. If these concepts are new to you, don't worry—this guide will walk you through managing your bioinformatics repository to ensure your work is easily reproducible on any machine. Whether your server is wiped or you need to spin up a new virtual machine, you'll be able to quickly rerun your pipeline. With CI/CD, every code update can automatically trigger tests on a small dataset to verify everything works before scaling up, ensuring that new changes don't break your results or workflows.

The Evolution of Version Control - Git's Role in Reproducible Bioinformatics (Part 1)

· 13 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In Part 1 (this post), we explore the history of Git, its integration with GitHub, and basic hands-on tutorials. Part 2 (coming soon) will cover real-world bioinformatics examples and advanced workflows with best practices.

This part focuses on practical applications, including NGS quality control using multiqc and fastqc.

Building a Slurm HPC Cluster (Part 3) - Administration and Best Practices

· 13 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In Part 1 and Part 2, we built a complete Slurm HPC cluster from a single node to a production-ready multi-node system. Now let's learn how to manage, maintain, and secure it effectively.

This final post covers daily administration tasks, troubleshooting, security hardening, and integration with data processing frameworks.

Building a Slurm HPC Cluster (Part 1) - Single Node Setup and Fundamentals

· 8 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Building a High-Performance Computing (HPC) cluster can seem daunting, but with the right approach, you can create a robust system for managing computational workloads. This is Part 1 of a 3-part series where we'll build a complete Slurm cluster from scratch.

In this first post, we'll cover the fundamentals by setting up a single-node Slurm cluster and understanding the core concepts.