Skip to main content

12 posts tagged with "variant-calling"

View All Tags

How can a country build its own 1000 Genomes Project?

· 45 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Recent national genome initiatives highlight the growing need for population-specific genomic resources. After reading VN1K: a genome graph-based and function-driven multi-omics and phenomics resource for the Vietnamese population and EGP1K: Whole-Genome Sequencing of 1,024 Egyptians Characterizes Population Structure and Genetic Diversity, it becomes clear that building a national-scale 1000-genome project is increasingly important for understanding genetic diversity, improving disease research, and enabling precision medicine.

Variant Calling (Part 11): Population-Scale Genotyping Using gVCF and Joint Variant Calling

· 21 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Population-scale variant calling is a critical step in building genomic population projects. While single-sample variant calling is well established, scaling joint genotyping to thousands of WGS samples introduces challenges in performance, storage, and incremental updates. In this blog, I explore gVCF-based joint variant calling approaches and evaluate scalable solutions using modern open-source tools. I also discuss practical architecture considerations to efficiently construct population-scale genomics projects.

Customizing nf-core Modules: Building a Domain-Specific Variant Calling Library

· 12 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

While nf-core provides a comprehensive, well-maintained library of Nextflow modules for bioinformatics, organizations often need domain-specific variants tailored to their analysis pipelines and computational infrastructure. The gianglabs/nf-modules repository demonstrates how to extend the nf-core framework by integrating both shared nf-core modules and custom organization-specific modules optimized for genomic data processing, particularly variant discovery and annotation workflows. This post explores the core differences between standard nf-core modules and the customized gianglabs implementation, including specialized module variants, domain-focused subworkflows, and adapted CI/CD practices.

GIAB Pilot Study: Establishing High-Confidence Genomic Benchmarks

· 12 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

The Genome in a Bottle (GIAB) Consortium has become the gold standard for variant calling benchmarking in clinical and research genomics. While many bioinformaticians routinely use GIAB reference materials to validate pipelines and assess variant caller performance, fewer understand the rigorous methodological framework underlying these benchmarks. This review examines the foundational 2014 pilot study by Zook et al., which established the technical infrastructure and integration strategies that continue to guide reference material development today.

Variant Calling (Part 10): The Challenges of Structural Variant Calling with Short Reads

· 18 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Structural variants (SVs) are large-scale genomic alterations (≥50 bp) including deletions, duplications, inversions, and translocations. While short-read sequencing has revolutionized SNP and small indel detection, structural variant calling remains a significant challenge. This blog explores SV calling tools for germline samples with 30X short reads, benchmarks five popular callers (Manta, TIDDIT, Delly, Smoove, CNVnator), and reveals fundamental limitations in both detection tools and evaluation methods. The results show that even the best-performing tool (Manta) achieves only 36.8% precision when evaluated with Truvari, while alternative evaluation methods from 2019 report >80% precision - highlighting the complexity and ambiguity inherent in SV analysis.

Variant Calling (Part 9): Storage Cost Optimization

· 11 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Whole Genome Sequencing (WGS) projects generate massive amounts of data. While analysis costs are significant, storage costs often become the dominant expense over time. The key challenge: you need to preserve raw data and alignments for potential re-analysis with new tools, but you can't afford unlimited storage. This blog post explores how CRAM format provides a solution, achieving 45% storage savings compared to BAM while maintaining full lossless compression and re-alignment capability. Therefore, on the new version, nf-germline-short-read-variant-calling supports cram file for better storage cost and re-analysis.

Variant Calling (Part 8): Structural Variant Calling Short Read Benchmark

· 8 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Structural variants (SVs) — deletions, insertions, duplications, inversions, and translocations — are large genomic alterations (typically ≥50 bp) that play a major role in disease but are much harder to detect than SNPs or small indels. In this post, we benchmark Manta, the SV caller integrated in nf-core/sarek, against the GIAB HG002 truth set using Truvari, and explore why short-read SV calling remains a fundamentally difficult problem.

Variant Calling (Part 7): Variant Annotation with VEP and SnpSift: Integrating Functional Prediction and Variant Databases

· 16 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

After calling variants with high accuracy from the previous benchmark, the next step is variant annotation—understanding what each variant tells us about the sample. Variant annotation can be broadly categorized into two approaches: (1) using computational tools to predict the functional effect of variants, and (2) cross-referencing against databases of known variant effects.

Variant Calling (Part 6): Do we really need complex pipelines to achieve high-quality variant calling?

· 11 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In many bioinformatics workflows, pipelines keep getting more complex — more preprocessing steps, more tools, more layers of abstraction. But sometimes a simple question is worth asking: Do we actually need all of that? While benchmarking germline variant calling on the HG002 sample from the Genome In A Bottle (GIAB) truth set. Surprisingly, the results were very similar to the ones produced by nf-core/sarek, achieving >99% accuracy for SNPs and INDELs on HG002.

Variant Calling (Part 5): Benchmarking Germline Variant Calling with nf-core/sarek

· 13 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

We have a series for variant calling, however, one of the most important thing is not about how your workflow is advanced with modern framework, it is about the scoring system that show your workflow achieve high quality score with gold standard criteria. Therefore, in this series, I used the Genome In A Bottle (GIAB) with the truth set variants are curated. With nf-core/sarek-it shows the consistency with the results has been published in its paper with more than 99% in SNP/INDEL variants.

Variant Calling (Part 3): Production Scale HPC Deployment and Performance Optimization

· 20 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In Part 1, we built a solid bash baseline. In Part 2, we migrated to Nextflow with MD5 validation. Now it's time to deploy on HPC clusters with SLURM and optimize for production scale: configure executors for small clusters, tune resources per tool, replace bottleneck steps with faster alternatives (fastp + Spark-GATK), and demonstrate scaling from 1 to 100 samples. This practical guide will help you run your variant calling pipeline efficiently on real HPC infrastructure.

Variant Calling (Part 1): Building a Reproducible GATK Variant Calling Bash Workflow with Pixi

· 19 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

This blog is designed as a practical starting point for building bioinformatics workflows focused on germline variant calling. You'll begin with a straightforward, standard approach using bash and reproducible environments. In future posts, we'll explore how to transition to best-practice workflow management with Nextflow, allowing for further optimization, customization, and integration of additional tools to enhance workflow quality.