Skip to main content

6 posts tagged with "variant-calling"

View All Tags

Variant Calling (Part 8): Structural Variant Calling Short Read Benchmark

· 8 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Structural variants (SVs) — deletions, insertions, duplications, inversions, and translocations — are large genomic alterations (typically ≥50 bp) that play a major role in disease but are much harder to detect than SNPs or small indels. In this post, we benchmark Manta, the SV caller integrated in nf-core/sarek, against the GIAB HG002 truth set using Truvari, and explore why short-read SV calling remains a fundamentally difficult problem.

Variant Calling (Part 7): Variant Annotation with VEP and SnpSift: Integrating Functional Prediction and Variant Databases

· 16 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

After calling variants with high accuracy from the previous benchmark, the next step is variant annotation—understanding what each variant tells us about the sample. Variant annotation can be broadly categorized into two approaches: (1) using computational tools to predict the functional effect of variants, and (2) cross-referencing against databases of known variant effects.

Variant Calling (Part 6): Do we really need complex pipelines to achieve high-quality variant calling?

· 11 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

In many bioinformatics workflows, pipelines keep getting more complex — more preprocessing steps, more tools, more layers of abstraction. But sometimes a simple question is worth asking: Do we actually need all of that? While benchmarking germline variant calling on the HG002 sample from the Genome In A Bottle (GIAB) truth set. Surprisingly, the results were very similar to the ones produced by nf-core/sarek, achieving >99% accuracy for SNPs and INDELs on HG002.

Variant Calling (Part 5): Benchmarking Germline Variant Calling with nf-core/sarek

· 13 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

We have a series for variant calling, however, one of the most important thing is not about how your workflow is advanced with modern framework, it is about the scoring system that show your workflow achieve high quality score with gold standard criteria. Therefore, in this series, I used the Genome In A Bottle (GIAB) with the truth set variants are curated. With nf-core/sarek-it shows the consistency with the results has been published in its paper with more than 99% in SNP/INDEL variants.

Variant Calling (Part 3): Production Scale HPC Deployment and Performance Optimization

· 20 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

In Part 1, we built a solid bash baseline. In Part 2, we migrated to Nextflow with MD5 validation. Now it's time to deploy on HPC clusters with SLURM and optimize for production scale: configure executors for small clusters, tune resources per tool, replace bottleneck steps with faster alternatives (fastp + Spark-GATK), and demonstrate scaling from 1 to 100 samples. This practical guide will help you run your variant calling pipeline efficiently on real HPC infrastructure.

Variant Calling (Part 1): Building a Reproducible GATK Variant Calling Bash Workflow with Pixi

· 19 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

This blog is designed as a practical starting point for building bioinformatics workflows focused on germline variant calling. You'll begin with a straightforward, standard approach using bash and reproducible environments. In future posts, we'll explore how to transition to best-practice workflow management with Nextflow, allowing for further optimization, customization, and integration of additional tools to enhance workflow quality.