Skip to main content

12 posts tagged with "nextflow"

View All Tags

Variant Calling (Part 8): Structural Variant Calling Short Read Benchmark

· 8 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Structural variants (SVs) — deletions, insertions, duplications, inversions, and translocations — are large genomic alterations (typically ≥50 bp) that play a major role in disease but are much harder to detect than SNPs or small indels. In this post, we benchmark Manta, the SV caller integrated in nf-core/sarek, against the GIAB HG002 truth set using Truvari, and explore why short-read SV calling remains a fundamentally difficult problem.

Variant Calling (Part 7): Variant Annotation with VEP and SnpSift: Integrating Functional Prediction and Variant Databases

· 16 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

After calling variants with high accuracy from the previous benchmark, the next step is variant annotation—understanding what each variant tells us about the sample. Variant annotation can be broadly categorized into two approaches: (1) using computational tools to predict the functional effect of variants, and (2) cross-referencing against databases of known variant effects.

Variant Calling (Part 6): Do we really need complex pipelines to achieve high-quality variant calling?

· 11 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

In many bioinformatics workflows, pipelines keep getting more complex — more preprocessing steps, more tools, more layers of abstraction. But sometimes a simple question is worth asking: Do we actually need all of that? While benchmarking germline variant calling on the HG002 sample from the Genome In A Bottle (GIAB) truth set. Surprisingly, the results were very similar to the ones produced by nf-core/sarek, achieving >99% accuracy for SNPs and INDELs on HG002.

Variant Calling (Part 5): Benchmarking Germline Variant Calling with nf-core/sarek

· 13 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

We have a series for variant calling, however, one of the most important thing is not about how your workflow is advanced with modern framework, it is about the scoring system that show your workflow achieve high quality score with gold standard criteria. Therefore, in this series, I used the Genome In A Bottle (GIAB) with the truth set variants are curated. With nf-core/sarek-it shows the consistency with the results has been published in its paper with more than 99% in SNP/INDEL variants.

Variant Calling (Part 3): Production Scale HPC Deployment and Performance Optimization

· 20 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

In Part 1, we built a solid bash baseline. In Part 2, we migrated to Nextflow with MD5 validation. Now it's time to deploy on HPC clusters with SLURM and optimize for production scale: configure executors for small clusters, tune resources per tool, replace bottleneck steps with faster alternatives (fastp + Spark-GATK), and demonstrate scaling from 1 to 100 samples. This practical guide will help you run your variant calling pipeline efficiently on real HPC infrastructure.

Variant Calling (Part 2): From Bash to Nextflow: GATK Best Practice With Nextflow

· 27 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

In Part 1, we built a complete 10-step GATK variant calling pipeline in bash—perfect for academic research and 1-10 samples. But what happens when you need to scale to 100+ samples? This is where Nextflow becomes essential.

📁 Repository: All code from this tutorial is organized in the nf-germline-short-read-variant-calling repository. The structure follows best practices with separate directories for bash (bash-gatk) and Nextflow (nextflow-gatk) implementations.

Setting Up a Local Nextflow Training Environment with Code-Server and HPC

· 9 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Setting up a robust development environment for Nextflow training across local and HPC systems requires a unified solution. Code-server provides a browser-based VS Code interface accessible from any machine, making it perfect for teams collaborating on Nextflow workflows. This guide walks you through configuring a complete Nextflow training environment with code-server, Singularity containers, and Pixi-managed tools.

For a comprehensive introduction to Pixi and package management, see our Pixi new-conda era.

How to Migrate from In-House Pipelines to Enterprise-Level Workflows: A Proven 3-Step Validation Framework

· 18 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Whether your lab uses bash scripts, Python workflows, Snakemake pipelines, or custom solutions—your in-house pipeline works fine locally. It's been running for years. But as your research scales, you face a hard truth: in-house pipelines don't scale, aren't reproducible across teams, and require constant manual fixes.

Containers in Bioinformatics: Community Tooling and Efficient Docker Building

· 21 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Docker containers are revolutionizing bioinformatics by automating reproducibility and portability across platforms. But what problems can they actually solve? This post shows real-world applications of containers in bioinformatics workflows, then guides you through the simplest possible ways to use, build and debug them.

Bioinformatics Cost Optimization For Input Using Nextflow (Part 2)

· 18 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Amazon S3 (Simple Storage Service) is built around the concept of storing files as objects, where each file is identified by a unique key rather than a traditional file system path. While this architecture offers scalability and flexibility for storage, it can present challenges when used as a standard file system, especially in bioinformatics workflows. When running Nextflow with S3 as the input/output backend, there are trade-offs to consider—particularly when dealing with large numbers of small files. In such cases, Nextflow may spend significant time handling downloads and uploads via the AWS CLI v2, which can impact overall workflow performance.On this blog post, we will start with downloading input first. Let’s explore this in more detail.

Bioinformatics Cost Optimization for Computing Resources Using Nextflow (Part 1)

· 13 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Many bioinformatics tools provide options to adjust the number of threads or CPU cores, which can reduce execution time with a modest increase in resource cost. But does doubling computational resources always result in processes running twice as fast? In practice, the speed-up is often less than linear, and each tool behaves differently.