Variant Calling (Part 6): Do we really need complex pipelines to achieve high-quality variant calling?

March 9, 2026 · 11 min read

Founder at G Labs

In many bioinformatics workflows, pipelines keep getting more complex — more preprocessing steps, more tools, more layers of abstraction. But sometimes a simple question is worth asking: Do we actually need all of that? While benchmarking germline variant calling on the HG002 sample from the Genome In A Bottle (GIAB) truth set. Surprisingly, the results were very similar to the ones produced by nf-core/sarek, achieving >99% accuracy for SNPs and INDELs on HG002.

tip

The environment setups and scripts are uploaded to nf-germline-short-read-variant-calling
Cloning repository with tag 0.4.0 to reproduce this blog contents

Clone the repository

git clone git@github.com:nttg8100/nf-germline-short-read-variant-calling.git -b 0.4.0
cd nf-germline-short-read-variant-calling
cd benchmark/pipelines/nf-germline-short-read-variant-calling

1. About The Workflow

tip

This is the workflow which is built from scratch (bash workflow) to modern framework (nextflow workflow). What it news ?

It was refactored to reorganize into the subworkflow, workflows
processing subworkflow: split into alignment and preprocess_alignment

├── LICENSE
├── Makefile
├── Readme.md
├── assets
│   ├── samplesheet.csv
├── conf
│   ├── base.config
│   ├── igenomes.config
│   └── modules.config
├── main.nf
├── modules
│   └── local
│       ├── bcftools
│       │   ├── query
│       │   │   └── main.nf
│       │   └── stats
│       │       └── main.nf
│       ├── bedtools
│       │   └── genomecov
│       │       └── main.nf
│       ├── bwa
│       │   ├── index
│       │   │   └── main.nf
│       │   └── mem2
│       │       └── main.nf
│       ├── deepvariant
│       │   └── main.nf
│       ├── fastp
│       │   └── trim
│       │       └── main.nf
│       ├── freebayes
│       │   └── main.nf
│       ├── gatk
│       │   ├── collectmetrics
│       │   │   └── main.nf
│       │   ├── genotypegvcfs
│       │   │   └── main.nf
│       │   ├── haplotypecaller
│       │   │   └── main.nf
│       │   ├── mergevcfs
│       │   │   └── main.nf
│       │   ├── selectvariants_indel
│       │   │   └── main.nf
│       │   ├── selectvariants_snp
│       │   │   └── main.nf
│       │   ├── variantfiltration_indel
│       │   │   └── main.nf
│       │   └── variantfiltration_snp
│       │       └── main.nf
│       ├── gatkspark
│       │   ├── applybqsr
│       │   │   └── main.nf
│       │   ├── baserecalibrator
│       │   │   └── main.nf
│       │   └── markduplicates
│       │       └── main.nf
│       ├── sambamba
│       │   └── markdup
│       │       └── main.nf
│       ├── samtools
│       │   ├── merge
│       │   │   └── main.nf
│       │   └── sort
│       │       └── main.nf
│       └── snpeff
│           └── annotate
│               └── main.nf
├── nextflow.config
├── nf-test.config
├── pixi.lock
├── pixi.toml
├── subworkflows
│   └── local
│       ├── alignment
│       │   └── main.nf
│       ├── annotation
│       │   └── main.nf
│       ├── preprocessing_alignment
│       │   └── main.nf
│       └── variant_calling
│           └── main.nf
├── tests
│   ├── default.nf.test
│   ├── default.nf.test.snap
│   └── nextflow.config
└── workflows
    └── main.nf

1.1 Config

For easy getting the correct reference files according to genome, here I adapted from nf-core/sarek with public reference data

params {
    genomes = [
        'sarek_test': [
            fasta: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.fasta',
            fasta_fai: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.fasta.fai',
            dict: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.dict',
            index_bwa2_reference: true,
            bwa2_index: 'null',
            dbsnp: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz',
            dbsnp_tbi: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz.tbi',
            known_indels: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/mills_and_1000G.indels.vcf.gz',
            known_indels_tbi: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/mills_and_1000G.indels.vcf.gz.tbi'
        ],
        'GRCh38': [
            fasta: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta",
            fasta_fai: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta.fai",
            dict: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.dict",
            index_bwa2_reference: false,
            bwa2_index: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/BWAmem2Index/",
            dbsnp: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz",
            dbsnp_tbi: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz.tbi",
            known_indels: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
            known_indels_tbi: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
            known_indels_beta: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/beta/Homo_sapiens_assembly38.known_indels.vcf.gz",
            known_indels_beta_tbi: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/beta/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi"
        ]
    ]
}

To run with the new workflow genome, specify it via the params --genome, it will fetch the correct reference input data from specific genome which is defined on the igenomes.config on this conf folder

nextflow run main.nf \
    -profile docker\
    --genome GRCh38 \
    --input input.csv \
    --output variant_calling

1.2 Subworkflow

Similar to modules, subworkflow can be imported to use in the main workflow. It is the group of modules that has the similar function. As a result, grouping them help to reuse accross pipelines and have a better monitoring on the terminal

1.2.1 Example

The below subworkflow is the alignment subworkflow

/*
========================================================================================
    ALIGNMENT SUBWORKFLOW
========================================================================================
    QC, Trimming, Alignment
========================================================================================
*/


include { FASTP } from '../../../modules/local/fastp/trim/main'
include { BWAMEM2_INDEX } from '../../../modules/local/bwa/index/main'
include { BWA_MEM2 } from '../../../modules/local/bwa/mem2/main'
include { SAMTOOLS_SORT } from '../../../modules/local/samtools/sort/main'
include { SAMTOOLS_MERGE } from '../../../modules/local/samtools/merge/main'


workflow ALIGNMENT {
    take:
    reads_ch // channel: [ val(meta), [ path(read1), path(read2) ] ]
    ref_fasta // path: reference FASTA
    ref_fai // path: reference FAI
    ref_dict // path: reference dict
    bwa2_index // channel: Optional BWA index files
    index_bwa2_reference // channel: Optional BWA index files

    main:
    ch_versions = channel.empty()

    // Check if BWA index needs to be generated
    // If empty, generate it; otherwise use provided index
    if (index_bwa2_reference) {
        BWAMEM2_INDEX(ref_fasta)
        ch_versions = ch_versions.mix(BWAMEM2_INDEX.out.versions)
        bwa2_index_ch = BWAMEM2_INDEX.out.index
    }
    else {
        bwa2_index_ch = channel.fromPath(bwa2_index)
    }
    //
    // STEP 1: Adapter Trimming, Quality Filtering, and QC with fastp
    //
    FASTP(
        reads_ch
    )
    ch_versions = ch_versions.mix(FASTP.out.versions)

    //
    // STEP 2: Read Alignment with BWA-MEM2
    //
    BWA_MEM2(
        FASTP.out.reads,
        ref_fasta,
        ref_fai,
        ref_dict,
        bwa2_index_ch,
    )
    ch_versions = ch_versions.mix(BWA_MEM2.out.versions)

    //
    // STEP 3: Sort BAM file
    //
    SAMTOOLS_SORT(
        BWA_MEM2.out.bam
    )
    ch_versions = ch_versions.mix(SAMTOOLS_SORT.out.versions)

    //
    // STEP 4: Merge lanes per sample (if multiple lanes exist)
    //
    SAMTOOLS_SORT.out.bam
        .map { meta, bam ->
            def new_meta = [:]
            new_meta.id = meta.sample
            new_meta.sample = meta.sample
            return [new_meta.id, new_meta, bam]
        }
        .groupTuple()
        .map { _sample_id, metas, bams ->
            return [metas[0], bams]
        }
        .set { ch_bams_to_merge }

    SAMTOOLS_MERGE(
        ch_bams_to_merge
    )
    ch_versions = ch_versions.mix(SAMTOOLS_MERGE.out.versions)


    emit:
    bam = SAMTOOLS_MERGE.out.bam // channel: [ val(meta), path(bam) ]
    bai = SAMTOOLS_MERGE.out.bai // channel: [ val(meta), path(bai) ]
    versions = ch_versions // channel: path(versions.yml)
}

info

Subworkflow vs module, what is news?

For input, it requires after take
For output, it is defined after emit directly

1.2.2 How To Import Workflow

To import the workflow, it can be simplified include as the modules

include { ALIGNMENT } from '../subworkflows/local/alignment/main'
include { PREPROCESSING } from '../subworkflows/local/preprocessing_alignment/main'
include { VARIANT_CALLING } from '../subworkflows/local/variant_calling/main'
include { ANNOTATION } from '../subworkflows/local/annotation/main'
/*
========================================================================================
    NAMED WORKFLOW FOR PIPELINE
========================================================================================
*/

workflow GERMLINE_VARIANT_CALLING {
  ...
}

In addition, it the subworkflow can be included in a subworkflow. Here, 4 subworkflows above are included on the GERMLINE_VARIANT_CALLING, while this subworkfow was included on the main.nf

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

include { GERMLINE_VARIANT_CALLING } from './workflows/main.nf'

/*
========================================================================================
    RUN MAIN WORKFLOW
========================================================================================
*/

workflow {

    //
    // Create input channel from samplesheet or input parameters
    //
    // Read from samplesheet CSV
    channel.fromPath(params.input)
        .splitCsv(header: true)
        .map { row ->
            def meta = [:]
            meta.id = row.sample
            meta.sample = row.sample
            // Support both new format (with lane) and old format (without lane)
            meta.lane = row.lane ?: "L001"
            meta.read_group = "${row.sample}_${meta.lane}"
            def reads = []
            reads.add(file(row.fastq_1))
            if (row.fastq_2) {
                reads.add(file(row.fastq_2))
            }
            return [meta, reads]
        }
        .set { ch_input }


    GERMLINE_VARIANT_CALLING(
        ch_input
    )
}

1.3 Versus nf-core/sarek

The variant calling series blog post that help to construct the pipeline from the old ways to the current modern technology. According to the documentation of variant calling tools, it may not need to utilize the preprocessing for aligned reads using GATK best practice which requires extensive computing resources:

Freebayes: https://github.com/freebayes/freebayes
Deepvariant: https://github.com/google/deepvariant

As a result, I refactor the pipeline with 2 recommended workflows by 2 variant calling tools above:

Freebayes: fastp > bwa-mem2 > sambamba > freebayes
Deepvariant: fastp > bwa-mem2 > deepvariant

tip

The preprocess_alignment subworkflow was updated with preprocessor where it will accept the string value channel as method gatk or sambamba to preprocess the data

#!/usr/bin/env nextflow

/*
========================================================================================
    PREPROCESSING SUBWORKFLOW
========================================================================================
    Deduplication, BQSR, Quality Metrics
========================================================================================
*/

include { GATKSPARK_MARKDUPLICATES } from '../../../modules/local/gatkspark/markduplicates/main'
include { GATKSPARK_BASERECALIBRATOR } from '../../../modules/local/gatkspark/baserecalibrator/main'
include { GATKSPARK_APPLYBQSR } from '../../../modules/local/gatkspark/applybqsr/main'
include { GATK_COLLECTMETRICS } from '../../../modules/local/gatk/collectmetrics/main'
include { SAMBAMBA_MARKDUP } from '../../../modules/local/sambamba/markdup/main'

workflow PREPROCESSING {
    take:
    preprocessor // value: gatk or sambamba
    aligned_bams // channel: [ val(meta), path(bam) ]
    ref_fasta // path: reference FASTA
    ref_fai // path: reference FAI
    ref_dict // path: reference dict
    dbsnp_vcf // path: dbSNP VCF
    dbsnp_tbi // path: dbSNP TBI
    known_indels_vcf // path: known indels VCF
    known_indels_tbi // path: known indels TBI

    main:
    ch_versions = channel.empty()

    if (preprocessor=="gatk"){
        <<GATK preprocessing steps>>

    } else if (preprocessor=="sambamba"){
        <<sambamba>>
    }
    else {
        error "Unsupported preprocessor: ${preprocessor}"
    }
   

    emit:
    bam = ch_processed_bam // channel: [ val(meta), path(bam) ]
    bai = ch_processed_bai // channel: [ val(meta), path(bai) ]
    alignment_summary = ch_alignment_summary // channel: [ val(meta), path(metrics) ]
    insert_metrics = ch_alignment_insert_metrics // channel: [ val(meta), path(metrics) ]
    versions = ch_versions // channel: path(versions.yml)
}

I allocated more computing resources so it can be ran faster with the below config customized_labels.config

// HPC internet is quite slow, so we increase the pull timeout for singularity containers
singularity.pullTimeout="100.h"
process {
    cpus   = { 2      * task.attempt }
    memory = { 4.GB   * task.attempt }
    time   = { 4.h    * task.attempt }

    // memory errors which should be retried. otherwise error out
    errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
    maxRetries    = 1
    maxErrors     = '-1'

    // Process-specific resource requirements
    // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
    withLabel:error_ignore {
        errorStrategy = 'ignore'
    }
    withLabel:error_retry {
        errorStrategy = 'retry'
        maxRetries    = 2
    }
    withLabel:process_single {
        cpus   = { 3                   }
        memory = { 6.GB * task.attempt }
        time   = { 4.h  * task.attempt }
    }
    withLabel:process_low {
        cpus   = { 6     * task.attempt }
        memory = { 12.GB * task.attempt }
        time   = { 4.h   * task.attempt }
    }
    withLabel:process_medium {
        cpus   = { 64     * task.attempt }
        memory = { 120.GB * task.attempt }
        time   = { 8.h   * task.attempt }
    }
    withLabel:process_high {
        cpus   = { 64    * task.attempt }
        memory = { 120.GB * task.attempt }
        time   = { 16.h  * task.attempt }
    }
    withLabel:process_long {
        time   = { 20.h  * task.attempt }
    }
    withLabel:process_high_memory {
        memory = { 200.GB * task.attempt }
    }
}
process.executor="slurm"

tip

The workflow subworkflow with GERMLINE_VARIANT_CALLING was updated with skip_preprocessing that the raw aligned reads were put directly into the variant caller

    if(!params.skip_preprocessing){
        PREPROCESSING(
            params.preprocessor,
            ALIGNMENT.out.bam,
            ref_fasta,
            ref_fai,
            ref_dict,
            dbsnp_vcf,
            dbsnp_tbi,
            known_indels_vcf,
            known_indels_tbi,
        )
        
        ch_versions = ch_versions.mix(PREPROCESSING.out.versions)
        ch_final_bam = PREPROCESSING.out.bam
        ch_final_bai = PREPROCESSING.out.bai
    }

   
    else {
        ch_final_bam = ALIGNMENT.out.bam
        ch_final_bai = ALIGNMENT.out.bai
    }
   

2. Running 2 Simplified Workflow

Cloning the repo with inputs files are avaible at tag 0.4.0

# cloning repo
git clone https://github.com/nttg8100/nf-germline-short-read-variant-calling.git -b 0.4.0
cd benchmark/pipelines/nf-germline-short-read-variant-calling

2.1 Freebayes

The workflow procedure for Freebayes: fastp > bwa-mem2 > sambamba > freebayes

# clone repo again
git clone https://github.com/nttg8100/nf-germline-short-read-variant-calling.git -b 0.4.0
cd benchmark/pipelines/nf-germline-short-read-variant-calling
# Freebayes recommendation: 
# 1. fastp: quality control
# 2. bwa-mem2: read alignment  
# 3. sambamba: PCR remove duplicate
# 4. freebayes: Variant calling
nextflow run nf-germline-short-read-variant-calling \
    -profile docker -c customized_labels.config \
    --genome GRCh38 \
    --input input.csv \
    --variant_caller "freebayes" \
    --preprocessor "sambamba" \
    --output variant_calling

2.2 Deepvariant

The workflow procedure for Deepvariant: fastp > bwa-mem2 > deepvariant

# Deepvariant recommendation: 
# 1. fastp: quality control
# 2. bwa-mem2: read alignment  
# 3. deepvariant: variant calling 
nextflow run nf-germline-short-read-variant-calling \
    -profile docker -c customized_labels.config \
    --genome GRCh38 \
    --input input.csv \
    --variant_caller "freebayes" \
    --preprocessor "sambamba" \
    --output variant_calling -resume # Running on the same workding directory to reuse the cache for aligned reads processes

3. Evaluation

To evaluate, it reuses the same setup with HG002 in GIAB from previous blog for benchmarking germline variant calling using nf-core/sarek

Benchmarking against GIAB HG002 shows that a minimal workflow (FASTP → BWA-MEM2 → DeepVariant) achieves essentially identical SNP and INDEL accuracy compared to the full nf-core/sarek pipeline, suggesting that for single-sample germline variant calling, pipeline complexity does not necessarily translate into better performance.

Run	Type	Filter	TRUTH.TOTAL	TRUTH.TP	TRUTH.FN	QUERY.TOTAL	QUERY.FP	QUERY.UNK	FP.gt	FP.al	Recall	Precision	Frac_NA	F1_Score	TRUTH.TiTv	QUERY.TiTv	TRUTH.het/hom	QUERY.het/hom
deepvariant_HG002	INDEL	ALL	525469	519079	6390	892719	3369	348852	2473	421	0.987839	0.993805	0.390775	0.990813			1.528276	1.892501
deepvariant_HG002	INDEL	PASS	525469	519079	6390	892719	3369	348852	2473	421	0.987839	0.993805	0.390775	0.990813			1.528276	1.892501
deepvariant_HG002	SNP	ALL	3365127	3344672	20455	3840687	6131	488147	1391	617	0.993921	0.998171	0.127099	0.996042	2.100128	1.986419	1.581196	1.498072
deepvariant_HG002	SNP	PASS	3365127	3344672	20455	3840687	6131	488147	1391	617	0.993921	0.998171	0.127099	0.996042	2.100128	1.986419	1.581196	1.498072
freebayes_HG002	INDEL	ALL	525469	503744	21725	883066	25587	336867	14646	1947	0.958656	0.953154	0.381474	0.955897			1.528276	1.856978
freebayes_HG002	INDEL	PASS	525469	503744	21725	883066	25587	336867	14646	1947	0.958656	0.953154	0.381474	0.955897			1.528276	1.856978
freebayes_HG002	SNP	ALL	3365127	3344366	20761	4404601	129353	927169	3829	16162	0.993831	0.962802	0.210500	0.978070	2.100128	1.823321	1.581196	1.875394
freebayes_HG002	SNP	PASS	3365127	3344366	20761	4404601	129353	927169	3829	16162	0.993831	0.962802	0.210500	0.978070	2.100128	1.823321	1.581196	1.875394

3.1 DeepVariant

A minor change on the performance for simple workflow with deepvariant

Metric	Simple Workflow	nf-core/sarek	Difference
SNP Recall	0.993921	0.993885	~0.00004
SNP Precision	0.998171	0.998421	~0.00025
SNP F1 Score	0.996042	0.996148	~0.00011
INDEL Recall	0.987839	0.989684	~0.00185
INDEL Precision	0.993805	0.994586	~0.00078
INDEL F1 Score	0.990813	0.992129	~0.00132

3.2 FreeBayes

A minor change on the performance for simple workflow freebayes

Metric	Simple Workflow	nf-core/sarek	Difference
SNP Recall	0.993831	0.993862	~0.00003
SNP Precision	0.962802	0.962893	~0.00009
SNP F1 Score	0.978070	0.978133	~0.00006
INDEL Recall	0.958656	0.959965	~0.00131
INDEL Precision	0.953154	0.964047	~0.01089
INDEL F1 Score	0.955897	0.962002	~0.00610

Recap

info

In practice, using fewer steps can help reduce the computational cost of running the analysis.
A simple workflow, following the recommendations in the tool documentation, can still achieve strong and reliable results.

References

hap.py tool for benchmarking germline short-read variants
https://github.com/qbic-projects/QSARK/tree/main
Genome in a Bottle (GIAB) program for gold standard datasets

oaicite:4

https://www.nist.gov/programs-projects/genome-bottle

1. About The Workflow​

1.1 Config​

1.2 Subworkflow​

1.2.1 Example​

1.2.2 How To Import Workflow​

1.3 Versus nf-core/sarek​

2. Running 2 Simplified Workflow​

2.1 Freebayes​

2.2 Deepvariant​

3. Evaluation​

3.1 DeepVariant​

3.2 FreeBayes​

Recap​

References​