Skip to main content

Variant Calling (Part 6): Do we really need complex pipelines to achieve high-quality variant calling?

· 11 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

In many bioinformatics workflows, pipelines keep getting more complex — more preprocessing steps, more tools, more layers of abstraction. But sometimes a simple question is worth asking: Do we actually need all of that? While benchmarking germline variant calling on the HG002 sample from the Genome In A Bottle (GIAB) truth set. Surprisingly, the results were very similar to the ones produced by nf-core/sarek, achieving >99% accuracy for SNPs and INDELs on HG002.

tip

Clone the repository

git clone git@github.com:nttg8100/nf-germline-short-read-variant-calling.git -b 0.4.0
cd nf-germline-short-read-variant-calling
cd benchmark/pipelines/nf-germline-short-read-variant-calling

1. About The Workflow

tip

This is the workflow which is built from scratch (bash workflow) to modern framework (nextflow workflow). What it news ?

  • It was refactored to reorganize into the subworkflow, workflows
  • processing subworkflow: split into alignment and preprocess_alignment
├── LICENSE
├── Makefile
├── Readme.md
├── assets
│   ├── samplesheet.csv
├── conf
│   ├── base.config
│   ├── igenomes.config
│   └── modules.config
├── main.nf
├── modules
│   └── local
│   ├── bcftools
│   │   ├── query
│   │   │   └── main.nf
│   │   └── stats
│   │   └── main.nf
│   ├── bedtools
│   │   └── genomecov
│   │   └── main.nf
│   ├── bwa
│   │   ├── index
│   │   │   └── main.nf
│   │   └── mem2
│   │   └── main.nf
│   ├── deepvariant
│   │   └── main.nf
│   ├── fastp
│   │   └── trim
│   │   └── main.nf
│   ├── freebayes
│   │   └── main.nf
│   ├── gatk
│   │   ├── collectmetrics
│   │   │   └── main.nf
│   │   ├── genotypegvcfs
│   │   │   └── main.nf
│   │   ├── haplotypecaller
│   │   │   └── main.nf
│   │   ├── mergevcfs
│   │   │   └── main.nf
│   │   ├── selectvariants_indel
│   │   │   └── main.nf
│   │   ├── selectvariants_snp
│   │   │   └── main.nf
│   │   ├── variantfiltration_indel
│   │   │   └── main.nf
│   │   └── variantfiltration_snp
│   │   └── main.nf
│   ├── gatkspark
│   │   ├── applybqsr
│   │   │   └── main.nf
│   │   ├── baserecalibrator
│   │   │   └── main.nf
│   │   └── markduplicates
│   │   └── main.nf
│   ├── sambamba
│   │   └── markdup
│   │   └── main.nf
│   ├── samtools
│   │   ├── merge
│   │   │   └── main.nf
│   │   └── sort
│   │   └── main.nf
│   └── snpeff
│   └── annotate
│   └── main.nf
├── nextflow.config
├── nf-test.config
├── pixi.lock
├── pixi.toml
├── subworkflows
│   └── local
│   ├── alignment
│   │   └── main.nf
│   ├── annotation
│   │   └── main.nf
│   ├── preprocessing_alignment
│   │   └── main.nf
│   └── variant_calling
│   └── main.nf
├── tests
│   ├── default.nf.test
│   ├── default.nf.test.snap
│   └── nextflow.config
└── workflows
└── main.nf

1.1 Config

For easy getting the correct reference files according to genome, here I adapted from nf-core/sarek with public reference data

params {
genomes = [
'sarek_test': [
fasta: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.fasta',
fasta_fai: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.fasta.fai',
dict: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.dict',
index_bwa2_reference: true,
bwa2_index: 'null',
dbsnp: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz',
dbsnp_tbi: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz.tbi',
known_indels: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/mills_and_1000G.indels.vcf.gz',
known_indels_tbi: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/mills_and_1000G.indels.vcf.gz.tbi'
],
'GRCh38': [
fasta: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta",
fasta_fai: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta.fai",
dict: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.dict",
index_bwa2_reference: false,
bwa2_index: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/BWAmem2Index/",
dbsnp: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz",
dbsnp_tbi: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz.tbi",
known_indels: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
known_indels_tbi: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
known_indels_beta: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/beta/Homo_sapiens_assembly38.known_indels.vcf.gz",
known_indels_beta_tbi: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/beta/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi"
]
]
}

To run with the new workflow genome, specify it via the params --genome, it will fetch the correct reference input data from specific genome which is defined on the igenomes.config on this conf folder

nextflow run main.nf \
-profile docker\
--genome GRCh38 \
--input input.csv \
--output variant_calling

1.2 Subworkflow

Similar to modules, subworkflow can be imported to use in the main workflow. It is the group of modules that has the similar function. As a result, grouping them help to reuse accross pipelines and have a better monitoring on the terminal

1.2.1 Example

The below subworkflow is the alignment subworkflow

/*
========================================================================================
ALIGNMENT SUBWORKFLOW
========================================================================================
QC, Trimming, Alignment
========================================================================================
*/


include { FASTP } from '../../../modules/local/fastp/trim/main'
include { BWAMEM2_INDEX } from '../../../modules/local/bwa/index/main'
include { BWA_MEM2 } from '../../../modules/local/bwa/mem2/main'
include { SAMTOOLS_SORT } from '../../../modules/local/samtools/sort/main'
include { SAMTOOLS_MERGE } from '../../../modules/local/samtools/merge/main'


workflow ALIGNMENT {
take:
reads_ch // channel: [ val(meta), [ path(read1), path(read2) ] ]
ref_fasta // path: reference FASTA
ref_fai // path: reference FAI
ref_dict // path: reference dict
bwa2_index // channel: Optional BWA index files
index_bwa2_reference // channel: Optional BWA index files

main:
ch_versions = channel.empty()

// Check if BWA index needs to be generated
// If empty, generate it; otherwise use provided index
if (index_bwa2_reference) {
BWAMEM2_INDEX(ref_fasta)
ch_versions = ch_versions.mix(BWAMEM2_INDEX.out.versions)
bwa2_index_ch = BWAMEM2_INDEX.out.index
}
else {
bwa2_index_ch = channel.fromPath(bwa2_index)
}
//
// STEP 1: Adapter Trimming, Quality Filtering, and QC with fastp
//
FASTP(
reads_ch
)
ch_versions = ch_versions.mix(FASTP.out.versions)

//
// STEP 2: Read Alignment with BWA-MEM2
//
BWA_MEM2(
FASTP.out.reads,
ref_fasta,
ref_fai,
ref_dict,
bwa2_index_ch,
)
ch_versions = ch_versions.mix(BWA_MEM2.out.versions)

//
// STEP 3: Sort BAM file
//
SAMTOOLS_SORT(
BWA_MEM2.out.bam
)
ch_versions = ch_versions.mix(SAMTOOLS_SORT.out.versions)

//
// STEP 4: Merge lanes per sample (if multiple lanes exist)
//
SAMTOOLS_SORT.out.bam
.map { meta, bam ->
def new_meta = [:]
new_meta.id = meta.sample
new_meta.sample = meta.sample
return [new_meta.id, new_meta, bam]
}
.groupTuple()
.map { _sample_id, metas, bams ->
return [metas[0], bams]
}
.set { ch_bams_to_merge }

SAMTOOLS_MERGE(
ch_bams_to_merge
)
ch_versions = ch_versions.mix(SAMTOOLS_MERGE.out.versions)


emit:
bam = SAMTOOLS_MERGE.out.bam // channel: [ val(meta), path(bam) ]
bai = SAMTOOLS_MERGE.out.bai // channel: [ val(meta), path(bai) ]
versions = ch_versions // channel: path(versions.yml)
}
info

Subworkflow vs module, what is news?

  • For input, it requires after take
  • For output, it is defined after emit directly

1.2.2 How To Import Workflow

To import the workflow, it can be simplified include as the modules

include { ALIGNMENT } from '../subworkflows/local/alignment/main'
include { PREPROCESSING } from '../subworkflows/local/preprocessing_alignment/main'
include { VARIANT_CALLING } from '../subworkflows/local/variant_calling/main'
include { ANNOTATION } from '../subworkflows/local/annotation/main'
/*
========================================================================================
NAMED WORKFLOW FOR PIPELINE
========================================================================================
*/

workflow GERMLINE_VARIANT_CALLING {
...
}

In addition, it the subworkflow can be included in a subworkflow. Here, 4 subworkflows above are included on the GERMLINE_VARIANT_CALLING, while this subworkfow was included on the main.nf

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

include { GERMLINE_VARIANT_CALLING } from './workflows/main.nf'

/*
========================================================================================
RUN MAIN WORKFLOW
========================================================================================
*/

workflow {

//
// Create input channel from samplesheet or input parameters
//
// Read from samplesheet CSV
channel.fromPath(params.input)
.splitCsv(header: true)
.map { row ->
def meta = [:]
meta.id = row.sample
meta.sample = row.sample
// Support both new format (with lane) and old format (without lane)
meta.lane = row.lane ?: "L001"
meta.read_group = "${row.sample}_${meta.lane}"
def reads = []
reads.add(file(row.fastq_1))
if (row.fastq_2) {
reads.add(file(row.fastq_2))
}
return [meta, reads]
}
.set { ch_input }


GERMLINE_VARIANT_CALLING(
ch_input
)
}

1.3 Versus nf-core/sarek

The variant calling series blog post that help to construct the pipeline from the old ways to the current modern technology. According to the documentation of variant calling tools, it may not need to utilize the preprocessing for aligned reads using GATK best practice which requires extensive computing resources:

As a result, I refactor the pipeline with 2 recommended workflows by 2 variant calling tools above:

  • Freebayes: fastp > bwa-mem2 > sambamba > freebayes
  • Deepvariant: fastp > bwa-mem2 > deepvariant
tip

The preprocess_alignment subworkflow was updated with preprocessor where it will accept the string value channel as method gatk or sambamba to preprocess the data

#!/usr/bin/env nextflow

/*
========================================================================================
PREPROCESSING SUBWORKFLOW
========================================================================================
Deduplication, BQSR, Quality Metrics
========================================================================================
*/

include { GATKSPARK_MARKDUPLICATES } from '../../../modules/local/gatkspark/markduplicates/main'
include { GATKSPARK_BASERECALIBRATOR } from '../../../modules/local/gatkspark/baserecalibrator/main'
include { GATKSPARK_APPLYBQSR } from '../../../modules/local/gatkspark/applybqsr/main'
include { GATK_COLLECTMETRICS } from '../../../modules/local/gatk/collectmetrics/main'
include { SAMBAMBA_MARKDUP } from '../../../modules/local/sambamba/markdup/main'

workflow PREPROCESSING {
take:
preprocessor // value: gatk or sambamba
aligned_bams // channel: [ val(meta), path(bam) ]
ref_fasta // path: reference FASTA
ref_fai // path: reference FAI
ref_dict // path: reference dict
dbsnp_vcf // path: dbSNP VCF
dbsnp_tbi // path: dbSNP TBI
known_indels_vcf // path: known indels VCF
known_indels_tbi // path: known indels TBI

main:
ch_versions = channel.empty()

if (preprocessor=="gatk"){
<<GATK preprocessing steps>>

} else if (preprocessor=="sambamba"){
<<sambamba>>
}
else {
error "Unsupported preprocessor: ${preprocessor}"
}


emit:
bam = ch_processed_bam // channel: [ val(meta), path(bam) ]
bai = ch_processed_bai // channel: [ val(meta), path(bai) ]
alignment_summary = ch_alignment_summary // channel: [ val(meta), path(metrics) ]
insert_metrics = ch_alignment_insert_metrics // channel: [ val(meta), path(metrics) ]
versions = ch_versions // channel: path(versions.yml)
}

I allocated more computing resources so it can be ran faster with the below config customized_labels.config

// HPC internet is quite slow, so we increase the pull timeout for singularity containers
singularity.pullTimeout="100.h"
process {
cpus = { 2 * task.attempt }
memory = { 4.GB * task.attempt }
time = { 4.h * task.attempt }

// memory errors which should be retried. otherwise error out
errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'

// Process-specific resource requirements
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
withLabel:error_ignore {
errorStrategy = 'ignore'
}
withLabel:error_retry {
errorStrategy = 'retry'
maxRetries = 2
}
withLabel:process_single {
cpus = { 3 }
memory = { 6.GB * task.attempt }
time = { 4.h * task.attempt }
}
withLabel:process_low {
cpus = { 6 * task.attempt }
memory = { 12.GB * task.attempt }
time = { 4.h * task.attempt }
}
withLabel:process_medium {
cpus = { 64 * task.attempt }
memory = { 120.GB * task.attempt }
time = { 8.h * task.attempt }
}
withLabel:process_high {
cpus = { 64 * task.attempt }
memory = { 120.GB * task.attempt }
time = { 16.h * task.attempt }
}
withLabel:process_long {
time = { 20.h * task.attempt }
}
withLabel:process_high_memory {
memory = { 200.GB * task.attempt }
}
}
process.executor="slurm"
tip

The workflow subworkflow with GERMLINE_VARIANT_CALLING was updated with skip_preprocessing that the raw aligned reads were put directly into the variant caller

    if(!params.skip_preprocessing){
PREPROCESSING(
params.preprocessor,
ALIGNMENT.out.bam,
ref_fasta,
ref_fai,
ref_dict,
dbsnp_vcf,
dbsnp_tbi,
known_indels_vcf,
known_indels_tbi,
)

ch_versions = ch_versions.mix(PREPROCESSING.out.versions)
ch_final_bam = PREPROCESSING.out.bam
ch_final_bai = PREPROCESSING.out.bai
}


else {
ch_final_bam = ALIGNMENT.out.bam
ch_final_bai = ALIGNMENT.out.bai
}

2. Running 2 Simplified Workflow

Cloning the repo with inputs files are avaible at tag 0.4.0

# cloning repo
git clone https://github.com/nttg8100/nf-germline-short-read-variant-calling.git -b 0.4.0
cd benchmark/pipelines/nf-germline-short-read-variant-calling

2.1 Freebayes

The workflow procedure for Freebayes: fastp > bwa-mem2 > sambamba > freebayes

# clone repo again
git clone https://github.com/nttg8100/nf-germline-short-read-variant-calling.git -b 0.4.0
cd benchmark/pipelines/nf-germline-short-read-variant-calling
# Freebayes recommendation:
# 1. fastp: quality control
# 2. bwa-mem2: read alignment
# 3. sambamba: PCR remove duplicate
# 4. freebayes: Variant calling
nextflow run nf-germline-short-read-variant-calling \
-profile docker -c customized_labels.config \
--genome GRCh38 \
--input input.csv \
--variant_caller "freebayes" \
--preprocessor "sambamba" \
--output variant_calling

2.2 Deepvariant

The workflow procedure for Deepvariant: fastp > bwa-mem2 > deepvariant

# Deepvariant recommendation: 
# 1. fastp: quality control
# 2. bwa-mem2: read alignment
# 3. deepvariant: variant calling
nextflow run nf-germline-short-read-variant-calling \
-profile docker -c customized_labels.config \
--genome GRCh38 \
--input input.csv \
--variant_caller "freebayes" \
--preprocessor "sambamba" \
--output variant_calling -resume # Running on the same workding directory to reuse the cache for aligned reads processes

3. Evaluation

To evaluate, it reuses the same setup with HG002 in GIAB from previous blog for benchmarking germline variant calling using nf-core/sarek

Benchmarking against GIAB HG002 shows that a minimal workflow (FASTP → BWA-MEM2 → DeepVariant) achieves essentially identical SNP and INDEL accuracy compared to the full nf-core/sarek pipeline, suggesting that for single-sample germline variant calling, pipeline complexity does not necessarily translate into better performance.

RunTypeFilterTRUTH.TOTALTRUTH.TPTRUTH.FNQUERY.TOTALQUERY.FPQUERY.UNKFP.gtFP.alRecallPrecisionFrac_NAF1_ScoreTRUTH.TiTvQUERY.TiTvTRUTH.het/homQUERY.het/hom
deepvariant_HG002INDELALL5254695190796390892719336934885224734210.9878390.9938050.3907750.9908131.5282761.892501
deepvariant_HG002INDELPASS5254695190796390892719336934885224734210.9878390.9938050.3907750.9908131.5282761.892501
deepvariant_HG002SNPALL33651273344672204553840687613148814713916170.9939210.9981710.1270990.9960422.1001281.9864191.5811961.498072
deepvariant_HG002SNPPASS33651273344672204553840687613148814713916170.9939210.9981710.1270990.9960422.1001281.9864191.5811961.498072
freebayes_HG002INDELALL52546950374421725883066255873368671464619470.9586560.9531540.3814740.9558971.5282761.856978
freebayes_HG002INDELPASS52546950374421725883066255873368671464619470.9586560.9531540.3814740.9558971.5282761.856978
freebayes_HG002SNPALL336512733443662076144046011293539271693829161620.9938310.9628020.2105000.9780702.1001281.8233211.5811961.875394
freebayes_HG002SNPPASS336512733443662076144046011293539271693829161620.9938310.9628020.2105000.9780702.1001281.8233211.5811961.875394

3.1 DeepVariant

A minor change on the performance for simple workflow with deepvariant

MetricSimple Workflownf-core/sarekDifference
SNP Recall0.9939210.993885~0.00004
SNP Precision0.9981710.998421~0.00025
SNP F1 Score0.9960420.996148~0.00011
INDEL Recall0.9878390.989684~0.00185
INDEL Precision0.9938050.994586~0.00078
INDEL F1 Score0.9908130.992129~0.00132

3.2 FreeBayes

A minor change on the performance for simple workflow freebayes

MetricSimple Workflownf-core/sarekDifference
SNP Recall0.9938310.993862~0.00003
SNP Precision0.9628020.962893~0.00009
SNP F1 Score0.9780700.978133~0.00006
INDEL Recall0.9586560.959965~0.00131
INDEL Precision0.9531540.964047~0.01089
INDEL F1 Score0.9558970.962002~0.00610

Recap

info
  • In practice, using fewer steps can help reduce the computational cost of running the analysis.
  • A simple workflow, following the recommendations in the tool documentation, can still achieve strong and reliable results.

References

  1. hap.py tool for benchmarking germline short-read variants
    https://github.com/qbic-projects/QSARK/tree/main

  2. Genome in a Bottle (GIAB) program for gold standard datasets

    oaicite:4

    https://www.nist.gov/programs-projects/genome-bottle