Variant Calling (Part 6): Do we really need complex pipelines to achieve high-quality variant calling?
In many bioinformatics workflows, pipelines keep getting more complex — more preprocessing steps, more tools, more layers of abstraction. But sometimes a simple question is worth asking: Do we actually need all of that? While benchmarking germline variant calling on the HG002 sample from the Genome In A Bottle (GIAB) truth set. Surprisingly, the results were very similar to the ones produced by nf-core/sarek, achieving >99% accuracy for SNPs and INDELs on HG002.
- The environment setups and scripts are uploaded to nf-germline-short-read-variant-calling
- Cloning repository with tag 0.4.0 to reproduce this blog contents
Clone the repository
git clone git@github.com:nttg8100/nf-germline-short-read-variant-calling.git -b 0.4.0
cd nf-germline-short-read-variant-calling
cd benchmark/pipelines/nf-germline-short-read-variant-calling
1. About The Workflow
This is the workflow which is built from scratch (bash workflow) to modern framework (nextflow workflow). What it news ?
- It was refactored to reorganize into the subworkflow, workflows
processingsubworkflow: split intoalignmentandpreprocess_alignment
├── LICENSE
├── Makefile
├── Readme.md
├── assets
│ ├── samplesheet.csv
├── conf
│ ├── base.config
│ ├── igenomes.config
│ └── modules.config
├── main.nf
├── modules
│ └── local
│ ├── bcftools
│ │ ├── query
│ │ │ └── main.nf
│ │ └── stats
│ │ └── main.nf
│ ├── bedtools
│ │ └── genomecov
│ │ └── main.nf
│ ├── bwa
│ │ ├── index
│ │ │ └── main.nf
│ │ └── mem2
│ │ └── main.nf
│ ├── deepvariant
│ │ └── main.nf
│ ├── fastp
│ │ └── trim
│ │ └── main.nf
│ ├── freebayes
│ │ └── main.nf
│ ├── gatk
│ │ ├── collectmetrics
│ │ │ └── main.nf
│ │ ├── genotypegvcfs
│ │ │ └── main.nf
│ │ ├── haplotypecaller
│ │ │ └── main.nf
│ │ ├── mergevcfs
│ │ │ └── main.nf
│ │ ├── selectvariants_indel
│ │ │ └── main.nf
│ │ ├── selectvariants_snp
│ │ │ └── main.nf
│ │ ├── variantfiltration_indel
│ │ │ └── main.nf
│ │ └── variantfiltration_snp
│ │ └── main.nf
│ ├── gatkspark
│ │ ├── applybqsr
│ │ │ └── main.nf
│ │ ├── baserecalibrator
│ │ │ └── main.nf
│ │ └── markduplicates
│ │ └── main.nf
│ ├── sambamba
│ │ └── markdup
│ │ └── main.nf
│ ├── samtools
│ │ ├── merge
│ │ │ └── main.nf
│ │ └── sort
│ │ └── main.nf
│ └── snpeff
│ └── annotate
│ └── main.nf
├── nextflow.config
├── nf-test.config
├── pixi.lock
├── pixi.toml
├── subworkflows
│ └── local
│ ├── alignment
│ │ └── main.nf
│ ├── annotation
│ │ └── main.nf
│ ├── preprocessing_alignment
│ │ └── main.nf
│ └── variant_calling
│ └── main.nf
├── tests
│ ├── default.nf.test
│ ├── default.nf.test.snap
│ └── nextflow.config
└── workflows
└── main.nf
1.1 Config
For easy getting the correct reference files according to genome, here I adapted from nf-core/sarek with public reference data
params {
genomes = [
'sarek_test': [
fasta: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.fasta',
fasta_fai: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.fasta.fai',
dict: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.dict',
index_bwa2_reference: true,
bwa2_index: 'null',
dbsnp: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz',
dbsnp_tbi: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz.tbi',
known_indels: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/mills_and_1000G.indels.vcf.gz',
known_indels_tbi: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/mills_and_1000G.indels.vcf.gz.tbi'
],
'GRCh38': [
fasta: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta",
fasta_fai: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta.fai",
dict: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.dict",
index_bwa2_reference: false,
bwa2_index: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/BWAmem2Index/",
dbsnp: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz",
dbsnp_tbi: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz.tbi",
known_indels: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
known_indels_tbi: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
known_indels_beta: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/beta/Homo_sapiens_assembly38.known_indels.vcf.gz",
known_indels_beta_tbi: "s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/beta/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi"
]
]
}
To run with the new workflow genome, specify it via the params --genome, it will fetch the correct reference input data from specific genome which is defined on the igenomes.config on this conf folder
nextflow run main.nf \
-profile docker\
--genome GRCh38 \
--input input.csv \
--output variant_calling
1.2 Subworkflow
Similar to modules, subworkflow can be imported to use in the main workflow. It is the group of modules that has the similar function. As a result, grouping them help to reuse accross pipelines and have a better monitoring on the terminal
1.2.1 Example
The below subworkflow is the alignment subworkflow
/*
========================================================================================
ALIGNMENT SUBWORKFLOW
========================================================================================
QC, Trimming, Alignment
========================================================================================
*/
include { FASTP } from '../../../modules/local/fastp/trim/main'
include { BWAMEM2_INDEX } from '../../../modules/local/bwa/index/main'
include { BWA_MEM2 } from '../../../modules/local/bwa/mem2/main'
include { SAMTOOLS_SORT } from '../../../modules/local/samtools/sort/main'
include { SAMTOOLS_MERGE } from '../../../modules/local/samtools/merge/main'
workflow ALIGNMENT {
take:
reads_ch // channel: [ val(meta), [ path(read1), path(read2) ] ]
ref_fasta // path: reference FASTA
ref_fai // path: reference FAI
ref_dict // path: reference dict
bwa2_index // channel: Optional BWA index files
index_bwa2_reference // channel: Optional BWA index files
main:
ch_versions = channel.empty()
// Check if BWA index needs to be generated
// If empty, generate it; otherwise use provided index
if (index_bwa2_reference) {
BWAMEM2_INDEX(ref_fasta)
ch_versions = ch_versions.mix(BWAMEM2_INDEX.out.versions)
bwa2_index_ch = BWAMEM2_INDEX.out.index
}
else {
bwa2_index_ch = channel.fromPath(bwa2_index)
}
//
// STEP 1: Adapter Trimming, Quality Filtering, and QC with fastp
//
FASTP(
reads_ch
)
ch_versions = ch_versions.mix(FASTP.out.versions)
//
// STEP 2: Read Alignment with BWA-MEM2
//
BWA_MEM2(
FASTP.out.reads,
ref_fasta,
ref_fai,
ref_dict,
bwa2_index_ch,
)
ch_versions = ch_versions.mix(BWA_MEM2.out.versions)
//
// STEP 3: Sort BAM file
//
SAMTOOLS_SORT(
BWA_MEM2.out.bam
)
ch_versions = ch_versions.mix(SAMTOOLS_SORT.out.versions)
//
// STEP 4: Merge lanes per sample (if multiple lanes exist)
//
SAMTOOLS_SORT.out.bam
.map { meta, bam ->
def new_meta = [:]
new_meta.id = meta.sample
new_meta.sample = meta.sample
return [new_meta.id, new_meta, bam]
}
.groupTuple()
.map { _sample_id, metas, bams ->
return [metas[0], bams]
}
.set { ch_bams_to_merge }
SAMTOOLS_MERGE(
ch_bams_to_merge
)
ch_versions = ch_versions.mix(SAMTOOLS_MERGE.out.versions)
emit:
bam = SAMTOOLS_MERGE.out.bam // channel: [ val(meta), path(bam) ]
bai = SAMTOOLS_MERGE.out.bai // channel: [ val(meta), path(bai) ]
versions = ch_versions // channel: path(versions.yml)
}
Subworkflow vs module, what is news?
- For input, it requires after
take - For output, it is defined after
emitdirectly
1.2.2 How To Import Workflow
To import the workflow, it can be simplified include as the modules
include { ALIGNMENT } from '../subworkflows/local/alignment/main'
include { PREPROCESSING } from '../subworkflows/local/preprocessing_alignment/main'
include { VARIANT_CALLING } from '../subworkflows/local/variant_calling/main'
include { ANNOTATION } from '../subworkflows/local/annotation/main'
/*
========================================================================================
NAMED WORKFLOW FOR PIPELINE
========================================================================================
*/
workflow GERMLINE_VARIANT_CALLING {
...
}
In addition, it the subworkflow can be included in a subworkflow. Here, 4 subworkflows above are included on the GERMLINE_VARIANT_CALLING, while this subworkfow was included
on the main.nf
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
include { GERMLINE_VARIANT_CALLING } from './workflows/main.nf'
/*
========================================================================================
RUN MAIN WORKFLOW
========================================================================================
*/
workflow {
//
// Create input channel from samplesheet or input parameters
//
// Read from samplesheet CSV
channel.fromPath(params.input)
.splitCsv(header: true)
.map { row ->
def meta = [:]
meta.id = row.sample
meta.sample = row.sample
// Support both new format (with lane) and old format (without lane)
meta.lane = row.lane ?: "L001"
meta.read_group = "${row.sample}_${meta.lane}"
def reads = []
reads.add(file(row.fastq_1))
if (row.fastq_2) {
reads.add(file(row.fastq_2))
}
return [meta, reads]
}
.set { ch_input }
GERMLINE_VARIANT_CALLING(
ch_input
)
}
1.3 Versus nf-core/sarek
The variant calling series blog post that help to construct the pipeline from the old ways to the current modern technology. According to the documentation of variant calling tools, it may not need to utilize the preprocessing for aligned reads using GATK best practice which requires extensive computing resources:
- Freebayes: https://github.com/freebayes/freebayes
- Deepvariant: https://github.com/google/deepvariant
As a result, I refactor the pipeline with 2 recommended workflows by 2 variant calling tools above:
- Freebayes: fastp > bwa-mem2 > sambamba > freebayes
- Deepvariant: fastp > bwa-mem2 > deepvariant
The preprocess_alignment subworkflow was updated with preprocessor where it will accept the string value channel as method gatk or sambamba to preprocess the data
#!/usr/bin/env nextflow
/*
========================================================================================
PREPROCESSING SUBWORKFLOW
========================================================================================
Deduplication, BQSR, Quality Metrics
========================================================================================
*/
include { GATKSPARK_MARKDUPLICATES } from '../../../modules/local/gatkspark/markduplicates/main'
include { GATKSPARK_BASERECALIBRATOR } from '../../../modules/local/gatkspark/baserecalibrator/main'
include { GATKSPARK_APPLYBQSR } from '../../../modules/local/gatkspark/applybqsr/main'
include { GATK_COLLECTMETRICS } from '../../../modules/local/gatk/collectmetrics/main'
include { SAMBAMBA_MARKDUP } from '../../../modules/local/sambamba/markdup/main'
workflow PREPROCESSING {
take:
preprocessor // value: gatk or sambamba
aligned_bams // channel: [ val(meta), path(bam) ]
ref_fasta // path: reference FASTA
ref_fai // path: reference FAI
ref_dict // path: reference dict
dbsnp_vcf // path: dbSNP VCF
dbsnp_tbi // path: dbSNP TBI
known_indels_vcf // path: known indels VCF
known_indels_tbi // path: known indels TBI
main:
ch_versions = channel.empty()
if (preprocessor=="gatk"){
<<GATK preprocessing steps>>
} else if (preprocessor=="sambamba"){
<<sambamba>>
}
else {
error "Unsupported preprocessor: ${preprocessor}"
}
emit:
bam = ch_processed_bam // channel: [ val(meta), path(bam) ]
bai = ch_processed_bai // channel: [ val(meta), path(bai) ]
alignment_summary = ch_alignment_summary // channel: [ val(meta), path(metrics) ]
insert_metrics = ch_alignment_insert_metrics // channel: [ val(meta), path(metrics) ]
versions = ch_versions // channel: path(versions.yml)
}
I allocated more computing resources so it can be ran faster with the below config customized_labels.config
// HPC internet is quite slow, so we increase the pull timeout for singularity containers
singularity.pullTimeout="100.h"
process {
cpus = { 2 * task.attempt }
memory = { 4.GB * task.attempt }
time = { 4.h * task.attempt }
// memory errors which should be retried. otherwise error out
errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'
// Process-specific resource requirements
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
withLabel:error_ignore {
errorStrategy = 'ignore'
}
withLabel:error_retry {
errorStrategy = 'retry'
maxRetries = 2
}
withLabel:process_single {
cpus = { 3 }
memory = { 6.GB * task.attempt }
time = { 4.h * task.attempt }
}
withLabel:process_low {
cpus = { 6 * task.attempt }
memory = { 12.GB * task.attempt }
time = { 4.h * task.attempt }
}
withLabel:process_medium {
cpus = { 64 * task.attempt }
memory = { 120.GB * task.attempt }
time = { 8.h * task.attempt }
}
withLabel:process_high {
cpus = { 64 * task.attempt }
memory = { 120.GB * task.attempt }
time = { 16.h * task.attempt }
}
withLabel:process_long {
time = { 20.h * task.attempt }
}
withLabel:process_high_memory {
memory = { 200.GB * task.attempt }
}
}
process.executor="slurm"
The workflow subworkflow with GERMLINE_VARIANT_CALLING was updated with skip_preprocessing that the raw aligned reads were put directly into the variant caller
if(!params.skip_preprocessing){
PREPROCESSING(
params.preprocessor,
ALIGNMENT.out.bam,
ref_fasta,
ref_fai,
ref_dict,
dbsnp_vcf,
dbsnp_tbi,
known_indels_vcf,
known_indels_tbi,
)
ch_versions = ch_versions.mix(PREPROCESSING.out.versions)
ch_final_bam = PREPROCESSING.out.bam
ch_final_bai = PREPROCESSING.out.bai
}
else {
ch_final_bam = ALIGNMENT.out.bam
ch_final_bai = ALIGNMENT.out.bai
}
2. Running 2 Simplified Workflow
Cloning the repo with inputs files are avaible at tag 0.4.0
# cloning repo
git clone https://github.com/nttg8100/nf-germline-short-read-variant-calling.git -b 0.4.0
cd benchmark/pipelines/nf-germline-short-read-variant-calling
2.1 Freebayes
The workflow procedure for Freebayes: fastp > bwa-mem2 > sambamba > freebayes
# clone repo again
git clone https://github.com/nttg8100/nf-germline-short-read-variant-calling.git -b 0.4.0
cd benchmark/pipelines/nf-germline-short-read-variant-calling
# Freebayes recommendation:
# 1. fastp: quality control
# 2. bwa-mem2: read alignment
# 3. sambamba: PCR remove duplicate
# 4. freebayes: Variant calling
nextflow run nf-germline-short-read-variant-calling \
-profile docker -c customized_labels.config \
--genome GRCh38 \
--input input.csv \
--variant_caller "freebayes" \
--preprocessor "sambamba" \
--output variant_calling
2.2 Deepvariant
The workflow procedure for Deepvariant: fastp > bwa-mem2 > deepvariant
# Deepvariant recommendation:
# 1. fastp: quality control
# 2. bwa-mem2: read alignment
# 3. deepvariant: variant calling
nextflow run nf-germline-short-read-variant-calling \
-profile docker -c customized_labels.config \
--genome GRCh38 \
--input input.csv \
--variant_caller "freebayes" \
--preprocessor "sambamba" \
--output variant_calling -resume # Running on the same workding directory to reuse the cache for aligned reads processes
3. Evaluation
To evaluate, it reuses the same setup with HG002 in GIAB from previous blog for benchmarking germline variant calling using nf-core/sarek
Benchmarking against GIAB HG002 shows that a minimal workflow (FASTP → BWA-MEM2 → DeepVariant) achieves essentially identical SNP and INDEL accuracy compared to the full nf-core/sarek pipeline, suggesting that for single-sample germline variant calling, pipeline complexity does not necessarily translate into better performance.
| Run | Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | Recall | Precision | Frac_NA | F1_Score | TRUTH.TiTv | QUERY.TiTv | TRUTH.het/hom | QUERY.het/hom |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| deepvariant_HG002 | INDEL | ALL | 525469 | 519079 | 6390 | 892719 | 3369 | 348852 | 2473 | 421 | 0.987839 | 0.993805 | 0.390775 | 0.990813 | 1.528276 | 1.892501 | ||
| deepvariant_HG002 | INDEL | PASS | 525469 | 519079 | 6390 | 892719 | 3369 | 348852 | 2473 | 421 | 0.987839 | 0.993805 | 0.390775 | 0.990813 | 1.528276 | 1.892501 | ||
| deepvariant_HG002 | SNP | ALL | 3365127 | 3344672 | 20455 | 3840687 | 6131 | 488147 | 1391 | 617 | 0.993921 | 0.998171 | 0.127099 | 0.996042 | 2.100128 | 1.986419 | 1.581196 | 1.498072 |
| deepvariant_HG002 | SNP | PASS | 3365127 | 3344672 | 20455 | 3840687 | 6131 | 488147 | 1391 | 617 | 0.993921 | 0.998171 | 0.127099 | 0.996042 | 2.100128 | 1.986419 | 1.581196 | 1.498072 |
| freebayes_HG002 | INDEL | ALL | 525469 | 503744 | 21725 | 883066 | 25587 | 336867 | 14646 | 1947 | 0.958656 | 0.953154 | 0.381474 | 0.955897 | 1.528276 | 1.856978 | ||
| freebayes_HG002 | INDEL | PASS | 525469 | 503744 | 21725 | 883066 | 25587 | 336867 | 14646 | 1947 | 0.958656 | 0.953154 | 0.381474 | 0.955897 | 1.528276 | 1.856978 | ||
| freebayes_HG002 | SNP | ALL | 3365127 | 3344366 | 20761 | 4404601 | 129353 | 927169 | 3829 | 16162 | 0.993831 | 0.962802 | 0.210500 | 0.978070 | 2.100128 | 1.823321 | 1.581196 | 1.875394 |
| freebayes_HG002 | SNP | PASS | 3365127 | 3344366 | 20761 | 4404601 | 129353 | 927169 | 3829 | 16162 | 0.993831 | 0.962802 | 0.210500 | 0.978070 | 2.100128 | 1.823321 | 1.581196 | 1.875394 |
3.1 DeepVariant
A minor change on the performance for simple workflow with deepvariant
| Metric | Simple Workflow | nf-core/sarek | Difference |
|---|---|---|---|
| SNP Recall | 0.993921 | 0.993885 | ~0.00004 |
| SNP Precision | 0.998171 | 0.998421 | ~0.00025 |
| SNP F1 Score | 0.996042 | 0.996148 | ~0.00011 |
| INDEL Recall | 0.987839 | 0.989684 | ~0.00185 |
| INDEL Precision | 0.993805 | 0.994586 | ~0.00078 |
| INDEL F1 Score | 0.990813 | 0.992129 | ~0.00132 |
3.2 FreeBayes
A minor change on the performance for simple workflow freebayes
| Metric | Simple Workflow | nf-core/sarek | Difference |
|---|---|---|---|
| SNP Recall | 0.993831 | 0.993862 | ~0.00003 |
| SNP Precision | 0.962802 | 0.962893 | ~0.00009 |
| SNP F1 Score | 0.978070 | 0.978133 | ~0.00006 |
| INDEL Recall | 0.958656 | 0.959965 | ~0.00131 |
| INDEL Precision | 0.953154 | 0.964047 | ~0.01089 |
| INDEL F1 Score | 0.955897 | 0.962002 | ~0.00610 |
Recap
- In practice, using fewer steps can help reduce the computational cost of running the analysis.
- A simple workflow, following the recommendations in the tool documentation, can still achieve strong and reliable results.
References
-
hap.py tool for benchmarking germline short-read variants
https://github.com/qbic-projects/QSARK/tree/main -
Genome in a Bottle (GIAB) program for gold standard datasets
oaicite:4
https://www.nist.gov/programs-projects/genome-bottle