Variant Calling (Part 5): Benchmarking Germline Variant Calling with nf-core/sarek

March 8, 2026 · 13 min read

Founder at G Labs

We have a series for variant calling, however, one of the most important thing is not about how your workflow is advanced with modern framework, it is about the scoring system that show your workflow achieve high quality score with gold standard criteria. Therefore, in this series, I used the Genome In A Bottle (GIAB) with the truth set variants are curated. With nf-core/sarek-it shows the consistency with the results has been published in its paper with more than 99% in SNP/INDEL variants.

tip

The environment setups and scripts are uploaded to nf-germline-short-read-variant-calling
Cloning repository with tag 0.3.0 to reproduce this blog contents

Clone the repository

git clone git@github.com:nttg8100/nf-germline-short-read-variant-calling.git -b 0.3.0
cd nf-germline-short-read-variant-calling
cd benchmark/pipelines/sarek

1. Prepare Dataset

Prepare the input.csv as below, nextflow can automatically download the dataset

patient,sex,status,sample,lane,fastq_1,fastq_2
HG002,XY,0,HG002,1,s3://human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/ILMN/downsampled/HG002_HiSeq30x_subsampled_R1.fastq.gz,s3://human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/ILMN/downsampled/HG002_HiSeq30x_subsampled_R2.fastq.gz

The benchmark uses the HG002 sample from the Genome in a Bottle (GIAB) consortium. HG002 is widely used as a gold standard dataset for benchmarking germline variant calling pipelines because it provides:

High-confidence truth variants
High-confidence confident regions
Well-curated reference datasets for SNPs and indels

Dataset characteristics:

Sample: HG002 (NA24385)
Sequencing: Illumina short reads
Coverage: ~30X
Reference genome: GRCh38
Truth set: GIAB high-confidence variant calls

The truth VCF and confident region BED files are used to compare predicted variants against the gold standard.

2. Run The nf-core/sarek

I allocated more computing resources so it can be ran faster with the below config customized_labels.config

// HPC internet is quite slow, so we increase the pull timeout for singularity containers
singularity.pullTimeout="100.h"
process {
    cpus   = { 2      * task.attempt }
    memory = { 4.GB   * task.attempt }
    time   = { 4.h    * task.attempt }

    // memory errors which should be retried. otherwise error out
    errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
    maxRetries    = 1
    maxErrors     = '-1'

    // Process-specific resource requirements
    // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
    withLabel:error_ignore {
        errorStrategy = 'ignore'
    }
    withLabel:error_retry {
        errorStrategy = 'retry'
        maxRetries    = 2
    }
    withLabel:process_single {
        cpus   = { 3                   }
        memory = { 6.GB * task.attempt }
        time   = { 4.h  * task.attempt }
    }
    withLabel:process_low {
        cpus   = { 6     * task.attempt }
        memory = { 12.GB * task.attempt }
        time   = { 4.h   * task.attempt }
    }
    withLabel:process_medium {
        cpus   = { 64     * task.attempt }
        memory = { 120.GB * task.attempt }
        time   = { 8.h   * task.attempt }
    }
    withLabel:process_high {
        cpus   = { 64    * task.attempt }
        memory = { 120.GB * task.attempt }
        time   = { 16.h  * task.attempt }
    }
    withLabel:process_long {
        time   = { 20.h  * task.attempt }
    }
    withLabel:process_high_memory {
        memory = { 200.GB * task.attempt }
    }
}
process.executor="slurm"

cd benchmark/pipelines/sarek
pixi run nextflow run nf-core/sarek -r 3.7.1 -profile docker,test_full_germline \
    -c customized_labels.config \
    --input ./input.csv \
    --outdir bwamem \
    --tools "deepvariant,freebayes,strelka,manta,haplotypecaller" \
    --aligner "bwa-mem" \
    -resume

The results then are copied to the corrected folder on benchmark

├── pipelines
│   └── sarek
│       ├── customized_labels.config
│       ├── input.csv
│       ├── pixi.lock
│       ├── pixi.toml
│       ├── run_pipeline.sh
│       └── variant_calling
│           ├── deepvariant
│           │   └── HG002
│           │       ├── HG002.deepvariant.g.vcf.gz
│           │       ├── HG002.deepvariant.g.vcf.gz.tbi
│           │       ├── HG002.deepvariant.vcf.gz
│           │       └── HG002.deepvariant.vcf.gz.tbi
│           ├── filtered
│           │   └── HG002
│           │       ├── HG002.deepvariant.bcftools_filtered.vcf.gz
│           │       ├── HG002.deepvariant.bcftools_filtered.vcf.gz.tbi
│           │       ├── HG002.freebayes.filtered.bcftools_filtered.vcf.gz
│           │       ├── HG002.freebayes.filtered.bcftools_filtered.vcf.gz.tbi
│           │       ├── HG002.haplotypecaller.filtered.bcftools_filtered.vcf.gz
│           │       ├── HG002.haplotypecaller.filtered.bcftools_filtered.vcf.gz.tbi
│           │       ├── HG002.strelka.variants.bcftools_filtered.vcf.gz
│           │       └── HG002.strelka.variants.bcftools_filtered.vcf.gz.tbi
│           ├── freebayes
│           │   └── HG002
│           │       ├── HG002.freebayes.filtered.vcf.gz
│           │       ├── HG002.freebayes.filtered.vcf.gz.tbi
│           │       ├── HG002.freebayes.vcf.gz
│           │       └── HG002.freebayes.vcf.gz.tbi
│           ├── haplotypecaller
│           │   └── HG002
│           │       ├── HG002.haplotypecaller.filtered.vcf.gz
│           │       ├── HG002.haplotypecaller.filtered.vcf.gz.tbi
│           │       ├── HG002.haplotypecaller.vcf.gz
│           │       └── HG002.haplotypecaller.vcf.gz.tbi
│           ├── manta
│           │   └── HG002
│           │       ├── HG002.manta.diploid_sv.vcf.gz
│           │       └── HG002.manta.diploid_sv.vcf.gz.tbi
│           └── strelka
│               └── HG002
│                   ├── HG002.strelka.genome.vcf.gz
│                   ├── HG002.strelka.genome.vcf.gz.tbi
│                   ├── HG002.strelka.variants.vcf.gz
│                   └── HG002.strelka.variants.vcf.gz.tbi

3. Benchmark With HG002 Dataset

3.1 Download Truth Set And Reference Data

It is for proof of concept on how to benchmark the data, we ran with only with a simple, there are more many samples from GIAB or another public truth datasets can be used by the downloading script download.sh in reference folder

# download GATK reference genome
aws s3 cp --recursive --no-sign-request s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/  .

# download reference files for benchmarking
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
# download bed files for confident regions
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed

Download the data by running this command

cd benchmark/references
pixi run download.sh

The downloaded files are showed as below

benchmark/
├── Readme.md
├── benchmark.sh
├── benchmark_rewrite.sh
├── pipelines
│   └── sarek
│       ├── customized_labels.config
│       ├── input.csv
│       ├── pixi.lock
│       ├── pixi.toml
│       ├── run_pipeline.sh
│       └── variant_calling
│           ├── deepvariant
│           │   └── HG002
│           │       ├── HG002.deepvariant.g.vcf.gz
│           │       ├── HG002.deepvariant.g.vcf.gz.tbi
│           │       ├── HG002.deepvariant.vcf.gz
│           │       └── HG002.deepvariant.vcf.gz.tbi
│           ├── filtered
│           │   └── HG002
│           │       ├── HG002.deepvariant.bcftools_filtered.vcf.gz
│           │       ├── HG002.deepvariant.bcftools_filtered.vcf.gz.tbi
│           │       ├── HG002.freebayes.filtered.bcftools_filtered.vcf.gz
│           │       ├── HG002.freebayes.filtered.bcftools_filtered.vcf.gz.tbi
│           │       ├── HG002.haplotypecaller.filtered.bcftools_filtered.vcf.gz
│           │       ├── HG002.haplotypecaller.filtered.bcftools_filtered.vcf.gz.tbi
│           │       ├── HG002.strelka.variants.bcftools_filtered.vcf.gz
│           │       └── HG002.strelka.variants.bcftools_filtered.vcf.gz.tbi
│           ├── freebayes
│           │   └── HG002
│           │       ├── HG002.freebayes.filtered.vcf.gz
│           │       ├── HG002.freebayes.filtered.vcf.gz.tbi
│           │       ├── HG002.freebayes.vcf.gz
│           │       └── HG002.freebayes.vcf.gz.tbi
│           ├── haplotypecaller
│           │   └── HG002
│           │       ├── HG002.haplotypecaller.filtered.vcf.gz
│           │       ├── HG002.haplotypecaller.filtered.vcf.gz.tbi
│           │       ├── HG002.haplotypecaller.vcf.gz
│           │       └── HG002.haplotypecaller.vcf.gz.tbi
│           ├── manta
│           │   └── HG002
│           │       ├── HG002.manta.diploid_sv.vcf.gz
│           │       └── HG002.manta.diploid_sv.vcf.gz.tbi
│           └── strelka
│               └── HG002
│                   ├── HG002.strelka.genome.vcf.gz
│                   ├── HG002.strelka.genome.vcf.gz.tbi
│                   ├── HG002.strelka.variants.vcf.gz
│                   └── HG002.strelka.variants.vcf.gz.tbi
├── pixi.lock
├── pixi.toml
├── references
│   ├── HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
│   ├── HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
│   ├── HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
│   ├── Homo_sapiens_assembly38.dict
│   ├── Homo_sapiens_assembly38.fasta
│   ├── Homo_sapiens_assembly38.fasta.fai
│   ├── download.sh
│   ├── pixi.lock
│   └── pixi.toml

3.2 Tool For Benchmarking

Here, we use hap.py is the tool that help to benchmark for germline variant calling with SNP and INDEL variants. For more detail, take a look at the original paper here https://www.nature.com/articles/s41587-019-0054-x.

What it does:

Reformat to find the match between the query and the truth set variant record on the vcf file
You can see the below example where these 2 variants are exactly matched with each others, however, it was represented by different position, alternative nucleotides
Reformat the variant records, then, hap.py calculates the metrics that can be represented to how good each tools for variant calling are with

Example output table:

Type	Filter	TRUTH.TOTAL	TRUTH.TP	TRUTH.FN	QUERY.TOTAL	QUERY.FP	QUERY.UNK	FP.gt	FP.al	METRIC.Recall	METRIC.Precision	METRIC.Frac_NA	METRIC.F1_Score	TRUTH.TOTAL.TiTv_ratio	QUERY.TOTAL.TiTv_ratio	TRUTH.TOTAL.het_hom_ratio	QUERY.TOTAL.het_hom_ratio
INDEL	ALL	525469	520048	5421	908475	2949	363798	2010	455	0.989684	0.9945860000000001	0.400449	0.992129			1.528275734138537	1.9313187724266925
INDEL	PASS	525469	520048	5421	908475	2949	363798	2010	455	0.989684	0.9945860000000001	0.400449	0.992129			1.528275734138537	1.9313187724266925
SNP	ALL	3365127	3344549	20578	3841759	5292	490176	1401	562	0.993885	0.9984209999999999	0.127592	0.9961479999999999	2.1001284867575745	1.9839843850083037	1.5811958532541182	1.4858796715961773
SNP	PASS	3365127	3344549	20578	3841759	5292	490176	1401	562	0.993885	0.9984209999999999	0.127592	0.9961479999999999	2.1001284867575745	1.9839843850083037	1.5811958532541182	1.4858796715961773

info

Happy outputs columns are explained by its documentation as below

Happy outputs a set of stratification columns, followed by metrics columns. Stratification columns may contain the placeholder "*" value, which indicates that counts in this row are aggregated over all possible values of the column. In case of a subtype, this means that the counts have been summed across all subtypes. In case of QQ, it means that the counts correspond to all variants rather than only ones below a QQ threshold.

Stratification Column	Description
Type	Variant type (SNP / INDEL)
Subtype	Variant Subtype (ti/tv/indel length, see above
Subset	Subset of the genome/stratification region
Filter	Variant filters: PASS, SEL, ALL, or a particular filter from the query VCF
Genotype	Genotype of benchmarked variants (het / homalt / hetalt)
QQ.Field	Which field from the original VCF was used to produce QQ values in truth and query
QQ	QQ threshold for ROC values

Metric Column	Description
METRIC.Recall	Recall for truth variant representation = TRUTH.TP / (TRUTH.TP + TRUTH.FN)
METRIC.Precision	Precision of query variants = QUERY.TP / (QUERY.TP + QUERY.FP)
METRIC.Frac_NA	Fraction of non-assessed query calls = QUERY.UNK / QUERY.TOTAL
METRIC.F1_Score	Harmonic mean of precision and recall = 2METRIC.RecallMetric.Precision/(METRIC.Recall + METRIC.Precision)
TRUTH.TOTAL	Total number of truth variants
TRUTH.TP	Number of true-positive calls in truth representation (counted via the truth sample column)
TRUTH.FN	Number of false-negative calls = calls in truth without matching query call
QUERY.TOTAL	Total number of query calls
QUERY.TP	Number of true positive calls in query representation (counted via the query sample column)
QUERY.FP	Number of false-positive calls in the query file (mismatched query calls within the confident regions)
QUERY.UNK	Number of query calls outside the confident regions
FP.gt	Number of genotype mismatches (alleles match, but different zygosity)
FP.al	Number of allele mismatches (variants matched by position and not by haplotype)
TRUTH.TOTAL.TiTv_ratio	Transition / Transversion ratio for all truth variants
TRUTH.TOTAL.het_hom_ratio	Het/Hom ratio for all truth variants
TRUTH.FN.TiTv_ratio	Transition / Transversion ratio for false-negative variants
TRUTH.FN.het_hom_ratio	Het/Hom ratio for false-negative variants
TRUTH.TP.TiTv_ratio	Transition / Transversion ratio for true positive variants
TRUTH.TP.het_hom_ratio	Het/Hom ratio for true positive variants
QUERY.FP.TiTv_ratio	Transition / Transversion ratio for false positive variants
QUERY.FP.het_hom_ratio	Het/Hom ratio for false-positive variants
QUERY.TOTAL.TiTv_ratio	Transition / Transversion ratio for all query variants
QUERY.TOTAL.het_hom_ratio	Het/Hom ratio for all query variants
QUERY.TP.TiTv_ratio	Transition / Transversion ratio for true positive variants (query representation)
QUERY.TP.het_hom_ratio	Het/Hom ratio for true positive variants (query representation)
QUERY.UNK.TiTv_ratio	Transition / Transversion ratio for unknown variants
QUERY.UNK.het_hom_ratio	Het/Hom ratio for unknown variants
Subset.Size	When using stratification regions, this gives the number of nucleotides contained in the current subset
Subset.IS_CONF.Size	This gives the number of confident bases (`-f` regions) in the current subset

3.3 Script And Environment Setup For Benchmarking

3.3.1 Install

#!/bin/bash
# Benchmark script for HG002 with DeepVariant, FreeBayes, HaplotypeCaller, Manta, Strelka (relative paths)
set -euo pipefail

HGREF="references/Homo_sapiens_assembly38.fasta"
REF="HG002"
BED="references/${REF}_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed"
TRUTH="references/${REF}_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
SDF="references/grch38.sdf"
THREADS=16

RESULTSDIR="results"
mkdir -p "$RESULTSDIR"

TOOLS=(deepvariant freebayes haplotypecaller manta strelka)

VCF_DEEPVARIANT="pipelines/sarek/variant_calling/deepvariant/HG002/HG002.deepvariant.vcf.gz"
VCF_FREEBAYES="pipelines/sarek/variant_calling/freebayes/HG002/HG002.freebayes.vcf.gz"
VCF_HAPLOTYPECALLER="pipelines/sarek/variant_calling/haplotypecaller/HG002/HG002.haplotypecaller.vcf.gz"
VCF_MANTA="pipelines/sarek/variant_calling/manta/HG002/HG002.manta.diploid_sv.vcf.gz"
VCF_STRELKA="pipelines/sarek/variant_calling/strelka/HG002/HG002.strelka.variants.vcf.gz"

# Path map
declare -A VCF_PATHS=(
  [deepvariant]="$VCF_DEEPVARIANT"
  [freebayes]="$VCF_FREEBAYES"
  [haplotypecaller]="$VCF_HAPLOTYPECALLER"
  [manta]="$VCF_MANTA"
  [strelka]="$VCF_STRELKA"
)

for TOOL in "${TOOLS[@]}"; do
  TOOLDIR="$RESULTSDIR/$TOOL"
  mkdir -p "$TOOLDIR"
  VCF="${VCF_PATHS[$TOOL]}"
  LOG="$TOOLDIR/${REF}_${TOOL}.log"
  SUMMARY="$TOOLDIR/${REF}_${TOOL}.summary.csv"
  FORMATTED="$TOOLDIR/${REF}_${TOOL}.formatted.summary.csv"
  OUTPUT="$TOOLDIR/${REF}_${TOOL}"

  # If the tool output folder already exists with summary file, continue from AWK formatting (line 55)
  if [[ -f "${OUTPUT}.summary.csv" ]]; then
    echo "Output for $TOOL already exists, formatting summary."
    awk -v d="${TOOL}_${REF}" -F',' 'FNR==1{a="run"} FNR>1{a=d} {print $0",\t"a}' "${OUTPUT}.summary.csv" > "$FORMATTED"
    echo "Completed $TOOL (using existing output)"
    continue
  fi

  if [[ ! -f "$VCF" ]]; then
    echo "VCF file not found for $TOOL: $VCF" >&2
    continue
  fi

  echo "Benchmarking $TOOL for $REF ..."
  hap.py "$TRUTH" "$VCF" \
    -o "$OUTPUT" \
    -V --engine=vcfeval --threads "$THREADS" --engine-vcfeval-template "$SDF" \
    -f "$BED" \
    --logfile "$LOG" \
    --scratch-prefix .

  if [[ -f "${OUTPUT}.summary.csv" ]]; then
    awk -v d="${TOOL}_${REF}" -F',' 'FNR==1{a="run"} FNR>1{a=d} {print $0",\t"a}' "${OUTPUT}.summary.csv" > "$FORMATTED"
  else
    echo "Warning: ${OUTPUT}.summary.csv not found for $TOOL" >&2
  fi

  echo "Completed $TOOL"
done

MERGED="$RESULTSDIR/merged_${REF}_benchmark.csv"
echo "Merging summaries to $MERGED ..."
awk '(NR == 1) || (FNR > 1)' "$RESULTSDIR"/*/*.formatted.summary.csv > "$MERGED"
echo "All done. Results for $REF in $MERGED"

3.3.2 Running Benchmarking Script

Then you can obtain the result as below

├── results
│   └── sarek
│       ├── deepvariant
│       │   ├── HG002_deepvariant.extended.csv
│       │   ├── HG002_deepvariant.formatted.summary.csv
│       │   ├── HG002_deepvariant.log
│       │   ├── HG002_deepvariant.metrics.json.gz
│       │   ├── HG002_deepvariant.roc.Locations.INDEL.PASS.csv.gz
│       │   ├── HG002_deepvariant.roc.Locations.INDEL.csv.gz
│       │   ├── HG002_deepvariant.roc.Locations.SNP.PASS.csv.gz
│       │   ├── HG002_deepvariant.roc.Locations.SNP.csv.gz
│       │   ├── HG002_deepvariant.roc.all.csv.gz
│       │   ├── HG002_deepvariant.runinfo.json
│       │   ├── HG002_deepvariant.summary.csv
│       │   ├── HG002_deepvariant.vcf.gz
│       │   └── HG002_deepvariant.vcf.gz.tbi
│       ├── freebayes
│       │   ├── HG002_freebayes.extended.csv
│       │   ├── HG002_freebayes.formatted.summary.csv
│       │   ├── HG002_freebayes.log
│       │   ├── HG002_freebayes.metrics.json.gz
│       │   ├── HG002_freebayes.roc.Locations.INDEL.PASS.csv.gz
│       │   ├── HG002_freebayes.roc.Locations.INDEL.csv.gz
│       │   ├── HG002_freebayes.roc.Locations.SNP.PASS.csv.gz
│       │   ├── HG002_freebayes.roc.Locations.SNP.csv.gz
│       │   ├── HG002_freebayes.roc.all.csv.gz
│       │   ├── HG002_freebayes.runinfo.json
│       │   ├── HG002_freebayes.summary.csv
│       │   ├── HG002_freebayes.vcf.gz
│       │   └── HG002_freebayes.vcf.gz.tbi
│       ├── haplotypecaller
│       │   ├── HG002_haplotypecaller.extended.csv
│       │   ├── HG002_haplotypecaller.formatted.summary.csv
│       │   ├── HG002_haplotypecaller.log
│       │   ├── HG002_haplotypecaller.metrics.json.gz
│       │   ├── HG002_haplotypecaller.roc.Locations.INDEL.PASS.csv.gz
│       │   ├── HG002_haplotypecaller.roc.Locations.INDEL.csv.gz
│       │   ├── HG002_haplotypecaller.roc.Locations.SNP.PASS.csv.gz
│       │   ├── HG002_haplotypecaller.roc.Locations.SNP.csv.gz
│       │   ├── HG002_haplotypecaller.roc.all.csv.gz
│       │   ├── HG002_haplotypecaller.runinfo.json
│       │   ├── HG002_haplotypecaller.summary.csv
│       │   ├── HG002_haplotypecaller.vcf.gz
│       │   └── HG002_haplotypecaller.vcf.gz.tbi
│       ├── manta
│       │   ├── HG002_manta.extended.csv
│       │   ├── HG002_manta.formatted.summary.csv
│       │   ├── HG002_manta.log
│       │   ├── HG002_manta.metrics.json.gz
│       │   ├── HG002_manta.roc.Locations.INDEL.PASS.csv.gz
│       │   ├── HG002_manta.roc.Locations.INDEL.csv.gz
│       │   ├── HG002_manta.roc.Locations.SNP.PASS.csv.gz
│       │   ├── HG002_manta.roc.Locations.SNP.csv.gz
│       │   ├── HG002_manta.roc.all.csv.gz
│       │   ├── HG002_manta.runinfo.json
│       │   ├── HG002_manta.summary.csv
│       │   ├── HG002_manta.vcf.gz
│       │   └── HG002_manta.vcf.gz.tbi
│       ├── merged_HG002_benchmark.csv
│       └── strelka
│           ├── HG002_strelka.extended.csv
│           ├── HG002_strelka.formatted.summary.csv
│           ├── HG002_strelka.log
│           ├── HG002_strelka.metrics.json.gz
│           ├── HG002_strelka.roc.Locations.INDEL.PASS.csv.gz
│           ├── HG002_strelka.roc.Locations.INDEL.csv.gz
│           ├── HG002_strelka.roc.Locations.SNP.PASS.csv.gz
│           ├── HG002_strelka.roc.Locations.SNP.csv.gz
│           ├── HG002_strelka.roc.all.csv.gz
│           ├── HG002_strelka.runinfo.json
│           ├── HG002_strelka.summary.csv
│           ├── HG002_strelka.vcf.gz
│           └── HG002_strelka.vcf.gz.tbi

3.3.3 Overview From Output

I benchmarked several popular variant callers—DeepVariant, FreeBayes, HaplotypeCaller, Strelka, and Manta—on the HG002 dataset, evaluating performance separately for SNPs and INDELs using Recall, Precision, and F1-score.

Overall, DeepVariant achieved the best performance, showing the highest balance between recall and precision for both SNPs (F1 ≈ 0.996) and INDELs (F1 ≈ 0.992). HaplotypeCaller and Strelka also performed strongly, with slightly lower recall or precision depending on the variant type. FreeBayes showed comparable recall but noticeably lower precision, indicating more false positives.

SNP detection was consistently strong across most tools (recall ≈ 0.99), while INDEL calling showed more variability, highlighting the greater difficulty of detecting small insertions and deletions accurately.

As expected, Manta performed poorly for SNPs and INDELs, since it is designed primarily for structural variant detection rather than small variants.

Overall, these results confirm that modern machine-learning–based callers like DeepVariant provide the most balanced performance, while traditional callers remain competitive but may require stricter filtering to control false positives.

tip

modern machine-learning–based callers like DeepVariant provide the most balanced performance, while traditional callers remain competitive but may require stricter filtering to control false positives.
It shows the consistent with old sarek version from its publication in June 2024: https://academic.oup.com/nargab/article/doi/10.1093/nargab/lqae031/7658070

See the merged_HG002_benchmark.csv is the aggregated results are showed as below

Type	Filter	TRUTH.TOTAL	TRUTH.TP	TRUTH.FN	QUERY.TOTAL	QUERY.FP	QUERY.UNK	FP.gt	FP.al	METRIC.Recall	METRIC.Precision	METRIC.Frac_NA	METRIC.F1_Score	TRUTH.TOTAL.TiTv_ratio	QUERY.TOTAL.TiTv_ratio	TRUTH.TOTAL.het_hom_ratio	QUERY.TOTAL.het_hom_ratio	run
INDEL	ALL	525469	520048	5421	908475	2949	363798	2010	455	0.989684	0.994586	0.400449	0.992129			1.528275734138537	1.9313187724266925	deepvariant_HG002
INDEL	PASS	525469	520048	5421	908475	2949	363798	2010	455	0.989684	0.994586	0.400449	0.992129			1.528275734138537	1.9313187724266925	deepvariant_HG002
SNP	ALL	3365127	3344549	20578	3841759	5292	490176	1401	562	0.993885	0.998421	0.127592	0.996148	2.1001284867575745	1.9839843850083037	1.5811958532541182	1.4858796715961773	deepvariant_HG002
SNP	PASS	3365127	3344549	20578	3841759	5292	490176	1401	562	0.993885	0.998421	0.127592	0.996148	2.1001284867575745	1.9839843850083037	1.5811958532541182	1.4858796715961773	deepvariant_HG002
INDEL	ALL	525469	504432	21037	881738	19441	341005	13873	1528	0.959965	0.964047	0.386742	0.962002			1.528275734138537	1.8211051374734122	freebayes_HG002
INDEL	PASS	525469	504432	21037	881738	19441	341005	13873	1528	0.959965	0.964047	0.386742	0.962002			1.528275734138537	1.8211051374734122	freebayes_HG002
SNP	ALL	3365127	3344473	20654	4423242	129028	946030	3762	16431	0.993862	0.962893	0.213877	0.978133	2.1001284867575745	1.8132303933942775	1.5811958532541182	1.8836484571228804	freebayes_HG002
SNP	PASS	3365127	3344473	20654	4423242	129028	946030	3762	16431	0.993862	0.962893	0.213877	0.978133	2.1001284867575745	1.8132303933942775	1.5811958532541182	1.8836484571228804	freebayes_HG002
INDEL	ALL	525469	516926	8543	895219	6343	351037	3797	436	0.983742	0.988344	0.392124	0.986038			1.528275734138537	1.8797119780755498	haplotypecaller_HG002
INDEL	PASS	525469	516926	8543	895219	6343	351037	3797	436	0.983742	0.988344	0.392124	0.986038			1.528275734138537	1.8797119780755498	haplotypecaller_HG002
SNP	ALL	3365127	3340110	25017	3957190	28452	587714	2758	2288	0.992566	0.991556	0.148518	0.992061	2.1001284867575745	1.9560158340309197	1.5811958532541182	1.7246457081290798	haplotypecaller_HG002
SNP	PASS	3365127	3340110	25017	3957190	28452	587714	2758	2288	0.992566	0.991556	0.148518	0.992061	2.1001284867575745	1.9560158340309197	1.5811958532541182	1.7246457081290798	haplotypecaller_HG002
INDEL	ALL	525469	8	525461	7821	114	7699	0	42	1.5e-05	0.065574	0.984401	3e-05			1.528275734138537	1.0282749675745784	manta_HG002
INDEL	PASS	525469	8	525461	6877	106	6763	0	36	1.5e-05	0.070175	0.983423	3e-05			1.528275734138537	1.060851318944844	manta_HG002
SNP	ALL	3365127	3	3365124	1416	4	1408	0	0	1e-06	0.5	0.99435	2e-06	2.1001284867575745	0.9005524861878453	1.5811958532541182	1.002828854314003	manta_HG002
SNP	PASS	3365127	3	3365124	1230	4	1222	0	0	1e-06	0.5	0.993496	2e-06	2.1001284867575745	0.8307692307692308	1.5811958532541182	0.9967532467532467	manta_HG002
INDEL	ALL	525469	515560	9909	875380	6569	334974	4149	572	0.981143	0.987844	0.382661	0.984482			1.528275734138537	1.8953993716115898	strelka_HG002
INDEL	PASS	525469	513675	11794	834212	3403	298722	2401	324	0.977555	0.993645	0.358089	0.985535			1.528275734138537	1.8205751890885689	strelka_HG002
SNP	ALL	3365127	3341546	23581	4233605	67357	821358	4302	7749	0.992993	0.98026	0.194009	0.986585	2.1001284867575745	1.8583435698762165	1.5811958532541182	1.8781977072413574	strelka_HG002
SNP	PASS	3365127	3323449	41678	3676907	4389	345864	909	758	0.987615	0.998682	0.094064	0.993118	2.1001284867575745	2.038753419553027	1.5811958532541182	1.57305061157284	strelka_HG002

Recap

info

Using modern framwork is good but the more important factor is about how much your pipeline show its accuracy, reliability
For developing the pipeline, it should be carefully reviewed and benchmarked with its own gold standard
However, using additional steps may not improve the results while consuming more computing resources (higher running cost), next blog will coming soon

References

hap.py tool for benchmarking germline short-read variants
https://github.com/qbic-projects/QSARK/tree/main
Genome in a Bottle (GIAB) program for gold standard datasets

oaicite:4

https://www.nist.gov/programs-projects/genome-bottle
nf-germline-short-read-variant-calling https://github.com/nttg8100/nf-germline-short-read-variant-calling

1. Prepare Dataset​

2. Run The nf-core/sarek​

3. Benchmark With HG002 Dataset​

3.1 Download Truth Set And Reference Data​

3.2 Tool For Benchmarking​

3.3 Script And Environment Setup For Benchmarking​

3.3.1 Install​

3.3.2 Running Benchmarking Script​

3.3.3 Overview From Output​

Recap​

References​