Skip to main content

Variant Calling (Part 8): Structural Variant Calling Short Read Benchmark

· 8 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Structural variants (SVs) — deletions, insertions, duplications, inversions, and translocations — are large genomic alterations (typically ≥50 bp) that play a major role in disease but are much harder to detect than SNPs or small indels. In this post, we benchmark Manta, the SV caller integrated in nf-core/sarek, against the GIAB HG002 truth set using Truvari, and explore why short-read SV calling remains a fundamentally difficult problem.

tip

Clone the repository:

git clone git@github.com:nttg8100/nf-germline-short-read-variant-calling.git -b 0.6.0
cd nf-germline-short-read-variant-calling
cd benchmark

1. Reference and Truth Set

Before running the benchmark, download all required reference files using the script at benchmark/references/download.sh:

# Download GATK reference genomes (GRCh38 and GRCh37/hg19)
aws s3 cp --recursive --no-sign-request \
s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/ .
aws s3 cp --recursive --no-sign-request \
s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/WholeGenomeFasta/ .

# Download GIAB HG002 SNP/INDEL truth set (GRCh38, NISTv4.2.1)
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed

# Download GIAB HG002 SV truth set (hg19, NIST_SV_v0.6)
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NIST_SV_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NIST_SV_v0.6/HG002_SVs_Tier1_v0.6.bed

# Download liftover chain file (hg38 → hg19) for coordinate conversion
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/liftOver/hg38ToHg19.over.chain.gz
info

Why two reference genomes?

  • The GRCh38 reference is used by nf-core/sarek for alignment and SV calling
  • The GRCh37/hg19 reference (human_g1k_v37_decoy.fasta) is required because the GIAB SV truth set (v0.6) is defined in hg19 coordinates — liftover is needed to compare the two
  • The SNP/INDEL truth set is in GRCh38, so no liftover is needed for small variant benchmarking

2. Background: Why SV Calling Is Hard for Short Reads

Short-read sequencing (Illumina, ~150 bp reads) excels at SNP and small indel detection, but it has fundamental limitations for SVs:

ChallengeDescription
Read length vs. SV sizeSVs range from 50 bp to megabases — far larger than a single read
Repetitive regionsMany SVs occur in segmental duplications or low-complexity regions, causing mapping ambiguity
Breakpoint imprecisionShort reads cannot span large SVs, so breakpoints are inferred rather than directly observed
Split-read signal is weakOnly a fraction of reads at a breakpoint are split, limiting evidence depth
Liftover artifactsBenchmarking requires coordinate conversion between assemblies, introducing noise

Manta addresses these challenges by combining split-read and paired-end discordant read signals to detect SVs. However, even with a 30X WGS dataset like HG002, short-read SV calling remains fundamentally limited compared to long-read approaches.

3. Running Manta via nf-core/sarek

Manta was run as part of the nf-core/sarek pipeline (v3.7.1) on the HG002 30X WGS dataset aligned to GRCh38.

nextflow run nf-core/sarek -r 3.7.1 -profile docker,test_full_germline \
-c customized_labels.config \
--input ./input.csv \
--outdir bwamem \
--tools "deepvariant,freebayes,strelka,manta,haplotypecaller" \
--aligner "bwa-mem" \
-resume

The Manta output VCF (HG002.manta.diploid_sv.vcf.gz) contains diploid SV calls including DEL, INS, INV, DUP, and BND (breakend) types.

4. Benchmarking Methodology

4.1 Truth Set

The benchmark uses the GIAB HG002 Structural Variant Tier 1 truth set (v0.6):

  • Truth VCF: HG002_SVs_Tier1_v0.6.vcf.gz
  • Confident regions BED: HG002_SVs_Tier1_v0.6.bed
  • Reference: human_g1k_v37_decoy.fasta (hg19/GRCh37)

The truth set is defined in hg19 coordinates, while sarek outputs VCFs in GRCh38. A liftover step is required before benchmarking.

4.2 Benchmarking Pipeline

The full benchmarking workflow is implemented in benchmark/benchmark_sv_sarek.sh and consists of four steps:

cd benchmark
pixi run --environment svbench bash benchmark_sv_sarek.sh

Step 1 — Normalize the query VCF (split multi-allelic, left-align indels):

bcftools norm -m -any \
-o HG002_manta_hg38_normalized.vcf \
pipelines/sarek/variant_calling/manta/HG002/HG002.manta.diploid_sv.vcf.gz

Step 2 — Liftover from hg38 to hg19 using CrossMap:

CrossMap vcf \
references/hg38ToHg19.over.chain.gz \
HG002_manta_hg38_normalized.vcf \
references/human_g1k_v37_decoy.fasta \
HG002_manta_hg19_raw.vcf
info

During liftover, 3,945 records could not be mapped to hg19 and were written to the .unmap file. These are excluded from downstream benchmarking.

Step 3 — Remove chr prefix, sort and index (to match truth set naming):

sed -e 's/chr//g' HG002_manta_hg19_raw.vcf > HG002_manta_hg19_nochr.vcf
bcftools sort -o HG002_manta_hg19.vcf HG002_manta_hg19_nochr.vcf
bgzip -f HG002_manta_hg19.vcf
tabix -p vcf HG002_manta_hg19.vcf.gz

Step 4 — Run Truvari with default SV benchmarking parameters:

truvari bench \
-b references/HG002_SVs_Tier1_v0.6.vcf.gz \
-c results/sarek/sv/manta/HG002_manta_hg19.vcf.gz \
-o results/sarek/sv/manta/HG002_manta.truvari \
-f references/human_g1k_v37_decoy.fasta \
--includebed references/HG002_SVs_Tier1_v0.6.bed

Key Truvari parameters used (from params.json):

ParameterValueDescription
refdist500Max reference location distance for matching
pctseq0.7Min sequence similarity for match (70%)
pctsize0.7Min size similarity for match (70%)
sizemin50Min SV size to benchmark
sizemax50000Max SV size to benchmark
passonlyfalseInclude all variants (not just PASS-filtered)
picksinglePick single best match per variant

5. Benchmark Results

5.1 Overall Performance

The Truvari benchmark results from HG002_manta.truvari/summary.json:

MetricValue
TP (True Positives)1,082
FP (False Positives)1,858
FN (False Negatives)12,650
Total truth SVs (base cnt)13,732
Total called SVs (comp cnt)2,940
Precision0.3680 (36.8%)
Recall0.0788 (7.88%)
F1 Score0.1298 (12.98%)
Genotype Concordance0.9039 (90.39%)

5.2 Genotype Breakdown

Among the 1,082 matched true positives, genotype concordance was 90.4%. The genotype confusion matrix:

Truth GTCalled GTCount
0/1 (het)0/1 (het)344
0/1 (het)1/1 (hom)63
1/1 (hom)1/1 (hom)634
1/1 (hom)0/1 (het)29
None/None1/1 (hom)6
None/None0/1 (het)4
0/00/1 (het)1
0/01/1 (hom)1

The dominant source of genotype error is het/hom confusion (92 cases), which is expected for short-read SV calling where homozygous and heterozygous SVs in repetitive regions are easily confused.

5.3 Coverage Gap: Called vs. Truth

The most striking result is the coverage gap:

Truth SVs:   13,732
Called SVs: 2,940 (only 21.4% of truth SVs are even called)
True Positives: 1,082 (only 7.88% of truth SVs are recovered)

Manta called only ~2,940 SVs total, while the GIAB truth set contains 13,732 high-confidence SVs in Tier 1 regions. This is not primarily a precision problem (36.8% of calls are correct) — it is a sensitivity problem: the majority of real SVs are simply not detected by short reads.

6. Why Is Recall So Low?

The low recall (7.88%) reflects the inherent limitations of short-read SV detection:

  1. Insertion detection failure: Short reads cannot span insertions larger than the read length. Most of the GIAB truth insertions (especially mobile element insertions) are invisible to Illumina reads.

  2. Liftover losses: 3,945 records (~57% of raw Manta calls) failed liftover from hg38 to hg19, further reducing the callable variant set.

  3. Repetitive region blackout: Many GIAB Tier 1 SVs are in segmental duplications where short reads cannot be uniquely mapped.

  4. SV size distribution mismatch: Manta is most sensitive for deletions and larger rearrangements, but the truth set contains many small insertions (50–200 bp) where short-read signal is weakest.

Recap

info
  • Short-read SV calling with Manta achieves ~7.9% recall against the GIAB HG002 truth set — a fundamentally different performance profile than SNP/INDEL calling (>99% F1).
  • Precision (36.8%) is reasonable given the difficulty, but most calls cannot be validated against the truth set.
  • Genotype concordance (90.4%) is high for the variants that are correctly called, suggesting the signal quality is good when SVs are detected at all.
  • The main bottleneck is sensitivity, not precision — the majority of real SVs are not detectable from short reads alone.
  • For clinical-grade SV detection, long-read sequencing (PacBio HiFi, Oxford Nanopore) is required, or combining short-read callers with optical mapping technologies.

References

  1. Manta SV caller Chen X et al. (2016). Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. https://github.com/Illumina/manta

  2. Truvari: SV benchmarking tool https://github.com/ACEnglish/truvari

  3. GIAB HG002 Structural Variant Truth Set (v0.6) https://www.nist.gov/programs-projects/genome-bottle

  4. CrossMap: Genome coordinate conversion http://crossmap.sourceforge.net/

  5. nf-core/sarek pipeline https://nf-co.re/sarek

  6. bcftools: VCF normalization and manipulation http://samtools.github.io/bcftools/