Variant Calling (Part 8): Structural Variant Calling Short Read Benchmark
Structural variants (SVs) — deletions, insertions, duplications, inversions, and translocations — are large genomic alterations (typically ≥50 bp) that play a major role in disease but are much harder to detect than SNPs or small indels. In this post, we benchmark Manta, the SV caller integrated in nf-core/sarek, against the GIAB HG002 truth set using Truvari, and explore why short-read SV calling remains a fundamentally difficult problem.
- The environment setups and scripts are uploaded to nf-germline-short-read-variant-calling
- Clone the repository with tag 0.6.0 to reproduce this blog's contents
Clone the repository:
git clone git@github.com:nttg8100/nf-germline-short-read-variant-calling.git -b 0.6.0
cd nf-germline-short-read-variant-calling
cd benchmark
1. Reference and Truth Set
Before running the benchmark, download all required reference files using the script at benchmark/references/download.sh:
# Download GATK reference genomes (GRCh38 and GRCh37/hg19)
aws s3 cp --recursive --no-sign-request \
s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/ .
aws s3 cp --recursive --no-sign-request \
s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh37/Sequence/WholeGenomeFasta/ .
# Download GIAB HG002 SNP/INDEL truth set (GRCh38, NISTv4.2.1)
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
# Download GIAB HG002 SV truth set (hg19, NIST_SV_v0.6)
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NIST_SV_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NIST_SV_v0.6/HG002_SVs_Tier1_v0.6.bed
# Download liftover chain file (hg38 → hg19) for coordinate conversion
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/liftOver/hg38ToHg19.over.chain.gz
Why two reference genomes?
- The GRCh38 reference is used by nf-core/sarek for alignment and SV calling
- The GRCh37/hg19 reference (
human_g1k_v37_decoy.fasta) is required because the GIAB SV truth set (v0.6) is defined in hg19 coordinates — liftover is needed to compare the two - The SNP/INDEL truth set is in GRCh38, so no liftover is needed for small variant benchmarking
2. Background: Why SV Calling Is Hard for Short Reads
Short-read sequencing (Illumina, ~150 bp reads) excels at SNP and small indel detection, but it has fundamental limitations for SVs:
| Challenge | Description |
|---|---|
| Read length vs. SV size | SVs range from 50 bp to megabases — far larger than a single read |
| Repetitive regions | Many SVs occur in segmental duplications or low-complexity regions, causing mapping ambiguity |
| Breakpoint imprecision | Short reads cannot span large SVs, so breakpoints are inferred rather than directly observed |
| Split-read signal is weak | Only a fraction of reads at a breakpoint are split, limiting evidence depth |
| Liftover artifacts | Benchmarking requires coordinate conversion between assemblies, introducing noise |
Manta addresses these challenges by combining split-read and paired-end discordant read signals to detect SVs. However, even with a 30X WGS dataset like HG002, short-read SV calling remains fundamentally limited compared to long-read approaches.
3. Running Manta via nf-core/sarek
Manta was run as part of the nf-core/sarek pipeline (v3.7.1) on the HG002 30X WGS dataset aligned to GRCh38.
nextflow run nf-core/sarek -r 3.7.1 -profile docker,test_full_germline \
-c customized_labels.config \
--input ./input.csv \
--outdir bwamem \
--tools "deepvariant,freebayes,strelka,manta,haplotypecaller" \
--aligner "bwa-mem" \
-resume
The Manta output VCF (HG002.manta.diploid_sv.vcf.gz) contains diploid SV calls including DEL, INS, INV, DUP, and BND (breakend) types.
4. Benchmarking Methodology
4.1 Truth Set
The benchmark uses the GIAB HG002 Structural Variant Tier 1 truth set (v0.6):
- Truth VCF:
HG002_SVs_Tier1_v0.6.vcf.gz - Confident regions BED:
HG002_SVs_Tier1_v0.6.bed - Reference:
human_g1k_v37_decoy.fasta(hg19/GRCh37)
The truth set is defined in hg19 coordinates, while sarek outputs VCFs in GRCh38. A liftover step is required before benchmarking.
4.2 Benchmarking Pipeline
The full benchmarking workflow is implemented in benchmark/benchmark_sv_sarek.sh and consists of four steps:
cd benchmark
pixi run --environment svbench bash benchmark_sv_sarek.sh
Step 1 — Normalize the query VCF (split multi-allelic, left-align indels):
bcftools norm -m -any \
-o HG002_manta_hg38_normalized.vcf \
pipelines/sarek/variant_calling/manta/HG002/HG002.manta.diploid_sv.vcf.gz
Step 2 — Liftover from hg38 to hg19 using CrossMap:
CrossMap vcf \
references/hg38ToHg19.over.chain.gz \
HG002_manta_hg38_normalized.vcf \
references/human_g1k_v37_decoy.fasta \
HG002_manta_hg19_raw.vcf
During liftover, 3,945 records could not be mapped to hg19 and were written to the .unmap file. These are excluded from downstream benchmarking.
Step 3 — Remove chr prefix, sort and index (to match truth set naming):
sed -e 's/chr//g' HG002_manta_hg19_raw.vcf > HG002_manta_hg19_nochr.vcf
bcftools sort -o HG002_manta_hg19.vcf HG002_manta_hg19_nochr.vcf
bgzip -f HG002_manta_hg19.vcf
tabix -p vcf HG002_manta_hg19.vcf.gz
Step 4 — Run Truvari with default SV benchmarking parameters:
truvari bench \
-b references/HG002_SVs_Tier1_v0.6.vcf.gz \
-c results/sarek/sv/manta/HG002_manta_hg19.vcf.gz \
-o results/sarek/sv/manta/HG002_manta.truvari \
-f references/human_g1k_v37_decoy.fasta \
--includebed references/HG002_SVs_Tier1_v0.6.bed
Key Truvari parameters used (from params.json):
| Parameter | Value | Description |
|---|---|---|
refdist | 500 | Max reference location distance for matching |
pctseq | 0.7 | Min sequence similarity for match (70%) |
pctsize | 0.7 | Min size similarity for match (70%) |
sizemin | 50 | Min SV size to benchmark |
sizemax | 50000 | Max SV size to benchmark |
passonly | false | Include all variants (not just PASS-filtered) |
pick | single | Pick single best match per variant |
5. Benchmark Results
5.1 Overall Performance
The Truvari benchmark results from HG002_manta.truvari/summary.json:
| Metric | Value |
|---|---|
| TP (True Positives) | 1,082 |
| FP (False Positives) | 1,858 |
| FN (False Negatives) | 12,650 |
| Total truth SVs (base cnt) | 13,732 |
| Total called SVs (comp cnt) | 2,940 |
| Precision | 0.3680 (36.8%) |
| Recall | 0.0788 (7.88%) |
| F1 Score | 0.1298 (12.98%) |
| Genotype Concordance | 0.9039 (90.39%) |
5.2 Genotype Breakdown
Among the 1,082 matched true positives, genotype concordance was 90.4%. The genotype confusion matrix:
| Truth GT | Called GT | Count |
|---|---|---|
| 0/1 (het) | 0/1 (het) | 344 |
| 0/1 (het) | 1/1 (hom) | 63 |
| 1/1 (hom) | 1/1 (hom) | 634 |
| 1/1 (hom) | 0/1 (het) | 29 |
| None/None | 1/1 (hom) | 6 |
| None/None | 0/1 (het) | 4 |
| 0/0 | 0/1 (het) | 1 |
| 0/0 | 1/1 (hom) | 1 |
The dominant source of genotype error is het/hom confusion (92 cases), which is expected for short-read SV calling where homozygous and heterozygous SVs in repetitive regions are easily confused.
5.3 Coverage Gap: Called vs. Truth
The most striking result is the coverage gap:
Truth SVs: 13,732
Called SVs: 2,940 (only 21.4% of truth SVs are even called)
True Positives: 1,082 (only 7.88% of truth SVs are recovered)
Manta called only ~2,940 SVs total, while the GIAB truth set contains 13,732 high-confidence SVs in Tier 1 regions. This is not primarily a precision problem (36.8% of calls are correct) — it is a sensitivity problem: the majority of real SVs are simply not detected by short reads.
6. Why Is Recall So Low?
The low recall (7.88%) reflects the inherent limitations of short-read SV detection:
-
Insertion detection failure: Short reads cannot span insertions larger than the read length. Most of the GIAB truth insertions (especially mobile element insertions) are invisible to Illumina reads.
-
Liftover losses: 3,945 records (~57% of raw Manta calls) failed liftover from hg38 to hg19, further reducing the callable variant set.
-
Repetitive region blackout: Many GIAB Tier 1 SVs are in segmental duplications where short reads cannot be uniquely mapped.
-
SV size distribution mismatch: Manta is most sensitive for deletions and larger rearrangements, but the truth set contains many small insertions (50–200 bp) where short-read signal is weakest.
Recap
- Short-read SV calling with Manta achieves ~7.9% recall against the GIAB HG002 truth set — a fundamentally different performance profile than SNP/INDEL calling (>99% F1).
- Precision (36.8%) is reasonable given the difficulty, but most calls cannot be validated against the truth set.
- Genotype concordance (90.4%) is high for the variants that are correctly called, suggesting the signal quality is good when SVs are detected at all.
- The main bottleneck is sensitivity, not precision — the majority of real SVs are not detectable from short reads alone.
- For clinical-grade SV detection, long-read sequencing (PacBio HiFi, Oxford Nanopore) is required, or combining short-read callers with optical mapping technologies.
References
-
Manta SV caller Chen X et al. (2016). Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. https://github.com/Illumina/manta
-
Truvari: SV benchmarking tool https://github.com/ACEnglish/truvari
-
GIAB HG002 Structural Variant Truth Set (v0.6) https://www.nist.gov/programs-projects/genome-bottle
-
CrossMap: Genome coordinate conversion http://crossmap.sourceforge.net/
-
nf-core/sarek pipeline https://nf-co.re/sarek
-
bcftools: VCF normalization and manipulation http://samtools.github.io/bcftools/