Variant Calling (Part 11): Population-Scale Genotyping Using gVCF and Joint Variant Calling
Population-scale variant calling is a critical step in building genomic population projects. While single-sample variant calling is well established, scaling joint genotyping to thousands of WGS samples introduces challenges in performance, storage, and incremental updates. In this blog, I explore gVCF-based joint variant calling approaches and evaluate scalable solutions using modern open-source tools. I also discuss practical architecture considerations to efficiently construct population-scale genomics projects.
In Parts 1-8 of the variant calling series, I built a complete short-read germline variant calling pipeline—from raw reads through variant annotation using nf-germline-short-read-variant-calling. In Part 6, I demonstrated that modern variant callers like DeepVariant can produce gVCF files directly—output files that capture both variant AND reference confidence information.
But there's a critical next step that most tutorials skip: How do you combine hundreds or thousands of these gVCF files to explore and compare variants across multiple samples? And more importantly, how do you use the power of a population to improve variant accuracy across your cohort?
This article explores two complementary concepts:
- gVCF Combination → Aggregating gVCF files to systematically explore variants across multiple samples, enabling population-level analysis and discovery
- Joint Variant Calling → Using cohort-wide population information to improve genotype accuracy, correct individual sample errors, and refine variant quality scores
I demonstrate the technical journey from single-sample gVCF files (created by DeepVariant or GATK) to a population-scale multi-sample VCF using DRAGEN gVCF Genotyper, showing how joint genotyping leverages population signals to enhance accuracy without re-calling variants. I use real data from the 1000 Genomes Project to show practical examples.
This is Part 11 of the Variant Calling series:
- Parts 1-3: Building production-grade variant calling pipeline (bash → Nextflow → HPC)
- Parts 4-8: Workflow quality, benchmarking, annotation, and structural variants
- Part 11 (This Post): Population-scale joint genotyping and cohort aggregation
Prerequisites: Familiarity with gVCF format from Part 6 and basic understanding of variant calling pipelines.
1. Two Concepts: gVCF Combination and Joint Variant Calling
1.1. Understanding the Distinction
Before diving into population-scale workflows, I need to clarify two distinct (but complementary) processes that often get confused:
gVCF Combination:
- Purpose: Aggregating reference confidence information from multiple samples to systematically explore which variants exist across your cohort
- Process: Merging gVCF files while preserving per-sample genotype likelihoods and reference confidence
- Output: A multi-sample VCF showing what variants each sample carries (genotype discovery)
- Accuracy Improvement: Minimal—primarily enables cross-sample exploration and discovery
Joint Variant Calling:
- Purpose: Using population-level signals to improve genotype accuracy and refine quality scores at each site
- Process: Probabilistically re-evaluating each sample's genotype using cohort-wide allele frequency priors and genotype patterns
- Output: Refined genotypes with improved quality scores leveraging population consensus
- Accuracy Improvement: Significant—especially for low-confidence sites, rare variants, and low-coverage regions
1.2. Where I Left Off: gVCF from DeepVariant
In Part 6, our nf-germline-short-read-variant-calling pipeline produces:
Per-sample outputs:
- single-sample VCF: Variants only (positions with alternate alleles)
- single-sample gVCF (created by DeepVariant): Variants + reference blocks with quality scores
Example file structure:
sample-1/
├── sample1.vcf.gz # Hard-filtered variants only
├── sample1.vcf.gz.tbi # Index for VCF
├── sample1.gvcf.gz # gVCF with reference blocks ← KEY FILE
└── sample1.gvcf.gz.tbi # Index for gVCF
sample-2/
├── sample2.vcf.gz
├── sample2.gvcf.gz # ← I combine these to explore variants
└── ...
Why gVCF enables exploration: Unlike standard VCF files, gVCF captures:
- ✅ Variant sites (same as VCF)
- ✅ Reference confidence blocks (
<NON_REF>entries) showing where samples are likely non-variant - ✅ Quality scores for non-variant positions
- ✅ Complete genotype likelihoods per sample
This rich information allows us to systematically explore variants across samples—comparing not just which variants each sample has, but also the confidence level at every genomic position.
1.3. Two-Stage Pipeline: Combination then Refinement
Figure 1: Two-Stage Population Variant Calling Pipeline
Many workflows combine these stages into one tool (like GATK Joint Genotyping), but conceptually they are two distinct operations.
2. The Three-Stage Genomic Data Journey: VCF → gVCF → Joint Cohort VCF
- Various statistical and machine learning models can be applied to improve individual variant calling using information from multiple samples.
- The variant re-calling model below is presented for proof-of-concept purposes only.
This section traces the data transformation pipeline from individual variant calls to population-scale cohorts. Each stage adds value: variant-only VCF enables basic discovery, gVCF enables cross-sample exploration, and joint cohort VCF applies population knowledge for accuracy refinement.
What makes this workflow powerful:
- gVCF preserves reference confidence → You can explore not just variants, but confidence at every genomic position
- Multi-sample VCF enables discovery → See which variants are present in your cohort
- Joint genotyping refines accuracy → Population consensus improves genotype quality scores across all samples
2.1. Stage 1: Single-Sample VCF (from Dragen)
What it contains:
- Only variant sites (positions with alternate alleles)
- Genotype information for ONE sample
- No information about non-variant positions
- File size: ~100-500 MB per sample
Example command (using bcftools on S3):
- Illumina with the 1000 Genome Project has release the individual gvcf and cohort gvcf that can be used for exploration on this blog
- The files are stored on S3 bucket, where the latest bcftools based on htslib supports to read the file randomely. Read more on htslib remote file blog
# View DRAGEN hard-filtered VCF for single sample
bcftools view s3://1000genomes-dragen-v4-2-7/data/individuals/hg38_alt_masked_graph_v3/NA21144/NA21144.final.hard-filtered.vcf.gz \
| grep -v "contig" \
| head -n80
Output showing VCF header and variant records:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes">
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA21144
chr1 10616 . CCGCCGTTGCAAAGGCGCGCCG C 8.83 PASS AC=2;AF=1;AN=2;DP=15;... GT:AD:GQ:PL 1/1:2,13:4:38,6,0
chr1 10927 . A G 10.2 PASS AC=2;AF=1;AN=2;DP=2;... GT:AD:GQ:PL 1/1:0,2:3:45,5,0
chr1 11022 . G A 10.54 PASS AC=2;AF=1;AN=2;DP=4;... GT:AD:GQ:PL 1/1:1,3:5:47,7,0
Key observation: Position chr1:15000 is NOT in this file—there's no way to know if the sample is confidently homozygous reference or just not sequenced.
2.2. Stage 2: Single-Sample gVCF (output from DeepVariant in Part 6 is similar)
What it contains:
- Both variant sites AND reference blocks (homozygous reference regions)
- Genotype likelihoods for each position
- Quality scores for non-variant positions
- File size: ~200-800 MB per sample (larger than VCF due to reference blocks)
Example command:
# View gVCF with reference blocks and quality scores from Dragen pipeline
bcftools view s3://1000genomes-dragen-v4-2-7/data/individuals/hg38_alt_masked_graph_v3/NA21144/NA21144.final.hard-filtered.gvcf.gz \
| grep -v "contig" \
| head -n100
Key difference from VCF: Notice the <NON_REF> entries and END field that mark reference blocks:
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##GVCFBlock=minGQ=0(inclusive),maxGQ=1(exclusive)
##GVCFBlock=minGQ=1(inclusive),maxGQ=10(exclusive)
##GVCFBlock=minGQ=10(inclusive),maxGQ=20(exclusive)
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA21144
chr1 9998 . N <NON_REF> . LowGQ END=10003 GT:GQ:MIN_DP ./.:0:1
chr1 10004 . C <NON_REF> . PASS END=10025 GT:GQ:MIN_DP 0/0:3:12
chr1 10026 . A <NON_REF> . PASS END=10026 GT:GQ:MIN_DP 0/0:15:32
chr1 10027 . A <NON_REF> . PASS END=10044 GT:GQ:MIN_DP 0/0:21:35
chr1 10045 . A <NON_REF> . PASS END=10048 GT:GQ:MIN_DP 0/0:21:44
Key insights:
<NON_REF>: Symbolic ALT meaning "reference confidence"END=10025: This reference block spans chr1:10004-10025GT:GQ:MIN_DP: Genotype (0/0 = homozygous reference), Quality (GQ=3 = low confidence), Min depth in block
Now you know: Sample NA21144 is confidently homozygous reference at chr1:10004-10025 with GQ=3 (low confidence) and chr1:10026 with GQ=15 (higher confidence).
2.2.1. Understanding Genotype Quality (GQ) Calculation
What is GQ?
Genotype Quality (GQ) is a Phred-scaled probability that measures confidence in the genotype call. It's defined as:
Interpretation:
- GQ = 3: Probability of error = 10^(-3/10) ≈ 50% (essentially coin flip—not confident)
- GQ = 10: Probability of error = 10^(-10/10) = 0.1 = 10% (one in ten chance of error)
- GQ = 20: Probability of error = 10^(-20/10) = 0.01 = 1% (one in hundred chance)
- GQ = 30: Probability of error = 10^(-30/10) = 0.001 = 0.1% (one in thousand—high confidence)
- GQ ≥ 60: Near-perfect confidence in the genotype
Real-world example from the gVCF above:
chr1 10004 . C <NON_REF> . PASS END=10025 GT:GQ:MIN_DP 0/0:3:12
↑ GQ=3 (LOW confidence)
chr1 10026 . A <NON_REF> . PASS END=10026 GT:GQ:MIN_DP 0/0:15:32
↑ GQ=15 (HIGHER confidence)
Why the difference?
- Position chr1:10004-10025 has MIN_DP=12 (moderate depth but probably low-complexity region with ambiguous reads)
- Position chr1:10026 has MIN_DP=32 (higher depth, clearer reference call)
The problem with low GQ calls:
When GQ is low (≤3), you cannot distinguish between:
- "This position really is homozygous reference" (GQ low due to sequencing noise)
- "This position might have a rare variant I can't see" (GQ low due to insufficient coverage)
This ambiguity cascades when building cohort call sets—if 1,000 samples have GQ=3 at the same position, do you trust their consensus that it's a non-variant? Or should you flag this region as suspicious?
2.2.2. How Joint Genotyping Improves Variant Calls Using Population Power
The Single-Sample Problem: Decisions in Isolation
When calling variants on a single sample, low-confidence calls must be decided with only that sample's data:
Single Sample 1 at position chr22:10514143:
- Reads support: 95% for genotype 0/0 (ref/ref), 5% for genotype 0/1 (ref/alt)
- Confidence: GQ=10 (1 in 10 chance this call is wrong)
- Decision: Call 0/0, but confidence is LOW
Without population context:
- Is this truly a homozygous reference call?
- Or is this a real but very rare variant being missed?
- I cannot know without additional information
The Joint Genotyping Solution: Borrowing Strength from Population
When you have a cohort, population patterns dramatically improve accuracy. Joint genotyping uses cohort-wide signals to reassess individual sample genotypes:
Cohort of 1,000 samples at chr22:10514143:
┌─────────────────────────────────────────────────────┐
│ Sample 1: 0/0 (GQ=10) ← questionable call │
│ Sample 2: 0/0 (GQ=50) ← very confident │
│ Sample 3: 0/0 (GQ=48) ← very confident │
│ Sample 4: 0/0 (GQ=52) ← very confident │
│ Samples 5-1000: All 0/0 (GQ>40) ← strong consensus │
└─────────────────────────────────────────────────────┘
Population signal:
- 999 samples independently call 0/0 with high confidence
- Only 1 sample has uncertain confidence
- Statistical analysis: Probability this position is truly 0/0 = 99%+ (allele frequency ≈ 0)
- Conclusion: Sample 1's uncertain call is REINFORCED by population consensus
Result: Sample 1's GQ improved from 10 → 35 (borrowing confidence from the cohort)
Real-world impact - Concrete examples of accuracy improvement:
Example 1: Low-coverage region
Position chr1:45000000 (low sequencing depth in Sample 1):
Before joint genotyping:
Sample_1 0/0:8 (uncertain due to only 3 reads, noisy region)
Sample_2 0/0:60 (30 reads, clear reference)
...
Sample_1000 0/0:58
After joint genotyping:
Sample_1 0/0:38 ← GQ improved 8→38 using 999 samples' consensus
(Population evidence: 1,000/1,000 samples call 0/0, so position is likely not variant)
Example 2: Rare variant validation
Position chr5:12000000:
Before joint genotyping:
Sample_1 0/1:20 (one read supports alt, one ref—uncertain heterozygous call)
Sample_2 0/0:55 (confidently reference)
...
Sample_1000 0/0:52
After joint genotyping:
Sample_1 0/1:32 ← GQ improved 20→32 (population consensus confirms this is a real rare variant)
(Calculation: "This heterozygous call in one sample + homozygous refs in 999 others = likely a real rare variant worth believing")
Example 3: Correcting false positives
Position chr19:88000000:
Before joint genotyping:
Sample_1 0/1:9 (weak evidence for alt, GQ very low)
Sample_2 0/0:51 (reference)
...
Sample_1000 0/0:49
After joint genotyping:
Sample_1 0/0:28 ← Corrected from 0/1→0/0 (population consensus suggests this was artifact)
(Rationale: "Only 1 sample heterozygous out of 1,000 with weak signal + 999 homozygous refs = likely sequencing artifact")
How the Math Works - The Probabilistic Framework
Joint genotyping combines three pieces of information:
- Single-sample likelihood: P(reads | genotype) ← What the reads in this one sample support
- Population prior: P(genotype | cohort allele frequency) ← What I expect based on 999 other samples
- Posterior probability: P(genotype | reads + cohort) ← Final refined estimate combining both
Simplified equation:
Refined GQ = single_sample_confidence
+ population_evidence
- conflict_penalty
When population power is most valuable:
| Scenario | Individual GQ | Population GQ | Improvement | Why |
|---|---|---|---|---|
| Low-coverage region | 6 | 35 | +29 | Cohort consensus compensates for low depth |
| Rare variant (1/1000 samples) | 15 | 28 | +13 | Population signals confirm it's real |
| Artifact in one sample | 8 | 2 (filtered) | -6 | 999 samples say it's wrong |
| Common variant | 45 | 48 | +3 | Minimal improvement (already confident) |
Critical insight: Why this matters for your cohort
The key realization is that population power specifically helps uncertain calls:
- High-confidence individual calls stay high-confidence (no change needed)
- Low-confidence individual calls gain confidence from cohort agreement
- Erroneous singleton calls are identified and corrected by population patterns
This is why joint genotyping can simultaneously:
- Improve accuracy across the cohort
- Reduce false positives (artifact calls corrected)
- Increase true positive detection (weak real calls boosted)
2.3. Stage 3: Multi-Sample Cohort VCF (after DRAGEN gVCF Genotyper)
What it contains:
- All variants discovered ACROSS the entire cohort (3,202 samples in 1KGP)
- Genotypes for all samples at each variant site
- Population-level metrics (allele counts, frequencies, Hardy-Weinberg statistics)
- Local alleles representation (memory-efficient FORMAT/LPL and FORMAT/LAA fields)
Example command:
# View the cohort VCF from 3,202 samples on chromosome 22
# Note: showing only the first 12 columns (CHROM to first few samples)
bcftools view s3://1000genomes-dragen-v4-2-7/data/cohorts/gvcf-genotyper-dragen-4.2.7/hg38/3202-samples-cohort/3202_samples_cohort_gg_chr22.vcf.gz \
| grep -v "contig" \
| head -n60 \
| cut -f1-12
Key metrics added by DRAGEN gVCF Genotyper:
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Total number of samples">
##INFO=<ID=NS_GT,Number=1,Type=Integer,Description="Total number of samples with called genotypes">
##INFO=<ID=NS_NOGT,Number=1,Type=Integer,Description="Samples with unknown genotypes ./.">
##INFO=<ID=NS_NODATA,Number=1,Type=Integer,Description="Samples with no coverage">
##INFO=<ID=IC,Number=1,Type=Float,Description="The inbreeding coefficient">
##INFO=<ID=HWE,Number=A,Type=Float,Description="Hardy-Weinberg Equilibrium P-value">
##FORMAT=<ID=LPL,Number=.,Type=Integer,Description="Local Phred-scaled likelihoods (memory efficient)">
##FORMAT=<ID=LAA,Number=.,Type=String,Description="Local alternate allele indices">
Example variant in the cohort:
chr22 10514143 . G A 17.92 PASS AC=16;AN=1452;NS=3202;NS_GT=726;NS_NOGT=33;NS_NODATA=2443;IC=0.75;HWE=2.1e-12
GT:GQ:AD:LPL:LAA .:.:.:.:. .:.:.:.:. ...
What this tells you:
- AC=16, AN=1452: 16 copies of A allele out of 1,452 total alleles genotyped
- NS=3202: All 3,202 samples in cohort
- NS_GT=726: 726 samples have a genotype call at this site
- NS_NODATA=2443: 2,443 samples had no coverage at this position
- IC=0.75: High inbreeding coefficient (excess homozygotes—potential population structure)
- HWE=2.1e-12: Highly significant deviation from Hardy-Weinberg equilibrium
3. Joint Genotyping Solutions
For implementing joint genotyping with practical results, the ideal tool should have:
- Open-source: Accessible to deploy on any platform without licensing barriers
- Accuracy: Effectively leverages population signals to improve genotype quality
- Scalability: Handles growing cohort sizes efficiently with distributed computing
Once you have combined gVCF files into a multi-sample VCF, the next step is applying joint genotyping to improve accuracy using population signals. Different tools offer different trade-offs between accuracy improvement, computational cost, scalability, and operational flexibility.
3.1. Comprehensive Tool Comparison
| Tool | Performance | Scalability | Open-Source | Key Pros | Key Cons |
|---|---|---|---|---|---|
| GATK Joint Genotyping | Slow (single thread) | Poor (reprocess entire cohort on new samples) | Yes | ✅ Mature, well-documented; Robust VQSR recalibration | ❌ Must re-run full pipeline; Complex parameter tuning; Tightly coupled to GATK |
| GLnexus | Fast (specialized C++ backend) | Medium (splitted by regions) | Yes | ✅ Faster than GATK; Incremental updates possible | ❌ Tightly coupled to DeepVariant; scale by regions splitting |
| Genomic Variant Store (Terra) | Very Fast (BigQuery-based) | Excellent (cloud-native) | No ❌ | ✅ Scalable to millions; Managed service | ❌ Terra platform lock-in; Licensing; No local control |
| DRAGEN IGG | Very Fast (specialized hardware) | Excellent (splitted by regions + sample batches) | No ❌ | ✅ Incremental processing; High performance; Enterprise support | ❌ Licensing; Hardware dependency; Vendor lock-in |
| DPGT | Very Fast (Apache Spark distributed) | Excellent (splitted by regions + sample batches) | Yes | ✅ Open-source; Incremental processing; High performance | ⚠️ Less mature; Limited documentation;setup |
3.2. Decision Matrix: Which Tool Should You Choose?
| Your Situation | Recommended Tool | Why |
|---|---|---|
| Academic/research, small cohort (<1K samples), rare variant discovery andvariants are called by GATK HaplotypeCaller | GATK Joint Genotyping | Mature, well-documented, best accuracy for non-DRAGEN callers |
| Growing biobank, need incremental updates, variant is called by Dragen | DRAGEN IGG (if already using DRAGEN) or DPGT (open-source) | Incremental processing critical for operational efficiency |
| Large consortium (100K+ samples) | Genomic Variant Store or DRAGEN IGG | Massive scalability; cost acceptable at this scale |
| Open-source, scale up to million samples | DPGT | Best open-source option; excellent scaling; no vendor lock-in |
| Academic/research, small cohort (<1K samples), variant is called by DeepVariant | GLnexus | Good speed/accuracy trade-off; cloud-friendly |
3.3. Practical Implementation: Open-Source Joint Variant Calling with GKit
This section demonstrates how to implement joint variant calling using open-source tools integrated in the GKit repository (github.com/gianglabs/gkit). I show both GLnexus and DPGT implementations with real workflows.
3.3.1. GATK Joint Genotyping (Already Integrated in nf-germline)
Note: GATK GenotyperGVCFs is already implemented in the production pipeline at nf-germline-short-read-variant-calling. The pipeline includes the GATK_GENOTYPEGVCFS process that:
- Takes individual gVCF files from DeepVariant or GATK
- Performs population-aware genotype refinement
- Outputs joint VCF with population-level metrics
For academic research projects with small cohorts calling variants with GATK HaplotypeCaller, this integrated approach is recommended.
3.3.2. GLnexus Implementation
What GLnexus does:
- High-performance joint variant calling optimized for DeepVariant gVCF output
- Efficient region-based processing (splittable by genomic regions)
- Produces cohort VCF with improved genotype quality scores
Quick Start with GKit:
cd /path/to/gkit/glnexus
# Install dependencies with Pixi
pixi install
# Run joint genotyping on test data (GIAB + 1KGP samples)
make test
Output:
cohort_vcf/GIAB/data/GIAB_ALDH2_5kb.vcf.gz
cohort_vcf/1KGP/data/1KGP_ALDH2_5kb.vcf.gz
Configuration Details:
The pipeline downloads gVCF files from public cloud storage and runs GLnexus:
glnexus_cli \
--config DeepVariant \ # Config optimized for DeepVariant output
--threads 4 \ # Parallel processing
--bed ALDH2_5kb.bed \ # Limit to region of interest
*.g.vcf.gz > cohort.bcf # Input gVCF files
Key parameters:
- --config DeepVariant: Uses DeepVariant-optimized settings
- --threads: CPU cores for parallel processing (adjust based on available resources)
- --bed: Region file to limit analysis (enables incremental processing across genome)
Customization:
- Change analysis region by editing the BED file
- Add more samples by including their gVCF files
- Reduce memory usage by lowering thread count
Pipeline features:
- Automatically downloads GIAB and 1000 Genomes Project samples from cloud storage
- Converts BCF output to gzip-compressed VCF format
- Reproducible environment via Pixi
3.3.3. DPGT Implementation (Distributed Population Genetics Tool)
What DPGT does:
- Open-source distributed joint variant calling using Apache Spark
- Handles both sample-level and region-level parallelization
- Comparable accuracy to GATK with better scalability for large cohorts (10K+ samples)
Quick Start with GKit:
cd /path/to/gkit/dpgt
# Install and build DPGT locally
pixi install
bash scripts/clone_DPGT.sh
bash scripts/build_cpp.sh
# Run joint genotyping
bash run_dpgt.sh
Default output:
results_verify_1kgp_run_dpgt/result.chr12_111760000_111765000.0.vcf.gz
Configuration with Environment Variables:
INPUT_LIST=inputs/1kgp_3samples.list \
REFERENCE_FASTA=reference/Homo_sapiens_assembly38.fasta \
OUTPUT_DIR=results/my_run \
TARGET_REGION=chr12:111760000-111765000 \
JOBS=4 \
bash run_dpgt.sh
Key parameters:
- INPUT_LIST: File listing paths to input gVCF files
- REFERENCE_FASTA: Reference genome in FASTA format
- OUTPUT_DIR: Output directory for results
- TARGET_REGION: Genomic region in format
chr:start-end - JOBS: Number of Spark executors for parallel processing
Pipeline workflow:
- Checks DPGT JAR and native library compilation
- Validates input files and reference
- Runs joint genotyping using local Spark cluster
- Outputs cohort VCF with population statistics
Scalability Notes:
- Local mode: Suitable for test/validation (as shown here with 3 samples)
- Cluster mode: DPGT integrates with Spark on SLURM (see
spark-on-slurm/in GKit) - Can scale to 10K+ samples by distributing jobs across Spark executors
Known limitations:
- GIAB inputs may fail due to
<NON_REF>compatibility; 1KGP samples are validated - Requires Apache Spark infrastructure setup
- Spark cluster management needed for large-scale processing
3.3.4. Spark on SLURM for Large-Scale DPGT Processing
For biobank-scale analysis (1K-100K+ samples), DPGT can run on a Spark cluster deployed within SLURM jobs using the spark-on-slurm configuration in GKit:
cd /path/to/gkit/spark-on-slurm
# Setup environment
make setup
# Test Spark cluster on SLURM
make example
What this enables:
- Standalone Spark cluster launches inside a SLURM allocation (no Docker required)
- Reproducible environment via Pixi
- PySpark driver can submit DPGT jobs to distributed workers
- Automatic cleanup after job completion
Architecture:
- SLURM batch job allocation
- Spark master starts on first node
- Spark workers start via
srunacross allocated nodes - PySpark driver connects to master and runs genotyping jobs
- Results written to shared filesystem
This approach enables true horizontal scalability for population-scale joint genotyping without vendor lock-in.
- Due to the limitation of license, the accuracy of tools are expected to be similar, while scalability should be prioritized.
Thought
Based on the above investigations and the previous blog, building a population genomics project with fewer than 10,000 WGS samples is achievable using existing technologies.
GATK (Genome Analysis Toolkit) appears to have slower performance for both single-sample variant calling and joint genotyping at large cohort scale. In contrast, DPGT is promising as an open-source solution that supports scaling through sample batching and genomic region splitting using BED files, while maintaining accuracy comparable to GLnexus.
GLnexus performs well for joint genotyping, but one limitation is that when new samples are added, the cohort may need to be recomputed. This could potentially be improved by introducing features or plugins that split samples into batches, preserve intermediate cohort metrics, and incrementally aggregate results as new data arrives. Such an approach would significantly improve scalability and reduce recomputation costs. GLnexus can be integrated into common workflow systems such as Nextflow, WDL, or Snakemake. Additionally, individual GVCF files can be accessed using HTSlib to enable random access to specific genomic regions, avoiding the need to download entire files and improving efficiency when working with BED-defined regions.
Overall, with approximately two full-time engineers, adequate computing infrastructure, and high-quality raw sequencing data, a population-scale WGS variant calling pipeline for fewer than 10,000 samples could realistically be implemented in less than one month.
References
-
Caetano-Anolles, D. (2022). "Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode." GATK Documentation.
-
Lin, M. F. (2018). "GLnexus: joint variant calling for large cohort sequencing." bioRxiv.
-
Caetano-Anolles, D. (2022). "gVCF – Genomic Variant Call Format." GATK Documentation.
-
Kousathanas et al. (2022). "Whole-genome sequencing reveals host factors underlying critical COVID-19." Nature, 607, 97–103.
-
Byrska-Bishop, M. et al. (2022). "High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios." Cell, 185.
-
Illumina. "DRAGEN Bio-IT Platform: gVCF Genotyper User Guide." Technical documentation.
-
https://terra.bio/scaling-variant-discovery-to-a-million-genomes-with-the-genomic-variant-store/
-
DPGT: https://www.biorxiv.org/content/10.64898/2026.03.02.709184v2.full