Variant Calling (Part 11): Population-Scale Genotyping Using gVCF and Joint Variant Calling

April 8, 2026 · 21 min read

Founder at G Labs

Population-scale variant calling is a critical step in building genomic population projects. While single-sample variant calling is well established, scaling joint genotyping to thousands of WGS samples introduces challenges in performance, storage, and incremental updates. In this blog, I explore gVCF-based joint variant calling approaches and evaluate scalable solutions using modern open-source tools. I also discuss practical architecture considerations to efficiently construct population-scale genomics projects.

In Parts 1-8 of the variant calling series, I built a complete short-read germline variant calling pipeline—from raw reads through variant annotation using nf-germline-short-read-variant-calling. In Part 6, I demonstrated that modern variant callers like DeepVariant can produce gVCF files directly—output files that capture both variant AND reference confidence information.

But there's a critical next step that most tutorials skip: How do you combine hundreds or thousands of these gVCF files to explore and compare variants across multiple samples? And more importantly, how do you use the power of a population to improve variant accuracy across your cohort?

This article explores two complementary concepts:

gVCF Combination → Aggregating gVCF files to systematically explore variants across multiple samples, enabling population-level analysis and discovery
Joint Variant Calling → Using cohort-wide population information to improve genotype accuracy, correct individual sample errors, and refine variant quality scores

I demonstrate the technical journey from single-sample gVCF files (created by DeepVariant or GATK) to a population-scale multi-sample VCF using DRAGEN gVCF Genotyper, showing how joint genotyping leverages population signals to enhance accuracy without re-calling variants. I use real data from the 1000 Genomes Project to show practical examples.

Series Context

This is Part 11 of the Variant Calling series:

Parts 1-3: Building production-grade variant calling pipeline (bash → Nextflow → HPC)
Parts 4-8: Workflow quality, benchmarking, annotation, and structural variants
Part 11 (This Post): Population-scale joint genotyping and cohort aggregation

Prerequisites: Familiarity with gVCF format from Part 6 and basic understanding of variant calling pipelines.

1. Two Concepts: gVCF Combination and Joint Variant Calling

1.1. Understanding the Distinction

Before diving into population-scale workflows, I need to clarify two distinct (but complementary) processes that often get confused:

gVCF Combination:

Purpose: Aggregating reference confidence information from multiple samples to systematically explore which variants exist across your cohort
Process: Merging gVCF files while preserving per-sample genotype likelihoods and reference confidence
Output: A multi-sample VCF showing what variants each sample carries (genotype discovery)
Accuracy Improvement: Minimal—primarily enables cross-sample exploration and discovery

Joint Variant Calling:

Purpose: Using population-level signals to improve genotype accuracy and refine quality scores at each site
Process: Probabilistically re-evaluating each sample's genotype using cohort-wide allele frequency priors and genotype patterns
Output: Refined genotypes with improved quality scores leveraging population consensus
Accuracy Improvement: Significant—especially for low-confidence sites, rare variants, and low-coverage regions

1.2. Where I Left Off: gVCF from DeepVariant

In Part 6, our nf-germline-short-read-variant-calling pipeline produces:

Per-sample outputs:

single-sample VCF: Variants only (positions with alternate alleles)
single-sample gVCF (created by DeepVariant): Variants + reference blocks with quality scores

Example file structure:

sample-1/
  ├── sample1.vcf.gz          # Hard-filtered variants only
  ├── sample1.vcf.gz.tbi      # Index for VCF
  ├── sample1.gvcf.gz         # gVCF with reference blocks ← KEY FILE
  └── sample1.gvcf.gz.tbi     # Index for gVCF

sample-2/
  ├── sample2.vcf.gz
  ├── sample2.gvcf.gz         # ← I combine these to explore variants
  └── ...

Why gVCF enables exploration: Unlike standard VCF files, gVCF captures:

✅ Variant sites (same as VCF)
✅ Reference confidence blocks (<NON_REF> entries) showing where samples are likely non-variant
✅ Quality scores for non-variant positions
✅ Complete genotype likelihoods per sample

This rich information allows us to systematically explore variants across samples—comparing not just which variants each sample has, but also the confidence level at every genomic position.

Figure 1: Two-Stage Population Variant Calling Pipeline

Many workflows combine these stages into one tool (like GATK Joint Genotyping), but conceptually they are two distinct operations.

2. The Three-Stage Genomic Data Journey: VCF → gVCF → Joint Cohort VCF

warning

Various statistical and machine learning models can be applied to improve individual variant calling using information from multiple samples.
The variant re-calling model below is presented for proof-of-concept purposes only.

This section traces the data transformation pipeline from individual variant calls to population-scale cohorts. Each stage adds value: variant-only VCF enables basic discovery, gVCF enables cross-sample exploration, and joint cohort VCF applies population knowledge for accuracy refinement.

What makes this workflow powerful:

gVCF preserves reference confidence → You can explore not just variants, but confidence at every genomic position
Multi-sample VCF enables discovery → See which variants are present in your cohort
Joint genotyping refines accuracy → Population consensus improves genotype quality scores across all samples

2.1. Stage 1: Single-Sample VCF (from Dragen)

What it contains:

Only variant sites (positions with alternate alleles)
Genotype information for ONE sample
No information about non-variant positions
File size: ~100-500 MB per sample

Example command (using bcftools on S3):

info

Illumina with the 1000 Genome Project has release the individual gvcf and cohort gvcf that can be used for exploration on this blog
The files are stored on S3 bucket, where the latest bcftools based on htslib supports to read the file randomely. Read more on htslib remote file blog

# View DRAGEN hard-filtered VCF for single sample
bcftools view s3://1000genomes-dragen-v4-2-7/data/individuals/hg38_alt_masked_graph_v3/NA21144/NA21144.final.hard-filtered.vcf.gz \
  | grep -v "contig" \
  | head -n80

Output showing VCF header and variant records:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes">
...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA21144
chr1    10616   .       CCGCCGTTGCAAAGGCGCGCCG  C       8.83    PASS    AC=2;AF=1;AN=2;DP=15;...  GT:AD:GQ:PL  1/1:2,13:4:38,6,0
chr1    10927   .       A       G       10.2    PASS    AC=2;AF=1;AN=2;DP=2;...  GT:AD:GQ:PL  1/1:0,2:3:45,5,0
chr1    11022   .       G       A       10.54   PASS    AC=2;AF=1;AN=2;DP=4;...  GT:AD:GQ:PL  1/1:1,3:5:47,7,0

Key observation: Position chr1:15000 is NOT in this file—there's no way to know if the sample is confidently homozygous reference or just not sequenced.

2.2. Stage 2: Single-Sample gVCF (output from DeepVariant in Part 6 is similar)

What it contains:

Both variant sites AND reference blocks (homozygous reference regions)
Genotype likelihoods for each position
Quality scores for non-variant positions
File size: ~200-800 MB per sample (larger than VCF due to reference blocks)

Example command:

# View gVCF with reference blocks and quality scores from Dragen pipeline
bcftools view s3://1000genomes-dragen-v4-2-7/data/individuals/hg38_alt_masked_graph_v3/NA21144/NA21144.final.hard-filtered.gvcf.gz \
  | grep -v "contig" \
  | head -n100

Key difference from VCF: Notice the <NON_REF> entries and END field that mark reference blocks:

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##GVCFBlock=minGQ=0(inclusive),maxGQ=1(exclusive)
##GVCFBlock=minGQ=1(inclusive),maxGQ=10(exclusive)
##GVCFBlock=minGQ=10(inclusive),maxGQ=20(exclusive)
...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA21144
chr1    9998    .       N       <NON_REF>       .       LowGQ   END=10003       GT:GQ:MIN_DP  ./.:0:1
chr1    10004   .       C       <NON_REF>       .       PASS    END=10025       GT:GQ:MIN_DP  0/0:3:12
chr1    10026   .       A       <NON_REF>       .       PASS    END=10026       GT:GQ:MIN_DP  0/0:15:32
chr1    10027   .       A       <NON_REF>       .       PASS    END=10044       GT:GQ:MIN_DP  0/0:21:35
chr1    10045   .       A       <NON_REF>       .       PASS    END=10048       GT:GQ:MIN_DP  0/0:21:44

Key insights:

<NON_REF>: Symbolic ALT meaning "reference confidence"
END=10025: This reference block spans chr1:10004-10025
GT:GQ:MIN_DP: Genotype (0/0 = homozygous reference), Quality (GQ=3 = low confidence), Min depth in block

Now you know: Sample NA21144 is confidently homozygous reference at chr1:10004-10025 with GQ=3 (low confidence) and chr1:10026 with GQ=15 (higher confidence).

2.2.1. Understanding Genotype Quality (GQ) Calculation

What is GQ?

Genotype Quality (GQ) is a Phred-scaled probability that measures confidence in the genotype call. It's defined as:

$GQ = -10 \times \log_{10}(P(\text{genotype is wrong}))$

Interpretation:

GQ = 3: Probability of error = 10^(-3/10) ≈ 50% (essentially coin flip—not confident)
GQ = 10: Probability of error = 10^(-10/10) = 0.1 = 10% (one in ten chance of error)
GQ = 20: Probability of error = 10^(-20/10) = 0.01 = 1% (one in hundred chance)
GQ = 30: Probability of error = 10^(-30/10) = 0.001 = 0.1% (one in thousand—high confidence)
GQ ≥ 60: Near-perfect confidence in the genotype

Real-world example from the gVCF above:

chr1    10004   .   C   <NON_REF>   .   PASS   END=10025   GT:GQ:MIN_DP   0/0:3:12
                                                            ↑ GQ=3 (LOW confidence)
chr1    10026   .   A   <NON_REF>   .   PASS   END=10026   GT:GQ:MIN_DP   0/0:15:32
                                                            ↑ GQ=15 (HIGHER confidence)

Why the difference?

Position chr1:10004-10025 has MIN_DP=12 (moderate depth but probably low-complexity region with ambiguous reads)
Position chr1:10026 has MIN_DP=32 (higher depth, clearer reference call)

The problem with low GQ calls:

When GQ is low (≤3), you cannot distinguish between:

"This position really is homozygous reference" (GQ low due to sequencing noise)
"This position might have a rare variant I can't see" (GQ low due to insufficient coverage)

This ambiguity cascades when building cohort call sets—if 1,000 samples have GQ=3 at the same position, do you trust their consensus that it's a non-variant? Or should you flag this region as suspicious?

2.2.2. How Joint Genotyping Improves Variant Calls Using Population Power

The Single-Sample Problem: Decisions in Isolation

When calling variants on a single sample, low-confidence calls must be decided with only that sample's data:

Single Sample 1 at position chr22:10514143:
- Reads support: 95% for genotype 0/0 (ref/ref), 5% for genotype 0/1 (ref/alt)
- Confidence: GQ=10 (1 in 10 chance this call is wrong)
- Decision: Call 0/0, but confidence is LOW

Without population context:
- Is this truly a homozygous reference call?
- Or is this a real but very rare variant being missed?
- I cannot know without additional information

The Joint Genotyping Solution: Borrowing Strength from Population

When you have a cohort, population patterns dramatically improve accuracy. Joint genotyping uses cohort-wide signals to reassess individual sample genotypes:

Cohort of 1,000 samples at chr22:10514143:
┌─────────────────────────────────────────────────────┐
│ Sample 1:    0/0 (GQ=10)   ← questionable call      │
│ Sample 2:    0/0 (GQ=50)   ← very confident         │
│ Sample 3:    0/0 (GQ=48)   ← very confident         │
│ Sample 4:    0/0 (GQ=52)   ← very confident         │
│ Samples 5-1000: All 0/0 (GQ>40) ← strong consensus │
└─────────────────────────────────────────────────────┘

Population signal:
- 999 samples independently call 0/0 with high confidence
- Only 1 sample has uncertain confidence
- Statistical analysis: Probability this position is truly 0/0 = 99%+ (allele frequency ≈ 0)
- Conclusion: Sample 1's uncertain call is REINFORCED by population consensus

Result: Sample 1's GQ improved from 10 → 35 (borrowing confidence from the cohort)

How Joint Genotyping Uses Population Signals to Improve Accuracy

Real-world impact - Concrete examples of accuracy improvement:

Example 1: Low-coverage region

Position chr1:45000000 (low sequencing depth in Sample 1):

Before joint genotyping:
Sample_1  0/0:8   (uncertain due to only 3 reads, noisy region)
Sample_2  0/0:60  (30 reads, clear reference)
...
Sample_1000  0/0:58

After joint genotyping:
Sample_1  0/0:38  ← GQ improved 8→38 using 999 samples' consensus
          (Population evidence: 1,000/1,000 samples call 0/0, so position is likely not variant)

Example 2: Rare variant validation

Position chr5:12000000:

Before joint genotyping:
Sample_1   0/1:20  (one read supports alt, one ref—uncertain heterozygous call)
Sample_2   0/0:55  (confidently reference)
...
Sample_1000  0/0:52

After joint genotyping:
Sample_1   0/1:32  ← GQ improved 20→32 (population consensus confirms this is a real rare variant)
           (Calculation: "This heterozygous call in one sample + homozygous refs in 999 others = likely a real rare variant worth believing")

Example 3: Correcting false positives

Position chr19:88000000:

Before joint genotyping:
Sample_1   0/1:9   (weak evidence for alt, GQ very low)
Sample_2   0/0:51  (reference)
...
Sample_1000  0/0:49

After joint genotyping:
Sample_1   0/0:28  ← Corrected from 0/1→0/0 (population consensus suggests this was artifact)
           (Rationale: "Only 1 sample heterozygous out of 1,000 with weak signal + 999 homozygous refs = likely sequencing artifact")

How the Math Works - The Probabilistic Framework

Joint genotyping combines three pieces of information:

Single-sample likelihood: P(reads | genotype) ← What the reads in this one sample support
Population prior: P(genotype | cohort allele frequency) ← What I expect based on 999 other samples
Posterior probability: P(genotype | reads + cohort) ← Final refined estimate combining both

Simplified equation:

Refined GQ = single_sample_confidence
           + population_evidence
           - conflict_penalty

When population power is most valuable:

Scenario	Individual GQ	Population GQ	Improvement	Why
Low-coverage region	6	35	+29	Cohort consensus compensates for low depth
Rare variant (1/1000 samples)	15	28	+13	Population signals confirm it's real
Artifact in one sample	8	2 (filtered)	-6	999 samples say it's wrong
Common variant	45	48	+3	Minimal improvement (already confident)

Critical insight: Why this matters for your cohort

The key realization is that population power specifically helps uncertain calls:

High-confidence individual calls stay high-confidence (no change needed)
Low-confidence individual calls gain confidence from cohort agreement
Erroneous singleton calls are identified and corrected by population patterns

This is why joint genotyping can simultaneously:

Improve accuracy across the cohort
Reduce false positives (artifact calls corrected)
Increase true positive detection (weak real calls boosted)

2.3. Stage 3: Multi-Sample Cohort VCF (after DRAGEN gVCF Genotyper)

What it contains:

All variants discovered ACROSS the entire cohort (3,202 samples in 1KGP)
Genotypes for all samples at each variant site
Population-level metrics (allele counts, frequencies, Hardy-Weinberg statistics)
Local alleles representation (memory-efficient FORMAT/LPL and FORMAT/LAA fields)

Example command:

# View the cohort VCF from 3,202 samples on chromosome 22
# Note: showing only the first 12 columns (CHROM to first few samples)
bcftools view s3://1000genomes-dragen-v4-2-7/data/cohorts/gvcf-genotyper-dragen-4.2.7/hg38/3202-samples-cohort/3202_samples_cohort_gg_chr22.vcf.gz \
  | grep -v "contig" \
  | head -n60 \
  | cut -f1-12

Key metrics added by DRAGEN gVCF Genotyper:

##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Total number of samples">
##INFO=<ID=NS_GT,Number=1,Type=Integer,Description="Total number of samples with called genotypes">
##INFO=<ID=NS_NOGT,Number=1,Type=Integer,Description="Samples with unknown genotypes ./.">
##INFO=<ID=NS_NODATA,Number=1,Type=Integer,Description="Samples with no coverage">
##INFO=<ID=IC,Number=1,Type=Float,Description="The inbreeding coefficient">
##INFO=<ID=HWE,Number=A,Type=Float,Description="Hardy-Weinberg Equilibrium P-value">
##FORMAT=<ID=LPL,Number=.,Type=Integer,Description="Local Phred-scaled likelihoods (memory efficient)">
##FORMAT=<ID=LAA,Number=.,Type=String,Description="Local alternate allele indices">

Example variant in the cohort:

chr22   10514143   .   G   A   17.92   PASS   AC=16;AN=1452;NS=3202;NS_GT=726;NS_NOGT=33;NS_NODATA=2443;IC=0.75;HWE=2.1e-12
GT:GQ:AD:LPL:LAA  .:.:.:.:.   .:.:.:.:.   ...

What this tells you:

AC=16, AN=1452: 16 copies of A allele out of 1,452 total alleles genotyped
NS=3202: All 3,202 samples in cohort
NS_GT=726: 726 samples have a genotype call at this site
NS_NODATA=2443: 2,443 samples had no coverage at this position
IC=0.75: High inbreeding coefficient (excess homozygotes—potential population structure)
HWE=2.1e-12: Highly significant deviation from Hardy-Weinberg equilibrium

3. Joint Genotyping Solutions

info

For implementing joint genotyping with practical results, the ideal tool should have:

Open-source: Accessible to deploy on any platform without licensing barriers
Accuracy: Effectively leverages population signals to improve genotype quality
Scalability: Handles growing cohort sizes efficiently with distributed computing

Once you have combined gVCF files into a multi-sample VCF, the next step is applying joint genotyping to improve accuracy using population signals. Different tools offer different trade-offs between accuracy improvement, computational cost, scalability, and operational flexibility.

3.1. Comprehensive Tool Comparison

Tool	Performance	Scalability	Open-Source	Key Pros	Key Cons
GATK Joint Genotyping	Slow (single thread)	Poor (reprocess entire cohort on new samples)	Yes	✅ Mature, well-documented; Robust VQSR recalibration	❌ Must re-run full pipeline; Complex parameter tuning; Tightly coupled to GATK
GLnexus	Fast (specialized C++ backend)	Medium (splitted by regions)	Yes	✅ Faster than GATK; Incremental updates possible	❌ Tightly coupled to DeepVariant; scale by regions splitting
Genomic Variant Store (Terra)	Very Fast (BigQuery-based)	Excellent (cloud-native)	No ❌	✅ Scalable to millions; Managed service	❌ Terra platform lock-in; Licensing; No local control
DRAGEN IGG	Very Fast (specialized hardware)	Excellent (splitted by regions + sample batches)	No ❌	✅ Incremental processing; High performance; Enterprise support	❌ Licensing; Hardware dependency; Vendor lock-in
DPGT	Very Fast (Apache Spark distributed)	Excellent (splitted by regions + sample batches)	Yes	✅ Open-source; Incremental processing; High performance	⚠️ Less mature; Limited documentation;setup

3.2. Decision Matrix: Which Tool Should You Choose?

Your Situation	Recommended Tool	Why
Academic/research, small cohort (<1K samples), rare variant discovery andvariants are called by GATK HaplotypeCaller	GATK Joint Genotyping	Mature, well-documented, best accuracy for non-DRAGEN callers
Growing biobank, need incremental updates, variant is called by Dragen	DRAGEN IGG (if already using DRAGEN) or DPGT (open-source)	Incremental processing critical for operational efficiency
Large consortium (100K+ samples)	Genomic Variant Store or DRAGEN IGG	Massive scalability; cost acceptable at this scale
Open-source, scale up to million samples	DPGT	Best open-source option; excellent scaling; no vendor lock-in
Academic/research, small cohort (<1K samples), variant is called by DeepVariant	GLnexus	Good speed/accuracy trade-off; cloud-friendly

3.3. Practical Implementation: Open-Source Joint Variant Calling with GKit

info

This section demonstrates how to implement joint variant calling using open-source tools integrated in the GKit repository (github.com/gianglabs/gkit). I show both GLnexus and DPGT implementations with real workflows.

3.3.1. GATK Joint Genotyping (Already Integrated in nf-germline)

Note: GATK GenotyperGVCFs is already implemented in the production pipeline at nf-germline-short-read-variant-calling. The pipeline includes the GATK_GENOTYPEGVCFS process that:

Takes individual gVCF files from DeepVariant or GATK
Performs population-aware genotype refinement
Outputs joint VCF with population-level metrics

For academic research projects with small cohorts calling variants with GATK HaplotypeCaller, this integrated approach is recommended.

3.3.2. GLnexus Implementation

What GLnexus does:

High-performance joint variant calling optimized for DeepVariant gVCF output
Efficient region-based processing (splittable by genomic regions)
Produces cohort VCF with improved genotype quality scores

Quick Start with GKit:

cd /path/to/gkit/glnexus

# Install dependencies with Pixi
pixi install

# Run joint genotyping on test data (GIAB + 1KGP samples)
make test

Output:

cohort_vcf/GIAB/data/GIAB_ALDH2_5kb.vcf.gz
cohort_vcf/1KGP/data/1KGP_ALDH2_5kb.vcf.gz

Configuration Details:

The pipeline downloads gVCF files from public cloud storage and runs GLnexus:

glnexus_cli \
  --config DeepVariant \        # Config optimized for DeepVariant output
  --threads 4 \                 # Parallel processing
  --bed ALDH2_5kb.bed \         # Limit to region of interest
  *.g.vcf.gz > cohort.bcf       # Input gVCF files

Key parameters:

--config DeepVariant: Uses DeepVariant-optimized settings
--threads: CPU cores for parallel processing (adjust based on available resources)
--bed: Region file to limit analysis (enables incremental processing across genome)

Customization:

Change analysis region by editing the BED file
Add more samples by including their gVCF files
Reduce memory usage by lowering thread count

Pipeline features:

Automatically downloads GIAB and 1000 Genomes Project samples from cloud storage
Converts BCF output to gzip-compressed VCF format
Reproducible environment via Pixi

3.3.3. DPGT Implementation (Distributed Population Genetics Tool)

What DPGT does:

Open-source distributed joint variant calling using Apache Spark
Handles both sample-level and region-level parallelization
Comparable accuracy to GATK with better scalability for large cohorts (10K+ samples)

Quick Start with GKit:

cd /path/to/gkit/dpgt

# Install and build DPGT locally
pixi install
bash scripts/clone_DPGT.sh
bash scripts/build_cpp.sh

# Run joint genotyping
bash run_dpgt.sh

Default output:

results_verify_1kgp_run_dpgt/result.chr12_111760000_111765000.0.vcf.gz

Configuration with Environment Variables:

INPUT_LIST=inputs/1kgp_3samples.list \
REFERENCE_FASTA=reference/Homo_sapiens_assembly38.fasta \
OUTPUT_DIR=results/my_run \
TARGET_REGION=chr12:111760000-111765000 \
JOBS=4 \
bash run_dpgt.sh

Key parameters:

INPUT_LIST: File listing paths to input gVCF files
REFERENCE_FASTA: Reference genome in FASTA format
OUTPUT_DIR: Output directory for results
TARGET_REGION: Genomic region in format chr:start-end
JOBS: Number of Spark executors for parallel processing

Pipeline workflow:

Checks DPGT JAR and native library compilation
Validates input files and reference
Runs joint genotyping using local Spark cluster
Outputs cohort VCF with population statistics

Scalability Notes:

Local mode: Suitable for test/validation (as shown here with 3 samples)
Cluster mode: DPGT integrates with Spark on SLURM (see spark-on-slurm/ in GKit)
Can scale to 10K+ samples by distributing jobs across Spark executors

Known limitations:

GIAB inputs may fail due to <NON_REF> compatibility; 1KGP samples are validated
Requires Apache Spark infrastructure setup
Spark cluster management needed for large-scale processing

3.3.4. Spark on SLURM for Large-Scale DPGT Processing

For biobank-scale analysis (1K-100K+ samples), DPGT can run on a Spark cluster deployed within SLURM jobs using the spark-on-slurm configuration in GKit:

cd /path/to/gkit/spark-on-slurm

# Setup environment
make setup

# Test Spark cluster on SLURM
make example

What this enables:

Standalone Spark cluster launches inside a SLURM allocation (no Docker required)
Reproducible environment via Pixi
PySpark driver can submit DPGT jobs to distributed workers
Automatic cleanup after job completion

Architecture:

SLURM batch job allocation
Spark master starts on first node
Spark workers start via srun across allocated nodes
PySpark driver connects to master and runs genotyping jobs
Results written to shared filesystem

This approach enables true horizontal scalability for population-scale joint genotyping without vendor lock-in.

warning

Due to the limitation of license, the accuracy of tools are expected to be similar, while scalability should be prioritized.

Thought

Based on the above investigations and the previous blog, building a population genomics project with fewer than 10,000 WGS samples is achievable using existing technologies.

GATK (Genome Analysis Toolkit) appears to have slower performance for both single-sample variant calling and joint genotyping at large cohort scale. In contrast, DPGT is promising as an open-source solution that supports scaling through sample batching and genomic region splitting using BED files, while maintaining accuracy comparable to GLnexus.

GLnexus performs well for joint genotyping, but one limitation is that when new samples are added, the cohort may need to be recomputed. This could potentially be improved by introducing features or plugins that split samples into batches, preserve intermediate cohort metrics, and incrementally aggregate results as new data arrives. Such an approach would significantly improve scalability and reduce recomputation costs. GLnexus can be integrated into common workflow systems such as Nextflow, WDL, or Snakemake. Additionally, individual GVCF files can be accessed using HTSlib to enable random access to specific genomic regions, avoiding the need to download entire files and improving efficiency when working with BED-defined regions.

Overall, with approximately two full-time engineers, adequate computing infrastructure, and high-quality raw sequencing data, a population-scale WGS variant calling pipeline for fewer than 10,000 samples could realistically be implemented in less than one month.

References

Caetano-Anolles, D. (2022). "Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode." GATK Documentation.
Lin, M. F. (2018). "GLnexus: joint variant calling for large cohort sequencing." bioRxiv.
Caetano-Anolles, D. (2022). "gVCF – Genomic Variant Call Format." GATK Documentation.
Kousathanas et al. (2022). "Whole-genome sequencing reveals host factors underlying critical COVID-19." Nature, 607, 97–103.
Byrska-Bishop, M. et al. (2022). "High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios." Cell, 185.
Illumina. "DRAGEN Bio-IT Platform: gVCF Genotyper User Guide." Technical documentation.
https://terra.bio/scaling-variant-discovery-to-a-million-genomes-with-the-genomic-variant-store/
DPGT: https://www.biorxiv.org/content/10.64898/2026.03.02.709184v2.full

1. Two Concepts: gVCF Combination and Joint Variant Calling​

1.1. Understanding the Distinction​

1.2. Where I Left Off: gVCF from DeepVariant​

1.3. Two-Stage Pipeline: Combination then Refinement​

2. The Three-Stage Genomic Data Journey: VCF → gVCF → Joint Cohort VCF​

2.1. Stage 1: Single-Sample VCF (from Dragen)​

2.2. Stage 2: Single-Sample gVCF (output from DeepVariant in Part 6 is similar)​

2.2.1. Understanding Genotype Quality (GQ) Calculation​

2.2.2. How Joint Genotyping Improves Variant Calls Using Population Power​

2.3. Stage 3: Multi-Sample Cohort VCF (after DRAGEN gVCF Genotyper)​

3. Joint Genotyping Solutions​

3.1. Comprehensive Tool Comparison​

3.2. Decision Matrix: Which Tool Should You Choose?​

3.3. Practical Implementation: Open-Source Joint Variant Calling with GKit​

3.3.1. GATK Joint Genotyping (Already Integrated in nf-germline)​

3.3.2. GLnexus Implementation​

3.3.3. DPGT Implementation (Distributed Population Genetics Tool)​

3.3.4. Spark on SLURM for Large-Scale DPGT Processing​

Thought​

References​

1. Two Concepts: gVCF Combination and Joint Variant Calling

1.1. Understanding the Distinction

1.2. Where I Left Off: gVCF from DeepVariant

1.3. Two-Stage Pipeline: Combination then Refinement

2. The Three-Stage Genomic Data Journey: VCF → gVCF → Joint Cohort VCF

2.1. Stage 1: Single-Sample VCF (from Dragen)

2.2. Stage 2: Single-Sample gVCF (output from DeepVariant in Part 6 is similar)

2.2.1. Understanding Genotype Quality (GQ) Calculation

2.2.2. How Joint Genotyping Improves Variant Calls Using Population Power

2.3. Stage 3: Multi-Sample Cohort VCF (after DRAGEN gVCF Genotyper)

3. Joint Genotyping Solutions

3.1. Comprehensive Tool Comparison

3.2. Decision Matrix: Which Tool Should You Choose?

3.3. Practical Implementation: Open-Source Joint Variant Calling with GKit

3.3.1. GATK Joint Genotyping (Already Integrated in nf-germline)

3.3.2. GLnexus Implementation

3.3.3. DPGT Implementation (Distributed Population Genetics Tool)

3.3.4. Spark on SLURM for Large-Scale DPGT Processing

Thought

References