Skip to main content

How can a country build its own 1000 Genomes Project?

· 45 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Recent national genome initiatives highlight the growing need for population-specific genomic resources. After reading VN1K: a genome graph-based and function-driven multi-omics and phenomics resource for the Vietnamese population and EGP1K: Whole-Genome Sequencing of 1,024 Egyptians Characterizes Population Structure and Genetic Diversity, it becomes clear that building a national-scale 1000-genome project is increasingly important for understanding genetic diversity, improving disease research, and enabling precision medicine.

In this blog, I outline how a country can design and implement a project at a similar scale — from sample collection to cohort-level genomic analysis. This discussion focuses on the technical and architectural aspects of building the 1000-genome resource. It does not cover the downstream data platform or user portal for accessing processed data, which will be explored in a separate article.

The goal is to provide a practical, scalable roadmap that countries — especially those with emerging genomics infrastructure — can adapt to their own national genome initiatives.

1. The Vision — Why Countries Need Their Own 1000 Genomes Project

info

The original 1000 Genomes Project (2008-2015) was a watershed moment in human genomics—it created the first comprehensive catalog of global genetic variation, providing a reference map that democratized variant calling and population genetics. But here's the catch: that data, while invaluable globally, doesn't fully represent your country's population, healthcare challenges, or genetic landscape.

1.1. The Global Reference Isn't Enough

Modern genome analysis relies heavily on population-specific variant frequencies. When you're doing variant interpretation, rare disease diagnosis, or pharmacogenomics, you need to know: How common is this variant in the population I'm treating? The 1000 Genomes Project surveyed ~2,500 individuals across diverse populations, but the genetic variation in your local population—shaped by unique migration histories, founder effects, and evolutionary pressures—may differ significantly.

Clinical implications are real:

  • A variant marked as "rare" globally might be common in your population
  • Disease association studies require local allele frequency data
  • Precision medicine strategies must account for population-specific risk alleles
  • Rare disease diagnosis becomes more accurate with local context

1.2. Sovereignty and Data Ownership

Large-scale genomic projects generate not just data—they generate infrastructure, expertise, and economic value. Countries that build their own projects gain:

  • Data sovereignty: Complete control over sensitive health data rather than relying on international consortiums
  • Research leadership: Ability to conduct population-specific studies without external dependencies
  • Economic opportunity: Local biotech ecosystems can leverage the data for drug discovery and diagnostics
  • Healthcare innovation: Direct pathway from research insights to clinical practice

1.3. Learning from Existing National Programs Led By Government

The good news: many countries have already built or are building large-scale genome projects. Here's what they've taught us:

  • UK Biobank sequenced 500k+ genomes linked to longitudinal health records, demonstrating how genomic data at scale enables population-wide disease discovery and precision medicine
  • All of Us Research Program (US) demonstrates how to scale to 1 million+ genomes with robust data governance and participant engagement
  • Biobank Japan shows how to integrate genomic data with longitudinal health records for disease association studies
  • Singapore's National Precision Medicine Program illustrates how population-specific allele frequencies improve diagnosis and treatment in resource-constrained settings
  • China's BGI and national initiatives showcase infrastructure for processing massive cohorts at scale
  • Australia's Australian Genomics demonstrates federated governance across multiple institutions and states
  • Others many more genomics population biobank is establishing with efforts to support for their own contries: South Korea, Sweden, Finland, etc

These programs share common patterns: they all standardized variant calling, implemented joint genotyping strategies, built cohort-wide analytics layers, and invested heavily in data governance. The technical blueprint is proven—now it's about adapting it to your country's resources, population, and healthcare priorities.

national_biobank Figure 1: Overview of genomic resources in the national biobanks. a Geographical distribution of the biobanks, with the sample numbers representing the total cohort size targeted or achieved by each biobank. Major biobanks possessing large-scale WGS datasets exceeding 10,000 individuals are highlighted. An asterisk (“*”) indicates the targeted cohort size. b Detailed information on WGS sample sizes, ancestry composition, and health conditions of the respective biobank datasets were recently disclosed

Reference: https://link.springer.com/article/10.1186/s44342-025-00040-9

1.4. Learning from Existing Programs Led By Private Companies

While national and government-funded programs dominate the landscape, private companies and organizations have also built impressive large-scale genomic initiatives. Understanding their approach offers valuable lessons on infrastructure, scaling, and data governance.

Direct-to-Consumer (DTC) Companies

  • 23andMe - 15M+ genotyped users with proprietary research database and health insights. Demonstrates consumer engagement at massive scale, though with privacy considerations
  • AncestryDNA - 20M+ genotyped users focused on ancestry and family connections. Shows how genealogical context drives adoption and participation

Pharmaceutical & Drug Development

  • Regeneron Genetics Center (DiscovEHR) - Exomes of ~50k patients linked to electronic health records. Gold standard for pharma-healthcare collaboration; proves that clinical data integration dramatically increases research value
  • Genentech/Roche - Built internal genomic databases from clinical trials and partner networks. Focus on precision oncology and rare diseases
  • GSK (GlaxoSmithKline) - Genomics partnerships acquiring genetic databases for target identification. Demonstrates how pharma funds large-scale infrastructure

Clinical Genomics & Diagnostic Companies

  • Invitae - ~5M genetic testing records (largest clinical genetics lab). Standardized variant calling across millions of diagnostic samples shows how clinical labs maintain quality at scale
  • Tempus - 10M+ de-identified genomic records with AI-powered analysis. Cancer-focused platform demonstrates the value of unified analytics layers
  • Foundation Medicine (Roche) - 1.5M+ patient profiles. Precision oncology at scale—proves that focused cohorts can drive clinical impact

Global Infrastructure & Sequencing

  • BGI (China) - Largest sequencing company globally with massive internal database and hospital partnerships. Demonstrates infrastructure and cost efficiency at unprecedented scale

Key Insight from Private Programs:

Private organizations succeed because they:

  1. Link genomics to actionable outcomes (diagnosis, treatment, drug discovery)—not just research
  2. Invest heavily in data standardization (consistent variant calling, annotation pipelines)
  3. Build cloud-native infrastructure early for scalability
  4. Integrate with clinical workflows directly, creating network effects

For national programs, the private sector model teaches us that data utility drives participation. People contribute samples when they see immediate clinical benefit (diagnosis, health insights) or participate when aligned with personal incentives (ancestry, precision medicine).

1.5. The Technical Challenge

Building a 1000 Genomes-scale project requires:

  • Thousands of high-quality genome sequences
  • Standardized variant calling across all samples
  • Unified genotyping across cohorts
  • Storage and query infrastructure for tens of terabytes of data
  • Population statistics and annotation pipelines
  • Governance and ethics frameworks

But here's the key insight from existing programs: the tools and workflows have matured dramatically. Modern tools like Nextflow, nf-core pipelines, Hail, and cloud infrastructure make this achievable for any well-resourced national program.

This post walks you through the architecture, design decisions, and implementation strategy for building a national genome project—from variant calling to cohort analytics.

2. The Big Picture Architecture

A national genome project is fundamentally a data transformation pipeline: raw sequencing reads → standardized variants → cohort-wide genotypes → research-ready analytics. The architecture must handle massive scale, enforce consistency across thousands of samples, and satisfy strict data governance requirements.

Here's the architecture we'll build around, anchored in real, proven components:

National Genome Project Architecture

Figure 2: End-to-end architecture showing data flow from raw sequencing through population analytics, with concrete tools and storage at each stage.

2.1. Stage 1: Core Infrastructure with SLURM HPC Cluster

info

The foundation is a high-performance computing cluster managed by SLURM.

slurm_architect

Figure 3: SLURM HPC cluster and Simple Object Storage S3 architecture

Key setup:

  • ansible automation: Use the proven river-slurm Ansible playbook to configure SLURM from scratch
    • Automatic node provisioning and job scheduling
    • Resource allocation policies tuned for bioinformatics workloads
    • Network and storage integration
  • Local compute: All processing happens on-site, no data egress to cloud
  • Direct S3 access: SLURM nodes can directly read/write to local/domestic S3 storage
tip

Why SLURM?

  • It's the industry standard for HPC. It scales from dozens to thousands of nodes, handles complex job dependencies, and integrates seamlessly with bioinformatics tools.
  • Distributed system can be run on top of SLURM for scaling the bioinformatics analysis
  • Easy to integrate to the domestic cloud provider

2.2. Stage 2: Standardized Variant Calling

info

Every genome must go through identical variant calling logic. This is non-negotiable for cohort analyses.

The pipeline: nf-germline-short-read-variant-calling

  • Built on Nextflow for reproducibility and portability
  • Carefully benchmarked for accuracy and efficiency
  • Benchmarked to find the best choice for the large scale variant calling genomics project
  • Optimized to run efficiently on SLURM clusters
  • Outputs individual gVCF files (genome VCF format—raw variants per sample)

Why gVCF? It includes not just called variants, but also "confident no-calls" at each genomic position. This is crucial for joint genotyping later—without it, you lose information about coverage depth and genotype quality.

Scale: Processing 100 samples/month at 30x coverage:

  • Individual variant calling: ~4-6 hours per genome on a standard node. Faster with GPU for variant calling using DeepVariant
  • Parallelizable across SLURM queue

pipeline Figure 4: The nextflow pipeline for short read variant calling using Illumina platform, optimized for 30X WGS. The pipeline is integrated with nf-modules where it standardizes to share the common components between pipelines

2.3. Stage 3: Joint Genotyping — Cohort Integration

info

After variant calling, you have a thousands of individual gVCF files and each sample called independently. But independent calling is not enough for population-scale genomics. You must combine them into a cohort VCF using joint variant calling.

Example:

sample A:

  • chr1:1000 A>G (variant detected)

sample B:

  • chr1:1000 no variant detected

while sample B may have these artifacts:

  • Maybe Sample B had low coverage
  • Maybe variant signal was weak
  • Variant may actually exist but missed

Two proven options:

Option A: GLnexus (Efficient, Recommended for <100k samples)

warning
  • Joint calling based on the region of bed file can be used to compute parrallelly
  • Flanking regions are required mainly because variant representation and normalization can extend beyond the target window, especially for indels and complex variants.
# GLnexus merges gVCFs into a cohort VCF
glnexus_cli --config DeepVariant_h37 --bed <bed file> *.gvcf.gz > cohort.bcf
bcftools view cohort.bcf -O z > cohort.vcf.gz
  • Dramatically faster than GATK GenotypeGVCFs
  • Excellent quality for medium-scale cohorts
  • Memory efficient (can run on modest hardware)

Option B: DPGT with Spark (Distributed, Scales to Millions)

#!/bin/bash
# DPGT runner for joint genotyping cohort VCF

set -euo pipefail

PROJECT_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
BUILD_LIB_PATH="${PROJECT_ROOT}/build/lib"
DPGT_JAR="${PROJECT_ROOT}/DPGT/target/dpgt-1.3.2.0.jar"

# Defaults (override with env vars if needed)
INPUT_LIST="${INPUT_LIST:-${PROJECT_ROOT}/cohort_vcf/1KGP/gvcf_input.list}"
REFERENCE_FASTA="${REFERENCE_FASTA:-${PROJECT_ROOT}/reference/Homo_sapiens_assembly38.fasta}"
OUTPUT_DIR="${OUTPUT_DIR:-${PROJECT_ROOT}/cohort_vcf/1KGP/results}"
TARGET_REGION="${TARGET_REGION:-chr12:111760000-111763759}"
JOBS="${JOBS:-4}"
ALLOW_OVERWRITE="${ALLOW_OVERWRITE:-0}"

echo "================================"
echo "DPGT Cohort VCF Runner"
echo "================================"
echo ""

# Check prerequisites
echo "Checking prerequisites..."
if [ -z "$DPGT_JAR" ] || [ ! -f "$DPGT_JAR" ]; then
echo "ERROR: DPGT JAR not found at $DPGT_JAR"
echo "Run 'make build' to compile DPGT first"
exit 1
fi

if [ ! -f "$BUILD_LIB_PATH/libcdpgt.so" ]; then
echo "ERROR: libcdpgt.so not found at $BUILD_LIB_PATH"
echo "Run 'make build-cpp' to compile C++ libraries"
exit 1
fi

if [ ! -f "$INPUT_LIST" ]; then
echo "ERROR: input list not found: $INPUT_LIST"
echo "Create it with one gVCF path per line (3-sample trio supported)."
echo "Example existing list: ${PROJECT_ROOT}/gvcf_input.list"
exit 1
fi

if [ ! -f "$REFERENCE_FASTA" ]; then
echo "ERROR: reference fasta not found: $REFERENCE_FASTA"
exit 1
fi

if [ -d "$OUTPUT_DIR" ] && [ "$(ls -A "$OUTPUT_DIR" 2>/dev/null || true)" != "" ]; then
if [ "$ALLOW_OVERWRITE" = "1" ]; then
echo "Output exists. Removing: $OUTPUT_DIR"
rm -rf "$OUTPUT_DIR"
else
echo "ERROR: output directory exists and is not empty: $OUTPUT_DIR"
echo "Set ALLOW_OVERWRITE=1 or choose another OUTPUT_DIR"
exit 1
fi
fi

echo "DPGT JAR: $DPGT_JAR"
echo "C++ Library: $BUILD_LIB_PATH/libcdpgt.so"
echo "Input List: $INPUT_LIST"
echo "Reference: $REFERENCE_FASTA"
echo "Output Dir: $OUTPUT_DIR"
echo "Region: $TARGET_REGION"
echo ""
echo "Note: default region is a small smoke-test interval."
echo ""

# Runtime environment
export LD_LIBRARY_PATH="$BUILD_LIB_PATH:${LD_LIBRARY_PATH:-}"

echo "Running DPGT joint genotyping..."
echo ""

# Some environments need explicit local filesystem implementations for Spark/Hadoop
java \
-Dspark.hadoop.fs.file.impl=org.apache.hadoop.fs.LocalFileSystem \
-Dspark.hadoop.fs.AbstractFileSystem.file.impl=org.apache.hadoop.fs.local.LocalFs \
-cp "$DPGT_JAR" \
org.bgi.flexlab.dpgt.jointcalling.JointCallingSpark \
-i "$INPUT_LIST" \
-r "$REFERENCE_FASTA" \
-o "$OUTPUT_DIR" \
-j "$JOBS" \
-l "$TARGET_REGION" \
--local

echo ""
echo "Run complete."
echo "Output files in: $OUTPUT_DIR"
find "$OUTPUT_DIR" -maxdepth 1 -type f -name "result*.vcf.gz" -print || true
  • Built on Apache Spark for distributed computing
  • Linear scaling with sample count
  • Better for national programs planning billion-scale expansion
  • Requires Spark cluster (can run on same SLURM infrastructure)

Output: A single cohort.vcf.gz with all samples and all variants. For ~1000 samples project, it can be 200-500GB (it is depended on the diversity of variants and samples)

2.4. Stage 4: Cohort QC and Format Conversion (Hail)

tip

The raw cohort VCF isn't analysis-ready. It needs QC and must be converted to efficient formats. QC should be adjusted according to the quality of the ingested data

Using Hail's MatrixTable:

# Load cohort VCF into Hail
mt = hl.import_vcf('cohort.vcf.gz')

# Sample-level QC
mt = mt.filter_cols(hl.agg.count_where(mt.GT.is_non_ref()) > 100)

# Variant-level QC
mt = mt.filter_rows(hl.agg.count_where(mt.GT.is_non_ref()) > 0)

# Export to multiple formats for different tools
mt.export('cohort.plink') # PLINK format for association studies
mt.export_bgen('cohort.bgen') # BGEN format (more efficient)
mt.write('cohort.vds') # Hail VDS (best for Hail)

Export formats for different use cases:

  • PLINK (.bed/.bim/.fam): Standard format for GWAS, linkage analysis
  • BGEN: Compact binary format, efficient for large cohorts
  • VDS (Hail format): Native Hail format, fastest for downstream Hail analyses
  • Hail MatrixTable: In-memory representation for interactive analysis

QC filtering removes:

  • Samples with excessive missing data
  • Samples with unexpected heterozygosity or sex mismatches
  • Variants with low call rates or Hardy-Weinberg violations
  • Variants with extreme allele frequency outliers

2.5. Stage 5: Long-term Data Storage & Access Policies

tip

Data Sovereignty Strategy: Rather than building on-premises storage infrastructure, partner with a domestic S3 provider (cloud or government-backed) for long-term data residency. This keeps data in-country while avoiding large capital investment in storage hardware.

info

Nowadays, the data should be accessed via the data platform where users need to:

  • Learn and pass the training courses where users are allowed only to use the data for analysis on the controlled platform. They are not allowed to download the data or intermediate data that can reveal/identify personality.
  • Not only restrict by policies, the data platform should be able to restrict users according to the data platform architect

Architecture:

The system architecture consists of:

  • SLURM Cluster (on-premises) with direct S3 access via local network peering
  • S3 Bucket (domestic provider) containing:
    • Raw gVCFs (archived, immutable, 10 years)
    • Cohort VCFs (versioned releases)
    • Cohort MatrixTables (Hail-native)
    • Final analytics formats (PLINK/BGEN/VDS/MT)

Why this partnership model?

  • Data sovereignty: Genomic data never leaves the country. Domestic S3 provider ensures local data residency
  • No capital burden: Pay-as-you-go storage
  • Scalability: Expand from 10TB to 100TB+ without hardware refresh
  • Performance: Direct network access from SLURM cluster (local peering, not internet)
  • Compliance: Meets strict data residency policies and regulatory requirements
  • Integration: Nextflow, Hail, and SLURM all have native S3 support
  • Longevity: Partner contract ensures data availability beyond project timeline (5-10 years)

Long-term value: After the 1k-sample project completes, this S3 infrastructure remains available for future extensions (multi-omics storage, disease cohort data, long-term archival)

3. Design Principles

3.1. Standardization as the Foundation

warning

Without CI/CD integration, "standardization" is just a claim, not a guarantee.

  • No CI/CD = manual testing = inconsistent results across developers/environments
  • With CI/CD = automated testing = identical results every time, on HPC or Cloud

Every sample must go through identical processing—this is non-negotiable for population-scale analyses.

Why standardization matters:

  • Reproducibility: Same input data + same pipeline = same results every time
  • Quality assurance: Systematic issues caught early and applied uniformly across all samples
  • Cross-cohort comparisons: Different batches and timepoints can be compared without systematic bias
  • Joint analyses: Cohort-wide statistics are only valid when all samples processed identically

Implementation:

  • Use Nextflow for pipeline versioning and reproducibility
  • gVCF format enforces standardized output across all samples
  • Version control (Github, Gitlab, Gitea, etc): Track pipeline versions, reference genome versions, annotation database versions

ci/cd_modules/subworkflows

Figure 5: The CI/CD on the nf-modules where it shows the concepts to ensure that the common patterns (modules, subworkflows), can be identical, reproducible and reused accross multiple pipelines.

cd/cd_germline_wf Figure 6: The CI/CD on the nf-germline-short-read-variant-calling where the identical inputs should create the same output files, the workflows also be able to deploy on HPC, Cloud

3.2. Scalability Through Region-Based Parallelization

Joint variant calling works on the entire cohort simultaneously, not sample-by-sample. The key to scaling is splitting the genome into smaller, independent regions (windows) and processing all samples together within each region in parallel.

The Parallelization Strategy:

  1. All samples in each region: For region 1, call variants across ALL samples simultaneously; for region 2, call variants across ALL samples; etc.
  2. Region-based parallelization: Each genomic region runs on a separate SLURM job, enabling horizontal scaling
  3. Failure isolation: If one region fails, you only re-run that region—not the entire genome for all samples

Bed File Strategy (Chromosome-Based with Intervals):

For small-to-medium cohorts (<10k samples): Use simple chromosome-based BED files (one per chromosome):

  • chr1.bed covering chr1:1-248956422
  • chr2.bed covering chr2:1-242193529
  • ... and so on for all 22 autosomes
  • chrX.bed and chrY.bed for sex chromosomes

For large cohorts (>10k samples): Split each chromosome into 100kb intervals for finer-grained parallelization:

  • chr1_region_1.bed: chr1:1-100000
  • chr1_region_2.bed: chr1:100001-200000
  • chr1_region_3.bed: chr1:200001-300000
  • ... continuing across all chromosomes with thousands of regions

Why this interval-based approach scales:

  • Memory efficiency: Each region holds only ~100kb of genomic data per sample, not entire chromosomes
  • Job parallelization: 22 chromosomes × 24,000 intervals = 528,000 potential SLURM jobs (realistic for 50k+ sample cohorts)
  • Failure recovery: Failed region only requires re-running 100kb interval, not entire chromosome
  • SLURM queue optimization: Small jobs (100kb regions) fit better into queue scheduling than large jobs

Example workflow with GLnexus (chromosome-based for <10k):

# For each chromosome in parallel:
glnexus_cli --config DeepVariant_h37 \
--bed chr1.bed \
sample1.gvcf.gz sample2.gvcf.gz ... sampleN.gvcf.gz \
> cohort_chr1.bcf

# For chr2, chr3, etc. (run in parallel across SLURM cluster)

# Merge all chromosomal BCF files
bcftools concat cohort_chr*.bcf -O z > cohort.vcf.gz

Example for large cohorts (interval-based for >10k):

# For each 100kb interval in parallel:
glnexus_cli --config DeepVariant_h37 \
--bed chr1_region_1.bed \
sample1.gvcf.gz sample2.gvcf.gz ... sample10000.gvcf.gz \
> cohort_chr1_region_1.bcf

# Across all regions simultaneously (SLURM schedules thousands of jobs)

# Final merge (can be done hierarchically: chr1 regions → chr1.bcf, then merge all chr*.bcf)
bcftools concat cohort_chr*_region_*.bcf -O z > cohort.vcf.gz

Why this is fundamentally different from:

  • ❌ Single-sample variant calling: Cannot be used for joint genotyping (loses population information)
  • ❌ Whole-genome per-cohort in one job: No parallelization, no failure recovery, memory explosion at scale
  • Region-based cohort calling: Scales linearly, fails gracefully, maximizes SLURM parallelization

3.3. Data Storage and Sharing

For the data storage, it should standardizes to store the needed data only:

  • Data should be stored and grant the access with the approriate permission
  • BAM,CRAM file should be kept where they can restore to the original format (fastq) while save storage and be the inputs for many tools to extract more informations from the data
  • Hail Matrix Table, VDS, Plink, Bgen are the pre-processed high-quality data can be ready for analysis.
  • User do not need to reingest data that repeat the pre-process and quality control step

Table 1: Pipelines and bioinformatics tools utilized in genomic resources in the national biobank pipeline_and_bioinformatics_tools

Reference: https://link.springer.com/article/10.1186/s44342-025-00040-9

Don't just store raw data; design storage tiers for downstream analysis from the start.

Three-Tier Storage:

TierFormatPurposeSize (10k samples)
RawgVCFSource data, immutable, archival2-3 TB
ProcessedCohort VCFQC baseline, joint-called, version-controlled50-100 GB
AnalyticsPLINK, BGEN, VDSAnalysis-ready, indexed, compressed30-50 GB

Table 2: Three-tier storage architecture for genome project data

Analytics formats enable different use cases:

  • PLINK: GWAS, linkage analysis, PCA (de facto standard in genetics)
  • BGEN: Compact binary format, excellent for large-scale population studies
  • VDS (Hail): Columnar format, best for Hail analytics and machine learning pipelines

3.4. Federated Governance and Multi-Institutional Collaboration

Most national genomic projects involve multiple hospitals, universities, and research centers contributing samples.

Federated Model:

  • Each institution contributes samples independently
  • Central joint genotyping ensures consistency: all samples called together, not separately per institution
  • Shared infrastructure: multiples SLURM cluster, multiples S3 storage while centralize data analytics platform

Privacy-Preserving Architecture:

  • Raw data: restricted access (only institution that contributed it)
  • QC'd cohort data: institutional researchers + project leadership
  • Aggregated statistics (allele frequencies, QC metrics): publicly shared (no individual-level data)
  • Compute-to-data: Analysis runs on the secure cluster; results exported, not raw data

Role-Based Access Control:

Raw gVCF data
├─ Accessible to: Original institution + data stewards
└─ Access: read-only, audit-logged

Cohort VCF (QC'd)
├─ Accessible to: Project researchers
└─ Access: for analysis with ethics approval

Public allele frequencies
├─ Accessible to: General public
└─ Download from website

This design principle enables nations to collaborate across institutions while maintaining strict data governance and participant privacy.

4. Infrastructure Layer & Implementation Roadmap

Building a national genome project requires careful sequencing of phases. This section outlines both the infrastructure requirements and a realistic implementation timeline.

4.1. Infrastructure Requirements Overview

A national genome project infrastructure stack consists of three core layers:

LayerComponentTechnologyPurpose
ComputeHPC ClusterSLURM + compute nodesVariant calling, joint genotyping, analytics
StorageObject StorageS3-compatible (on-premises)gVCFs, VCFs, analytics formats
AnalyticsData PlatformHail, Python, SparkPopulation statistics, GWAS, QC

Table 3: Core infrastructure layers for national genome project

1000 Genome Project Roadmap: Realistic 5-Phase Timeline Figure 7: Realistic 6–12 month implementation roadmap showing all five phases. Key insight: Phase 2 (Sample Collection) is the bottleneck (3–6 months). Phase 3 (Sequencing) and Phase 4 (Variant Calling + QC) are rapid (1–2 months each). Infrastructure setup runs in parallel with sample collection.

4.2. Phase 1: Planning & Ethics (2–4 months)

Goals:

  • Secure institutional approvals (IRB/ethics)
  • Design sample recruitment strategy
  • Establish sequencing partnerships
  • Begin infrastructure procurement

Timeline:

  • IRB/ethics approval: 1–2 months
  • Sample recruitment design & partnership agreements: 1–2 months
  • Parallel: Order HPC hardware and establish S3 partnership contracts

Outcomes at End of Phase 1:

  • IRB/ethics approval obtained
  • Recruitment strategy finalized
  • Sequencing capacity confirmed (partner with sequencing center or in-house)
  • HPC hardware ordered (delivery 4–8 weeks)
  • S3 partnership contract signed

4.3. Phase 2: Sample Collection (3–6 months)

Goals:

  • Recruit and enroll 1,000 participants
  • Perform DNA extraction
  • Establish data governance protocols

Timeline:

  • Multi-hospital recruitment: 3–6 months (usually the slowest phase)
  • DNA extraction & quality control: Parallel to recruitment
  • Cohort metadata assembly: Concurrent

Key Challenge: Sample collection is typically the bottleneck. Plan for:

  • Hospital coordination across multiple sites
  • Participant compliance and scheduling
  • DNA quality control (ensure >95% pass-through)
  • Metadata standardization

Outcomes at End of Phase 2:

  • 1,000 samples collected, extracted, and QC'd
  • Metadata database established
  • Samples ready for sequencing

4.4. Phase 3: Sequencing (1–2 months)

sequencing_flow

Figure 8: Sequencing procedure: Blood samples are collected from participants and genomic DNA is extracted using standard laboratory protocols. The DNA is then prepared for sequencing and processed on high-throughput platforms, generating raw sequencing data in FASTQ format. The raw data is uploaded to object storage such as Amazon S3 or an S3-compatible partner. The HPC system downloads the data from S3, performs alignment and variant calling, and uploads the processed outputs back to S3 for long-term storage and downstream population analysis.

Goals:

  • Sequence all 1,000 samples at 30X WGS depth
  • Deliver FASTQ files to HPC cluster
  • Begin variant calling immediately as data arrives

HPC Cluster Setup (Parallel to Sequencing):

Reference: river-slurm Ansible playbook

Minimum viable cluster:

  • 1 head node (Slurm controller, login node)
  • 4-8 compute nodes (2-socket, 32-64 CPU each)
  • 100TB local SSD for scratch space
  • 1Gbps network connectivity

Optional enhancement (for faster variant calling):

  • 2 GPU nodes with NVIDIA GPU (for DeepVariant acceleration)
    • Reduces per-sample variant calling from 4-6 hours (CPU-only) to 1-2 hours per sample
    • Optional optimization (not required for 1k-sample target)

Using Ansible for Infrastructure Automation:

# Deploy entire SLURM cluster from scratch
ansible-playbook -i inventory.ini cluster_slurm.yml

# Validates:
- 1. All nodes can submit/run jobs
- 2. Job scheduling policies working
- 3. Network connectivity stable

Domestic S3 Storage Partnership (Initiated)

tip
  • Industry-standard storage: Object storage such as Amazon S3, Google Cloud Storage, or Azure Blob Storage or Local compatible S3 provider is widely used across industries, ensuring scalability, compatibility, and long-term sustainability beyond bioinformatics use cases.
  • High availability requires dedicated expertise: Population genomics projects generate hundreds of TB to PB-scale data. Maintaining high availability, throughput, backup, and security requires a dedicated infrastructure team, which increases operational complexity.
  • Reduced operational cost and complexity: Partnering with an S3 provider shifts infrastructure management to experienced teams, reducing maintenance overhead and allowing the project to focus on analysis and scientific outcomes.

Using the external S3 provider is a feasible approach

Options:

  1. Commercial domestic cloud provider (with local data residency)
  2. Government-backed infrastructure (if available in your country)
  3. Academic cloud partnership (university data center)

Key requirements:

  • Data never leaves the country (sovereignty requirement)
  • On-premise SLURM nodes have direct network access (low latency)
  • Contract negotiated for 5-10 year commitment
  • Scalable from this project while opening the possibility to extend the storage volume

Quality Control For Raw Data

After sequencing, the raw FASTQ data should undergo quality control to ensure it is suitable for downstream analysis. This step typically includes checking sequencing quality scores, read length distribution, GC content, adapter contamination, and duplication levels using tools such as FastQC and MultiQC.

Samples that do not meet quality thresholds may require re-sequencing or additional preprocessing such as adapter trimming and filtering. Performing quality control at this stage ensures that the data is reliable and ready for subsequent alignment and variant calling steps.

Outcomes at End of Phase 3:

  • All 1,000 samples sequenced (FASTQ files delivered)
  • FASTQ files transferred to HPC cluster and archived in S3
  • Ready for variant calling in Phase 4

4.5. Phase 4: Variant Calling + QC (1–2 months)

Goals:

  • Call variants on all 1,000 samples using DeepVariant
  • Perform joint genotyping with GLnexus
  • Complete cohort QC and finalization

Variant Calling Pipeline (Parallel to Sequencing):

Deploy nf-germline-short-read-variant-calling https://github.com/gianglabs/nf-germline-short-read-variant-calling

Using DeepVariant for high-accuracy variant calling:

  • Per-sample variant calling: 4-6 hours (CPU-only) 1-2 hours (with GPU acceleration)
  • Chromosome-based parallelization: 22 concurrent jobs

Timeline for 1,000 samples:

  • With 4-8 CPU nodes: ~1-2 weeks for all gVCF generation
  • With optional 2 GPU nodes: ~1 week for all gVCF generation

Joint Genotyping:

GLnexus performs joint genotyping by:

  • Combining all 1,000 gVCFs into a cohort VCF
  • Using chromosome-based processing with 22 parallel jobs
  • Completing in 2–5 days (depending on variant density)
  • Producing a single cohort VCF with all variants

QC + Cohort Assembly:

After joint genotyping, perform quality control checks:

  • Sample-level QC: contamination, depth, relatedness
  • Variant-level QC: allele frequency distribution, Hardy-Weinberg equilibrium
  • Population structure analysis: PCA
  • Prepare final release cohort VCF

Timeline: approximately 1 week for QC and finalization

Total Phase 4 Duration: 1–2 months

  • Variant calling: 1–2 weeks
  • Joint genotyping: 2–5 days
  • QC & finalization: 1 week
  • Combined: 1–2 months total

Outcomes at End of Phase 4:

  • All 1,000 samples processed through variant calling
  • 1,000 high-quality gVCF files generated
  • Cohort VCF finalized (joint genotyping complete)
  • Population-level QC passed
  • Ready for analytics and research

4.6. Phase 5: Operations & Research Enablement (Year 1+)

Goals:

  • Establish steady-state operations
  • Enable population-level research
  • Support clinical diagnostics and precision medicine

Infrastructure Maturity:

  • Maintain 4-8 compute nodes (core infrastructure)
  • Optional 2 GPU nodes for accelerated variant calling

Key Reuse Opportunity: This infrastructure was built for 1k-genome analysis, but can be repurposed:

  • Multi-omics processing (proteomics, metabolomics, RNA-seq)
  • Rare disease interpretation pipelines
  • Clinical genomics workflows
  • Advanced bioinformatics research platform

Capacity at 1k-sample scale:

  • GWAS analyses: minutes to hours
  • Variant annotation pipelines: hours
  • Population statistics: minutes
  • Spare capacity: available for other research

Analytics Platform: Hail MatrixTable (for population analytics):

  • All 1,000 samples + variants in single analysis object
  • Per-sample QC: compute ancestry, relatedness, sample-specific statistics
  • Per-variant QC: allele frequency distribution, population structure
  • Rapid queries: filter by allele frequency, population, phenotype

Research enablement:

  • Export to PLINK for GWAS
  • Export to BGEN for imputation studies
  • Direct analysis via Hail Batch for distributed computing

Data Access & Governance: Domestic S3 Storage (Long-term):

  • Established contract: 5-10 year commitment
  • Data residency: All data stays in-country
  • Pay-as-you-grow model: ~$0.02-0.05/GB/month
  • Direct SLURM access: low-latency queries

Data access policies:

  • Researchers access data via controlled analytics platform
  • Training required before data access
  • Audit trails for all data access
  • No raw data downloads (aggregate statistics only)

Outcomes at End of Phase 5:

  • Production-grade operations established
  • Research projects supported
  • Clinical integration pathways clear
  • Infrastructure reusable for future projects

4.7. Overall Timeline Summary

Total Project Duration: 6–12 months (realistic range accounting for sample collection bottleneck)

PhaseActivityDurationBottleneck
Phase 1Planning & Ethics2–4 monthsIRB approval
Phase 2Sample Collection3–6 monthsMulti-hospital recruitment
Phase 3Sequencing1–2 monthsSample delivery speed
Phase 4Variant Calling + QC1–2 monthsComputational throughput
Phase 5Operations & Research EnablementYear 1+Ongoing

Table 4: Project timeline summary with phase durations and bottlenecks

Key Insight: The critical path is Phase 2 (Sample Collection), which typically takes 3-6 months. Phases 3 and 4 are rapid (1–2 months each). Infrastructure (Phase 1 planning and Phase 3 implementation) runs in parallel with sample collection.


4.8. Critical Success Factors

Technical:

  • Robust variant calling (validated benchmarking)
  • Reliable joint genotyping (tested thoroughly)
  • Automated QC (catches errors before publication)
  • CI/CD integration (ensures reproducibility)

Organizational:

  • Dedicated team (at least 2-3 full-time staff, stable)
  • Clear governance (who decides on data access, pipeline changes)
  • Institutional buy-in (hospital systems, university commitment)
  • Sustainable funding (multi-year commitment, not grant-dependent)

Strategic:

  • Population engagement (transparent about data use)
  • Research partnerships (early collaboration with universities)
  • Clinical integration (demonstrate direct patient benefit)
  • International standards (align with 1000 Genomes, GA4GH, etc.)

4.9. Risk Mitigation & Contingencies

Risk: Pipeline bugs causing systematic errors in 1k samples

  • Mitigation: Comprehensive CI/CD testing; region-based processing with checkpoints; ability to re-run any region

Risk: Data loss or corruption

  • Mitigation: Multi-site backup; immutable gVCFs archived; version control on all code/configs

Risk: Key staff turnover

  • Mitigation: Documentation, runbooks, training; knowledge transfer; open-source code (no vendor lock-in)

Risk: Regulatory changes affecting data sharing

  • Mitigation: Privacy-by-design (local storage, role-based access); clear consent forms; regular legal review

Risk: Funding interruption

  • Mitigation: Demonstrate research value early; publish results; engage stakeholders; seek multi-year commitments

5. Variant Calling & Joint Genotyping

note

Detailed implementation guides for variant calling and joint genotyping are covered in our dedicated blog series:

  • Variant Calling: See our comprehensive series on nf-germline-short-read-variant-calling pipeline
  • Joint Genotyping: Detailed technical walkthroughs for GLnexus and DPGT implementations
  • Benchmarking: Performance comparisons and optimization techniques

This blog focuses on the architecture-level decisions and ecosystem integration rather than pipeline internals. Refer to the dedicated blogs for implementation details, code walkthroughs, and troubleshooting.

6. Cohort Analytics Layer (Hail) — Enabling "Open Work" for Researchers

warning
  • The exported data can support multiple research purposes, including integration with EHR and external datasets.
  • The following sections are for illustration only; practical analyses typically require more extensive effort.

6.1 Why the Analytics Layer Matters

The true value of a national genome project emerges in the "open work" phase—when researchers can freely query your 1k-sample cohort to answer scientific questions. This requires more than just storing raw genotypes. You need:

  1. Multiple export formats (PLINK, BGEN, VDS) so different research tools can work efficiently
  2. Population statistics (allele frequencies, ancestry information, phenotype summaries) accessible to all researchers
  3. Quality-controlled, analysis-ready data that researchers can trust
  4. Fast query infrastructure (MatrixTable, HailQL) for exploratory analysis

The analytics layer is the difference between "we have genome data" and "researchers can use our genome data."

This section shows how to build that infrastructure.

6.2. From Raw VCF to Research-Ready Data

Once you have a cohort VCF with all samples jointly-called, the next critical step is transforming it into analysis-ready formats and populating metadata for QC and filtering.

Why Hail for this layer:

  • Scalable: Processes cohort VCFs with 100k+ samples efficiently
  • Flexible: Export to PLINK, BGEN, VDS, or keep as MatrixTable
  • Production-ready: Used by major biobanks (UK Biobank, gnomAD, etc.)
  • Analytics integration: Built-in for GWAS, burden tests, population statistics

6.3. Hail MatrixTable Workflow

Load cohort VCF:

import hail as hl

# Import cohort VCF into Hail MatrixTable
mt = hl.import_vcf('s3://genome-project/cohort.vcf.gz')

# MatrixTable structure:
# - Rows: Variants (SNPs, indels, etc.)
# - Columns: Samples
# - Entries: Genotypes (0/0, 0/1, 1/1, ./.)

Add annotations:

# Variant-level annotations
mt = hl.vep(mt) # Functional prediction
mt = mt.annotate_rows(
gnomad_af = gnomad.rows()[mt.row_key].AF,
clinvar_sig = clinvar.rows()[mt.row_key].significance
)

# Sample-level annotations (sex, ancestry, phenotype)
mt = mt.annotate_cols(
sex = sample_meta[mt.col_key].sex,
ancestry = sample_meta[mt.col_key].ancestry,
case_status = sample_meta[mt.col_key].phenotype
)

QC filtering:

# Sample QC
sample_qc = hl.sample_qc(vds)
mt = mt.annotate_cols(qc = sample_qc[mt.col_key])

# Filter based on QC metrics
mt = mt.filter_cols(
(mt.qc.n_het > 0) & # Has heterozygous calls
(mt.qc.dp_mean > 30) & # Average depth >30x
(mt.qc.r_ti_tv > 2.0) # Ti/Tv ratio normal
)

# Variant QC
mt = mt.filter_rows(hl.agg.count_where(mt.GT.is_non_ref()) > 0)

Export to analysis-ready formats:

# PLINK format (for GWAS)
hl.export_plink(mt, 'cohort.bed', 'cohort.bim', 'cohort.fam')

# BGEN format (compact, efficient)
hl.export_bgen(mt, 'cohort.bgen')

# VDS (Hail native, best for future Hail analyses)
mt.write('s3://genome-project/cohort.vds', overwrite=True)

6.4. Population-Level Statistics

Calculate allele frequencies:

# Global allele frequencies
mt = mt.annotate_rows(
allele_freq = hl.agg.allele_stats(mt.GT).AF,
ac = hl.agg.count_where(mt.GT.is_non_ref()),
an = hl.agg.count() * 2
)

# Ancestry-stratified frequencies (if multi-population)
mt = mt.annotate_rows(
freq_by_ancestry = hl.agg.collect_as_dict(
mt.ancestry,
hl.agg.allele_stats(mt.GT).AF
)
)

Population genetics calculations:

# Hardy-Weinberg equilibrium
mt = mt.annotate_rows(
hwe = hl.hardy_weinberg_test(
hl.agg.count_where(mt.GT == hl.call(0, 0)),
hl.agg.count_where(mt.GT == hl.call(0, 1)),
hl.agg.count_where(mt.GT == hl.call(1, 1))
)
)

# Principal Component Analysis (population structure)
pc_eigenvalues, pc_scores, pc_loadings = hl.pca(
mt.GT.n_alt_alleles(),
k=20,
compute_loadings=True
)

6.5. Variant Annotation Pipeline

Integrate external resources:

# ClinVar pathogenicity
clinvar = hl.read_table('gs://hail-datasets/clinvar.ht')
mt = mt.annotate_rows(
clinvar_sig = clinvar.rows()[mt.row_key].significance
)

# gnomAD frequencies (population reference)
gnomad = hl.read_table('gs://hail-datasets/gnomad.ht')
mt = mt.annotate_rows(
gnomad_ac = gnomad.rows()[mt.row_key].AC,
gnomad_af = gnomad.rows()[mt.row_key].AF
)

# CADD scores (variant effect prediction)
cadd = hl.read_table('path/to/cadd.ht')
mt = mt.annotate_rows(
cadd_score = cadd.rows()[mt.row_key].score
)

Create interpretable variant categories:

# Flag likely pathogenic variants
mt = mt.annotate_rows(
is_likely_pathogenic = (
(mt.clinvar_sig == 'pathogenic') |
((mt.cadd_score > 30) & (mt.gnomad_af < 0.01))
),
is_rare = mt.gnomad_af < 0.01,
is_common = mt.gnomad_af > 0.05
)

6.6. Rare Disease Interpretation & Diagnosis

Using cohort as background reference:

# For each rare disease sample, find matching variants in cohort
# Example: Exome analysis with population context

rare_disease_mt = hl.read_matrix_table('rare_disease_samples.vds')

# Annotate with cohort allele frequencies
rare_disease_mt = rare_disease_mt.annotate_rows(
cohort_af = mt.rows()[rare_disease_mt.row_key].allele_freq,
cohort_ac = mt.rows()[rare_disease_mt.row_key].ac
)

# Prioritize rare variants (absent or very rare in cohort)
rare_disease_mt = rare_disease_mt.filter_rows(
(rare_disease_mt.cohort_ac < 5) | # <5 people in cohort
hl.is_missing(rare_disease_mt.cohort_af) # Not in cohort at all
)

6.7. Example Research Projects ("Open Work")

This is where the value of your genome project becomes visible. Once the infrastructure is ready, researchers can immediately begin:

Example 1: Genome-Wide Association Study (GWAS)

# Researcher has 1k individuals, 500 with Type 2 Diabetes, 500 without
# Goal: Find variants associated with diabetes in your population

from hail.methods import linear_regression_rows

# Load PLINK format
mt = hl.read_table('cohort.plink')

# Fit linear regression: genotype ~ diabetes status
mt = mt.annotate_rows(
beta = linear_regression_rows(
y=mt.phenotype.diabetes,
x=mt.GT.n_alt_alleles()
)
)

# Find significant associations
gwas_hits = mt.filter_rows(mt.p_value < 5e-8)

# Result: 50-200 genome-wide significant variants unique to your population

Example 2: Rare Variant Burden Test in Specific Genes

# Researcher: "I think inactivating mutations in Gene X cause Disease Y"
# Goal: Test if individuals with mutations in Gene X have higher disease rates

# Filter to rare variants in Gene X
gene_x_variants = mt.filter_rows(
(mt.gene == "GENE_X") &
(mt.gnomad_af < 0.01) # Rare in world population
)

# Count carriers in cases vs. controls
case_carriers = mt.filter_cols(mt.phenotype.is_case).filter_rows(
mt.GT.is_non_ref()
).count_rows() / mt.count_cols()

# Perform burden test (logistic regression)
burden_test = hl.logistic_regression_rows(...)

# Result: "Individuals with rare variants in Gene X are 5x more likely to have Disease Y"

Example 3: Population Stratification & Ancestry-Specific Allele Frequencies

# Researcher: "How does allele frequency of SNP rs123 differ by ancestry?"

# PCA on your 1k cohort (calculated in 6.3)
mt = mt.annotate_cols(
ancestry = predicted_ancestry[mt.col_key] # PC-based ancestry labels
)

# Allele frequencies stratified by ancestry
mt = mt.annotate_rows(
af_by_ancestry = hl.agg.collect_as_dict(
mt.ancestry,
hl.agg.allele_stats(mt.GT).AF
)
)

# Result: "SNP rs123 is 10% in European ancestry, 3% in African ancestry"
# Clinical implication: Frequency interpretation depends on patient ancestry

Example 4: Clinical Variant Interpretation

# Clinician: "I have a patient with a rare variant; how common is it?"
# Query your public allele frequency database (built from 6.4)

# Web interface: Enter variant rs123456
# System returns: "This variant found in 2 of 1000 samples (0.2%)"
# "gnomAD says 0.05%, so it's enriched in your population"
# "ClinVar classification: Likely pathogenic"
# "VEP prediction: This affects splicing"

# Result: Clinician has population-specific evidence for clinical interpretation

Example 5: Longitudinal Health Outcomes (if EHR linked)

# Researcher: "Do people with high polygenic risk scores develop heart disease earlier?"
# Requires: 5+ years of follow-up data

# Build PRS from GWAS hits
prs = hl.sum_prs(mt_gwas_hits)

# Predict time-to-disease (Kaplan-Meier curves)
survivors_by_prs = mt.annotate_cols(
prs = prs,
years_to_disease = ehr_data[mt.col_key].years_to_heart_disease
)

# Result: "High PRS group: 30% disease by age 60; Low PRS group: 5%"
# "This enables risk stratification in your healthcare system"

6.8. Monitoring Data Quality & Variant Drift

As years pass, you may want to ensure variant calls remain stable:

# Annual QC check: Re-call a subset of samples with latest pipeline
mt_revalidation = hl.import_vcf('revalidated_samples.vcf.gz')

# Compare allele frequencies to historical records
mt_historical = mt.filter_rows(mt_revalidation.row_key)

# Should be nearly identical (small differences OK, large drift is problem)
af_concordance = hl.corr(
mt_historical.allele_freq,
mt_revalidation.allele_freq
)

# Flag if >0.01 drift (suggests pipeline changes or batch effects)
if af_concordance < 0.99:
alert("Variant frequency drift detected; investigate pipeline changes")

7. Storage Architecture & Data Organization

7.1. Three-Tier Storage Strategy

Tier 1: Raw Archive (gVCFs)

s3://genome-project/archive/gvcfs/
├── 2026/01/sample_001.gvcf.gz
├── 2026/01/sample_002.gvcf.gz
...
├── 2026/12/sample_12000.gvcf.gz

Purpose: Immutable source data (5-10 year retention)
Access: Restricted to data stewards
Storage class: Cold storage (infrequent access)

Tier 2: Processed Cohort (VCFs)

s3://genome-project/processed/
├── cohort_2026_q1.vcf.gz (Jan-Mar samples)
├── cohort_2026_q1.vcf.gz.tbi
├── cohort_2026_q2.vcf.gz (Apr-Jun samples)
...

Purpose: QC baseline, joint-called source
Access: Project researchers with approval
Retention: 5+ years
Storage class: Standard (frequent access)

Tier 3: Analytics Ready

s3://genome-project/analytics/
├── cohort_combined.plink.* (PLINK bed/bim/fam)
├── cohort_combined.bgen (BGEN format)
├── cohort_combined.vds/ (Hail VDS)
├── population_stats/ (AF, AC, HWE tables)
└── public_release_v3.0/ (Anonymized for sharing)

Purpose: Analysis-ready, indexed, compressed
Access: Researchers, public (stratified by tier)
Retention: Indefinite for current
Storage class: Standard

7.2. Metadata Organization

Sample metadata:

samples.tsv
col1: sample_id
col2: institution
col3: sex
col4: ancestry
col5: case_status
col6: sequencing_date
col7: pipeline_version
col8: qc_pass (true/false)

Variant metadata (stored in VCF header & annotations):

- Pipeline version (nf-germline-short-read-variant-calling v1.2.3)
- Reference genome (GRCh38)
- Joint genotyping tool (GLnexus v1.5.2)
- Annotation version (VEP version 106)
- Date processed (2026-03-15)

7.3. Data Versioning & Release Strategy

Public data releases:

v1.0 (2026-06): 500 samples, 30M variants
v1.1 (2026-09): 750 samples, 35M variants
v2.0 (2026-12): 1,000 samples, 40M variants (complete 1k cohort)
v2.1 (2027-06): 1,000 samples, 40M variants + updated annotations
v3.0 (2028-06): 1,000 samples, 40M variants + multi-omic layers (if extended work pursued)

Each release:
- Frozen cohort VCF (immutable)
- Allele frequency tables (stratified by ancestry if multi-population)
- QC metrics summary (depth, call rates, Hardy-Weinberg stats)
- Known issues & caveats
- Citation guidance for researchers using the data

8. Example: Deploying Your First National Genome Project

8.1. Minimal Viable Setup (First 6 months)

Infrastructure:

  • 1 SLURM login node + 4 compute nodes
  • 50 TB S3
  • Single Nextflow instance
  • Single Hail cluster (4 nodes)

Team:

  • 1 Infrastructure engineer (SLURM/networking)
  • 1 Bioinformatician (variant calling)
  • 1 Data analyst (Hail/QC)
  • 0.5 Project manager

First cohort:

  • 100-500 samples (pilot)
  • Deep validation (every sample manually reviewed)
  • Establish QC baselines
  • Train team on operations

Deliverable:

  • 500 gVCF files archived
  • 1 cohort VCF validated
  • PLINK/BGEN exports tested
  • Standard operating procedures documented

8.2. Production Operations (Months 10-12 and Beyond)

Infrastructure:

  • 20-30 SLURM compute nodes (sufficient for 1k samples)
  • 50-100 TB S3 storage (with backup)
  • Hail cluster (8-12 nodes) for analytics
  • No Spark cluster needed (GLnexus handles 1k-scale joint genotyping)

Team (Sustainable 2-3 FTE):

  • 1 Infrastructure engineer (0.5 FTE) - maintains cluster
  • 1 Bioinformatician (0.5 FTE) - maintains pipelines
  • 1 Data analyst (1.0 FTE) - QC + research support
  • 0.5 Project lead - coordination

Operational Status at 1k Samples:

  • All 1,000 samples processed and QC'd
  • Automated QC and monitoring in place
  • Cohort VCF finalized and versioned
  • Public allele frequency database live

Feedback Loops & Research Enablement:

  • Researchers querying cohort for GWAS studies
  • Rare disease diagnoses using cohort as reference
  • Research publications emerging from cohort data
  • Clinical lab partnerships validated
  • Participant newsletters sharing discoveries

9. Extended Work & Future Roadmap

This section outlines research directions and infrastructure enhancements that go beyond the core 1k-sample program. These are optional expansions—your 1k-sample project is complete and sustainable without them—but they represent natural next steps if funding and institutional commitment allow.

9.1. Multi-Omics Integration: From DNA to Systems Biology

Rationale: A 1k-sample genome alone provides variant data. Adding complementary molecular layers (proteomics, metabolomics, transcriptomics) enables:

  • Disease-variant-protein associations (genotype → phenotype causality)
  • Biomarker discovery (multi-omic signatures predict clinical outcomes)
  • Drug target prioritization (variants affecting protein levels vs. function)

Implementation pathway:

Year 2-3: Proteomics layer (blood samples, mass spectrometry or arrays)

  • 1k samples × 5,000 proteins per sample
  • Link via SomaLogic, Olink, or mass spec platforms
  • Storage: 50-100 GB additional

Year 3-4: Transcriptomics (RNA-seq from accessible tissues or biobanks)

  • 1k samples × tissue-specific expression
  • Link to variants using eQTL analysis
  • Storage: 1-2 TB additional

Year 4+: Metabolomics, imaging, microbiology (optional)

What this enables:

  • Systems-level GWAS: Find variants controlling protein → disease
  • Biomarker panels: Multi-omic signatures for diagnosis/prognosis
  • Precision medicine: Tailor treatments based on individual multi-omic profile
  • International collaboration: Compare findings with other biobanks (UK Biobank, All of Us, etc.)

9.2. Disease Phenotype Linkage & EHR Integration

Rationale: Genomics alone is incomplete. Linking variants to structured clinical data enables:

  • Rare disease diagnosis (genotype → clinical phenotype matching)
  • GWAS with continuous traits (BMI, lipids, kidney function from lab data)
  • Outcomes research (which variants predict treatment response, adverse events)

Implementation pathway:

  • Year 2: Minimal EHR linkage

    • Structured data: diagnosis codes (ICD-10), lab values, medications
    • Link to 1k samples via secure ID mapping
    • Privacy: All analyses on secure cluster (no data export)
  • Year 3: Advanced phenotyping

    • Rare disease cohort identification using NLP on clinical notes
    • Phenotype-genotype matching for undiagnosed disease cases
    • Integration with rare disease registries
  • Year 4+: Outcomes tracking

    • Follow-up cohort expansion (if participants consent to long-term follow-up)
    • Link to pharmacy data (drug response, adverse events)
    • Link to hospitalization records (complications, readmissions)

What this enables:

  • Rare disease diagnosis: "I have this kid with seizures + developmental delay; what variants explain it?"
  • Clinical validation: Prove that your genome project helps diagnose real patients
  • Publication-ready studies: "Variants in Gene X predict response to Drug Y"
  • Feedback to participants: Share personalized health insights (with appropriate genetic counseling)

Infrastructure cost: +100200kcapital;+100-200k capital; +30-50k annually (mainly for secure data infrastructure + compliance)

9.3. Population-Specific Allele Frequency Database as Public Resource

Current state (Phase 5): Your 1k-sample project publishes summary allele frequencies on a website.

Future enhancement: Build an interactive query platform with:

  • Web portal: "Enter variant rs123456 → See frequency in your population"
  • Comparison: Show gnomAD vs. your population (highlight population-specific variants)
  • Research output: Enable researchers worldwide to use your allele frequencies
  • Impact metric: Track citations and research using your data

Example:

  • gnomAD shows a variant at 0.1% globally
  • Your 1k-sample database shows it at 5% in your population
  • Result: Clinicians using your allele frequency make better diagnosis decisions

Infrastructure cost: Minimal (~$5-20k) for web hosting + database setup; can be outsourced

9.4. Disease-Specific Sub-Cohorts

Rationale: A general population cohort of 1k samples may include ~50-100 individuals with a specific disease (e.g., diabetes, cancer). Enriching a sub-cohort from your 1k for a disease:

Example: Establish a rare disease cohort within your 1k

  • Identify 50-100 individuals with undiagnosed genetic disorders
  • Deeper phenotyping (clinical exams, additional tests)
  • Whole exome/genome re-analysis (more aggressive variant calling)
  • Link to reference databases (OMIM, ClinVar) for diagnosis

Example: Establish a cancer cohort

  • Identify 100-200 cancer patients in your 1k
  • Tumor genomics (if samples available)
  • Germline variant analysis for cancer predisposition
  • Link to cancer outcomes/treatments

What this enables:

  • Rare disease diagnosis: Higher-power variant discovery in affected families
  • Cancer genomics: Germline-somatic variant correlation
  • Publication: "Variants in Gene X cause familial Disease Y in our population"

Infrastructure cost: Minimal (additional phenotyping via clinical partnerships); mostly operational

9.5. Machine Learning & Polygenic Risk Scores (PRS)

Rationale: With 1k genomes, you can build population-specific polygenic risk scores for common diseases:

Year 2+: PRS development

  1. Use GWAS results from your 1k cohort
  2. Identify lead variants associated with traits
  3. Build PRS combining effect sizes
  4. Validate on external cohorts (gnomAD, All of Us)
  5. Deploy in clinical settings (if predictive)

Examples:

  • Heart disease risk (combine variants in PCSK9, LDL genes, etc.)
  • Type 2 diabetes risk
  • Alzheimer's disease risk
  • Mental health conditions (depression, schizophrenia)

What this enables:

  • Personalized risk prediction: Tell individuals their polygenic risk score
  • Clinical stratification: Identify high-risk individuals for prevention
  • Population health: Predict disease burden in your country

Infrastructure cost: Minimal (~$20-50k); mostly ML engineering


Summary: Extended Work is Optional

These are not requirements for a successful 1k-sample program. They are future opportunities if:

  1. Your initial 1k-sample project succeeds (good QC, research output, clinical impact)
  2. Stakeholders want to expand capabilities
  3. Additional funding becomes available
  4. International partnerships develop

The core program (Phases 1-3) is designed to be complete, sustainable, and impactful without any of these extensions. Choose extensions strategically based on your country's research priorities and healthcare needs.

10. Conclusion

Technical foundation:

  • Standardized variant calling (Nextflow + benchmarking)
  • Scalable joint genotyping (region-based parallelization)
  • Population analytics (Hail MatrixTable)
  • Local-first storage (data sovereignty)

Organizational requirements:

  • Dedicated team (not side projects)
  • Sustainable funding (not grant-dependent)
  • Clear governance (data access policies)
  • Clinical integration (demonstrate value)

Strategic advantages:

  • Population-specific allele frequencies
  • Rare disease diagnosis using cohort context
  • GWAS powered by local variants
  • Precision medicine tailored to your population

The next steps:

  1. Form core team: Recruit infrastructure engineer, bioinformatician, data analyst (2-3 FTE)
  2. Secure funding for infrastructure:
    • Capital (one-time): $150-300k for HPC infrastructure (Phase 1)
    • Annual infrastructure costs: $55-95k (electricity, maintenance, S3 storage)
    • Note: Team compensation should be determined based on local labor market in your country
  3. Partner with domestic S3 provider: Establish long-term data residency contract
  4. Start small: Deploy pilot with 100-500 samples (Months 1-3, Phase 1)
  5. Scale efficiently: Process 1k samples using same infrastructure (Months 4-12, Phase 2)
  6. Enable research: Support ongoing studies and reuse infrastructure for multi-omics (Phase 3+)

The technology is ready. The infrastructure is proven. The question now is: When will your country start building?

Thought

Vietnam is my home country, and the government has set an ambitious target of 10% GDP growth per year from 2026–2031, alongside a goal of 30% digital economy contribution to GDP by 2030. Achieving these objectives will require foundational national infrastructure, particularly in genomics, healthcare data, EHR, multi-omics, high-performance computing, and large-scale data centers. Early efforts have already emerged from the private sector, such as the VN1K and related initiatives, demonstrating both feasibility and growing momentum. Building on this, the government could establish a national secure data and compute infrastructure, where hospitals and universities contribute datasets and submit research proposals to access shared resources. Researchers would analyze sensitive healthcare and genomics data within controlled environments, while moderate usage fees help sustain and expand the platform. This centralized approach reduces duplication, accelerates research, strengthens data sovereignty, and supports national precision medicine and digital economy goals.

To achieve that, the below action items can be the possible solutions:

1. Build National Secure Data Centers — Government funds centralized, secure infrastructure to store healthcare, genomics, and research data at national scale.

2. Deploy National Compute Platform (HPC/GPU) — Provide shared compute resources for large-scale analysis such as genomics, AI, and medical imaging, reducing duplicated infrastructure.

3. Establish Data Governance & Access Committee — Create a national body to review research proposals, approve data access, and allocate compute resources securely.

4. Enable Data Submission from Hospitals & Universities — Allow institutions to contribute EHR, genomics, and research datasets into the national platform.

5. Implement Secure Analysis Environment — Researchers analyze sensitive data within controlled environments without downloading sensitive datasets.

6. Create Quota-Based Resource Allocation — Allocate compute and storage resources based on approved proposals and project impact.

7. Introduce Cost-Recovery Model — Apply moderate fees for compute, storage, and data access to sustain and expand infrastructure.

8. Launch National Programs on Top (Genomics, EHR, AI) — Use the platform to support population genomics, precision medicine, and healthcare AI initiatives.

tip

If you are interested in my blog, feel free to discuss with me via:

References

info

The repository references for this blog:

Foundational Projects & Research

Relevant Genomic Biobanks Review

Tools & Infrastructure

Workflow Management

Variant Calling

Joint Genotyping

Population Analytics

Storage & Infrastructure

Standards & Interoperability

Genomic Data Standards

  • GA4GH (Global Alliance for Genomics and Health): https://www.ga4gh.org/
    • Beacon API for variant discovery
    • Data Repository Service (DRS) for data access
    • VCF format specifications
  • VCF (Variant Call Format): https://samtools.github.io/hts-specs/
  • gVCF (genomic VCF): Genomic variant call format retaining no-call information

Quality Control Standards

Variant Annotation & Interpretation

Privacy & Data Governance

Clinical Genomics & Disease Interpretation

  • OMIM (Online Mendelian Inheritance in Man): https://omim.org/
  • ClinGen (Clinical Genome Resource): https://clinicalgenome.org/ - Standards for variant interpretation
  • Rare Disease & Undiagnosed Disease Networks: Examples from US, EU, Australia, etc.

Population Genetics & GWAS