Testing in Bioinformatics: Why Running Code with Input Data Isn't Enough

April 7, 2026 · 9 min read

Founder at G Labs

In Part 2 of our CI/CD series, we showed how to set up automated testing with make test-e2e—running your workflow with test data and checking if it produces results. If the script runs without crashing, you might think "everything works fine." But here's the uncomfortable truth: a pipeline that runs successfully doesn't mean it produces correct results.

This post explains why testing goes far beyond "running code and checking it doesn't crash." We'll explore the different types of tests bioinformaticians should care about and show practical examples of how to catch real bugs that simple end-to-end tests would miss.

testing pyramic

The Problem: Why "It Runs" Isn't Enough

Let's say you have a quality control pipeline (like the one from Part 2) that:

Downloads data ✓
Runs FastQC ✓
Runs MultiQC ✓
Finishes without errors ✓

Does this mean everything is correct? Not necessarily. Consider these real-world scenarios:

Scenario 1: Silent Parameter Changes

# Your original script
fastqc --qual-threshold 30 sample.fastq.gz -o result/

# A colleague modifies it (accidentally)
fastqc --qual-threshold 10 sample.fastq.gz -o result/

# The pipeline still runs successfully!
# But now you're accepting lower quality reads without realizing it

Scenario 2: Data Corruption

# Your QC output looks fine visually
# But internally, MultiQC may have corrupted data:
# - Missing sample names in reports
# - Incorrect statistics calculations
# - Incomplete HTML output

# A simple "did it run?" test won't catch this

Scenario 3: Edge Case Failures

# Your pipeline works with 100 samples
# But when you try with 1,000 samples:
# - Memory usage explodes
# - Temporary file naming conflicts
# - Processes timeout unexpectedly

These aren't hypothetical—they're bugs that happen in real bioinformatics projects. The lesson: we need different types of tests that go beyond just "execution."

1. Types of Tests Bioinformaticians Need

1.1. Four Layers of Testing

Instead of one monolithic end-to-end test, think of testing in layers:

┌────────────────────────────────────────────────┐
│ Layer 4: End-to-End Tests                      │
│ "Does the full pipeline work with real data?"  │
├────────────────────────────────────────────────┤
│ Layer 3: Integration Tests                     │
│ "Do components work together correctly?"       │
├────────────────────────────────────────────────┤
│ Layer 2: Contract Tests (Output Validation)    │
│ "Are the results in the expected format?"      │
├────────────────────────────────────────────────┤
│ Layer 1: Unit Tests (Individual Functions)     │
│ "Does this logic work correctly?"              │
└────────────────────────────────────────────────┘

Let's look at each layer with concrete examples.

1.2. Layer 1: Unit Tests (Individual Functions)

These test small, isolated pieces of code—usually Python functions that parse or validate data.

Real-world example: Validating QC metrics

Imagine your pipeline reads quality control metrics from a JSON file. A buggy validator might accept bad data:

# parse_qc_metrics.py
def parse_fastqc_json(json_file):
    """Parse FastQC JSON output."""
    with open(json_file) as f:
        data = json.load(f)

    # BUG: What if 'summary' key is missing? Crashes silently
    q30_rate = data['summary']['q30_rate']

    # BUG: What if q30_rate is 0? No error checking
    return q30_rate

Unit tests catch these bugs early:

# test_parse_qc_metrics.py
def test_parse_fastqc_missing_summary():
    """Should handle missing summary section."""
    invalid_json = '{"other": "data"}'

    with pytest.raises(KeyError):
        parse_fastqc_json(invalid_json)

def test_parse_fastqc_zero_rate():
    """Should handle zero quality rate."""
    result = parse_fastqc_json('{"summary": {"q30_rate": 0}}')
    assert result == 0  # Not False or None—explicitly 0

1.3. Layer 2: Contract Tests (Output Validation)

These verify that tools produce output in the expected format. Your pipeline might run without errors, but produce corrupted output.

Real-world example: Validating BAM file structure

# Layer 2 test: Is the BAM file valid?
test_validate_bam_output() {
    # Run samtools sort
    samtools sort -o sorted.bam input.bam

    # Check: Does the output file exist?
    test -f sorted.bam || exit 1

    # Check: Is it actually a valid BAM file?
    samtools view -H sorted.bam > /dev/null || exit 1

    # Check: Does it have the expected number of reads?
    read_count=$(samtools view -c sorted.bam)
    test $read_count -gt 0 || exit 1
}

Why this matters: MultiQC might run successfully on a corrupted BAM, but your downstream analysis fails silently.

1.4. Layer 3: Integration Tests

These verify that multiple components work together—like testing an entire subworkflow (QC → Trimming → Alignment).

# Layer 3 test: Do all steps work together?
test_full_qc_trimming_workflow() {
    # Download test data
    make data

    # Run: FastQC → Trimming → FastQC again
    bash workflow.sh data/sample_1.fastq.gz data/sample_2.fastq.gz result/

    # Validate: All expected outputs exist
    test -f result/trimming_report.html
    test -f result/trimmed_1.fastq.gz
    test -f result/trimmed_2.fastq.gz

    # Validate: Output files are non-empty
    test -s result/trimmed_1.fastq.gz
    test -s result/trimmed_2.fastq.gz
}

1.5. Layer 4: End-to-End Tests

The "make test-e2e" test from Part 2 is a Layer 4 test. It's important but not sufficient on its own.

2. Real-World Bug Examples (Why Layer 1-3 Matter)

Let me show you bugs that Layer 4 (end-to-end) tests would miss, but other layers would catch:

Bug 1: Silent Parameter Error

# Original script
fastqc --qual-threshold 30 input.fastq.gz

# Developer accidentally changes it
fastqc --qual-threshold 5 input.fastq.gz

Layer 4 test (end-to-end):

✅ Passes—FastQC runs without crashing
❌ Doesn't catch the lower threshold

Layer 2 test (contract):

✅ Would catch this if you validate quality thresholds in output

# Contract test for quality thresholds
q_threshold=$(grep "Passed filters:" result.html | grep -oP '\d+')
test $q_threshold -ge 20 || (echo "Quality threshold too low"; exit 1)

Bug 2: Data Corruption in Intermediate Files

# Trimming tool runs successfully, but outputs wrong number of reads

original_reads=$(samtools view -c input.bam)
trimmed_reads=$(samtools view -c trimmed.bam)

# Bug: trimmed_reads should be LESS than original_reads
# If trimmed_reads > original_reads, something is wrong!

Layer 4 test (end-to-end):

✅ Passes—pipeline finishes without error
❌ Doesn't validate read counts

Layer 3 test (integration):

✅ Would catch this:

# Integration test: Validate trimming removes reads
test $original_reads -gt $trimmed_reads || \
    (echo "ERROR: Trimming removed no reads"; exit 1)

Bug 3: Missing Input Data

# Your script doesn't check if input exists
fastqc $input_file  # What if $input_file doesn't exist?

Layer 4 test (end-to-end):

✅ Passes (test data exists)
❌ Doesn't test what happens with missing input

Layer 1 test (unit):

✅ Catches this:

def test_fastqc_missing_input():
    """Should fail gracefully with missing input."""
    with pytest.raises(FileNotFoundError):
        run_fastqc("nonexistent.fastq.gz")

3. Practical: Adding Layer Tests to Your Makefile

Remember from Part 2, you had this in your Makefile:

test-e2e: data/GSE110004
	${HOME}/.pixi/bin/pixi run bash main.sh ...

Now let's add validation layers:

# Layer 1: Unit tests (validate Python parsing functions)
test-unit:
	pytest tests/unit/ -v

# Layer 2: Contract tests (validate output formats)
test-contract: data/GSE110004
	bash tests/contract/validate_fastqc_output.sh result/

# Layer 3: Integration tests (validate multi-step workflows)
test-integration: data/GSE110004
	bash tests/integration/test_qc_pipeline.sh

# Layer 4: End-to-end test (run full pipeline)
test-e2e: data/GSE110004
	${HOME}/.pixi/bin/pixi run bash main.sh data/GSE110004/*_1.fastq.gz data/GSE110004/*_2.fastq.gz result

# Run all tests
test-all: test-unit test-contract test-integration test-e2e
	@echo "All tests passed!"

Example: Layer 2 Contract Test

# tests/contract/validate_fastqc_output.sh
#!/bin/bash

RESULT_DIR=$1

echo "Validating FastQC outputs..."

# Check 1: HTML reports exist
test -f "$RESULT_DIR"/*.html || {
    echo "ERROR: No FastQC HTML reports found"
    exit 1
}

# Check 2: Check required sections exist in HTML
for html_file in "$RESULT_DIR"/*.html; do
    # FastQC must have these sections
    grep -q "Per base sequence quality" "$html_file" || {
        echo "ERROR: Missing quality metrics in $html_file"
        exit 1
    }
    grep -q "Per base sequence content" "$html_file" || {
        echo "ERROR: Missing content metrics in $html_file"
        exit 1
    }
done

# Check 3: MultiQC report exists and has content
test -s "$RESULT_DIR/multiqc_report.html" || {
    echo "ERROR: MultiQC report is empty or missing"
    exit 1
}

# Check 4: MultiQC HTML contains sample names
grep -q "General Statistics" "$RESULT_DIR/multiqc_report.html" || {
    echo "ERROR: MultiQC missing summary statistics"
    exit 1
}

echo "✓ All contract tests passed!"

Example: Layer 3 Integration Test

# tests/integration/test_qc_pipeline.sh
#!/bin/bash

echo "Testing QC → Trimming → QC pipeline..."

# Step 1: Run first FastQC
fastqc input.fastq.gz -o tmp_qc1/
count1=$(samtools view -c tmp_qc1/input_fastqc.zip)

# Step 2: Run trimming
trimmomatic SE input.fastq.gz trimmed.fastq.gz LEADING:3 TRAILING:3

# Step 3: Run second FastQC
fastqc trimmed.fastq.gz -o tmp_qc2/
count2=$(samtools view -c tmp_qc2/trimmed_fastqc.zip)

# Integration check: Second QC should show improvement
# (This is a behavior test: does trimming actually improve quality?)
if [ "$count2" -lt "$count1" ]; then
    echo "✓ Integration test passed: Quality improved after trimming"
else
    echo "ERROR: Trimming did not improve quality metrics"
    exit 1
fi

rm -rf tmp_qc1 tmp_qc2

4. What About Nextflow Pipelines?

If you're using Nextflow, testing is similar but uses nf-test framework:

# Layer 2: Contract test for a Nextflow process
test("samtools sort produces valid BAM") {
    when {
        input[0] = [ [id: 'test'], file('test.bam') ]
    }

    then {
        assert process.success

        // Layer 2: Validate BAM structure
        def bam_file = process.out.bam[0][1]
        assert shell("samtools view -H ${bam_file}").exitStatus == 0
    }
}

5. Summary: The Testing Pyramid

                    ▲
                   ╱ ╲
                  ╱   ╲          Layer 4: End-to-End (5%)
                 ╱ E2E ╲         "Did the pipeline finish?"
                ╱───────╲
               ╱         ╲
              ╱Integration╲    Layer 3 (15%)
             ╱ (Contract)  ╲   "Do outputs have correct format?"
            ╱───────────────╲
           ╱                 ╲
          ╱      Unit Tests   ╲ Layer 2 + 1 (80%)
         ╱    (Validation)     ╲ "Does this logic work? Handle errors?"
        ╱───────────────────────╲

Each layer catches different types of bugs that the layer above might miss.

Key Insight: A pipeline that "runs successfully" (Layer 4 ✓) might still have:

Silently corrupted data (caught by Layer 2)
Parameter errors (caught by Layer 1 and 2)
Broken intermediate steps (caught by Layer 3)

6. Getting Started

Start simple:

Add Layer 1 (unit tests):

pytest tests/unit/ --cov  # If you have Python code

Add Layer 2 (contract tests):

bash tests/contract/*.sh  # Validate outputs

Keep your Layer 4 (end-to-end):
```
make test-e2e  # From Part 2
```
Run them all before pushing:
```
make test-all
```

That's it. You're now testing beyond just "running code."

References

CI/CD in Bioinformatics (Part 2) — Previous post on automated testing
pytest Documentation — Python testing framework
nf-test Documentation — Nextflow testing framework
GitHub Actions Testing — CI/CD practices
Testing Best Practices — Software testing insights

Summary: Testing in bioinformatics isn't just about "running code and seeing if it finishes." A pipeline that executes successfully can still contain hidden bugs—corrupted data, wrong parameters, missing edge case handling. By adding Layer 1 (unit), Layer 2 (contract), and Layer 3 (integration) tests alongside your Layer 4 (end-to-end) tests from Part 2, you catch real problems that matter to your biology, not just your computational pipeline.

The Problem: Why "It Runs" Isn't Enough​

1. Types of Tests Bioinformaticians Need​

1.1. Four Layers of Testing​

1.2. Layer 1: Unit Tests (Individual Functions)​

1.3. Layer 2: Contract Tests (Output Validation)​

1.4. Layer 3: Integration Tests​

1.5. Layer 4: End-to-End Tests​

2. Real-World Bug Examples (Why Layer 1-3 Matter)​

Bug 1: Silent Parameter Error​

Bug 2: Data Corruption in Intermediate Files​

Bug 3: Missing Input Data​

3. Practical: Adding Layer Tests to Your Makefile​

Example: Layer 2 Contract Test​

Example: Layer 3 Integration Test​

4. What About Nextflow Pipelines?​

5. Summary: The Testing Pyramid​

6. Getting Started​

References​

The Problem: Why "It Runs" Isn't Enough

1. Types of Tests Bioinformaticians Need

1.1. Four Layers of Testing

1.2. Layer 1: Unit Tests (Individual Functions)

1.3. Layer 2: Contract Tests (Output Validation)

1.4. Layer 3: Integration Tests

1.5. Layer 4: End-to-End Tests

2. Real-World Bug Examples (Why Layer 1-3 Matter)

Bug 1: Silent Parameter Error

Bug 2: Data Corruption in Intermediate Files

Bug 3: Missing Input Data

3. Practical: Adding Layer Tests to Your Makefile

Example: Layer 2 Contract Test

Example: Layer 3 Integration Test

4. What About Nextflow Pipelines?

5. Summary: The Testing Pyramid

6. Getting Started

References