Skip to main content

8 posts tagged with "reproducibility"

View All Tags

Testing in Bioinformatics: Why Running Code with Input Data Isn't Enough

· 9 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In Part 2 of our CI/CD series, we showed how to set up automated testing with make test-e2e—running your workflow with test data and checking if it produces results. If the script runs without crashing, you might think "everything works fine." But here's the uncomfortable truth: a pipeline that runs successfully doesn't mean it produces correct results.

This post explains why testing goes far beyond "running code and checking it doesn't crash." We'll explore the different types of tests bioinformaticians should care about and show practical examples of how to catch real bugs that simple end-to-end tests would miss.

Scalable Nextflow Modules: Building a Template with Copier, CI/CD, and nf-test

· 19 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Creating and maintaining a library of reusable Nextflow modules is a significant challenge for bioinformatics teams. Without a consistent structure, code quality standards, and automated testing, modules quickly become difficult to share, validate, and integrate into pipelines. The nf-modules-template solves this by providing a production-ready template that uses Copier to scaffold new module repositories, GitHub Actions for automated CI/CD workflows, pre-commit hooks for code quality, and nf-test with intelligent sharding for scalable module testing. This post explores how these technologies work together to enable reproducible, maintainable Nextflow module libraries.

Variant Calling (Part 2): From Bash to Nextflow: GATK Best Practice With Nextflow

· 27 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In Part 1, we built a complete 10-step GATK variant calling pipeline in bash—perfect for academic research and 1-10 samples. But what happens when you need to scale to 100+ samples? This is where Nextflow becomes essential.

📁 Repository: All code from this tutorial is organized in the nf-germline-short-read-variant-calling repository. The structure follows best practices with separate directories for bash (bash-gatk) and Nextflow (nextflow-gatk) implementations.

Docker Out of Docker: Running Interactive Web Applications for Data Analysis

· 10 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Running interactive web applications like RStudio, JupyterLab, and Code Server in containers is a powerful way to provide reproducible analysis environments. However, users often need to spawn additional containerized tools from within these applications. Docker out of Docker (DooD) elegantly solves this by allowing containers to access the host's Docker daemon. This post explains how to set up DooD for interactive web applications and why it's the right approach for bioinformatics workflows.

How to Migrate from In-House Pipelines to Enterprise-Level Workflows: A Proven 3-Step Validation Framework

· 18 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Whether your lab uses bash scripts, Python workflows, Snakemake pipelines, or custom solutions—your in-house pipeline works fine locally. It's been running for years. But as your research scales, you face a hard truth: in-house pipelines don't scale, aren't reproducible across teams, and require constant manual fixes.

Bioinformatics Workflow Template: Standardizing Python Pipelines with Modular Design

· 13 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Building reproducible bioinformatics pipelines is hard. Every project starts from scratch with its own testing, CI/CD, and deployment strategy. What if you could clone a template, add your analysis tools, and be ready to go?

This post introduces a standardized bioinformatics workflow template featuring consistent testing, CI/CD, and project structure. Developed from real production experience with bioinfor-wf-template, this template reduces setup time from days to minutes, ensures research reproducibility, and promotes modular, reusable code. It is Python-based and ideal for proof-of-concept projects. Support for more advanced and widely adopted bioinformatics frameworks (such as Snakemake and Nextflow) is planned, applying the same core principles while leveraging their native testing systems.

The Evolution of Version Control - CI/CD in bioinformatics (Part 2)

· 14 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Welcome to Part 2 of our series on version control in bioinformatics. In Part 1, we introduced Git fundamentals, branching strategies, and collaborative workflows. In this post, we'll dive into how Continuous Integration and Continuous Deployment (CI/CD) can transform your bioinformatics projects. If these concepts are new to you, don't worry—this guide will walk you through managing your bioinformatics repository to ensure your work is easily reproducible on any machine. Whether your server is wiped or you need to spin up a new virtual machine, you'll be able to quickly rerun your pipeline. With CI/CD, every code update can automatically trigger tests on a small dataset to verify everything works before scaling up, ensuring that new changes don't break your results or workflows.

The Evolution of Version Control - Git's Role in Reproducible Bioinformatics (Part 1)

· 13 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

In Part 1 (this post), we explore the history of Git, its integration with GitHub, and basic hands-on tutorials. Part 2 (coming soon) will cover real-world bioinformatics examples and advanced workflows with best practices.

This part focuses on practical applications, including NGS quality control using multiqc and fastqc.