Skip to main content

HPC: Test Ansible Playbook With Molecule – From Manual Vagrant to Automated CI/CD

· 10 min read
Thanh-Giang Tan Nguyen
Founder at G Labs

Testing Ansible playbooks for HPC clusters is challenging. You can manually spin up VMs with Vagrant, but debugging issues across controller and worker nodes takes time. Instead, Molecule provides a repeatable, automated testing framework that validates your playbook configuration before deployment. This post shows how to transition from manual Vagrant testing to Molecule, and then integrate it into GitHub Actions CI/CD.

1. The Testing Challenge: Manual Vagrant vs. Automated Molecule

1.1. Manual Testing with Vagrant (The Old Way)

When you first develop an HPC Ansible playbook—say, for a SLURM cluster—you might test it manually:

# Create local VMs and deploy cluster
bash scripts/setup.sh 24.04 true
vagrant up

# SSH to controller
vagrant ssh controller-01

# Check if services are running
sinfo # SLURM info
scontrol show nodes

# SSH to worker
vagrant ssh worker-01

# Destroy when done
vagrant destroy -f

Problems with this approach:

  • Slow feedback loop (VMs take 5-10 minutes to boot, playbook takes 20+ minutes to run)
  • Manual verification of state (did the service start? is the config correct?)
  • Hard to catch regressions when you modify the playbook
  • Not reproducible across team members or CI/CD systems

1.2. Molecule Testing (The Better Way)

Molecule automates the entire test lifecycle using Docker containers instead of full VMs:

# Single command runs the full test
molecule test

# Automatically:
# 1. Spins up Docker containers (controller-01, worker-01)
# 2. Installs dependencies (ansible-galaxy)
# 3. Converges the playbook (runs Ansible)
# 4. Verifies state (checks config, services, job execution)
# 5. Cleans up containers

Benefits:

  • Fast feedback: less than 10 minutes instead of 30+ minutes
  • Automated verification of critical system state
  • Catches regressions in pull requests before merging
  • Integrates seamlessly with GitHub Actions
  • Reproducible across environments (Docker runs the same everywhere)

2. Setting Up Molecule for HPC Ansible Playbooks

2.1. Molecule Configuration Structure

For the river-slurm project, here's the directory layout:

river-slurm/
├── molecule/
│ └── default/
│ ├── molecule.yml # Test configuration
│ ├── converge.yml # Ansible playbook to apply
│ └── verify.yml # Post-convergence verification
├── roles/
│ ├── slurm-common/ # SLURM setup shared by all nodes
│ ├── slurm-master/ # SLURM controller specific
│ ├── slurm-worker/ # SLURM worker specific
│ └── account/ # User account management
├── Makefile
├── pixi.toml # Python/tool dependencies
└── .github/
└── workflows/
└── e2e.yaml # GitHub Actions trigger

2.2. The Molecule Configuration: molecule.yml

This file defines the test infrastructure—which containers to spin up, how to configure them, and the test sequence:

---
driver:
name: docker

platforms:
- name: controller-01
image: geerlingguy/docker-ubuntu2404-ansible:latest
command: /lib/systemd/systemd
privileged: true
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:rw
networks:
- name: slurm-net
groups:
- slurm_master
- slurm

- name: worker-01
image: geerlingguy/docker-ubuntu2404-ansible:latest
command: /lib/systemd/systemd
privileged: true
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:rw
networks:
- name: slurm-net
groups:
- slurm_worker
- slurm

provisioner:
name: ansible
playbooks:
converge: converge.yml
verify: verify.yml
env:
ANSIBLE_ROLES_PATH: "${MOLECULE_PROJECT_DIRECTORY}/roles"

scenario:
test_sequence:
- dependency # Install Galaxy roles
- create # Spin up Docker containers
- prepare # Setup (skipped if no playbook)
- converge # Run Ansible playbook
- verify # Run verification checks
- destroy # Clean up containers

Key points:

  • Driver: Uses Docker instead of VMs (fast)
  • Platforms: Defines two nodes (controller and worker) with networking
  • Provisioner: Ansible applies the playbook and runs verification
  • Test sequence: Orchestrates the full lifecycle

2.3. The Convergence Playbook: converge.yml

This playbook runs Ansible roles on the test containers. Notice how it mirrors your production deployment but excludes external dependencies:

---
# Molecule converge — runs the Slurm-relevant plays only.
# Roles that require external infrastructure (Docker daemon, NVIDIA GPU,
# NFS mounts between real VMs) are excluded so the test stays self-contained.

- name: All nodes
hosts: all
become: true
become_user: root
roles:
- slurm-common

- name: Controller
hosts: slurm_master
become: true
become_user: root
roles:
- slurm-master

- name: Worker
hosts: slurm_worker
become: true
become_user: root
roles:
- slurm-worker

- name: All nodes
hosts: all
become: true
become_user: root
roles:
- account

2.4. The Verification Playbook: verify.yml

After the playbook converges, this playbook verifies critical system state. It checks:

On the controller:

  • slurm.conf and slurmdbd.conf exist
  • SLURM binaries (slurmctld, slurmdbd) are installed
  • MUNGE key permissions are correct
  • Test user can SSH without a live job (PAM setup)
  • sinfo returns cluster info
  • A test job can be submitted and completes successfully

On the worker:

  • slurm.conf exists
  • slurmd binary is installed
  • MUNGE key is synchronized
  • Test user account exists
  • The job result file was created (proof the job ran on this worker)

Example verification tasks:

# ── Controller verification ────────────────────────────────────────────────
- name: Verify slurm-master (controller-01)
hosts: slurm_master
become: true
tasks:
- name: Check slurm.conf exists
stat:
path: /etc/slurm/slurm.conf
register: slurm_conf
- name: Assert slurm.conf is present
assert:
that: slurm_conf.stat.exists
fail_msg: "/etc/slurm/slurm.conf not found"

- name: Check slurmctld binary exists
stat:
path: /usr/local/sbin/slurmctld
register: slurmctld_bin
- name: Assert slurmctld binary is installed
assert:
that: slurmctld_bin.stat.exists
fail_msg: "slurmctld binary not found"

# Submit a test job as testuser
- name: Write job script
copy:
dest: /home/testuser/job.sh
content: |
#!/bin/bash
#SBATCH --job-name=mol-test
#SBATCH --output=/home/testuser/scratch/slurm-%j.out
touch /home/testuser/scratch/job_result.txt

- name: Submit job as testuser
command: /usr/local/bin/sbatch /home/testuser/job.sh
become_user: testuser
register: sbatch_out

# Poll until job completes
- name: Poll squeue until job completes (max 120 s)
command: /usr/local/bin/squeue --job {{ slurm_job_id }} --noheader
until: squeue_poll.stdout | trim == ""
retries: 24
delay: 5

3. Running Molecule Locally

3.1. Installation with Pixi

The river-slurm project uses Pixi to manage dependencies reproducibly:

# pixi.toml defines Python, Ansible, Molecule, Docker
[dependencies]
python = "==3.11"

[pypi-dependencies]
ansible = ">=6.7.0"
molecule = ">=6.0.0"
molecule-docker = ">=2.1.0"
docker = ">=7.0.0"

Install and verify:

# Install Python tools via Pixi
pixi install

# Verify Molecule is available
pixi run molecule --version

3.2. Running the Full Test Locally

# Install Ansible Galaxy roles (roles from community)
make install

# Run the full Molecule test
make test

Output excerpt (actual run):

INFO    default ➜ discovery: scenario test matrix: dependency, create, prepare, converge, verify, destroy
INFO default ➜ prerun: Performing prerun with role_name_check=0...
INFO default ➜ dependency: Executing
Starting galaxy role install process
- robertdebock.java (4.2.0) is already installed, skipping.
- gantsign.golang (2.10.2) is already installed, skipping.
- geerlingguy.docker (7.1.0) is already installed, skipping.
- nvidia.nvidia_driver (v2.3.0) is already installed, skipping.
INFO default ➜ dependency: Dependency completed successfully.
INFO default ➜ create: Executing
INFO default ➜ create: Executed: Skipped (Skipping, instances already created.)
INFO default ➜ prepare: Executing
INFO default ➜ prepare: Executed: Missing playbook (Remove from test_sequence to suppress)
INFO default ➜ converge: Executing
INFO default ➜ converge: Sanity checks: 'docker'

PLAY [All nodes] ***************************************************************

TASK [Gathering Facts] *********************************************************
ok: [controller-01]
ok: [worker-01]

TASK [slurm-common : Update apt cache] *****************************************
changed: [worker-01]
changed: [controller-01]

TASK [slurm-common : Install Slurm dependencies] *******************************
ok: [worker-01] => (item=build-essential)
ok: [controller-01] => (item=build-essential)
...
changed: [controller-01] => (item=libpam0g-dev)
changed: [worker-01] => (item=libdbus-1-dev)
changed: [controller-01] => (item=libev-dev)
changed: [worker-01] => (item=libevent-dev)
...

This shows the dependency phase installing Galaxy roles, then converge applying the playbook across both nodes. Tasks show status (ok, changed, failed), indicating idempotency.


4. Why Molecule Replaces Manual Testing

Molecule's automation and repeatability replace manual Vagrant workflows:

AspectManual VagrantMolecule
Speed30+ min (VM boot + playbook + manual checks)10-15 min (Docker containers + automated verification)
ReproducibilityDepends on manual stepsAutomated, identical across runs
VerificationManual SSH checksAutomated assertions in verify.yml
CI/CD ReadyHard to integrateNative GitHub Actions support
DebuggingManual inspectionLogs + Ansible output

5. GitHub Actions CI/CD Integration

5.1. The Workflow File: .github/workflows/e2e.yaml

When you push to main or open a PR, GitHub Actions automatically runs Molecule:

---
name: e2e

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
molecule:
name: Molecule e2e
runs-on: ubuntu-24.04

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Install Ansible Galaxy roles
run: make install

- name: Cache Slurm build artifacts
uses: actions/cache@v4
with:
path: /tmp/slurm-build-cache
key: slurm-build-${{ hashFiles('roles/slurm-common/defaults/main.yml') }}

- name: Patch molecule-docker boolean bug (ansible-core 2.19 compat)
run: make fix-molecule-docker

- name: Run Molecule
run: make test
env:
MOLECULE_EPHEMERAL_DIRECTORY: /tmp/molecule
DOCKER_HOST: unix:///var/run/docker.sock

5.2. How It Works

  1. Trigger: Push to main or open a PR
  2. Runner: Ubuntu 24.04 GitHub Actions runner (has Docker pre-installed)
  3. Steps:
    • Checkout the repository
    • Install Ansible Galaxy roles (make install)
    • Cache SLURM build artifacts (speeds up subsequent runs)
    • Patch molecule-docker for Ansible 2.19 compatibility
    • Run Molecule test (make test)
  4. Result: Pass/fail shown in the PR status

5.3. Why Caching Matters

SLURM is compiled from source, which is slow (~10 minutes). GitHub Actions' cache keeps build artifacts between runs:

- name: Cache Slurm build artifacts
uses: actions/cache@v4
with:
path: /tmp/slurm-build-cache
key: slurm-build-${{ hashFiles('roles/slurm-common/defaults/main.yml') }}

If the role defaults (SLURM version) haven't changed, the cache hits and saves ~10 minutes per run.


6. Example: Catching a Playbook Bug

Scenario: You modify slurm-master role to add a new config option but introduce a typo in the template.

Without Molecule:

# You might not catch it until deploying to the real cluster
ansible-playbook -i inventories/hosts.prod.yml playbooks/slurm-cluster/slurm.yml
# ... wait 30 minutes, SSH to controller, check logs ...
# Error: /etc/slurm/slurm.conf has invalid syntax
# Rollback, fix, redeploy again

Time lost: 1 hour

With Molecule:

# Before pushing to main, run locally
make test

# Molecule catches it immediately:
TASK [verify : Check slurm.conf syntax] *****
failed: [controller-01] => {
"msg": "slurm.conf failed syntax check",
"stderr": "Error on line 42: invalid token 'SLURM_TYPO'"
}

# Fix the typo, run again
make test
# Now it passes
git push

Time lost: 5 minutes, caught before production


7. From Vagrant to Molecule to CI/CD: The Full Story

Here's how the testing workflow evolved in the river-slurm project:

Stage 1: Manual Vagrant Testing (Slow Feedback)

bash scripts/setup.sh 24.04 true
vagrant up
vagrant ssh controller-01
sinfo
vagrant destroy -f
  • Developer responsibility
  • 30+ minutes per iteration
  • No reproducibility guarantee

Stage 2: Local Molecule Testing (Fast Feedback)

make test
  • One command, fully automated
  • 10-15 minutes
  • Identical to CI/CD

Stage 3: GitHub Actions CI/CD (Gated Merges)

Pull Request opened

GitHub Actions triggered

Molecule test runs in CI

Pass → Approve and merge
Fail → Fix and push again
  • Catches regressions before production
  • Team collaboration easier
  • Audit trail of what was tested

8. Key Takeaways

  1. Molecule replaces manual Vagrant testing — faster, automated, repeatable
  2. Docker containers instead of VMs — less than 10 minutes instead of 30+ minutes
  3. Automated verification — checks config, services, and even job execution
  4. CI/CD integration — GitHub Actions runs Molecule on every PR
  5. Caching speeds up CI — SLURM build artifacts cached between runs
  6. Local = CI — test locally with make test, same thing runs in CI

By adopting Molecule, you catch configuration bugs early, reduce deployment risk, and iterate faster on your HPC infrastructure code.


References