HPC deployment

AmorphGen is designed for deployment on GPU-enabled HPC clusters via SLURM.

SLURM job script

#!/bin/bash
#SBATCH --job-name=amorphgen
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --time=24:00:00
#SBATCH --account=your-account

module load CUDA/11.8.0
conda activate /path/to/your/env

amorphgen POSCAR --model mace-mpa-0 --device cuda

Resuming timed-out jobs

The --resume flag enables smart checkpoint detection for both pipeline and batch-quench modes. It scans the work directory for completed stage outputs and automatically skips them.

Pipeline mode

amorphgen POSCAR \
    --stages 1 4 5 6 7 \
    --config my_config.yaml \
    --work-dir my_run/ \
    --resume

If stages 1 and 4 are already complete, AmorphGen picks up from stage 5 using the stage4_eq.xyz checkpoint. If all stages are done, it exits immediately.

Batch quench mode

amorphgen --batch-quench \
    --snapshot-dir snapshots/ \
    --model mace-mpa-0 \
    --device cuda \
    --resume

This skips already-completed structures and continues from where the previous job left off.

Python API

from amorphgen import MeltQuenchPipeline

pipe = MeltQuenchPipeline(
    input_file="POSCAR",
    work_dir="my_run",
    cfg_override={"model": "mace-mpa-0", "device": "cuda"},
)
atoms = pipe.run(stages=[1, 4, 5, 6, 7], resume=True)

Array jobs for batch processing

For running many structures in parallel (e.g. 100 AIRSS structures), use a SLURM array job:

#!/bin/bash
#SBATCH --job-name=MQ_batch
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --array=1-100

SAMPLE=${SLURM_ARRAY_TASK_ID}

amorphgen "inputs/sample-${SAMPLE}.xyz" \
    --stages 1 4 5 6 7 \
    --config config.yaml \
    --work-dir "results/sample_${SAMPLE}" \
    --resume

Each array task runs on its own GPU. The --resume flag makes resubmission safe — completed samples are skipped automatically.