# HPC deployment

AmorphGen is designed for deployment on GPU-enabled HPC clusters via SLURM.

## SLURM job script

```bash
#!/bin/bash
#SBATCH --job-name=amorphgen
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --time=24:00:00
#SBATCH --account=your-account

module load CUDA/11.8.0
conda activate /path/to/your/env

amorphgen POSCAR --model mace-mpa-0 --device cuda
```

## Resuming timed-out jobs

The `--resume` flag enables smart checkpoint detection for both pipeline and batch-quench modes. It scans the work directory for completed stage outputs and automatically skips them.

### Pipeline mode

```bash
amorphgen POSCAR \
    --stages 1 4 5 6 7 \
    --config my_config.yaml \
    --work-dir my_run/ \
    --resume
```

If stages 1 and 4 are already complete, AmorphGen picks up from stage 5 using the `stage4_eq.xyz` checkpoint. If all stages are done, it exits immediately.

### Batch quench mode

```bash
amorphgen --batch-quench \
    --snapshot-dir snapshots/ \
    --model mace-mpa-0 \
    --device cuda \
    --resume
```

This skips already-completed structures and continues from where the previous job left off.

### Python API

```python
from amorphgen import MeltQuenchPipeline

pipe = MeltQuenchPipeline(
    input_file="POSCAR",
    work_dir="my_run",
    cfg_override={"model": "mace-mpa-0", "device": "cuda"},
)
atoms = pipe.run(stages=[1, 4, 5, 6, 7], resume=True)
```

## Array jobs for batch processing

For running many structures in parallel (e.g. 100 AIRSS structures), use a SLURM array job:

```bash
#!/bin/bash
#SBATCH --job-name=MQ_batch
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --array=1-100

SAMPLE=${SLURM_ARRAY_TASK_ID}

amorphgen "inputs/sample-${SAMPLE}.xyz" \
    --stages 1 4 5 6 7 \
    --config config.yaml \
    --work-dir "results/sample_${SAMPLE}" \
    --resume
```

Each array task runs on its own GPU. The `--resume` flag makes resubmission safe — completed samples are skipped automatically.