HPC Submission scripts

Here we provide some example submission scripts for various HPC systems. TIES MD will attempt to automatically write sensible submission scripts for NAMD2 targeting ARCHER 2 and for OpenMM targeting Summit. In general the user can make there own script for whichever HPC or cluster they prefer. To aid with writing general scripts TIES MD exposes 3 options in the API called sub_header, pre_run_line and run_line. The strings passed with these options will be injected into a general template for a NAMD2 or OpenMM submission. All generated submission scripts are written to the base TIES MD directory as sub.sh. An example of this is provided in here Running.

NAMD

Here is an example of a submission script for a large system (≈100k atoms) running on SuperMUC-NG:

#!/bin/bash
#SBATCH --job-name=LIGPAIR
#SBATCH -o ./%x.%j.out
#SBATCH -e ./%x.%j.err
#SBATCH -D ./
#SBATCH --nodes=130
#SBATCH --tasks-per-node=48
#SBATCH --no-requeue
#SBATCH --export=NONE
#SBATCH --get-user-env
#SBATCH --account=XXX
#SBATCH --partition=general
#SBATCH --time=10:00:00

module load slurm_setup
module load namd/2.14-gcc8-impi

nodes_per_namd=10
cpus_per_namd=480

echo $nodes_per_namd
echo $cpus_per_namd

#change this line to point to your project
ties_dir=/hppfs/work/pn98ve/di67rov/test_TIES/study/prot/ties-l2-l1/com
cd $ties_dir/replica-confs

for stage in {0..3}; do
for lambda in 0.00 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 1.0; do
        srun -N $nodes_per_namd -n $cpus_per_namd namd2 +replicas 5 --tclmain run$stage-replicas.conf $lambda&
        sleep 1
done
wait
done

The first 20 lines of this script could be adapted for a smaller system (≈10k atoms) as follows:

#!/bin/bash
#SBATCH --job-name=LIGPAIR
#SBATCH -o ./%x.%j.out
#SBATCH -e ./%x.%j.err
#SBATCH -D ./
#SBATCH --nodes=13
#SBATCH --tasks-per-node=45
#SBATCH --no-requeue
#SBATCH --export=NONE
#SBATCH --get-user-env
#SBATCH --account=XXX
#SBATCH --partition=micro
#SBATCH --time=10:00:00

module load slurm_setup
module load namd/2.14-gcc8-impi

#--nodes and nodes_per_namd can be scaled up for large simulations
nodes_per_namd=1
cpus_per_namd=45

OpenMM

Here we provide an example of TIES MD running with OpenMM on Summit:

#!/bin/bash
#BSUB -P XXX
#BSUB -W 20
#BSUB -nnodes 1
#BSUB -alloc_flags "gpudefault smt1"
#BSUB -J test
#BSUB -o otest.%J
#BSUB -e etest.%J
cd $LS_SUBCWD
export PATH="/gpfs/alpine/scratch/adw62/chm155/TIES_test/miniconda/bin:$PATH"
export ties_dir="/gpfs/alpine/scratch/adw62/chm155/TIES_test/TIES_MD/TIES_MD/examples/ethane/zero_sum/leg1"
module load cuda/10.1.168
date
jsrun --smpiargs="off" -n 1 -a 1 -c 1 -g 1 -b packed:1 ties_md --config_file=$ties_dir/TIES.cfg --exp_name='sys_solv'  --windows_mask=0,1 --rep_id=0 > $ties_dir/0.out&
jsrun --smpiargs="off" -n 1 -a 1 -c 1 -g 1 -b packed:1 ties_md --config_file=$ties_dir/TIES.cfg --exp_name='sys_solv'  --windows_mask=1,2 --rep_id=0 > $ties_dir/1.out&
jsrun --smpiargs="off" -n 1 -a 1 -c 1 -g 1 -b packed:1 ties_md --config_file=$ties_dir/TIES.cfg --exp_name='sys_solv'  --windows_mask=2,3 --rep_id=0 > $ties_dir/2.out&
jsrun --smpiargs="off" -n 1 -a 1 -c 1 -g 1 -b packed:1 ties_md --config_file=$ties_dir/TIES.cfg --exp_name='sys_solv'  --windows_mask=3,4 --rep_id=0 > $ties_dir/3.out&
jsrun --smpiargs="off" -n 1 -a 1 -c 1 -g 1 -b packed:1 ties_md --config_file=$ties_dir/TIES.cfg --exp_name='sys_solv'  --windows_mask=4,5 --rep_id=0 > $ties_dir/4.out&
jsrun --smpiargs="off" -n 1 -a 1 -c 1 -g 1 -b packed:1 ties_md --config_file=$ties_dir/TIES.cfg --exp_name='sys_solv'  --windows_mask=5,6 --rep_id=0 > $ties_dir/5.out&
wait

NAMD 3

Here we provide an example of TIES MD running with NAMD3 on ThetaGPU:

#!/bin/bash
#COBALT -A XXX
#COBALT -t 100
#COBALT -n 2
#COBALT -q full-node
export mpirun="/lus/theta-fs0/software/thetagpu/openmpi-4.0.5/bin/mpirun"
export namd3="/lus/theta-fs0/projects/CompBioAffin/awade/NAMD3/NAMD_3.0alpha9_Linux-x86_64-multicore-CUDA/namd3"
node1=$(sed "1q;d" $COBALT_NODEFILE)
node2=$(sed "2q;d" $COBALT_NODEFILE)

cd /lus/theta-fs0/projects/CompBioAffin/awade/many_reps/mcl1/l18-l39/com/replica-confs
for stage in {0..3}; do
  $mpirun -host $node1 --cpu-set 0 --bind-to core -np 1 $namd3 +devices 0 --tclmain run$stage.conf 0.00 0&
  $mpirun -host $node1 --cpu-set 1 --bind-to core -np 1 $namd3 +devices 1 --tclmain run$stage.conf 0.05 0&
  $mpirun -host $node1 --cpu-set 2 --bind-to core -np 1 $namd3 +devices 2 --tclmain run$stage.conf 0.10 0&
  $mpirun -host $node1 --cpu-set 3 --bind-to core -np 1 $namd3 +devices 3 --tclmain run$stage.conf 0.20 0&
  $mpirun -host $node1 --cpu-set 4 --bind-to core -np 1 $namd3 +devices 4 --tclmain run$stage.conf 0.30 0&
  $mpirun -host $node1 --cpu-set 5 --bind-to core -np 1 $namd3 +devices 5 --tclmain run$stage.conf 0.40 0&
  $mpirun -host $node1 --cpu-set 6 --bind-to core -np 1 $namd3 +devices 6 --tclmain run$stage.conf 0.50 0&
  $mpirun -host $node1 --cpu-set 7 --bind-to core -np 1 $namd3 +devices 7 --tclmain run$stage.conf 0.60 0&
  $mpirun -host $node2 --cpu-set 0 --bind-to core -np 1 $namd3 +devices 0 --tclmain run$stage.conf 0.70 0&
  $mpirun -host $node2 --cpu-set 1 --bind-to core -np 1 $namd3 +devices 1 --tclmain run$stage.conf 0.80 0&
  $mpirun -host $node2 --cpu-set 2 --bind-to core -np 1 $namd3 +devices 2 --tclmain run$stage.conf 0.90 0&
  $mpirun -host $node2 --cpu-set 3 --bind-to core -np 1 $namd3 +devices 3 --tclmain run$stage.conf 0.95 0&
  $mpirun -host $node2 --cpu-set 4 --bind-to core -np 1 $namd3 +devices 4 --tclmain run$stage.conf 1.00 0&
wait
done

This script is running 13 alchemical windows using only 1 replica simulation in each window. Additionally 3 GPUs are idle on node2. For real world application this script needs to be scaled up. Currently TIES MD will not attempt to build NAMD3 HPC scripts automatically. For creating general scripts a Python script can be very helpful the following script would allow us to scale up on ThetaGPU:

import os

if __name__ == "__main__":

    ###OPTIONS###

    #account name
    acc_name = 'XXX'
    #how many nodes do we want
    nodes = 9
    #what thermodynamic leg to run (these may have different wall times)
    leg = 'com'
    #Where is the namd3 binary
    namd3_exe = '/lus/theta-fs0/projects/CompBioAffin/awade/NAMD3/NAMD_3.0alpha9_Linux-x86_64-multicore-CUDA/namd3'

    #############

    cwd = os.getcwd()
    #give com and lig simulations different wall times if needed
    if leg == 'com':
        wall_time = 100
    else:
        wall_time = 60
    with open(os.path.join(cwd, 'thetagpu_{}.sub'.format(leg)), 'w') as f:

        #Writing a header
        f.write('#!/bin/bash\n')
        f.write('#COBALT -A {}\n'.format(acc_name))
        f.write('#COBALT -t {}\n'.format(wall_time))
        f.write('#COBALT -n {}\n'.format(nodes))
        f.write('#COBALT -q full-node\n')

        #exporting mpirun and namd3 install locations
        f.write('export mpirun=\"/lus/theta-fs0/software/thetagpu/openmpi-4.0.5/bin/mpirun\"\n')
        f.write('export namd3=\"/lus/theta-fs0/projects/CompBioAffin/awade/NAMD3/NAMD_3.0alpha9_Linux-x86_64-multicore-CUDA/namd3\"\n')

        #writing line to read node file
        for node in range(nodes):
            f.write('node{0}=$(sed \"{1}q;d\" $COBALT_NODEFILE)\n'.format(node+1, node+1))

        #move to ties directory
        f.write('cd {}\n'.format(os.path.join(cwd, 'replica-confs')))

        #iterate over minimization, NVT eq, NPT eq and production
        for stage in ['run0', 'run1', 'run2', 'run3']:
            count = 0
            node = 1
            #iterate over alchemical windows
            for lam in [0.00, 0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 1.00]:
                #iterate over replica simulations
                for rep in [0, 1, 2, 3, 4]:
                    #write the run line
                    f.write('$mpirun -host $node{} --cpu-set {} --bind-to core -np 1 $namd3 +devices {} --tclmain {}.conf {:.2f} {}&\n'.format(node, count%8, count%8, stage, lam, rep))
                    # count the number of gpus move to next node when gpus all filled
                    count += 1
                    if count%8 == 0:
                        node += 1
            #make sure we wait between simulation stages for all sims to finish
            f.write('wait\n')