Q2.12. I want to perform a large number of tasks while keeping the number of nodes in use low with bulk jobs.
A.2.12.
If you would like to execute a large number of tasks while keeping the number of nodes in use low, refer to the following example script to control the number of sruns to be executed simultaneously.
In this sample, one node is used and each calculation for one CPU core is repeated 3000 times.
#SBATCH -p XXX #Specify partition
#SBATCH --nodes=1
#SBATCH --ntasks=128
#SBATCH --cpus-per-task=1
TOTAL_TASKS=3000 # Total number of tasks (processes) to be executed repeatedly
EXE="myapp -a -b -c" # Commands to run the application
SRUN_OPTS="-N 1 -n 1 -c 1 --mem-per-cpu=1840 --exclusive"
CONCURRENT_TASKS=$SLURM_NTASKS # Number of concurrent executions
WAIT_TIME=10 # srun Interval to check for termination (seconds)
count=0
task_id=()
while [ $TOTAL_TASKS -gt $count ]; do
for ((i=0;i<$CONCURRENT_TASKS;i++)); do
pid=${task_id[i]:=dammy}
if [ ! -d /proc/$pid ]; then
# Run srun to save the PID
srun $SRUN_OPTS $EXE & task_id[$i]=$!
count=$((++count))
[ $TOTAL_TASKS -eq $count ] && break
fi
done
if ps | grep -q srun; then
sleep $WAIT_TIME
fi
done
wait # Wait for all tasks to be completed