Scheduling Studies (a.k.a. the Batch Block)
The batch block is an optional component of the workflow specification that enables job submission and management on remote clusters. The block contains a handful of keys for specifying system level information that's applicable to all scheduled steps:
Batch block keys
Key | Required? | Type | Description |
---|---|---|---|
type |
Yes | str | Select scheduler adapter to use. Currently supported are: {local , slurm , lsf , flux } |
shell |
No | str | Optional specification path to shell to use for execution. Defaults to "/bin/bash" |
bank |
Yes | str | Account which runs the job; this is used for computing job priority on the cluster. '--account' on slurm, '-G' on lsf, ... |
host |
Yes | str | The name of the cluster to execute this study on |
queue |
Yes | str | Scheduler queue/machine partition to submit jobs (study steps) to |
nodes |
No | int | Number of compute nodes to be reserved for jobs: note this is also a per step key |
reservation |
No | str | Optional prereserved allocation/partition to submit jobs to |
qos |
No | str | Quality of service specification -> i.e. run in standby mode to use idle resources when user priority is low/job limits already reached |
gpus |
No | str | Optional reservation of gpu resources for jobs |
procs |
No | int | Optional number of tasks in batch allocations: note this is also a per step key |
flux_uri |
Yes* | str | Uri of flux instance to schedule jobs to. * only required with type =flux . NOTE: it is recommended to rely on env vars instead as uri's are very ephemeral things. |
version |
No | str | Optional version of flux scheduler; for accomodating api changes |
args |
No | dict | Optional additional args to pass to scheduler; keys are arg names, values are arg values |
The information in this block is used to populate the step specific batch scripts with the appropriate
header comment blocks (e.g. '#SBATCH --partition' for slurm). Additional keys such as step specific
resource requirements (number of nodes, cpus/tasks, gpus, ...) get added here when processing
individual steps; see subsequent sections for scheduler specific details. Note that job steps will
run locally unless at least the nodes
or procs
key in the step is populated. The keys attached to the study steps also get used to construct the parallel launcer (e.g. srun
for SLURM). The following subsections describe the options in the currently supported scheduler types.
LAUNCHER Token
The LAUNCHER
token is a special token that has two forms for use in place of explicit scheduler specific commands in your study steps such as srun ...
and flux mini run ...
.
Legacy style
$(LAUNCHER)
The original style simply reads in the step keys such as nodes
and procs
(see scheduler specific sections for full list of options). Maestro then combines the step and batch
block configuration when writing the step scripts to generate the appropriate parallel launcher invocation for the system, e.g.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --partition=pbatch
#SBATCH --account=baasic
#SBATCH --time=00:10:00
#SBATCH --job-name=run-two-apps
#SBATCH --output=run-two-apps.out
#SBATCH --error=run-two-apps.err
#SBATCH --comment "Run two parallel apps"
#SBATCH --exclusive
srun -N 2 -n 72 par_app_1
srun -N 2 -n 72 par_app_2
New style
$(LAUNCHER)[<n>n, <p>p]
This updated variant allows more granular control of the launcher token to allocate resources differently on a per executable/command basis inside of a step.
<n>
: command specific number of nodes. Must be less than or equal to steps'nodes
setting.<p>
: command specific number of tasks/procs. Must be less than or equal to steps'procs
setting.
Note
You do not need both 'n' and 'p' with this syntax. You can also allocate soley based on tasks (p) or nodes (n).
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --partition=pbatch
#SBATCH --account=baasic
#SBATCH --time=00:10:00
#SBATCH --job-name=run-two-apps
#SBATCH --output=run-two-apps.out
#SBATCH --error=run-two-apps.err
#SBATCH --comment "Run two parallel apps using different resource configs"
#SBATCH --exclusive
srun -N 1 -n 36 par_app_1
srun -N 2 -n 36 par_app_2
LOCAL
The LOCAL scheduler gets run by conductor directly where it was launched. All tasks are currently run sequentially. There are no batch block arguments for it, and running a study specification that doesn't contain a batch block will default to this scheduler.
SLURM
The SLURM scheduler uses the srun
command to launch and allocate resources to tasks. Maestro currently supports the following subset of srun arguments:
SLURM (srun) | Maestro | Block | Description | Default |
---|---|---|---|---|
-n |
procs |
step, batch | Number of MPI tasks to allocate for the launched application | 1 |
-N |
nodes |
step, batch | Number of nodes to allocate for the launched application | 1 |
-c |
cores per task |
step | Number of physical CPU cores per task | 1 |
-t , --time |
walltime |
step | Limit on total run time of the job | N/A: Machine/system dependent |
--exclusive |
exclusive |
step | Grant job allocation excluive use of resources. Useful for running on processor scheduled machines . NOTE: this behavior depends on system config | False |
Flux
The Flux scheduler uses the command flux run
to launch and allocate resources to tasks. For adapter versions < 0.49.0 this will actually be the flux mini run
command which was recently deprecated. Maestro provides keys for a subset of arguments to this command along with hooks for passing a comma separated list of additional arguments
Flux | Maestro | Description | Default |
---|---|---|---|
-n |
procs |
Number of MPI tasks to allocate for the launched application | 1 |
-N |
nodes |
Number of nodes to allocate for the launched application | 1 |
-c |
cores per task |
Number of physical CPU cores per task | 1 |
-g |
gpus |
Number of gpus to allocate per task | 0 |
-o |
Comma separated list of additional args | None |
Flux adapter also supports some keys that control batch job behavior instead of getting passed to the flux mini run
or flux run
commands:
Maestro | Description | Default |
---|---|---|
nested |
Flag to control whether to run the step inside a nested flux instance. This is usually the desired option. | True |
waitable |
Whether to mark a job as 'waitable'; this is restricted to owners of an instance, and thus cannot be used if scheduling to a system instance (i.e. not to a broker with a specific uri). Note: this option is likely only of interest if using the script adapters directly to build a custom tool. New flag as of 0.49.0 adapter. Let us know via github issues if you find a need/use for this in the spec. | False |
See the flux framework for more information on flux. Additionally, checkout the flux-how-to-guides for the options available for using flux with Maestro. Also check out a full example spec run with flux.
Danger
The Flux scheduler itself and Maestro's flux adapter are still in a state of flux and may go through breaking changes more frequently than the Slurm and LSF scheduler adapters.
LSF: a Tale of Two Launchers
The LSF scheduler has multiple options for the parallel launcher commands:
Maestro currently supports only the jsrun version, which differs from slurm
via a more flexible specification of resources available for each task. In
addition to the procs
, cores per task
, and gpu
keys, there are also
tasks_per_rs
and rs_per_node
. jsrun
describes things in terms of resource
sets, with several keywords controlling these resource sets and mapping them to
the actual machine/node allocations:
Mapping of LSF args to Maestro step keys
LSF (jsrun) | Maestro | Description | Default |
---|---|---|---|
-n , --nrs |
procs |
Number of resource sets | 1 |
-a , --tasks_per_rs |
tasks per rs |
Number of MPI tasks (ranks) in a resource set | 1 |
-c , --cpu_per_rs |
cores per task |
Number of physical CPU cores in a resource set | 1 |
-g , --gpu_per_rs |
gpus |
Number of GPU's per resource set | 0 |
-b , --bind |
bind |
Controls binding of tasks in a resource set | rs |
-B , --bind_gpus |
bind gpus |
Controls binding of tasks to GPU's in a resource set | none |
-r , --rs_per_host |
rs per node |
Number of resource sets per node | 1 |
Warning
bind_gpus
is new in lsf 10.1 and may not be available on all systems
Examples
Now for a few examples of how to map these to Maestro's resource specifications.
Note the node
key is not directly used for any of these, but is still used for
the reservation itself. The rest of the keys serve to control the per task resources
and then the per node packing of resource sets. Consider a few examples run on the
LLNL Sierra architechture which has 44 cores and 4 gpus per node:
Multiple tasks with single cpu and gpu per task
1 resource set per gpu on a cluster with 4 gpus per node with an application requesting 8 gpus. This will consume 2 full nodes of the cluster with 1 MPI rank associated with each gpu and having 1 cpu each.
And the corresponding maestro step that generates it
study:
- name: run-my-app
description: launch the best gpu application.
run:
cmd: |
$(LAUNCHER) my_awesome_gpu_application
procs: 8
nodes: 2
gpus: 1
rs per node: 4
tasks per rs: 1
cores per task: 1
Note that procs
here maps more to the tasks/resource set concept in lsf/jsrun, and
nodes is a multiplier on rs_per_node
which yields the nrs
jsrun key
Multiple tasks with single cpu and no gpus per task
1 resource set per cpu, with no gpus, and using all 44 cpus on the node
study:
- name: run-my-app
description: launch a pure mpi-cpu application.
run:
cmd: |
$(LAUNCHER) my_awesome_mpi_cpu_application
procs: 44
nodes: 1
gpus: 0
rs per node: 44
tasks per rs: 1
cores per task: 1
Again, note that procs
is a multiple of rs_per_node
.
Multiple multithreaded mpi ranks/tasks per node, with no gpus
study:
- name: run-my-app
description: launch an application using mpi and omp
run:
cmd: |
$(LAUNCHER) my_awesome_omp_mpi_cpu_application
procs: 4
nodes: 1
gpus: 0
rs per node: 4
tasks per rs: 1
cores per task: 11
Multiple multithreaded mpi ranks/tasks per node with one gpu per rank, spanning multiple nodes
study:
- name: run-my-app
description: Use all the threads!
run:
cmd: |
$(LAUNCHER) my_awesome_all_the_threads_application
procs: 8
nodes: 2
gpus: 1
rs per node: 4
tasks per rs: 1
cores per task: 11