Using Maestro with Flux

Flux is unique amongst the scheduler adapters as it can be used in standalong batch job mode just like Slurm and LSF as well as in node/allocation packing mode. The allocation packing mode can be used regardless of the scheduler that is used to launch Flux:

Schedule to Flux batch jobs, which have nested brokers
Startup Flux brokers/instances inside of LSF and Slurm allocations and submit to those

A little setup is needed before testing out these different modes.

Installation

Running with Flux requires a few additional installation steps as Maestro is using the python interface.

Pip installation of bindings (for Flux > 0.45.0)

Recommended Option

Using this pip binding install process with the newest versions of Flux is a preferred option

Assuming Flux is installed on your system and you have a virtualenv active to install into:

$ flux -V
commands:           0.50.0
libflux-core:       0.50.0
libflux-security:   0.9.0
build-options:      +ascii-only+systemd+hwloc==2.8.0+zmq==4.3.4

$ pip install "flux-python==0.50.0"

The full list of available versions can be found on pypi, one per Flux version. Note that for some versions you may need to use one of the release candidates (rc).

Spack environment

Recommended Option

This option may be of interest if you are running on a system where Flux is not the native scheduler and/or is not publicly installed. Check out the spack tutorials for building an environment to install Flux and python, and then you can install Maestro into that environment.

Manual linking to existing/system Flux install

Not Recommended

This option should be a last resort if the other two options don't work

This option requires a few more steps. First, get the python path from Flux and append it to yours (after activating your virtualenv)

$ flux env
export FLUX_PMI_LIBRARY_PATH="/usr/lib64/flux/libpmi.so"
export LUA_PATH="/usr/share/lua/5.3/?.lua;;;"
export FLUX_EXEC_PATH="/usr/libexec/flux/cmd"
export PYTHONPATH="/usr/lib64/flux/python3.6"
export LUA_CPATH="/usr/lib64/lua/5.3/?.so;;;"
export FLUX_CONNECTOR_PATH="/usr/lib64/flux/connectors"
export MANPATH="<some giant list of paths..>"
export FLUX_MODULE_PATH="/usr/lib64/flux/modules"

$ export PYTHONPATH=$PYTHONPATH:/usr/lib64/flux/python3.6

Alternatively, use awk to update $PYTHONPATH automatically (or make a bash/zsh/etc func that can update/remove it later)

export PYTHONPATH=$PYTHONPATH:`flux env | awk -F "[= ]" '{if ($2 == "PYTHONPATH") {env=$3; split(env, p, "\""); print p[2]}}'`

Then install the two required python packages into your environment

$ pip install cffi pyyaml

Then you can install Maestro and start scheduling jobs to a Flux instance

Running with Flux

As mentioned at the above, the Flux adapter is different from the SLURM and LSF adapters in that it also enables usage as an allocation packing option where you may be running a Flux instance inside of SLURM/LSF. The adapter in Maestro can use an optional 'uri' to specify a particular Flux instance to schedule to, or in its absence assume it's talking to a system level broker where Maestro submits standalone batch jobs just as with SLURM and LSF.

Adapter version

The Flux adapter has an optional version switching mechanism to accomodate the variety of installs and more rapid behavior changes for this pre 1.0 scheduler. The default behavior is to try using the latest adapter version. This can be overridden using the version key in the batch block, choosing from one of the available options using the selection mechanism added in Maestro v1.1.9dev1:

Adapter Version	Flux Version
0.17.0	>= 0.17.0
0.18.0	>= 0.18.0
0.26.0	>= 0.26.0
0.49.0	>= 0.49.0

Note

Maestro's adapter versions are not pinned to exact Flux versions. The adapter version lags behind the Flux core version until breaking changes are introduced by Flux core.

Standalone batch jobs

In the absence of either populating the 'flux_uri' key in the batch block or the presence of the FLUX_URI environment variable, Maestro assumes you are scheduling to a system level instance, i.e. a machine managed natively by Flux. This will behave the same as the SLURM and LSF adapters.

Allocation packing mode

When the FLUX_URI environment variable is set, Maestro will submit jobs to that specific Flux broker, which can either be a nested instance inside a batch job on a Flux managed machine (uri ~ Flux jobid), or a Flux broker that was started by the user inside of a SLURM or LSF allocation. There are two ways to get this going

Launch Maestro inside the batch job/Flux broker

When you are inside a Flux batch job, or start a Flux broker inside of a SLURM or LSF allocation, Flux will automatically export the FLUX_URI. In this case you can simply execute maestro run <specification> inside of that broker/allocation and it will read that environment variable and submit all jobs to that broker. The primary concern here will be you may want/need to account for Maestro's conductor process consuming resources on one of the cores, depending on how often conductor sleeps and how resource intensive your processes are.

Launch Maestro external to the batch job/Flux broker

Recommended Option

On HPC clusters this often means running Maestro on the login node, but can be any machine that has ssh access to node/allocation that the Flux broker is running in. This option has the benefit that Maestro's conductor process does not consume allocation resources, and also that the allocation terminating early does not interrupt conductor's management of the study and leave it in an error state. There are multiple recipes for this, which vary in complexity based on machine configuration and flux version. See the Flux docs for more thorough discussions of this process on the non flux native machines such as Slurm and LSF.

Current versions of flux (~>0.40)

This process becomes much easier in the newest Flux versions which have native support for resolving nested uri on both SLURM and LSF; see more discussion in the flux documentation linked earlier
LSFSLURM
Resolving the uri can be done using the system native job id of the batch jobs where the flux broker was launched
$ flux uri --remote lsf:<lsf jobid> ssh://<job hostname>/var/tmp/flux-<hash>/local-0
The full recipe of updating the FLUX_URI environment variable and running a study in that broker:
$ export FLUX_URI=`flux uri --remote lsf:<lsf_jobid>` $ maestro run <study_specification> [run opts]
Resolving the uri can be done using the system native job id of the batch jobs where the flux broker was launched
$ flux uri --remote slurm:<slurm jobid> ssh://<job hostname>/var/tmp/flux-<hash>/local-0
The full recipe of updating the FLUX_URI environment variable and running a study in that broker:
$ export FLUX_URI=`flux uri --remote slurm:<slurm_jobid>` $ maestro run <study_specification> [run opts]
Older versions of Flux (~<0.40)

First, for older versions of Flux which do not have the lsf/slurm jobid proxy helpers, there is a recipe you can bake into your batch job that's launching Flux to expose the uri of the Flux broker to processes outside of that allocation using ssh.

The address of the broker can be constructed and dropped in a file via the following recipe using Flux's getattr command to query the broker/instance once it's started
flux_address.sh
```
#!/bin/sh
echo "ssh://$(hostname)$(flux getattr rundir)/local-0" | tee flux_address.txt
sleep inf
```
To drop this file, submit a batch job or get an interactive allocation that starts up the Flux broker and runs the above script:

First, spin up a Flux instance
LSFSLURM
Either using bsub or interactively via lalloc as shown here
$ lalloc 1 -W 60 -q pdebug -G guests $ jsrun -a 1 -c 40 -g 0 -n 1 --bind=none flux start ./flux_address.sh
Either use sbatch or interactively via salloc as shown below
$ salloc -N 1 -p pdebug -A guests $ srun -n1 -c112 flux start ./flux_address.sh
Then from the login node you can launch a Maestro study that schedules to this nested Flux instance
```
$ export FLUX_URI=`cat flux_address.txt`

$ maestro run <study_specification> [run opts]
```
Extras

You can still use other Flux commands from the login nodes to the brokers living inside allocations using the same uri resolution methods such as the flux top command for monitoring your study's Flux jobs in real time:
```
$ flux proxy `flux uri --remote slurm:<slurm_jobid>` flux top
```

Example Specs

Check out a few example specifications to get started running with flux ranging from simple flux managed serial commands to mpi enabled applications.

Hello, Bye WorldLulesh

Simple serial applications managed by flux to run parameter combinations in parallel

hello_bye_parameterized_flux.yaml

description:
    name: hello_bye_world
    description: A study that says hello and bye to multiple people.

batch:
    type        : flux
    host        : rzvernal
    bank        : guests
    queue       : pdebug

env:
    variables:
        OUTPUT_PATH: ./sample_output/hello_world_flux
    labels:
        OUT_FORMAT: $(GREETING)_$(NAME).txt

study:
    - name: hello_world
      description: Say hello to someone!
      run:
          cmd: |
            $(LAUNCHER) echo "$(GREETING), $(NAME)!" > $(OUT_FORMAT)
            $(LAUNCHER) sleep 10
          procs: 1
          nested: True
          walltime: "00:60"

    - name: bye_world
      description: Say bye to someone!
      run:
          cmd: |
            $(LAUNCHER) echo "Bye, World!" > bye.txt
            $(LAUNCHER) sleep 10
          procs: 1
          nested: True
          walltime: "00:60"
          depends: [hello_world]

global.parameters:
    NAME:
        values: [Pam, Jim, Michael, Dwight]
        label: NAME.%%
    GREETING:
        values: [Hello, Ciao, Hey, Hi]
        label: GREETING.%%

Workflow topology:

flowchart TD;
    A(study root) --> COMBO1;
    subgraph COMBO1 [Combo #1]
      subgraph say_hello1 [say-hello]
        B(Hello, Pam)
      end
      subgraph say_bye1 [say-bye]
        C(Bye, World!)
      end
      say_hello1 --> say_bye1
    end
    A --> COMBO2
    subgraph COMBO2 [Combo #2]
      direction TB
      subgraph say_hello2 [say-hello]
        D(Ciao, Jim)
      end
      subgraph say_bye2 [say-bye]
        E(Bye, World!)
      end
      say_hello2 --> say_bye2
    end
    A --> COMBO3
    subgraph COMBO3 [Combo #3]
      subgraph say_hello3 [say-hello]
        F(Hey, Michael)
      end
      subgraph say_bye3 [say-bye]
        G(Bye, World!)
      end
      say_hello3 --> say_bye3
    end
    A --> COMBO4
    subgraph COMBO4 [Combo #4]
      subgraph say_hello4 [say-hello]
        H(Hi, Dwight)
      end
      subgraph say_bye4 [say-bye]
        I(Bye, World!)
      end
      say_hello4 --> say_bye4;
    end

Compilation and running of mpi parallel application, which also runs the parameter combinations in parallel

lulesh_sample1_unix_flux.yaml

description:
    name: lulesh_sample1
    description: A sample LULESH study that downloads, builds, and runs a parameter study of varying problem sizes and iterations on SLURM.

env:
    variables:
        OUTPUT_PATH: ./sample_output/lulesh

    labels:
        outfile: $(SIZE.label).$(ITERATIONS.label).log

    dependencies:
      git:
        - name: LULESH
          path: $(OUTPUT_PATH)
          url: https://github.com/LLNL/LULESH.git

batch:
    type        : flux
    host        : quartz
    bank        : baasic
    queue       : pbatch

study:
    - name: make-lulesh
      description: Build the MPI enabled version of LULESH.
      run:
          cmd: |
            cd $(LULESH)
            mkdir build
            cd build
            cmake -WITH_MPI=Off -WITH_OPENMP=Off ..
            make
          depends: []

    - name: run-lulesh
      description: Run LULESH.
      run:
          cmd: |
            $(LAUNCHER) $(LULESH)/build/lulesh2.0 -s $(SIZE) -i $(ITERATIONS) -p > $(outfile)
          depends: [make-lulesh]
          nodes: 1
          procs: 1
          cores per task: 1
          nested: True
          priority: high
          walltime: "00:02:00"

global.parameters:
    SIZE:
        values  : [100, 100, 100, 200, 200, 200, 300, 300, 300]
        label   : SIZE.%%
    ITERATIONS:
        values  : [10, 20, 30, 10, 20, 30, 10, 20, 30]
        label   : ITER.%%

Workflow topology:

flowchart TD;
    A(study root) --> B(make-lulesh);
    B-->COMBO1;
    subgraph COMBO1 [Combo #1]
      subgraph run_lulesh1 [run-lulesh]
        C(SIZE=100\nITERATIONS=10)
      end
    end
    B --> COMBO2
    subgraph COMBO2 [Combo #2]
      subgraph run_lulesh2 [run-lulesh]
        D(SIZE=100\nITERATIONS=20)
      end
    end
    B --> COMBO3
    subgraph COMBO3 [Combo #3]
      subgraph run_lulesh3 [run-lulesh]
        E(SIZE=100\nITERATIONS=30)
      end
    end
    B --> COMBO4
    subgraph COMBO4 [Combo #4]
      subgraph run_lulesh4 [run-lulesh]
        F(SIZE=200\nITERATIONS=10)
      end
    end
    B --> COMBO5
    subgraph COMBO5 [Combo #5]
      subgraph run_lulesh5 [run-lulesh]
        G(SIZE=200\nITERATIONS=20)
      end
    end
    B --> COMBO6
    subgraph COMBO6 [Combo #6]
      subgraph run_lulesh6 [run-lulesh]
        H(SIZE=200\nITERATIONS=30)
      end
    end
    B --> COMBO7
    subgraph COMBO7 [Combo #7]
      subgraph run_lulesh7 [run-lulesh]
        I(SIZE=300\nITERATIONS=10)
      end
    end
    B --> COMBO8
    subgraph COMBO8 [Combo #8]
      subgraph run_lulesh8 [run-lulesh]
        J(SIZE=300\nITERATIONS=20)
      end
    end
    B --> COMBO9
    subgraph COMBO9 [Combo #9]
      subgraph run_lulesh9 [run-lulesh]
        K(SIZE=300\nITERATIONS=30)
      end
    end