Processing large numbers of parameters in batches

Scaling up the numbers of parameter combinations in studies can run into a few road blocks:

File system overloading

The current version of Maestro builds a single level tree in the step workspaces, with parameter combinations expanded flatly under the step root. This design can slow down other file system operations such as rm and ls, with impacts varying with specific file systems.

Additionally, there are inode and disk space limits (or potentially per user quotas on shared systems) that when hit can bring a study to a halt until things are cleaned up.
HPC scheduler overloading

Some HPC schedulers can run into scaling issues with large studies as most of the script/scheduler adapters treat each instance of a step (one parameter combination applied to a step) as discrete batch jobs. Thus naively launching studies with thousands or more parameter combinations can quickly swamp the scheduler.

There are a few solutions to this problem depending on the specific study, including use of the throttle argument to the run command to limit the number of jobs to submit to the scheduler queue, or if jobs are quick running and/or small, use the flux adapter to pack many jobs into an allocation. However, this still leaves open the issue of swamping the file system.

An alternative that can be used to address both concerns is to insert gaps in the execution by processing large numbers of parameter sets in batches across multiple studies. This batched execution allows cleanup of each batch's outputs before the next begins, freeing up precious file system space and avoiding deadlocks when that space/quota is reached. As a simple model problem we will use pgen to provide command line control of the number of parameters to read out of a csv file in each executed study in this batched execution option.

Note

This can also use data sources other than csv, including networked sources such as a database with minimal changes.

Generating the parameter set

This example will be working with the following set of parameters found in the params.csv file in the batched_parameters samples folder. The parameter set was generated using the csv_generator.py included here (and in the samples folder), facilitating experiments with alternate parameter naming/counts. In this example there's an index column param_combo, which gets ignored in the pgen calls later on. The csv file is all lower cased, but the Maestro tokens are uppercased to make them distinct in the study specification as shown inside the brackets ([ ]) in the column names below.

params.csvcsv_generator.py

param_combo	param1 [`$(PARAM1)`]	param2 [`$(PARAM2)`]	param3 [`$(PARAM3)`]
0	94	71	72
1	48	18	60
2	45	56	23
3	0	30	95
4	77	8	34
5	99	44	99
6	19	62	52
7	89	14	26
8	82	6	24
9	32	80	83
10	68	58	65
11	13	45	13
12	65	99	70
13	44	94	86
14	50	35	50
15	89	53	65
16	28	97	47
17	98	93	86
18	24	17	97
19	39	50	83

"""
Helper script for generating a csv filled with random samples for multiple parameters
"""

import csv

import argparse

from random import randint


def compute_random_integers(num_ints, start=0, stop=100):
    """Returns num_ints random samples inside interval [start, stop]"""
    return [randint(start, stop) for idx in range(num_ints)]


def setup_argparse():
    parser = argparse.ArgumentParser(
        "csv_gen",
        description="Generate csv file fille with random integer samples for"
        " several different named parameters",
    )

    parser.add_argument(
        "-o",
        dest="csv_file",
        default="params.csv",
        help="Name of csv file to write out.",
    )

    parser.add_argument(
        "-n",
        "--num-values",
        default=20,
        type=int,
        help="Number of values to generate for each parameter"
    )

    parser.add_argument(
        "-p",
        "--params",
        nargs='+',
        default=['param1', 'param2', 'param3'],
    )

    parser.add_argument(
        "-d",
        "--debug",
        action='store_true',
        help="Print out parameter combinations as they're written"
    )

    parser.add_argument(
        "-i",
        "--index",
        default='',
        help="Optionally add an index column of the given name.  i.e. "
        "param_combo 1, param_combo 2, ..."
    )

    return parser


if __name__ == "__main__":

    parser = setup_argparse()

    args = parser.parse_args()

    print(args)

    params = {}
    for param in args.params:
        params[param] = compute_random_integers(num_ints=args.num_values)

    with open(args.csv_file, 'w') as csvfile:
        fieldnames = args.params

        if args.index:
            fieldnames = [args.index] + fieldnames
            print(f"Updated fieldnames: {fieldnames}")
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        # Invert into list of dicts and write out
        if args.debug:
            print("Writing params:")
        for idx, row in enumerate([dict(zip(params, row_vals)) for row_vals in zip(*params.values())]):

            # Add optional index column
            if args.index:
                row_to_write = {}
                row_to_write[args.index] = idx
                row_to_write.update(row)

            else:
                row_to_write = row

            writer.writerow(row_to_write)

            if args.debug:
                print(row_to_write)

Study specification

The sample specification for this is a very simple single step study that echoe's the input parameters and their values.

description:
    name: batched_params_demo
    description: A study for demonstrating use of pgen to process large parameter sets

env:
    variables:
        OUTPUT_PATH: ./samples/batched_params_demo

study:
  # NOTE: Want to make a jinja templated version of this along with the csv generator to make testing
  #       different param counts less tedious?
    - name: echo-params
      description: Echo the parameter set
      run:
          cmd: |
            echo "PARAM1: $(PARAM1);  PARAM2: $(PARAM2);  PARAM3: $(PARAM3)"

# NO PARAMETER BLOCK: SEE corresponding pgen

# ADD CLEANUP STEPS LATER -> TAR UP PARENT DIRS/WHOLE WORKSPACE TREE?
# - this starts going beyond maestro itself, but is likely good to include in the demo
# - maybe use indexer to query the status and ensure completion before tarring/up, and add param status log tracking which sets are done? -> overlaps with use case involving db's

Running a subset of parameters

All of the control of which parameters get run in a given study go through pgen via it's pagrgs. The batched_demo_pgen.py custom generator has 5 available pargs that control the csv parsing and parameter outputs:

CSV: name/path of csv file to read parameters from
NROWS: number of parameter sets/combinations to read in for this instance of the study
START: optional row offset to start reading parameters from (i.e. for additional instances of the study)
INDEX: optional name of the index column. This column is not treated as a parameter to be used in the study.
DEBUG: optional flag to add debugging output during study initialization. If any string is added here then the pgen will print out all parameter names and values read from the csv file.

"""An example file that produces a custom parameters for the batched parameters demo"""

from maestrowf.datastructures.core import ParameterGenerator

import csv
import rich.pretty as rp

def get_custom_generator(env, **kwargs):
    """
    Create a custom populated ParameterGenerator.  Uses pargs to control
    reading of parameters from a pre-built csv file for chunked execution.

    pargs:
      CSV:   name of csv file to read parameters from
      NROWS: number of rows (param sets) to read out of the csv file
      START: optional row offest to start reading parameters from
      INDEX: optional name of index column (these are not parameters, just
             parameter set id's).  Default: no index column.
      DEBUG: optional debug flag for extra printouts of parameter reading.
             Any string turns this on.

    :params env: A StudyEnvironment object containing custom information.
    :params kwargs: A dictionary of keyword arguments this function uses.
    :returns: A ParameterGenerator populated with parameters.
    """
    p_gen = ParameterGenerator()

    # Check for the input keywords
    params_csv_file_name = kwargs.get('CSV').strip()
    num_params = int(kwargs.get('NROWS', '-1').strip())
    offset = int(kwargs.get('START', '0').strip())
    index_name = kwargs.get('INDEX', '').strip()
    debug = kwargs.get('DEBUG', '').strip()

    params_csv = []
    param_names = []
    with open(params_csv_file_name, 'r') as csvfile:
        csv_data = csv.DictReader(csvfile)

        if debug:
            rp.pprint("Reading csv:")

        for row in csv_data:
            if debug:
                rp.pprint(row)
            params_csv.append(row)

        param_names = csv_data.fieldnames

    # excluding optional first column: update name if calling it something
    # other than paramset, or leave out the pop and remove that column from the
    # csv input
    params = {}
    for param_name in param_names:
        # Skip the index in case an index column is specified in the csv
        if index_name and param_name == index_name:
            continue

        if debug:
            rp.pprint(f"Adding Param: {param_name}")

        p_gen.add_parameter(
            param_name.strip().upper(),  # key: strip it in case there was whitespace in the csv file
            [row[param_name].strip() for idx, row in enumerate(params_csv) if idx >= offset and (num_params > idx-offset or num_params < 0)],  # values
            f"{param_name.strip().upper()}.%%",  # label
        )

    if debug:
        rp.pprint(p_gen.parameters)
        rp.pprint(p_gen.names)

    return p_gen

Running an instance with the first 3 rows of parameter combos from the csv is as simple as:

maestro run --dry batched_parameters_demo.yaml --pgen batched_demo_pgen.py --pargs "INDEX:param_combo" --pargs "CSV:params.csv" --pargs "DEBUG:true" --pargs "NROWS:3"

We can then verify the workspace is as expected, with three directories under the echo-params step for the three parameter combintations selected from the csv input

Next steps

The next step in this 'how-to' is left up to the reader. At this point we have a study and parameter generator that can be used to process large numbers of parameter combinations in batches. Managing the disk-space/inode concerns can now be addressed between parameter batches. Potential options could include:

Extracting the necessary data and uploading into an external database, either offline or as subsequent steps in the study executed by Maestro directly. The entire study workspace can then be deleted upon success (what determines success being highly workflow dependent)
Tar up the outputs. This could be as simple as tarring up the whole Maestro study workspace for later processing, or more targetting tarring of each step's outputs to compress the contents of each step/parameter workspace to a single file to conserve inodes by deleting the originals.
Archive the outputs to some other file system, either with or without tarring
...

param_combo	param1 [`$(PARAM1)`]	param2 [`$(PARAM2)`]	param3 [`$(PARAM3)`]
0	94	71	72
1	48	18	60
2	45	56	23
3	0	30	95
4	77	8	34
5	99	44	99
6	19	62	52
7	89	14	26
8	82	6	24
9	32	80	83
10	68	58	65
11	13	45	13
12	65	99	70
13	44	94	86
14	50	35	50
15	89	53	65
16	28	97	47
17	98	93	86
18	24	17	97
19	39	50	83

param_combo	param1 [`$(PARAM1)`]	param2 [`$(PARAM2)`]	param3 [`$(PARAM3)`]
0	94	71	72
1	48	18	60
2	45	56	23
3	0	30	95
4	77	8	34
5	99	44	99
6	19	62	52
7	89	14	26
8	82	6	24
9	32	80	83
10	68	58	65
11	13	45	13
12	65	99	70
13	44	94	86
14	50	35	50
15	89	53	65
16	28	97	47
17	98	93	86
18	24	17	97
19	39	50	83

param_combo	param1 [`$(PARAM1)`]	param2 [`$(PARAM2)`]	param3 [`$(PARAM3)`]
0	94	71	72
1	48	18	60
2	45	56	23
3	0	30	95
4	77	8	34
5	99	44	99
6	19	62	52
7	89	14	26
8	82	6	24
9	32	80	83
10	68	58	65
11	13	45	13
12	65	99	70
13	44	94	86
14	50	35	50
15	89	53	65
16	28	97	47
17	98	93	86
18	24	17	97
19	39	50	83