Processing large numbers of parameters in batches
Scaling up the numbers of parameter combinations in studies can run into a few road blocks:
-
File system overloading
The current version of Maestro builds a single level tree in the step workspaces, with parameter combinations expanded flatly under the step root. This design can slow down other file system operations such as
rm
andls
, with impacts varying with specific file systems.Additionally, there are inode and disk space limits (or potentially per user quotas on shared systems) that when hit can bring a study to a halt until things are cleaned up.
-
HPC scheduler overloading
Some HPC schedulers can run into scaling issues with large studies as most of the script/scheduler adapters treat each instance of a step (one parameter combination applied to a step) as discrete batch jobs. Thus naively launching studies with thousands or more parameter combinations can quickly swamp the scheduler.
There are a few solutions to this problem depending on the specific study, including use of the
throttle
argument to the run command to limit the number of jobs to submit to the scheduler queue, or if jobs are quick running and/or small, use the flux adapter to pack many jobs into an allocation. However, this still leaves open the issue of swamping the file system.
An alternative that can be used to address both concerns is to insert gaps in the execution by processing large numbers of parameter sets in batches across multiple studies. This batched execution allows cleanup of each batch's outputs before the next begins, freeing up precious file system space and avoiding deadlocks when that space/quota is reached. As a simple model problem we will use pgen
to provide command line control of the number of parameters to read out of a csv file in each executed study in this batched execution option.
Note
This can also use data sources other than csv, including networked sources such as a database with minimal changes.
Generating the parameter set
This example will be working with the following set of parameters found in the params.csv
file
in the batched_parameters samples folder. The parameter set was generated using the csv_generator.py
included here (and in the samples folder), facilitating experiments with alternate
parameter naming/counts. In this example there's an index column param_combo
, which gets ignored
in the pgen
calls later on. The csv file is all lower cased, but the Maestro tokens are uppercased
to make them distinct in the study specification as shown inside the brackets ([ ]
) in the column names below.
param_combo | param1 [$(PARAM1) ] |
param2 [$(PARAM2) ] |
param3 [$(PARAM3) ] |
---|---|---|---|
0 | 94 | 71 | 72 |
1 | 48 | 18 | 60 |
2 | 45 | 56 | 23 |
3 | 0 | 30 | 95 |
4 | 77 | 8 | 34 |
5 | 99 | 44 | 99 |
6 | 19 | 62 | 52 |
7 | 89 | 14 | 26 |
8 | 82 | 6 | 24 |
9 | 32 | 80 | 83 |
10 | 68 | 58 | 65 |
11 | 13 | 45 | 13 |
12 | 65 | 99 | 70 |
13 | 44 | 94 | 86 |
14 | 50 | 35 | 50 |
15 | 89 | 53 | 65 |
16 | 28 | 97 | 47 |
17 | 98 | 93 | 86 |
18 | 24 | 17 | 97 |
19 | 39 | 50 | 83 |
"""
Helper script for generating a csv filled with random samples for multiple parameters
"""
import csv
import argparse
from random import randint
def compute_random_integers(num_ints, start=0, stop=100):
"""Returns num_ints random samples inside interval [start, stop]"""
return [randint(start, stop) for idx in range(num_ints)]
def setup_argparse():
parser = argparse.ArgumentParser(
"csv_gen",
description="Generate csv file fille with random integer samples for"
" several different named parameters",
)
parser.add_argument(
"-o",
dest="csv_file",
default="params.csv",
help="Name of csv file to write out.",
)
parser.add_argument(
"-n",
"--num-values",
default=20,
type=int,
help="Number of values to generate for each parameter"
)
parser.add_argument(
"-p",
"--params",
nargs='+',
default=['param1', 'param2', 'param3'],
)
parser.add_argument(
"-d",
"--debug",
action='store_true',
help="Print out parameter combinations as they're written"
)
parser.add_argument(
"-i",
"--index",
default='',
help="Optionally add an index column of the given name. i.e. "
"param_combo 1, param_combo 2, ..."
)
return parser
if __name__ == "__main__":
parser = setup_argparse()
args = parser.parse_args()
print(args)
params = {}
for param in args.params:
params[param] = compute_random_integers(num_ints=args.num_values)
with open(args.csv_file, 'w') as csvfile:
fieldnames = args.params
if args.index:
fieldnames = [args.index] + fieldnames
print(f"Updated fieldnames: {fieldnames}")
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
# Invert into list of dicts and write out
if args.debug:
print("Writing params:")
for idx, row in enumerate([dict(zip(params, row_vals)) for row_vals in zip(*params.values())]):
# Add optional index column
if args.index:
row_to_write = {}
row_to_write[args.index] = idx
row_to_write.update(row)
else:
row_to_write = row
writer.writerow(row_to_write)
if args.debug:
print(row_to_write)
Study specification
The sample specification for this is a very simple single step study that echoe's the input parameters and their values.
description:
name: batched_params_demo
description: A study for demonstrating use of pgen to process large parameter sets
env:
variables:
OUTPUT_PATH: ./samples/batched_params_demo
study:
# NOTE: Want to make a jinja templated version of this along with the csv generator to make testing
# different param counts less tedious?
- name: echo-params
description: Echo the parameter set
run:
cmd: |
echo "PARAM1: $(PARAM1); PARAM2: $(PARAM2); PARAM3: $(PARAM3)"
# NO PARAMETER BLOCK: SEE corresponding pgen
# ADD CLEANUP STEPS LATER -> TAR UP PARENT DIRS/WHOLE WORKSPACE TREE?
# - this starts going beyond maestro itself, but is likely good to include in the demo
# - maybe use indexer to query the status and ensure completion before tarring/up, and add param status log tracking which sets are done? -> overlaps with use case involving db's
Running a subset of parameters
All of the control of which parameters get run in a given study go through pgen
via it's pagrgs
. The batched_demo_pgen.py
custom generator has 5 available pargs that control the csv parsing and parameter outputs:
-
CSV
: name/path of csv file to read parameters from -
NROWS
: number of parameter sets/combinations to read in for this instance of the study -
START
: optional row offset to start reading parameters from (i.e. for additional instances of the study) -
INDEX
: optional name of the index column. This column is not treated as a parameter to be used in the study. -
DEBUG
: optional flag to add debugging output during study initialization. If any string is added here then the pgen will print out all parameter names and values read from the csv file.
"""An example file that produces a custom parameters for the batched parameters demo"""
from maestrowf.datastructures.core import ParameterGenerator
import csv
import rich.pretty as rp
def get_custom_generator(env, **kwargs):
"""
Create a custom populated ParameterGenerator. Uses pargs to control
reading of parameters from a pre-built csv file for chunked execution.
pargs:
CSV: name of csv file to read parameters from
NROWS: number of rows (param sets) to read out of the csv file
START: optional row offest to start reading parameters from
INDEX: optional name of index column (these are not parameters, just
parameter set id's). Default: no index column.
DEBUG: optional debug flag for extra printouts of parameter reading.
Any string turns this on.
:params env: A StudyEnvironment object containing custom information.
:params kwargs: A dictionary of keyword arguments this function uses.
:returns: A ParameterGenerator populated with parameters.
"""
p_gen = ParameterGenerator()
# Check for the input keywords
params_csv_file_name = kwargs.get('CSV').strip()
num_params = int(kwargs.get('NROWS', '-1').strip())
offset = int(kwargs.get('START', '0').strip())
index_name = kwargs.get('INDEX', '').strip()
debug = kwargs.get('DEBUG', '').strip()
params_csv = []
param_names = []
with open(params_csv_file_name, 'r') as csvfile:
csv_data = csv.DictReader(csvfile)
if debug:
rp.pprint("Reading csv:")
for row in csv_data:
if debug:
rp.pprint(row)
params_csv.append(row)
param_names = csv_data.fieldnames
# excluding optional first column: update name if calling it something
# other than paramset, or leave out the pop and remove that column from the
# csv input
params = {}
for param_name in param_names:
# Skip the index in case an index column is specified in the csv
if index_name and param_name == index_name:
continue
if debug:
rp.pprint(f"Adding Param: {param_name}")
p_gen.add_parameter(
param_name.strip().upper(), # key: strip it in case there was whitespace in the csv file
[row[param_name].strip() for idx, row in enumerate(params_csv) if idx >= offset and (num_params > idx-offset or num_params < 0)], # values
f"{param_name.strip().upper()}.%%", # label
)
if debug:
rp.pprint(p_gen.parameters)
rp.pprint(p_gen.names)
return p_gen
Running an instance with the first 3 rows of parameter combos from the csv is as simple as:
maestro run --dry batched_parameters_demo.yaml --pgen batched_demo_pgen.py --pargs "INDEX:param_combo" --pargs "CSV:params.csv" --pargs "DEBUG:true" --pargs "NROWS:3"
We can then verify the workspace is as expected, with three directories under the echo-params
step for the three parameter combintations selected from the csv input
Next steps
The next step in this 'how-to' is left up to the reader. At this point we have a study and parameter generator that can be used to process large numbers of parameter combinations in batches. Managing the disk-space/inode concerns can now be addressed between parameter batches. Potential options could include:
-
Extracting the necessary data and uploading into an external database, either offline or as subsequent steps in the study executed by Maestro directly. The entire study workspace can then be deleted upon success (what determines success being highly workflow dependent)
-
Tar up the outputs. This could be as simple as tarring up the whole Maestro study workspace for later processing, or more targetting tarring of each step's outputs to compress the contents of each step/parameter workspace to a single file to conserve inodes by deleting the originals.
-
Archive the outputs to some other file system, either with or without tarring
-
...