-
Notifications
You must be signed in to change notification settings - Fork 11
Batch processing
pampro contains a module called batch processing, which takes each line in a job file and passes that data to an analysis function which you define. This saves you from writing cumbersome loops over your data. If you are operating in a high-performance computing environment, you can use the same function to divide the job into smaller batches, and process multiple files in parallel.
The default behaviour of the batch processing function is to process everything in a job file, 1 line at a time. To use it, you simply define a Python function that takes 1 argument (a dictionary) and does something with your data:
def my_clever_function(job_details):
# Do some stuff
# ...
# Write some results to a file
And you pass that function as an argument to the batch processing module, along with the name of a job file:
batch_processing.batch_process(my_clever_function, "C:/path/to/job_file.csv")
The batch process function takes each line of the given CSV, turns it into a dictionary of key-value pairs, and calls the given function. For example, if your job file contains columns identified as id, filename, and wrist_location, your analysis function will have access to the following variables:
job_details["id"]
job_details["filename"]
job_details["wrist_location"]
Processing physical activity datasets can be done in an embarrassingly parallel fashion. In other words, each file can be processed in isolation, and if you are processing many at once, those processes don't need to talk to each other. This makes scaling up your analyses in pampro very easy, as long as you have the computing resources available.
In the basic usage example, we defined a function called my_clever_function(), and we gave that to batch_process() to do all the looping. batch_process() actually has 2 optional arguments, job_num and num_jobs, which by default are set to 1 and 1. num_jobs tells the function that we intend to run that many parallel jobs, so it should split the job file into that many approximately equal batches. job_num tells it that this particular call to the function should process that particular batch. In other words, if we were to execute:
batch_processing.batch_process(my_clever_function, "C:/path/to/job_file.csv", job_num=1, num_jobs=3)
The batch_process function will split the given job file into 3 roughly equal parts, and process only the 1st of the 3 parts. In we then initiate separate processes, we could execute:
batch_processing.batch_process(my_clever_function, "C:/path/to/job_file.csv", job_num=2, num_jobs=3)
and
batch_processing.batch_process(my_clever_function, "C:/path/to/job_file.csv", job_num=3, num_jobs=3)
...which would result in 3 parallel processes running simultaneously. The standard way to achieve this is to pass command-line arguments to the Python analysis script, and subsequently pass those to the batch_process() function. In other words, a more complete example script might follow this pattern:
# Accept job_num and num_jobs from the command-line, and convert them to integers
job_num = int(sys.argv[1])
num_jobs = int(sys.argv[2])
def my_clever_function(job_details):
# Do some stuff
# ...
# Write some results to a file
batch_processing.batch_process(my_clever_function, "C:/path/to/job_file.csv", job_num, num_jobs)
If you saved such a script as my_clever_analysis.py, you could then execute it from the command-line like so:
ipython my_clever_analysis.py 1 100
ipython my_clever_analysis.py 2 100
...
ipython my_clever_analysis.py 100 100
...dividing the processing up however you wish (the above example divides the job file 100 times). Naturally, if you are in a high-performance computing environment, you most likely have a queue submission command to wrap around this. And before you attempt to run 100 processes at once, you should determine the resources that are required and confirm that you have them available. These are things that only your local systems administrator can advise you on.