Splitting a job into subjobs
Learning Objectives
Learn how to process many files in parallel on the grid by splitting a job into many subjobs
In the previous lesson, you’ve submitted a job to the LHC grid. You will notice that the job will take a long time to finish. This is because it has to process many gigabytes of data.
ganga
provides several splitters that implement strategies for processing data in parallel.
The one we will use now is SplitByFiles
, which spawns several subjobs, each of which only processes a certain number of files.
Apart from processing data faster, this will also allow you to work with datasets that are spread across several sites of the LHC grid. This is because a job can only access datasets that are available on the site that it runs on.
To activate a splitter, assign it to the .splitter
attribute of your job:
j.splitter = SplitByFiles(filesPerJob=5)
Note that the specified number of files per job is only the allowed maximum. You will often get jobs with fewer files.
How do I choose the number of files per job?
Choose fewer files per job if possible, as this will allow you to finish sooner
and reduces the impact of jobs failing due to grid problems.
Setting filesPerJob=5
should work well for real data, while filesPerJob=1
should be good for signal MC.
Splitter arguments
The splitter has other useful arguments:
maxFiles
: the maximal total number of files. By default the splitter will run over all files in the dataset (which corresponds to the default value of -1)ignoremissing
: boolean indicating whether it is appropriate to run if there are data files which are not accessible at the moment. This is important if it is necessary to make sure that the resulting ntuples correspond to the whole data/MC sample.
Now, when you run j.submit()
, the job will automatically be split into several subjobs.
These can be displayed by entering
jobs(787).subjobs
in ganga
, where you have to replace 787 with the id
of your main job.
You can access individual subjobs as in jobs(787).subjobs(2)
.
For example, to resubmit a failed subjob, you would run
jobs(787).subjobs(2).resubmit()
To access several subjobs at once, you can use the .select
method:
jobs(787).subjobs.select(status='failed').resubmit()
This will resubmit all failed subjobs.
If you want to do something more complex with each subjob, a regular for
-loop can be used as well:
for j in jobs(787).subjobs:
print(j.id)
It’s possible that some of your subjobs will be stuck in a certain state (submitting/completing/…). If that is the case, try to reset the Dirac backend:
jobs(787).subjobs(42).backend.reset()
If you want to this for all stucked subjobs of your job in once, you can use
jobs(787).backend.reset(True)
If that doesn’t help, try failing the job and resubmitting:
jobs(787).subjobs(42).force_status('failed')
jobs(787).subjobs(42).resubmit()
It can take quite a while to submit all of your subjobs.
If you want to continue working in ganga while submitting jobs, you can use the queues
feature to do just that.
Simply call queues.add
with the submit
function of a job without adding parentheses, like this:
queues.add(j.submit)
Ganga will then submit this job (and its subjobs) in the background.
Make sure not to close ganga
before the submission is finished, or you will have to start submitting the rest of the jobs again later on.
Splitting your first job
Try splitting the ganga
job from our previous lesson with splitByFiles=1
(reference code
) and submit it with ganga
.