Scripting Ganga
We have already started using Ganga, such as when submitting jobs to the Grid and using datasets from the bookkeeping, for creating jobs; but there’s a lot more you can do with it.
Part of Ganga’s power comes from it being written in Python. When you run
ganga
, you’re given an IPython prompt where you input Python code that’s executed when you hit <enter>
.
The idea of running Python code extends outside of Ganga, where we can write
scripts that Ganga will execute when starting up. This lesson will focus on
writing job definition scripts, and exploring how we can define utility
functions that will be available across all of our Ganga sessions.
Defining jobs with scripts
The ganga
executable is similar to the python
and ipython
executables in
a couple of ways.
If you just run ganga
, you are dropped into a prompt, but you can also supply
the path to a Python script that will be executed.
Let’s start with a small script, saving it in a file called create_job.py
:
greeting = 'Hello!'
print greeting
Run it:
$ ganga create_job.py
*** Welcome to Ganga ***
Version: X.Y.Z
…
Hello!
…
Sensible enough. Just like python
and ipython
, we can pass the -i
flag
before the file path to tell Ganga to give us a prompt after it’s finished
executing the script:
$ ganga -i create_job.py
*** Welcome to Ganga ***
Version: X.Y.Z
…
Hello!
Ganga In [1]: greeting
Ganga Out [1]: 'Hello!'
Ganga In [2]:
Notice that the variable we defined in the script, greeting
, is available in
the interactive session.
The idea of doing some work in a script and then manipulating the result
interactively can be quite powerful.
One workflow that you might find useful is to create a script that defines a
job, because this can often take a few lines to do, and typing them out every
time is boring.
Let’s modify create_job.py
to do that.
# Note: Ganga makes objects like `Job` available in your script automagically
j = Job()
j.name = 'My job'
This example is quite boring, but it captures the idea. You’ll want to
extend this, changing the application
property to a GaudiExec
instance,
for example, as covered in a previous
lesson.
Now we can run this and interact with the job as the j
variable:
$ ganga -i create_job.py
*** Welcome to Ganga ***
Version: X.Y.Z
…
Ganga In [1]: j
Ganga Out [1]:
Job (
comment = ,
parallel_submit = False,
…
)
Ganga In [2]:
We often want to construct a set of very similar jobs that differ only by their
input data, for example running the same DaVinci application over 2015 and 2016
data and for magnet up and magnet down configurations.
We need to then parameterise our script, and one way of doing this is passing
arguments to it by the command line.
You can inspect arguments from a Python script by using the argv
property on the sys
module:
import sys
print sys.argv
Add that to your create_job.py
script, and run ganga
again, this time
passing some arguments:
$ ganga -i create_job.py -v 123 --hello=world
*** Welcome to Ganga ***
Version: X.Y.Z
…
['create_job.py', '-v', '123', '--hello=world']
Ganga In [1]: j
Our script sees sys.argv
as the list of the arguments that come after ganga -i
.
To parameterise our script for year and magnet polarity, we could check this
list to find one of 2015
or 2016
and one of Up
or Down
, for example,
but instead we’ll opt to use the excellent argparse
module, which
comes with Python, to parse the command-line arguments for us.
import argparse
parser = argparse.ArgumentParser(description="Make my DaVinci job.")
parser.add_argument('year', type=int, choices=[2015, 2016],
help='Year of data-taking to run over')
parser.add_argument('polarity', choices=['Up', 'Down'],
help='Polarity of data-taking to run over')
parser.add_argument('--test', action='store_true',
help='Run over one file locally')
args = parser.parse_args()
year = args.year
polarity = args.polarity
test = args.test
Nicely, argparse
gives us a useful --help
argument for free:
$ ganga -i create_job.py --help
*** Welcome to Ganga ***
Version: X.Y.Z
…
usage: create_job.py [-h] [--test] {2015,2016} {Up,Down}
Make my DaVinci job.
positional arguments:
{2015,2016} Year of data-taking to run over
{Up,Down} Polarity of data-taking to run over
optional arguments:
-h, --help show this help message and exit
--test Run over one file locally
Ganga In [1]:
This help will also be printed if we don’t supply all of the required arguments (the year and the magnet polarity), along with a message telling us what’s missing.
Getting to grips with argparse
The argparse
module can do a lot, being able to parse complex sets of
arguments with much difficultly. It’s a useful tool to know in general, so we
recommend that you check out the documentation to learn more.
When we do supply all the necessary arguments, the values are then available in
the year
, polarity
, and test
variables:
$ ganga -i create_job.py 2015 Down
*** Welcome to Ganga ***
Version: X.Y.Z
…
Ganga In [1]: print year, polarity, test
2015 Down False
Once you’ve reached this level, a whole world of possibilities opens up! Here are a few things you might proceed to do with these parameters in your script:
Fetch the corresponding dataset using a
BKQuery
;Give your
Job
object a specific name, e.g.j.name = 'Ntuples_{0}_{1}.format(year, polarity)'
;Give data-specific options files to the
application
object, e.g. if you have one options file per year definingDaVinci().DataType
.
Of course, you can add as many arguments as you think might be useful.
Above we added the --test
flag as an example: if this is True
, you could
run the application over only a single data file, and run the job locally
rather than on the Grid (setting j.backend
appropriately).
Adding helpers functions
We’ve seen above how giving a script to ganga
makes the variables defined in
those scripts available interactively.
But what if you have, or would like to have, some set of your own custom helper
methods defined in every session? It would be annoying to have to run ganga my_helpers.py
every time! Luckily, the ganga.py
file comes to the rescue.
When Ganga starts, it looks for a file in your home directory (echo $HOME
)
called .ganga.py
(note the starting period in the filename). If it finds such
a file, it executes it in the context of the Ganga session, meaning the code in
the file has access to Ganga objects like Job
, jobs
, and so on.
To demonstrate the behaviour, we can put a print
statement on our
~/.ganga.py
file:
print 'Yo!'
Then run ganga
(no arguments needed):
$ ganga
*** Welcome to Ganga ***
Version: X.Y.Z
…
Yo!
…
Neat. The general idea for this file is two-fold:
Add commands that you always want executed when Ganga starts, e.g.
print jobs.select(status='running')
; andDefine functions for commonly-performed tasks.
The latter is particularly interesting. Do you often find yourself creating a file that contains all the output LFNs of your job? Write a helper!
def write_lfns(j, filename):
"""Write LFNs of all DiracFiles of all completed subjobs to fname."""
# Treat a job with subjobs the same as a job with no subjobs
sjobs = j.subjobs
if len(sjobs) == 0:
sjobs = [j]
lfns = []
for sj in sjobs:
if sj.status != 'completed':
print 'Skipping #{0}'.format(sj.id)
continue
for df in sj.outputfiles.get(DiracFile):
lfns.append(df.lfn)
with open(filename, 'w') as f:
f.write('\n'.join(lfns))
How about downloading and merging the ROOT output of a job’s subjobs? Write a helper!
def merge_root_output(j, input_tree_name, merged_filepath):
# Treat a job with subjobs the same as a job with no subjobs
sjobs = j.subjobs
if len(sjobs) == 0:
sjobs = [j]
access_urls = []
for sj in sjobs:
if sj.status != 'completed':
print 'Skipping #{0}'.format(sj.id)
continue
for df in sj.outputfiles.get(DiracFile):
access_urls.append(df.accessURL())
tchain = ROOT.TChain(input_tree_name)
for url in access_urls:
tchain.Add(url)
tchain.Merge(merged_filepath)
Because of the way a ROOT TChain
works, the subjobs output won’t be
downloaded, so you only need enough disk space for the merged file.
Using ROOT in Ganga
By default, ROOT is not available in a Ganga session:
Ganga In [1]: import ROOT
ERROR No module named ROOT
To remedy this, you can start Ganga inside an environment where ROOT is available:
$ lb-run ROOT ganga
Once you have your helpers defined, use them in Ganga as you would any other Python function.
Ganga In [1]: j = jobs(123)
Ganga In [2]: write_lfns(j, '{0}.lfns'.format(j.name))
Here are some other common operations that you might want helpers for:
Deleting all LFNs created by a job;
Resetting the
backend
of all subjobs which are marked asfailed
;Replicating all LFNs to a specific Grid site.
What other tasks can you think of?