Managing files in Ganga
Learning Objectives
Choose whether job output is saved locally or on the Grid
Choose where to look for job input files
Move files from any grid site to CERN, for analysis using EOS
Ganga allows you to define a job and have it run anywhere: on your local machine, on the batch system, or on the Grid. This is very convenient as you don’t need to worry about the specifics of each platform.
Ganga treats files in a similar way to jobs, in that you only need to change the object you’re using to tell Ganga to use local files, files on the Grid, or files on EOS. In this lesson, we’ll see how you can efficiently manage input and output files using Ganga.
Ganga versions
It is generally advised to use the latest available version of Ganga. Functionality is not removed and there are no compatibility issues between versions (it is just python!). If you
encounter problems, you should first search the archives of the
lhcb-distributed-analysis
mailing list. If you don’t find an
answer, you can talk to the Ganga developers directly on the GitHub issues
page for Ganga, on the ~distributed-analysis mattermost channel, or by sending an email to
lhcb-distributed-analysis
.
Making a fresh start
To make sure there will be no pre-existing files from of Ganga to interfere, we will move them to a backup location.
$ cd ~
$ mkdir ganga-backup
# See what's in your home directory that's related to Ganga
$ ls -la | grep -i ganga
# Then move everything
$ mv gangadir .gangarc* .ganga.log .ganga.py .ipython-ganga ganga-backup
You can move this back after the lesson if you want to restore your old settings and data.
We’ll be doing everything in Ganga, so let’s start it up.
$ ganga
If it’s your first time starting Ganga, you’ll be asked if you want to create a
default .gangarc
file with the default settings.
Would you like to create default config file
~/.gangarc
with standard settings ([y]/n) ?
Answer with y
. The .gangarc
file defines the configuration of Ganga, and
the defaults are normally good enough.
You’ll then be dropped in a IPython shell. We will create a job that runs a Python script that accepts a path to an input text file as an argument, and saves a file that contains the text of the file reversed. For example, it would save a file containing ‘!dlrow olleH’ if it was given a file containing ‘Hello world!’ as input.
Download the script
to lxplus and set it to be executable.
You can execute these commands inside Ganga, if you like, by prefixing them
with a !
.
$ wget https://raw.githubusercontent.com/lhcb/starterkit-lessons/master/second-analysis-steps/code/01-managing-files-with-ganga/reverse.py
$ chmod +x reverse.py
$ ./reverse.py
Usage: reverse.py <file>
In Ganga, create a Job
object with a descriptive name and take a look at it.
j = Job(name='Reverser')
print(j)
You’ll see that Ganga has created a Job
which will execute the echo
command, passing the list of arguments ['Hello World']
. Each element of this
list will be passed as a positional argument to the echo
command.
We’ll replace the command name and the arguments, so that our reverse.py
script is run with a text file as input.
j.application.exe = File('reverse.py')
j.application.args = [File('input.txt')]
We haven’t made input.txt
, so let’s make it by executing a couple of shell
commands inside Ganga.
!echo -e "$(date)\nHello world!\nI am $USER!" > input.txt
!cat input.txt
Before submission, we just need to tell Ganga what to do with the output. The
script saves the output to a file called like the input, but with -reversed
appended before the file extension (.txt
in this case), so we tell Ganga
explicitly to move this file to the local job output directory.
j.outputfiles = [LocalFile('input-reversed.txt')]
Now we can submit the job.
j.submit()
By default, jobs run on the machine you’re running Ganga on, as their backend
property is set to an instance of the Local
backend.
The job will finish very quickly, and we can inspect the output files.
j.peek()
j.peek('input-reversed.txt')
There are a couple of file-related things to take note of in what we just did:
The
File
object is used to define local files that should be available in the ‘working directory’ of the job (wherever it executes). We need both the script and the input text file to be in the working directory, so both of the paths to the files on our local machine are wrapped inFile
.The
LocalFile
object is used to define what files in the working directory of the job should be saved in the local job output directory, in this case the file with-reversed
in it.
Note that there are several files in the job output directory, seen with
j.peek()
, that we didn’t explicitly ask for, most notably stdout
and
stderr
. These two files are essentially the logs of the job, and Ganga always
saves them in the local job output directory as they’re almost always useful.
For Gaudi jobs, Ganga will also automatically download the summary.xml
file,
which contains useful information about algorithm counters.
df = DiracFile('input.txt', localDir='.')
df.put(uploadSE='CERN-USER')
print(df.lfn)
Couldn’t upload file - This file GUID already exists
All files on the grid are required to have a unique identifier (GUID) which is normally generated from the file’s content and is independent of its filename. As a result, if you try to upload a file which already exists you receive an error.
If this happens, and you have a reason to not use the pre-existing file, the simplest solution is to make the file unique in some way, in this case we add the date and time to the top line of the text file.
Grid files that are replicated at CERN are directly accessible via EOS. We can
see that our file’s on EOS by looking at the LFN Ganga gave us. We just need to
add the prefix /eos/lhcb/grid/user
to the LFN.
eos ls /eos/lhcb/grid/user//lhcb/user/a/apearce/GangaFiles_22.24_Wednesday_18_May_2016
Using MassStorageFile
The MassStorageFile
object uploads job output directly to EOS. However,
using MassStorageFile
for this purpose is actively discouraged by the Ganga
developers as it is highly inefficient: a file made on the Grid will first be
downloaded to the machine running Ganga, and then uploaded to EOS.
Instead, always use DiracFile
for large outputs, and then replicate them to
CERN-USER
if you want to be able to access them on EOS.
If you have any DiracFile
, you can ask for it to be replicated to a grid site
it’s not currently available at.
df.replicate('RAL-USER')
Automating replication to CERN
If you have a job with subjobs, you can automate this to replicate all output files to CERN, so that you can run your analysis directly on the files on EOS.
j = jobs(...)
for sj in j.subjobs:
# Get all output files which are DiracFile objects
for df in sj.outputfiles.get(DiracFile):
# No need to replicate if it's already at CERN
if 'CERN-USER' not in df.locations:
df.replicate('CERN-USER')
After you did this your files will go into “/eos/lhcb/grid/lhcb/{u}/{user}/”+LFN.
You could make a function from this and put it in your .ganga.py
file, whose
contents is available in any Ganga session.
You can download a DiracFile
locally using the get
method. If you already
know an LFN, you can use this to quickly download it locally to play around
with it. All you need to do is prefix the LFN with LFN:
, and Ganga will
assume that the file already exists on the Grid somewhere (whereas before it
assumed the file was local).
df2 = DiracFile('LFN:' + df.lfn)
# The directory used for the download must exist first
!mkdir foo
dfr2.localDir = "foo"
dfr2.get()
!cat foo/input.txt
We can tell Ganga to upload the job output to the Grid automatically.
# Clone the job
j2 = Job(j)
j2.outputfiles = [DiracFile('*-reversed.txt')]
j2.submit()
Here we use a ‘pattern’ to tell Ganga that any file ending in *-reversed.txt
should be uploaded to Grid storage. Both DiracFile
and LocalFile
support
these patterns.
To download the output, we use .get
as usual.
j2.outputfiles.get(DiracFile)[0].get()
Being able to manipulate files with Ganga can be very useful. Particularly for Gaudi-based jobs where:
We often specify large sets of
DiracFiles
as input, from the bookkeeping, but often want to download a file or two locally when testing options;We want to duplicate a large number of output LFNs to
CERN-USER
so that we can use them directly with EOS and XRootD commands;We want to job output to be download locally automatically when the job completes.
Defining inputfiles
The inputfiles
attribute of a Job
object works in a similar way to
outputfile
. In our example, the reverser script that the Executable
application uses doesn’t know how to handle things specified as inputdata
,
so we had to use File
when defining the arguments.
For LHCb applications, you will almost always define the inputdata
list
using either LocalFile
or DiracFile
objects. Which one you will use just
depends on where the input files are.