More Ganga

Learning Objectives

  • Set the input data with BKQuery

  • Use LHCbDatasets

  • Set the location of the output of our jobs

  • Set the location of your .gangadir

  • Access output stored on the grid

As you already saw in the previous lesson, the input data can be specified for your job with the BKQuery tool. The path for the data can be found using the online Dirac portal and passed to the BKQuery to get the dataset. For example, to run over the Stripping 21 MagUp, Semileptonic stream

Ganga In [3]: j.inputdata = BKQuery('/LHCb/Collision12/Beam4000GeV-VeloClosed-MagUp/Real Data/Reco14/Stripping21r0p1a/90000000/SEMILEPTONIC.DST').getDataset()
Ganga In [4]: j.inputdata
Ganga Out [4]: 
 LHCbCompressedDataset (
  files = [ LHCbCompressedFileSet (
   lfn_prefix = '/lhcb/LHCb/Collision12/SEMILEPTONIC.DST',
   suffixes = [3716 Entries of type 'str']
 )]  ,
   persistency = None,
   depth = 0,
   XMLCatalogueSlice =    LocalFile (
     namePattern = ,
     localDir = ,
     compressed = False
   ) ,
   credential_requirements =    DiracProxy (
     group = lhcb_user,
     encodeDefaultProxyFileName = False,
     dirac_env = None,
     validTime = None
   )
 )

This is a container acting as a list of DiracFiles, the Ganga object for files stored on the grid. Each of these DiracFile objects are accessible by using the brackets, as if we try to access an element from a list. This we can then use to access one of the files locally via the accessURL method:

Ganga In [5]: j.inputdata[0].accessURL()
Ganga Out [5]: ['root://bw32-4.grid.sara.nl:1094/pnfs/grid.sara.nl/data/lhcb/LHCb/Collision12/SEMILEPTONIC.DST/00051179/0000/00051179_00006978_1.semileptonic.dst']

The returned path can be used by Bender to explore the contents of the DST, as in the Interactively exploring a DST lesson.

In the previous lesson we looked at the location of the ouput with job(782).outputdir. This location points us to the gangadir where ganga stores information about the jobs and their output. If we have lots of jobs with large files, the file system where the gangadir is located will quickly fill up.

Setting the gangadir location

The location of the gangadir can be changed in the configuration file ‘~/.gangarc’. Just search for the gangadir attribute and change it to where you like (on the CERN AFS the work area is a popular choice).

It is not recommended to have your gangadir on the FUSE mounted EOS area on lxplus. The connection may be slow and unreliable which will cause problems when running ganga.

To avoid filling up the filespace, it is wise to put the large files produced by your job somewhere with lots of storage - the grid. You can do so by setting the outputfiles attribute:

j.outputfiles = [DiracFile('*.root'), LocalFile('stdout')]

The DiracFile will be stored in your user area on the grid (with up to 2TB personal capacity). The wildcard means that any root file produced by your job will stay on the grid. LocalFile downloads the file to your gangadir, in this case the one called stdout. You can access your files stored on the grid with the accessURL() function as before. For example, to access the location of the output .root file of a specific subjob, one can use

jobs(787).subjobs(15).outputfiles[0].accessURL()

This is a very useful feature: you do not have to download your files from the grid in order to merge them locally! This takes a lot of time and disk space. Instead, one can get a list of URLs from each subjob, and pass them to the TChain or use the hadd method of ROOT[^1]. Ganga has a shortcut to access the list of all .root files from all the subjobs of a given job:

jobs(787).backend.getOutputDataAccessURLs()

Small files are downloaded as standard: .root, logfiles etc. Files that are expected to be large (with extensions .dst etc) are by default kept on the grid as Dirac files. In general, you are encouraged to keep your large files on the grid to avoid moving large amounts of data around through your work area.

Getting help with ganga

Speaking about large files: The Ganga FAQ tells you all about ways to remove unneeded branches from your nTuples, such that you do not waste space. This is also the place to find other tips and tricks on using Ganga.

[^1]: Merging your files with hadd will be significantly faster if you run it with the option telling ROOT to use the same compression level in that output file as is used for the input files. This can be done using the -fk option. If running on lxplus you will need to get a newer ROOT version that supports this option by using: lb-conda default hadd -fk output.root input/*.root