Analysis Productions

Learning Objectives

Understand the reasons to use Analysis Productions
Look at some pre-existing productions
Create your own production

Running DaVinci locally can be great for testing an options file, but is rarely appropriate for creating the full set of ROOT ntuples needed for an analysis. Historically, people would submit their jobs to the grid: one can send off a large number of DaVinci jobs to be handled by batch computing resources (in certain cases, this is still useful - see the next lesson for how to do this). However, this still has some drawbacks, in particular:

Large datasets (especially in run 3) would require a long period of monitoring your grid jobs.
Computing resources can be wasted if multiple analyses are independently producing similar ntuples.
Ntuples can be lost, or removed when analysts leave, which can be an issue for analysis preservation.

It is the goal of Analysis Productions to centralise and automate much of the process of making ntuples, and to keep a record of how datasets were produced. In Run 3, this is the preferred way to create ntuples for your analysis.

Monitoring productions

Before we get into the how-to, let’s first take a look at the end result of a real production. Open the Analysis Productions webpage, and navigate to Productions. You should see a new menu appear, showing a list of productions. Using the search bar on the right you can search for any production of interest, try b02dkpi for example. This should return one result and clicking on it should show you a table of jobs belonging to the production.

Each row in the table corresponds to a job belonging to this production, and displays:

Its status (e.g. READY, ACTIVE)
Its name (e.g. 2018_15164022_magup)
Some of its tags (e.g. MC/DATA, Run1/Run2)
When it needs a housekeeping check, when it was created and when it was last updated
The version of the Analysis Productions repository it corresponds to

To view more information about any one of these jobs, you can click on it to view a job summary page. The image below shows one such page, with a couple of particularly useful elements highlighted in magenta.

An AP job summary

The top section shows a summary of important information, such as the state of the job, its production version, the total storage required for the output files and the merge request used to submit the production. To view the input scripts used to set up the production click the link to the merge request. The next section lists the DIRAC production information for the job. Finally, is the Production Output section which lists the output files for the job. One can use either the PFNs or LFNs to access the output, the PFNs should be visible to all systems with access to the CVMFS.

Let’s try accessing one of these files right now by doing:

root -l root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/LHCb/Collision12/B02DKPI.ROOT/00121782/0000/00121782_00000004_1.b02dkpi.root

You can now explore this file by doing TBrowser b inside of ROOT (or with another method of your choice).

Creating your own production

For practice, we will now go through the steps of creating a simple production. For details about more advanced usage, see the Analysis Productions documentation.

Start by creating a folder to work in, and into it clone the Analysis Productions repository.

git clone ssh://git@gitlab.cern.ch:7999/lhcb-datapkg/AnalysisProductions.git
cd AnalysisProductions

Before making any edits, you should create a branch for your changes, and switch to it:

git checkout -b ${USER}/starterkit-practice

Now we need to create a folder to store all the things we’re going to add for our new production. For this practice production, we’ll continue with the \(` B^+ \to (J/\psi \to \mu^+ \mu^-) K^+ `\) decays used in the previous few lessons, so we should name the folder appropriately:

mkdir starterkit

Let’s enter that new directory (cd starterkit), and start adding the files we’ll need. First is the DaVinci options file. If you have the options file you created during the previous lessons in a working condition, copy it here, and open it with your text editor of choice.

If you don’t have your options file from earlier available, or are having trouble, you can use this options file.

The next file needed is a .yaml file, which will be used to configure the jobs. Create a new file named info.yaml, and add the following to it:

defaults:
  application: DaVinci/v64r12
  output: DATA.ROOT
  options:
    entrypoint: starterkit.dv_basic:main
    extra_options:
      input_type: ROOT 
      input_raw_format: 0.5
      simulation: false
      input_process: "TurboPass"
      geometry_version: run3/2024.Q1.2-v00.00
      conditions_version: master
      lumi: true
      data_type: "Upgrade"
      input_stream: b2cc
  inform:
    - aidan.richard.wiederhold@cern.ch
  wg: B2CC

Bu2JpsiK_24c4_MagDown:
  input:
    bk_query: "/LHCb/Collision24/Beam6800GeV-VeloClosed-MagDown/Real Data/Sprucing24c4/94000000/B2CC.DST"

Here, the unindented lines are the names of jobs (although defaults has a special function), and the indented lines are the options we’re applying to those jobs. Using this file will create one job called Bu2JpsiK_24c4_MagDown, that will read in data from the provided bookkeeping path. All the options applied under defaults are automatically applied to all other jobs - very useful for avoiding repetition. The options we’re using here are copied from the Run 3 DaVinci lesson:

application: the version of DaVinci to use. Here we choose v64r12, see here to check what versions are available.
wg: the working group this production is a part of. Since this is a \(` B^+ \to (J/\psi \to \mu^+ \mu^-) K^+ `\) decay, we’ll set this to B2CC.
inform: optionally, you can enter your email address to receive updates on the status of your jobs.
options: the settings to use when running DaVinci. These are copied from the Run 3 DaVinci lesson.
output: the name of the output .root ntuples. These will get registered in bookkeeping as well.
input: the bookkeeping path of the data you’re running over. This is what you located during the bookkeeping lesson, and is unique to the 24c4 magnet down job, so it doesn’t belong under defaults.

For a full list of the available options, and information on their allowed values, see the documentation.

Add a magnet-up job

Currently, this will create ntuples for 24c4 magnet-down data only. See if you can add to your info.yaml file to create a job for 24c4 magnet-up data as well. Hint: the location of the correct .DST file can be found using bookkeeping.

Solutions

Since we’re making use of the defaults job name, we need to add very little to add this new job. One can simply copy the final 3 lines to create a new job, rename it to Bu2JpsiK_24c4_MagUp, and add the appropriate bookkeeping path. An example of this can be found here.

For good practice, the final thing we should add is a README. This is a markdown file that will be used to document the information about your production in a more human-readable format. Create a file README.md, and add some pertinent information. If you’re not sure what to add, try looking at some existing productions to get an idea.

Testing your production locally

Now we’ve got both of the files we need, we should test the production to make sure it works as expected. All of this will be done using the lb-ap command. Navigate up one level to the base directory of the repository (AnalysisProductions), and run lb-ap, which should display the following:

Usage: lb-ap [OPTIONS] COMMAND [ARGS]...

  Command line tool for the LHCb AnalysisProductions

Options:
  --version
  --help     Show this message and exit.

Commands:
  login        Login to the Analysis Productions API
  versions     List the available tags of the Analysis Productions...
  clone        Clone the AnalysisProductions repository and do lb-ap...
  checkout     Clean out the current copy of the specified production and...
  list         List the available production folders by running 'lb-ap list'
  list-checks  List the checks for a specific production by running lb-ap...
  render       Render the info.yaml for a given production
  validate     Validate the configuration for a given production
  test         Execute a job locally
  check        Run checks for a production
  ci-checks    Helper function for running checks in the CI.
  debug        Start an interactive session inside the job's environment
  reproduce    Reproduce an existing online test locally
  parse-log    Read a Gaudi log file and extract information

This command lb-ap will allow us to perform a number of different tests. Let’s start with lb-ap list, which will display all of the productions. Hopefully you should see your new production (starterkit) on this list! You can also use this to list all of the jobs within a given production, by running lb-ap list starterkit. If you added a second job for magnet-up earlier, the output of this command should look like this:

The available jobs for starterkit are: 
* Bu2JpsiK_24c4_MagDown
* Bu2JpsiK_24c4_MagUp

The most important test is if the production actually runs successfully, and creates the desired ntuples. The lb-ap command is used for this as well. To test the magnet-down job, run this command:

lb-ap test starterkit Bu2JpsiK_24c4_MagDown

The first time you run this in a session remember to activate your proxy using lhcb-proxy-init. You will also be prompted to sign into the CERN Single Sign-On as an extra security step.

DaVinci will now run using the data and options files you specified in info.yaml. You should see the output from DaVinci similar to what you saw when you ran it manually in an earlier lesson, followed by a completion message that tells you the location of the output files created by the test.

You should find that a local-tests directory has been created, and inside it are a record of any local tests you’ve run. Navigate to the output folder of your test, and check what files have been created. There are assorted log files, as well as a .ROOT file called something like 00012345_00006789_1.DATA.ROOT.

Let’s open this .root file and check if everything worked correctly. Similar to what we did earlier, run root -l 00012345_00006789_1.DATA.ROOT to open ROOT with that ntuple loaded, and view the contents by running TBrowser b (or otherwise). Take a little time to look around, and make sure everything’s in order.

Useful resources

If you encounter problems while creating a real production, here are a few good places to ask for help:

For issues with options files, try the DaVinci Mattermost channel or the lhcb-davinci mailing list

For issues with info.yaml files, or anything else to do with the Analysis Productions framework, try the Analysis Productions Mattermost channel, or contact the Analysis Productions coordinator(s) (see here to find out who they are).

Creating a merge request

Now that we’ve tested all of our changes and are sure that everything’s working as intended, we can prepare to submit them to the main repository by creating a merge request. Start by commiting the changes:

git add dv_basic.py
git add info.yaml
git add README.md
git commit -m "Add starterkit example production"

And then push the changes with

git push origin ${USER}/starterkit-practice

Once this has completed, it should give you a link to create a merge request for your new branch. Open it in a browser, and give it a suitable name & description - also please add the Starterkit (not for merge) label! For a real production please ensure you follow the instructions in the merge request description template. Then you can submit your merge request.

Since this is only for practice, your request won’t actually be merged, but some tests will still be run automatically. To view these, go to the Pipelines tab of your merge request, and open it by clicking the pipeline number (eg. “#1958388”). At the bottom, you will see a test job - click on this, and it will show you the output of the test jobs. These will take a little time to complete, so it may still be in progress. The first few lines should look something like:

Running with gitlab_runner_api 0.1.dev42+g457dbd5
INFO:Creating new pipeline for ID 1958388
ALWAYS:Results will be available at https://lhcb-analysis-productions.web.cern.ch/1958388/

You can open that link in your browser to view the status of the test jobs (example here). After a few minutes, these should have completed - all being well, you’ve now successfully prepared your first production!

Analysis Productions Data

We have created a standalone Python package, APD, which can be used to help you access your Analysis Productions output. At the bottom of a job output page you can find an example of how to do this. For this starterkit example we could do

from apd import AnalysisData
datasets = AnalysisData("apd", "starterkit")
24c4_magdown_data = datasets(name="bu2jpsik_24c4_magdown")

This will create 24c4_magdown_data as a list of all the PFNs for the output of this job. You could then access these using your preferred kind of Python based ROOT.

Next steps for real productions

As mentioned earlier, your test productions for this lesson won’t be merged since this is only for practice purposes. If this were a real production, there would be some extra steps to take before merging.

You should try to reach out to anyone else who may also want to use these ntuples, and see if they’d like to add or change anything. This is usually done through your working group (either by email, or for new productions, by presenting at a WG meeting). For this, you can start by getting in touch with your WG conveners (see here to find out who they are).

In addition, the Analysis Productions coordinator(s) may also have some feedback on your additions/changes.