‎

1. Quick Start
- 1.1. Tutorial
- 1.2. Results
2. About this document
3. Overview of this pipeline
4. Preprocessing Data
- 4.1. Picking Peaks
- 4.2. Converting Dia Raw with Msconvert
5. Analysis [4/4]
6. Configuration for backends
7. SDRF support
8. Putting it together
- 8.1. gwl
- 8.2. nf
9. Troubleshooting

1. Quick Start

This shows section shows you how to quickly get started with gladiator-nf.

1.1. Tutorial

1.1.1. Requirements

This tutorial will require you to have on your system:

make (tested with Gnu make 4.4.1)
emacs 27 or greater
coreutils (i.e. ls,cp, mkdir)
podman or another container solution, such as docker, apptainer (formerly known as singularity), or guix.
git
java 17, 18 or 19.

1.1.2. Cloning the directory

If you are using the development version of this, you will wnat to clone this directory, and 'tangle' (create the scripts).

git clone https://github.com/elolab/gladiator-nf
cd gladiator-nf 
yes '//' | make tangle

1.1.3. Example Data

For this example we will use MSV000090837 from , just 210820_Grad090_LFQ_A_01.raw and 210820_Grad090_LFQ_B_01.raw, so that you dont need to use that much disk space for trying this out. While the massive repo provides mzml files, which you could use, we are gonna start from the raw for educational purposes. Lets say Download the data

mkdir RAW
wget --no-directories  --directory-prefix=RAW ftp://massive-ftp.ucsd.edu/v05/MSV000090837/raw/Exp03_repeat_90minGradient_from_Exp02_same_Mastermix/210820_Grad090_LFQ_{A,B}_01.raw

mkdir fasta
wget --directory-prefix=fasta  'ftp://massive-ftp.ucsd.edu:/v05/MSV000090837/sequence/fasta/*.fasta'

1.1.4. Convert RAW to mzml

We will want to convert the raw files to mzml, so we will want to run the following command in the container

COMMAND='find . -iname '"'"'*.raw'"'"' -print0 | xargs -P5 -0 -i wine msconvert {} --filter '"'"'titleMaker <RunId>.<ScanNumber>.<ScanNumber>.<ChargeState> File:"<SourcePath>", NativeID:"<Id>"'"'"' -o MZML/'
IMAGEID=$(podman pull docker://chambm/pwiz-skyline-i-agree-to-the-vendor-licenses:3.0.21354-9ee14c7)
echo  $COMMAND > ./convert.sh
chmod +x convert.sh
podman run -it  -v $PWD:$PWD -w $PWD $IMAGEID /bin/bash  ./convert.sh

1.1.5. Setting up your experiments config file

Create a config file with the following contents, adjusting as needed. Lets name it myconfig.nf

params {
    // this is fragment mass tolerance in Dalton, 0.02 is a sensible default
    fragment_mass_tolerance=0.02
    // This is in parts per million, ppm
    precursor_mass_tolerance=10
    // this is cutoffrate used by mayu for finding the peptide probability
    protFDR=0.01
    // retention time information, this traml is a good default.
    irt_traml_file='ftp://PASS00289:XY4524h@ftp.peptideatlas.org:/SGS/assays/OpenSWATH_SM4_iRT_AssayLibrary.TraML'
    use_irt=false
    max_missed_cleavages=1 
    libgen_method='diaumpire'
    pyprophet_use_legacy=false
    pyprophet_fixed_seed=false
    // this would make your subsample ratio 1 / your number of samples
    pyprophet_subsample_ratio=null
}

1.1.6. Calling gladiator

Note that you will have to quote wildcards, because nextflow's java expects to handle the wildcard expansion,rather than your shell handling the expandsion

# if you are using podman 
NXF_VER=22.10.1 nextflow -c myconfig.nf \
         -c config/podman.nf \
         run gladiator.nf \
         --fastafiles='fasta/*.fasta' \
         --diafiles='MZML/*.mzML'

if you are using singularity, use config/singularity.nf in lieu of config/podman.nf. if you have the guix package manager, you can use config/guix.nf .

You will want to have java / openjdk 17, 18, or 19. Java 20 and later will give you a groovy error.

1.2. Results

In the results directory , which you can specify with --outdir=/path/to/results, which defaults to ./results, you will find two files

dia/DIA-peptide-matrix.tsv This contains the intensitities on the peptide level, the first column is the peptide identifier, and the the other columns are the intensities in that sample. For example:

"ProteinName_FullPeptideName"	"210820_Grad090_LFQ_A_01.mzML"	"210820_Grad090_LFQ_B_01.mzML"
"1/A2I7N3_LAVSHVIHK"	624895	723064
"3/sp¦P08729¦K2C7_HUMAN/Q3KNV1/P08729_VDALNDEINFLR"	2037142	1283280

dia/DIA-protein-matrix.tsv Here the intensities are on the protein group level, with the first column containing the protein group name and size.

ProteinName	210820_Grad090_LFQ_A_01.mzML	210820_Grad090_LFQ_B_01.mzML
1/sp¦O14561¦ACPM_HUMAN	1410100	1359820
2/sp¦P0CX49¦RL18A_YEAST/sp¦P0CX50¦RL18B_YEAST	99249410	44970930

(read | instead of ¦ )

2. About this document

This is a literate programming (org-mode) document that describes the the nextflow implemantiation of gladiator (https://github.com/elolab/glaDIAtor) and all needed template files. This directory should already contain the tangled files. To learn more about org-mode, see https://orgmode.org/.

2.1. Writing Style

Sentences with the first person plural ("we") as subject or with implied third person (it reads as "[The program] …"), are notes about the development process or an explanation of the program, whereas sentences with the second person as subject ("You" e.g. "You might try setting foo to 3") are instructions to the end user.

2.2. Trouble Shooting

When you encounter an error, go to the section of this document that this error occurred, there we will describe fixes for errors that occur in that step. You can also search for the text the error message gave, we will usually include it in that section.

2.3. Tangling

in order to turn this file into the needed files run, you'll need to have emacs and gnu make installed, and then run

yes '//' | make tangle

alternatively, if you have gnu guix installed, you can run

make SHELL=guix tangle

The following is a list of all files this document tangles into

gladiator.nf

gwl-gladiator.scm

diaumpireconfig.txt

comet_template.txt

xtandem-template.xml

irt.txt

config/singularity.nf

config/singularity-local.nf

config/singularity-guix-local.nf

config/docker.nf

config/docker-local.nf

config/podman.nf

config/podman-local.nf

config/guix.nf

nextflow.config

ci/guix/manifests/nextflow-trace.scm

nextflow.tags

2.4. Building the containers

Containers are available from the public registry, but you can also build them yourself. The containers are defined in terms guix manifests, rather than Dockerfiles or the like, so you will need to install guix in order to build the containers yourself.

# if you want singularity images
make SHELL=guix singularity-containers
# if you want docker images
make SHELL=guix docker-containers

If you want to pass arguments to the step that's responsible for building the containers, (like e.g. guix style transformations), you can specify those with make variable GUIX_PACK_FLAGS. e.g.

make SHELL=guix 'GUIX_PACK_FLAGS=--with-patch=tpp=/path/to/my/tpp/patchfile.patch' singularity-containers

2.5. Making distribution tarball

You can make a distribution tarball with the tangled files and documentaion with

make dist

or if you have guix on your system

make SHELL=guix dist

2.6. License

This program is free software, under GPL3 or later.

/*
 * Copyright (C) 2025
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>.
 */

3. Overview of this pipeline

4. Preprocessing Data

We will not distribute the vendored msconvert, but if you have DDA-files you need to convert froma propriatry format, to mzmxml, following the picking peaks step, and you can use the docker image of dockerhub:chambm/pwiz-skyline-i-agree-to-the-vendor-licenses. You can convert your DIA-files with the same container following "Converting Dia Raw with Msconvert"

4.1. Picking Peaks

mkdir -p MZXML-pwiz
for f in RAW/*.wiff; do
    wine qtofpeakpicker --resolution=2000 --area=1 --threshold=1 --smoothwidth=1.1 --in $f --out MZXML-pwiz/$(basename --suffix=.wiff $f).mzXML
done

4.2. Converting Dia Raw with Msconvert

mkdir -p MZML-pwiz
find . -iname '*.wiff' -print0 | xargs -P5 -0 -i wine msconvert {} --filter 'titleMaker <RunId>.<ScanNumber>.<ScanNumber>.<ChargeState> File:"<SourcePath>", NativeID:"<Id>"' -o MZML-pwiz/

5. Analysis `[4/4]`

Gladiator paper: https://doi.org/10.1038/s43705-022-00137-0 diatools: https://doi.org/10.1021/acs.jproteome.9b00606

5.1. Headers

(setq org-babel-tangle-lang-exts
      (cl-remove-duplicates 
      (append
       '(("scheme" . "scm"))
       org-babel-tangle-lang-exts
       )
      :test 'equal))

(define-module (workflow)
  #:use-module (gwl workflows)
  #:use-module (gwl processes)
  #:use-module (gwl utils)
  #:use-module (gwl sugar))

<<gwl-vars>>

<<nf-sdrf-handling>>
<<nf-function-definitions>>
<<nf-vars>>

5.2. building database

Overview

5.2.1. Combining Fasta Files

from Bio import SeqIO
def join_fasta_files(input_files, output_file):
    IDs = set()
    seqRecords = []
    for filename in input_files:
        records = SeqIO.index(filename, "fasta")
        for ID in records:
            if ID not in IDs:
                seqRecords.append(records[ID])
                IDs.add(ID)
            else:
                print("Found duplicated sequence ID " + str(ID) + ", skipping this sequence from file " + filename)

    SeqIO.write(seqRecords, output_file, "fasta")

process JoinFastaFiles {
    input:
    file fasta_files from fasta_files_ch.toSortedList()
    output:
    file 'joined_database.fasta' into joined_fasta_database_ch

    """
    #!/usr/bin/env python3
    <<py-joinfastafiles>>
    join_fasta_files("$fasta_files".split(" "), 'joined_database.fasta')
    """
}

params.fastafiles='fasta/*.fasta'

Channel.fromPath(params.fastafiles).set{fasta_files_ch}

This was how we could set the fasta_files_ch to be in the same order as the original bruderer run

Channel.from([
    "fasta/Q7M135.fasta",
    "fasta/irtfusion.fasta",
    "fasta/trypsin.fasta",
    "fasta/uniprot_human_2017_04_05.fasta",
    "fasta/Bruderer_QS-spike-in-proteins.fasta"])
    .map{file(it)}
    .set({fasta_files_ch})

("join-fasta-files"
 "python"
 "biopython")

(define (join-fasta-files fasta-files)
  (make-process
   (name "join-fasta-files")
   (synopsis "Join fasta files into one file")
   (packages
    (cdr (quote
      <<gwl-joinfastafiles-deps>>)))
   (inputs (files fasta-files))
   (outputs "joined-fasta.fasta")
   # python
{
<<py-joinfastafiles>>
join_fasta_files({{inputs}}.split(" "),{{outputs}})
}))

(define fasta-files
  '("Q7M135.fasta" "trypsin.fasta"))

(join-fasta-files fasta-files)

5.2.2. Adding Decoys

fasta_db_with_decoys = Channel.value()
process BuildFastaDatabase {
    input:
    file joined_fasta_db from joined_fasta_database_ch
    output:
    file "DB_with_decoys.fasta" into joined_fasta_with_decoys_ch
    """
    DecoyDatabase -in $joined_fasta_db -out DB_with_decoys.fasta
    """
}

DecoyDatabase package is from OpenMs/utils https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/release/latest/html/UTILS_DecoyDatabase.html https://github.com/OpenMS/OpenMS.git License: BSD-3 clause (Not in guix, but uses cmake as build-program, should be relatively easy to define)

(define create-database-with-decoys
  (make-process
   (name "create-database-with-database")
   (synopsis "Add decoys to fasta database")
   (inputs "joined-fasta.fasta")
   (outputs "DB-with-decoys.fasta")
   (packages )
   # sh
     {
      DecoyDatabase -in $inputs -out $outputs
                    })))

create-database-with-decoys

5.3. Input DIA files

Here we redirect the dia files to the

Channel
    .fromPath(params.diafiles)
    .multiMap{
        it -> swath_windows: osw: it}
    .set{dia_mzml_files_ch}

5.4. Creating Swath window files

outputs files swath-windows.txt, truncated-swath-windows.txt

If you are using FAIMS split MZMLs, the mzml might not contain isolationWindow elements, in that case you can provide your own tab-separated file of swathwindows.

5.4.1. Branching if user supplied windows

// optional swath windows file thats a tab-separated file
// where the first column is the isolation window lower offset
// and the second column  is the isolation window upper offset
// this file is normally automatically generated in the MakeSwathWindows steps
// but if your mzML does not provide isolationWindow 
params.swath_windows_file=''

if (params.swath_windows_file) {    
  <<nf-regularize-user-swath-windows>>  
} else  {
    <<nf-infer-swath-windows>>
}

5.4.2. Making truncated-swath-windows and swath-windows from user-supplied swath-windows

Here we do some mangling so that the user inputed swathwindows is in the same format as the one that would be generated by us.

We keeep FS to the default so that awk will happily accept whitespace as field separator, (be it normal spaces or tabs), but we output with tab as separators.

process RegularizeUserSwathWindow {
    input:
    path user_swath_windows, stageAs: 'userSwathWindow.txt' from Channel.fromPath(params.swath_windows_file).first()
    output:
    file swath_windows into swath_windows_ch
    script:
    swath_windows="swath-windows.txt"
    """
    sort -n $user_swath_windows | awk 'BEGIN {OFS="     "} {print \$1,\$2}' >  $swath_windows
    """
}

5.4.3. Inferring Windows from mzml files

If the user didnt supply a swath windows file, we infer it from the mzml file

import xml.etree.ElementTree as ET
import os

def read_swath_windows(dia_mzML):

    print ("DEBUG: reading_swath_windows: ", dia_mzML)

    context = ET.iterparse(dia_mzML, events=("start", "end"))

    windows = {}
    for event, elem in context:

        if event == "end" and elem.tag == '{http://psi.hupo.org/ms/mzml}precursor':
            il_target = None
            il_lower = None
            il_upper = None

            isolationwindow = elem.find('{http://psi.hupo.org/ms/mzml}isolationWindow')
            if isolationwindow is None:
                raise RuntimeError("Could not find isolation window; please supply --swath_windows_file to Gladiator.")
            for cvParam in isolationwindow.findall('{http://psi.hupo.org/ms/mzml}cvParam'):
                name = cvParam.get('name')
                value = cvParam.get('value')

                if (name == 'isolation window target m/z'):
                    il_target = value
                elif (name == 'isolation window lower offset'):
                    il_lower = value
                elif (name == 'isolation window upper offset'):
                    il_upper = value

            ionList = elem.find('{http://psi.hupo.org/ms/mzml}selectedIonList')

            selectedion = ionList.find('{http://psi.hupo.org/ms/mzml}selectedIon')

            if selectedion:

                for cvParam in selectedion.findall('{http://psi.hupo.org/ms/mzml}cvParam'):
                    name = cvParam.get('name')
                    value = cvParam.get('value')

                    if (name == 'selected ion m/z'):
                        if not il_target:
                            il_target = value

            if not il_target in windows:
                windows[il_target] = (il_lower, il_upper)
            else:
                lower, upper = windows[il_target]
                assert (il_lower == lower)
                assert (il_upper == upper)
                return windows

    return windows

def create_swath_window_files(cwd, dia_mzML):
    windows = read_swath_windows(dia_mzML)
    swaths = []
    for x in windows:
        target_str = x
        lower_str, upper_str = windows[x]
        target = float(target_str)
        lower = float(lower_str)
        upper = float(upper_str)
        assert (lower > 0)
        assert (upper > 0)
        swaths.append((target - lower, target + upper))
    swaths.sort(key=lambda tup: tup[0])
    # here we use chr(10) (equivalent to slash n), and chr(9) (equivalent to slash t)  because i dont wanna deal with nextflow headaches
    newline_character = chr(10)
    tab_character = chr(9)
    with open(os.path.join(cwd, "swath-windows.txt"), "w") as fh_swaths:
        for lower,upper in swaths:
            fh_swaths.write(str(lower) + tab_character + str(upper)  + newline_character)
    return fh_swaths

process InferSwathWindows {
    input:
    file diafile from dia_mzml_files_ch.swath_windows.first()
    output: 
    file "swath-windows.txt" into  swath_windows_ch
    shell:
    '''
    #!/usr/bin/env python3
    <<py-makeswathwindows>>
    swaths = create_swath_window_files(".","!{diafile}")
    '''
}

we'll have to get minswath and maxswath by reading "swath-windows.txt"

5.4.4. Making the non-overlapping swath-windows

Openswath requires non-overlapping windows, so we create them here.

BEGIN {OFS="    "}
function max(a,b){
    if(a > b)
        return a
    return b
}
NR==1 {
    # we start with the special case that the boundary for the first entry
    # should be unchanged
    prev_upper=$1
    # and we add the column names
    print "LowerOffset","HigherOffset"
}
{
    if (prev_upper > $2)
    {
        print "There is a a window thats a subwindow of the previous window"
        exit 1
    }
    print(max($1,prev_upper),$2)
    prev_upper=$2
}

process InferNonOverlappingSwathWindows {
    input:
    file swath_windows from swath_windows_ch.first()
    output:
    file truncated_swath_windows into truncated_swath_windows_ch
    script:
    truncated_swath_windows="truncated_swath_windows.txt"
    ''' awk '
    <<awk-infer-non-overlapping-swath-windows>>' ''' + "$swath_windows > $truncated_swath_windows"
}

5.5. Library Generation

There are various way to generate spectral libraries from DIA data / DDA dat. Here we make the distinction between deconvolution methods and other library generation methods.

The following is a list of the methods we support,

[ "dda","custom", "deepdia", "diaumpire","diams2pep"]

And you can adjust the following parameter

// one or more of [ "dda","custom", "deepdia", "diaumpire","diams2pep"] seperated by commas
// will default to "dda" if ddafiles are supplied
// othewise to "deepdia"
params.libgen_method = null

// TODO: raise an error if params.libgen_method  is not a supported method
libgen_methods_validate_params(params)

// returns all libgen methods that we supplor
def libgen_methods_get_existing (){
    return [ "dda","custom", "deepdia", "diaumpire","diams2pep"]
}

def libgen_method_any_pseudospectra_method_is_enabled(params){
    def pseudospectra_methods = ["diams2pep","diaumpire"]
    return pseudospectra_methods.inject(false) { acc, val -> acc || libgen_method_is_enabled(val, params)}
}


def libgen_methods_validate_params(params){
    if(params.libgen_method != null){
        def invalid_methods = params.libgen_method.split(",").findAll({    !libgen_methods_get_existing().contains(it)})
        if(invalid_methods)
            raise RunTimeError("Invalid libgen methods specified: " + invalid_methods.join(","))
    }
}
def libgen_method_is_enabled(method, params){
    // method to use if the user didnt specify anything
    def fallback_method = "diaumpire";
    if (params.libgen_method){
        return params.libgen_method.split(",").contains(method)
    }
    switch (method) {
            case "dda": return !!params.ddafiles;
            case "custom": return !!params.speclib;
        default: return (method == fallback_method) && !params.ddafiles && !params.speclib;  
    }
}

def libgen_method_is_exclusively_enabled(method, params) {
    return libgen_methods_get_existing().inject(true) { acc, val -> acc && ( libgen_method_is_enabled(val, params)  == (val == methods)) }
}

5.5.1. Building {Pseudo-,}Spectral library from (Pseudo)-Spectra `[5/5]`

This section covers seval ways of deconvolution for making spectral libraries for later usage by open swath.

The below block handles the logic of dealing with the various deconvolution methods, sending diafiles to all input channels, and getting output mgf from output channels.

deconvolution_methods = []
<<nf-deconvolution-handling>>
deconv_input_chs = deconvolution_methods*.input.findAll({it != null})
if(deconv_input_chs){
    Channel
        .fromPath(params.diafiles)
        .into(
            deconv_input_chs
                .inject() { acc, val -> acc << val })
}
deconv_output_chs = deconvolution_methods*.output.findAll({it != null})
for(ch:deconv_output_chs)
    Channel.create().set(ch.clone())
if(deconv_output_chs){
    Channel.empty()
        .mix(*(deconv_output_chs*.call()))
        .multiMap{ it -> spectrast: comet: xtandem: it }
        .set{maybespectra_ch}
}

if(libgen_method_any_pseudospectra_method_is_enabled(params) || libgen_method_is_enabled("dda",params)){

5.5.1.1. Using DDA Data

if(libgen_method_is_enabled("dda",params)){
    deconvolution_methods += [output: { dda_files_ch } ]
}

if(libgen_method_is_enabled("dda",params)){
    Channel.fromPath(params.ddafiles).tap(dda_files_ch)
}

5.5.1.2. Creating Pseudospectra with DIAumpire

https://github.com/Nesvilab/DIA-Umpire/tree/master https://github.com/Nesvilab/DIA-Umpire/raw/gh-pages/DIA_Umpire_Manual_v2.0.pdf

5.5.1.2.1. Problems you might encounter during this step

5.5.1.2.1.1. Out of Memory in Dia-umpire

Dia-umpire, which we use here for pseudo-spectra creation, has pretty extreme memory requirements, in your config file you can set the process specific memory (required to be in Gigabyes) e.g.

process { 
  withName: 'GeneratePseudoSpectra'
  {
        time='96h'
        memory='400 GB'
  }
}

5.5.1.2.1.2. MzmlToMzxml processing error.

If you get an error of

processing file: RD139_Narrow_UPS1_50fmol_inj3.mzML
[SpectrumList_mzML::create()] Bad istream.
Error processing file RD139_Narrow_UPS1_50fmol_inj3.mzML

in MzmltoMzxml, that can mean that something went wrong when you used msconvert to convert from the propriatary format to mzml

5.5.1.2.2. Steps that are run

DIAumpire is Apache 2 licensed.

if(libgen_method_is_enabled("diaumpire",params)){
    deconvolution_methods += [output: {diaumpire_pseudospectra_ch},
                              input:  {dia_mzml_files_for_diaumpire_ch}]
}

if(libgen_method_is_enabled("diaumpire",params)){

#Number of threads
# set to the number of cores available
# In the original gladiator, this was set by replicing this all caps 
Thread = 4

#Precursor-fragments grouping parameters
RPmax = 25
RFmax = 300
CorrThreshold = 0.2
DeltaApex = 0.6
RTOverlap = 0.3

#Fragment intensity adjustments
# change BoostComplementaryIon if later using database search results to build libraries for Skyline/OpenSWATH
## [2023-05-30 Tue]
## what did the original gladiator author mean by this?
## he forgot.
## in dia-umpire repo example BoostComplementaryIon is True.
AdjustFragIntensity = true
BoostComplementaryIon = true

#Export detected MS1 features (output feature file can be loaded and mapped to RAW data in BatMass)
ExportPrecursorPeak = false

#Signal extraction: mass accuracy and resolution
# resolution parameter matters only for data generated in profile mode
SE.MS1PPM = 15
SE.MS2PPM = 25
SE.Resolution = 60000

#Signal extraction: signal to noise filter
SE.SN = 1.1
SE.MS2SN = 1.1

#Signal extraction: minimum signal intensity filter
# for Thermo data, filtering is usually not necessary. Set SE.EstimateBG to false and SE.MinMSIntensity and SE.MinMSMSIntensity to a low value, e.g. 1
# for older Q Exactive data, or when too many MS1 features are extracted, set SE.EstimateBG to yes (or apply SE.MinMSIntensity and SE.MinMSMSIntensity values based on BatMass visualization)
SE.EstimateBG = false
SE.MinMSIntensity = 1
SE.MinMSMSIntensity = 1

#Signal extraction: peak curve detection and isotope grouping
# for older Q Exactive data, or when too many MS1 features are extracted, set SE.NoMissedScan to 1
SE.NoMissedScan = 2
SE.MaxCurveRTRange = 2
SE.RemoveGroupedPeaks = true
SE.RemoveGroupedPeaksRTOverlap = 0.3
SE.RemoveGroupedPeaksCorr = 0.3
SE.MinNoPeakCluster = 2
SE.MaxNoPeakCluster = 4

#Signal extraction: filtering of MS1 features 
# if interested in modified peptides, increase MassDefectOffset parameter, or set SE.MassDefectFilter to false
SE.IsoPattern = 0.3
SE.MassDefectFilter = true
SE.MassDefectOffset = 0.1

#Signal extraction: other 
SE.StartCharge = 1
SE.EndCharge = 5
SE.MS2StartCharge = 2
SE.MS2EndCharge = 5
SE.MinFrag=10
SE.StartRT = 0
SE.EndRT = 9999
SE.MinMZ = 200
SE.MinPrecursorMass = 600
SE.MaxPrecursorMass = 5000

#Isolation window setting
#The current version supports the following window type: SWATH (fixed window size), V_SWATH (variable SWATH window), MSX, MSE, pSMART
WindowType=SWATH

#Fix window size (For SWATH)
# for Thermo data, this will be determined from raw data automatically
#WindowSize=15

#Variable SWATH window setting (start m/z, end m/z, separated by Tab)
# for Thermo data, this will be determined from raw data automatically

#==window setting begin
#==window setting end

MGF = Mascot Generic Format https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3518119/

// create mzxml
process MzmlToMzxml {
    input:
    file diafile from dia_mzml_files_for_diaumpire_ch
    output:
    file "*.mzXML" into dia_mzxml_files_for_diaumpire_ch
    """
    msconvert $diafile --32 --zlib --filter "peakPicking false 1-" --mzXML
    """
}

process GeneratePseudoSpectra  {
    memory '16 GB' 
    input:
    file diafile from dia_mzxml_files_for_diaumpire_ch
    path diaumpireconfig from diaumpireconfig_ch.first()
    output:
    // we flatten here becuase a single mzxml might result in multiple mgf files
    file "*.mgf" into diaumpire_pseudospectra_mgf_ch mode flatten 

    """
    # we set \$1 to the number of gigs of memory
    set -- $task.memory
    if command -v diaumpire-se; 
    then
        diaumpire-se  -Xmx\$1g -Xms\$1g $diafile $diaumpireconfig
    else 
        java -Xmx\$1g -Xms\$1g -jar /opt/dia-umpire/DIA_Umpire_SE.jar $diafile $diaumpireconfig
    fi
    """
}

process DiaUmpireMgfToMzxml {
    input:
    file mgf from diaumpire_pseudospectra_mgf_ch
    output:
    file "*.mzXML" into diaumpire_pseudospectra_ch
    when:
    // excluding empty files   
    mgf.size()  > 0
    """
    msconvert $mgf --mzXML 
    """
}

("generate-pseudo-spectra"
 "dia-umpire" 
 "pwiz") ;; the free one

though this might also be done with openms's FileConverter ? which is more conventionally build https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/release/latest/html/TOPP_FileConverter.html mstools

params.diaumpireconfig='diaumpireconfig.txt'
// glob to DIA mzmML files, e.g. "DIA/*.mzML"
// MANDATORY to be set if not set by SDRF file
params.diafiles = null
// OPTIONAL glob to mzXML dda files
// e.g. "DDA/*.mzXML"
// if left unset, then pseudospectra will be used.
params.ddafiles = null

// so that this is a singleton channel
diaumpireconfig_ch = Channel.fromPath(params.diaumpireconfig)

} // end of diaumpire guard

5.5.1.3. Creating Pseudospectra with diams2pep

https://github.com/SS2proteome/DIA-MS2pep https://doi.org/10.52601/bpr.2022.220011

// fragment tolarance for diam2spep in ppm
// (other tools require it in dalton)
params.diams2pep_fragment_tolerance = null

if(libgen_method_is_enabled("diams2pep",params)){
    deconvolution_methods += [output: { diams2pep_pseudospectra},
                              input: { diams2pep_input_mzml}]
}

if(libgen_method_is_enabled("diams2pep",params)){

Do we need msconvert to convert to a friendly mzml file? According to DIA-MS2PEP's readme we need

mzML=true
zlib=true
mz64=true
inten64=true
simAsSpectra=true
filter=”peakPicking vendor msLevel=1-2"

trying with filter "cwt" because we don't ship vendors.

process convert_for_DIAMS2PEP {
    input:
    file mzml from diams2pep_input_mzml
    output:
    // there is no good way in nextflow that makes a UUUID that persists across -resume things
    // task.hash is forgotten in resume, as is task.id.
    tuple val("${mzml.baseName}"), path(ofile) into diams2pep_mgf_mzml, diams2pep_window_mzml, diams2pep_for_pseudo_mzml
    script:
    ofile="converted/${mzml.baseName}.mzML"
    """
    mkdir -p converted
    msconvert --mzML --mz64 --zlib --inten64 --simAsSpectra --filter "peakPicking cwt msLevel=1-2" --outdir converted $mzml
    """
}

process convert_mgf_for_DIAMS2PEP {
    input:
    tuple val(hash), path(mzml) from diams2pep_mgf_mzml
    output:
    tuple val(hash), path("${mzml.baseName}.mgf") into diams2pep_mgf

    """
    msconvert --mgf $mzml
    """
}

DIA_acquistion_window_generator.pl is not a typo, this is how it is in the original repo.

process DIAMS2PEP_window {
    input:
    tuple val(hash), path(mzml) from  diams2pep_window_mzml
    output:
    tuple val(hash), path("${mzml}.DIA_acquisition_window.txt") into diams2pep_window

    """
    DIA_acquistion_window_generator.pl $mzml
    """
}

Looking at the source, this creates an mgf file for every window that was detected. which would not be known before running DIA_acquisition_window_generator.pl

DIA_pesudo_MS2.pl.pl= is not a typo, this is how it is in the original repo

if(params.diams2pep_fragment_tolerance == null)
    raise RunTimeError("DIAMSM2PEP enabled but no diams2pep_fragment_tolerance specified.")
process DIAMS2PEP_generate_pseudo {
    input:
    tuple val(hash), path(mzml), path(mgf), path(acq_window) from diams2pep_for_pseudo_mzml.join(diams2pep_mgf).join(diams2pep_window)
    val tolerance from Channel.value(params.diams2pep_fragment_tolerance)
    output:
    file "mgf-output/*.mgf" into diams2pep_pseudospectra_mgf mode flatten
    """
    mkdir -p mgf-output
    DIA_pesudo_MS2_multiforks.pl ${mzml.baseName} mgf-output $tolerance ${task.cpus}
    """
}

// this will use the default container because we need msconvert
process MgfToMzml_DIAMS2PEP {
    input:
    file mgf from diams2pep_pseudospectra_mgf
    output:
    file "*.mzXML" into diams2pep_pseudospectra
    """
    msconvert --mzXML $mgf
    """
}

} // end of diams2pep guard

5.5.1.4. Choosing the MS Sequence database search engine: Comet/Xtandem

params.search_engines = ["comet","xtandem"]

Depending on your experimental machine, the precursor and fragment tolerances are different. These are parameters to all search engines unsed.

Some scientific papers use mmu which is equal to 1 milidalton 0.001 Dalton

// Float or Int; in ppm ; eg. params.precursor_mass_tolerance=10
params.precursor_mass_tolerance=null  
// Float or Int; in Dalton; e.g. parames.fragment_mass_tolerance=0.2
params.fragment_mass_tolerance=null

Notably, a more stringent (lower) tolerance increases memorary usage by comet.

the maximum number of allowed missed cleavages is also passed to all search engines. Mandatory if you want to use these search engines.

// Int, if you are using the comet, this can by at maximum 5,
params.max_missed_cleavages=null

max_missed_cleavages = Channel.value(params.max_missed_cleavages)

However, in this developers experience, xtandem will crash when using another max-missed-cleavages then 1 so you would put e.g. the following in your config file.

params.search_engines = ["comet"]
params.max_missed_cleavages= 2

5.5.1.5. Comet

https://github.com/UWPR/Comet

The following is the template file for the parameters passed to comet. you can change fields for things that we don't give parameters for (so where the value is not "@…@"), in order to change behaviour of comet specific to your use case. For more information, see comet's documentation: https://comet-ms.sourceforge.net/parameters/parameters_202101/

# comet_version 2022.01 rev. 0
# Comet MS/MS search engine parameters file.
# Everything following the '#' symbol is treated as a comment.

database_name = @DDA_DB_FILE@
decoy_search = 0                       # 0=no (default), 1=concatenated search, 2=separate search
peff_format = 0                        # 0=no (normal fasta, default), 1=PEFF PSI-MOD, 2=PEFF Unimod
peff_obo =                             # path to PSI Mod or Unimod OBO file

num_threads = 0                        # 0=poll CPU to set num threads; else specify num threads directly (max 128)

#
# masses
#
peptide_mass_tolerance = @PRECURSOR_MASS_TOLERANCE@
peptide_mass_units = 2                 # 0=amu, 1=mmu, 2=ppm
mass_type_parent = 1                   # 0=average masses, 1=monoisotopic masses
mass_type_fragment = 1                 # 0=average masses, 1=monoisotopic masses
precursor_tolerance_type = 1           # 0=MH+ (default), 1=precursor m/z; only valid for amu/mmu tolerances
isotope_error = 3                      # 0=off, 1=0/1 (C13 error), 2=0/1/2, 3=0/1/2/3, 4=-8/-4/0/4/8 (for +4/+8 labeling)

#
# search enzyme
#
search_enzyme_number = 1               # choose from list at end of this params file
search_enzyme2_number = 0              # second enzyme; set to 0 if no second enzyme
num_enzyme_termini = 2                 # 1 (semi-digested), 2 (fully digested, default), 8 C-term unspecific , 9 N-term unspecific
allowed_missed_cleavage = @MAX_MISSED_CLEAVAGES@            # maximum value is 5; for enzyme search

#
# Up to 9 variable modifications are supported
# format:  <mass> <residues> <0=variable/else binary> <max_mods_per_peptide> <term_distance> <n/c-term> <required> <neutral_loss>
#     e.g. 79.966331 STY 0 3 -1 0 0 97.976896
#
variable_mod01 = 15.9949 M 0 3 -1 0 0 0.0
variable_mod02 = 0.0 X 0 3 -1 0 0 0.0
variable_mod03 = 0.0 X 0 3 -1 0 0 0.0
variable_mod04 = 0.0 X 0 3 -1 0 0 0.0
variable_mod05 = 0.0 X 0 3 -1 0 0 0.0
variable_mod06 = 0.0 X 0 3 -1 0 0 0.0
variable_mod07 = 0.0 X 0 3 -1 0 0 0.0
variable_mod08 = 0.0 X 0 3 -1 0 0 0.0
variable_mod09 = 0.0 X 0 3 -1 0 0 0.0
max_variable_mods_in_peptide = 5
require_variable_mod = 0

#
# fragment ions
#
# ion trap ms/ms:  1.0005 tolerance, 0.4 offset (mono masses), theoretical_fragment_ions = 1
# high res ms/ms:    0.02 tolerance, 0.0 offset (mono masses), theoretical_fragment_ions = 0, spectrum_batch_size = 15000
#
fragment_bin_tol = @FRAGMENT_MASS_TOLERANCE@              # binning to use on fragment ions
fragment_bin_offset = 0.0              # offset position to start the binning (0.0 to 1.0)
theoretical_fragment_ions = 1          # 0=use flanking peaks, 1=M peak only
use_A_ions = 0
use_B_ions = 1
use_C_ions = 0
use_X_ions = 0
use_Y_ions = 1
use_Z_ions = 0
use_Z1_ions = 0
use_NL_ions = 0                        # 0=no, 1=yes to consider NH3/H2O neutral loss peaks

#
# output
#
output_sqtfile = 0                     # 0=no, 1=yes  write sqt file
output_txtfile = 0                     # 0=no, 1=yes  write tab-delimited txt file
output_pepxmlfile = 1                  # 0=no, 1=yes  write pepXML file
output_mzidentmlfile = 0               # 0=no, 1=yes  write mzIdentML file
output_percolatorfile = 1              # 0=no, 1=yes  write Percolator pin file
print_expect_score = 1                 # 0=no, 1=yes to replace Sp with expect in out & sqt
num_output_lines = 5                   # num peptide results to show

sample_enzyme_number = 1               # Sample enzyme which is possibly different than the one applied to the search.
                                       # Used to calculate NTT & NMC in pepXML output (default=1 for trypsin).

#
# mzXML parameters
#
scan_range = 0 0                       # start and end scan range to search; either entry can be set independently
precursor_charge = 0 0                 # precursor charge range to analyze; does not override any existing charge; 0 as 1st entry ignores parameter
override_charge = 0                    # 0=no, 1=override precursor charge states, 2=ignore precursor charges outside precursor_charge range, 3=see online
ms_level = 2                           # MS level to analyze, valid are levels 2 (default) or 3
activation_method = HCD                # activation method; used if activation method set; allowed ALL, CID, ECD, ETD, ETD+SA, PQD, HCD, IRMPD, SID

#
# misc parameters
#
digest_mass_range = 600.0 5000.0       # MH+ peptide mass range to analyze
peptide_length_range = 5 63            # minimum and maximum peptide length to analyze (default 1 63; max length 63)
num_results = 100                      # number of search hits to store internally
max_duplicate_proteins = 20            # maximum number of additional duplicate protein names to report for each peptide ID; -1 reports all duplicates
max_fragment_charge = 3                # set maximum fragment charge state to analyze (allowed max 5)
max_precursor_charge = 6               # set maximum precursor charge state to analyze (allowed max 9)
nucleotide_reading_frame = 0           # 0=proteinDB, 1-6, 7=forward three, 8=reverse three, 9=all six
clip_nterm_methionine = 0              # 0=leave protein sequences as-is; 1=also consider sequence w/o N-term methionine
spectrum_batch_size = 15000            # max. # of spectra to search at a time; 0 to search the entire scan range in one loop
decoy_prefix = DECOY_                  # decoy entries are denoted by this string which is pre-pended to each protein accession
equal_I_and_L = 1                      # 0=treat I and L as different; 1=treat I and L as same
output_suffix =                        # add a suffix to output base names i.e. suffix "-C" generates base-C.pep.xml from base.mzXML input
mass_offsets =                         # one or more mass offsets to search (values substracted from deconvoluted precursor mass)
precursor_NL_ions =                    # one or more precursor neutral loss masses, will be added to xcorr analysis

#
# spectral processing
#
minimum_peaks = 10                     # required minimum number of peaks in spectrum to search (default 10)
minimum_intensity = 0                  # minimum intensity value to read in
remove_precursor_peak = 0              # 0=no, 1=yes, 2=all charge reduced precursor peaks (for ETD), 3=phosphate neutral loss peaks
remove_precursor_tolerance = 1.5       # +- Da tolerance for precursor removal
clear_mz_range = 0.0 0.0               # for iTRAQ/TMT type data; will clear out all peaks in the specified m/z range

#
# additional modifications
#

add_Cterm_peptide = 0.0
add_Nterm_peptide = 0.0
add_Cterm_protein = 0.0
add_Nterm_protein = 0.0

add_G_glycine = 0.0000                 # added to G - avg.  57.0513, mono.  57.02146
add_A_alanine = 0.0000                 # added to A - avg.  71.0779, mono.  71.03711
add_S_serine = 0.0000                  # added to S - avg.  87.0773, mono.  87.03203
add_P_proline = 0.0000                 # added to P - avg.  97.1152, mono.  97.05276
add_V_valine = 0.0000                  # added to V - avg.  99.1311, mono.  99.06841
add_T_threonine = 0.0000               # added to T - avg. 101.1038, mono. 101.04768
add_C_cysteine = 57.021464             # added to C - avg. 103.1429, mono. 103.00918
add_L_leucine = 0.0000                 # added to L - avg. 113.1576, mono. 113.08406
add_I_isoleucine = 0.0000              # added to I - avg. 113.1576, mono. 113.08406
add_N_asparagine = 0.0000              # added to N - avg. 114.1026, mono. 114.04293
add_D_aspartic_acid = 0.0000           # added to D - avg. 115.0874, mono. 115.02694
add_Q_glutamine = 0.0000               # added to Q - avg. 128.1292, mono. 128.05858
add_K_lysine = 0.0000                  # added to K - avg. 128.1723, mono. 128.09496
add_E_glutamic_acid = 0.0000           # added to E - avg. 129.1140, mono. 129.04259
add_M_methionine = 0.0000              # added to M - avg. 131.1961, mono. 131.04048
add_H_histidine = 0.0000               # added to H - avg. 137.1393, mono. 137.05891
add_F_phenylalanine = 0.0000           # added to F - avg. 147.1739, mono. 147.06841
add_U_selenocysteine = 0.0000          # added to U - avg. 150.0379, mono. 150.95363
add_R_arginine = 0.0000                # added to R - avg. 156.1857, mono. 156.10111
add_Y_tyrosine = 0.0000                # added to Y - avg. 163.0633, mono. 163.06333
add_W_tryptophan = 0.0000              # added to W - avg. 186.0793, mono. 186.07931
add_O_pyrrolysine = 0.0000             # added to O - avg. 237.2982, mono  237.14773
add_B_user_amino_acid = 0.0000         # added to B - avg.   0.0000, mono.   0.00000
add_J_user_amino_acid = 0.0000         # added to J - avg.   0.0000, mono.   0.00000
add_X_user_amino_acid = 0.0000         # added to X - avg.   0.0000, mono.   0.00000
add_Z_user_amino_acid = 0.0000         # added to Z - avg.   0.0000, mono.   0.00000

#
# COMET_ENZYME_INFO _must_ be at the end of this parameters file
#
[COMET_ENZYME_INFO]
0.  Cut_everywhere         0      -           -
1.  Trypsin                1      KR          P
2.  Trypsin/P              1      KR          -
3.  Lys_C                  1      K           P
4.  Lys_N                  0      K           -
5.  Arg_C                  1      R           P
6.  Asp_N                  0      D           -
7.  CNBr                   1      M           -
8.  Glu_C                  1      DE          P
9.  PepsinA                1      FL          P
10. Chymotrypsin           1      FWYL        P
11. No_cut                 1      @           @

Here we set the above as the default parameter template. If you customized or have your own the comet config, you can point to it with this.

params.comet_template="comet_template.txt"

process MakeCometConfig {
    // should we instead return a tuple here of fastadb and config
    // because the config.txt refers to it?
    input:
    val max_missed_cleavages
    file fastadb_with_decoy from joined_fasta_with_decoys_ch.first()
    path template from Channel.fromPath(params.comet_template)
    output:
    file "comet_config.txt" into comet_config_ch
    """
    sed 's/@DDA_DB_FILE@/$fastadb_with_decoy/g;s/@FRAGMENT_MASS_TOLERANCE@/$params.fragment_mass_tolerance/g;s/@PRECURSOR_MASS_TOLERANCE@/$params.precursor_mass_tolerance/g;s/@MAX_MISSED_CLEAVAGES@/$max_missed_cleavages/g' $template > comet_config.txt 
    """

}

setting memory & error strategy like this prevents caching even with process.cache='lenient' maybe because the task.attempt = 1 is tried first

process Comet {
    // we probably also want to publish thees
    memory { 5.GB * 2 *  task.attempt }
    errorStrategy { task.exitStatus in 137..137 ? 'retry' : 'terminate' }
    maxRetries 2
    input:
    file comet_config from comet_config_ch.first()
    // future dev: we can .mix with DDA here?
    // though we might need to tag for DDA / Pseudo
    // so that xinteract 
    file mzxml from maybespectra_ch.comet
    file fastadb_with_decoy from joined_fasta_with_decoys_ch.first()
    output:
    file("${mzxml.baseName}.pep.xml") into comet_pepxml_ch
    file mzxml into xinteract_comet_mzxml_ch
    when:
    params.search_engines.contains("comet")

    """
    if command -v command-ms;
    then
      comet-ms -P$comet_config $mzxml
    else
      comet -P$comet_config $mzxml
    fi
    """
}

process XinteractComet {
    memory '16 GB'
    time '5h'
    // memory usage scales with the number of input files
    // find the correct usage per input file or size
    // also for xinteractxtandem
    // usage there seems to be a lot smaller
    // as input files seems to be smaller
    input:
    file pepxmls from comet_pepxml_ch.toSortedList()
    // the filename of needed fastdadb was defined in cometcfg
    // and stored in pepxml in the comet-ms step
    // -a suppplies the absulute path to the data directory where the mzxmls
    // rather than reading wherer the mfrom the xmls
    // where the mzxml are, because its not very
    // nextflow to look outside the cwd.
    file fastadb_with_decoy from joined_fasta_with_decoys_ch.first()
    file mzxmls from  xinteract_comet_mzxml_ch.toSortedList()
    output: 
    file "interact_comet.pep.xml" into comet_search_results_ch
    when:
    pepxmls.size() > 0
    """
    xinteract -a\$PWD -OARPd -dDECOY_ -Ninteract_comet.pep.xml $pepxmls
    """
}

5.5.1.6. Xtandem

As with tandem, you an adjust the file "xtandem-template.xml" to suit your needs, values with @...@ are automatically replaced. See also the documentation of xtandem here: https://www.thegpm.org/TANDEM/api/

<?xml version="1.0"?>
<bioml label="x! taxon-to-file matching list">
  <taxon label="DB">
    <file format="peptide" URL="%s" />
  </taxon>
</bioml>

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="tandem-input-style.xsl"?>
<bioml>
<note>list path parameters</note>

<note>spectrum parameters</note>
        <note type="input" label="spectrum, fragment monoisotopic mass error">@FRAGMENT_MASS_TOLERANCE@</note>
        <note type="input" label="spectrum, parent monoisotopic mass error plus">@PRECURSOR_MASS_TOLERANCE@</note>
        <note type="input" label="spectrum, parent monoisotopic mass error minus">@PRECURSOR_MASS_TOLERANCE@</note>
        <note type="input" label="spectrum, parent monoisotopic mass isotope error">yes</note>
        <note type="input" label="spectrum, fragment monoisotopic mass error units">Daltons</note>
        <note>The value for this parameter may be 'Daltons' or 'ppm': all other values are ignored</note>
        <note type="input" label="spectrum, parent monoisotopic mass error units">ppm</note>
                <note>The value for this parameter may be 'Daltons' or 'ppm': all other values are ignored</note>
        <note type="input" label="spectrum, fragment mass type">monoisotopic</note>
                <note>values are monoisotopic|average </note>

<note>spectrum conditioning parameters</note>
        <note type="input" label="spectrum, dynamic range">100.0</note>
                <note>The peaks read in are normalized so that the most intense peak
                is set to the dynamic range value. All peaks with values of less that
                1, using this normalization, are not used. This normalization has the
                overall effect of setting a threshold value for peak intensities.</note>
        <note type="input" label="spectrum, total peaks">50</note> 
                <note>If this value is 0, it is ignored. If it is greater than zero (lets say 50),
                then the number of peaks in the spectrum with be limited to the 50 most intense
                peaks in the spectrum. X! tandem does not do any peak finding: it only
                limits the peaks used by this parameter, and the dynamic range parameter.</note>
        <note type="input" label="spectrum, maximum parent charge">4</note>
        <note type="input" label="spectrum, use noise suppression">yes</note>
        <note type="input" label="spectrum, minimum parent m+h">500.0</note>
        <note type="input" label="spectrum, minimum fragment mz">150.0</note>
        <note type="input" label="spectrum, minimum peaks">15</note> 
        <note type="input" label="spectrum, threads">40</note>
        <note type="input" label="spectrum, sequence batch size">1000</note>

<note>residue modification parameters</note>
        <note type="input" label="residue, modification mass">57.022@C</note>
                <note>The format of this parameter is m@X, where m is the modfication
                mass in Daltons and X is the appropriate residue to modify. Lists of
                modifications are separated by commas. For example, to modify M and C
                with the addition of 16.0 Daltons, the parameter line would be
                +16.0@M,+16.0@C
                Positive and negative values are allowed.
                </note>
        <note type="input" label="residue, potential modification mass">16@M</note>
                <note>The format of this parameter is the same as the format
                for residue, modification mass (see above).</note>
        <note type="input" label="residue, potential modification motif"></note>
                <note>The format of this parameter is similar to residue, modification mass,
                with the addition of a modified PROSITE notation sequence motif specification.
                For example, a value of 80@[ST!]PX[KR] indicates a modification
                of either S or T when followed by P, and residue and the a K or an R.
                A value of 204@N!{P}[ST]{P} indicates a modification of N by 204, if it
                is NOT followed by a P, then either an S or a T, NOT followed by a P.
                Positive and negative values are allowed.
                </note>

<note>protein parameters</note>
        <note type="input" label="protein, taxon">other mammals</note>
                <note>This value is interpreted using the information in taxonomy.xml.</note>
        <note type="input" label="protein, cleavage site">[RK]|{P}</note>
                <note>this setting corresponds to the enzyme trypsin. The first characters
                in brackets represent residues N-terminal to the bond - the '|' pipe -
                and the second set of characters represent residues C-terminal to the
                bond. The characters must be in square brackets (denoting that only
                these residues are allowed for a cleavage) or french brackets (denoting
                that these residues cannot be in that position). Use UPPERCASE characters.
                To denote cleavage at any residue, use [X]|[X] and reset the 
                scoring, maximum missed cleavage site parameter (see below) to something like 50.
                </note>
        <note type="input" label="protein, modified residue mass file"></note>
        <note type="input" label="protein, cleavage C-terminal mass change">+17.002735</note>
        <note type="input" label="protein, cleavage N-terminal mass change">+1.007825</note>
        <note type="input" label="protein, N-terminal residue modification mass">0.0</note>
        <note type="input" label="protein, C-terminal residue modification mass">0.0</note>
        <note type="input" label="protein, homolog management">no</note>
        <note>if yes, an upper limit is set on the number of homologues kept for a particular spectrum</note>
        <note type="input" label="protein, quick acetyl">no</note>
        <note type="input" label="protein, quick pyrolidone">no</note>

<note>model refinement parameters</note>
        <note type="input" label="refine">yes</note>
        <note type="input" label="refine, modification mass"></note>
        <note type="input" label="refine, sequence path"></note>
        <note type="input" label="refine, tic percent">20</note>
        <note type="input" label="refine, spectrum synthesis">yes</note>
        <note type="input" label="refine, maximum valid expectation value">0.1</note>
        <note type="input" label="refine, potential N-terminus modifications">+42.010565@[</note>


        <note type="input" label="refine, potential C-terminus modifications"></note>
        <note type="input" label="refine, unanticipated cleavage">yes</note>
        <note type="input" label="refine, potential modification mass"></note>
        <note type="input" label="refine, point mutations">no</note>
        <note type="input" label="refine, use potential modifications for full refinement">no</note>
        <note type="input" label="refine, point mutations">no</note>
        <note type="input" label="refine, potential modification motif"></note>
        <note>The format of this parameter is similar to residue, modification mass,
                with the addition of a modified PROSITE notation sequence motif specification.
                For example, a value of 80@[ST!]PX[KR] indicates a modification
                of either S or T when followed by P, and residue and the a K or an R.
                A value of 204@N!{P}[ST]{P} indicates a modification of N by 204, if it
                is NOT followed by a P, then either an S or a T, NOT followed by a P.
                Positive and negative values are allowed.
                </note>

<note>scoring parameters</note>
        <note type="input" label="scoring, minimum ion count">4</note>
        <note type="input" label="scoring, maximum missed cleavage sites">@MAX_MISSED_CLEAVAGES@</note>
        <note type="input" label="scoring, x ions">no</note>
        <note type="input" label="scoring, y ions">yes</note>
        <note type="input" label="scoring, z ions">no</note>
        <note type="input" label="scoring, a ions">no</note>
        <note type="input" label="scoring, b ions">yes</note>
        <note type="input" label="scoring, c ions">no</note>
        <note type="input" label="scoring, cyclic permutation">no</note>
                <note>if yes, cyclic peptide sequence permutation is used to pad the scoring histograms</note>
        <note type="input" label="scoring, include reverse">no</note>
                <note>if yes, then reversed sequences are searched at the same time as forward sequences</note>
        <note type="input" label="scoring, cyclic permutation">no</note>
        <note type="input" label="scoring, include reverse">no</note>

<note>output parameters</note>
        <note type="input" label="output, log path"></note>
        <note type="input" label="output, message">testing 1 2 3</note>
        <note type="input" label="output, one sequence copy">no</note>
        <note type="input" label="output, sequence path"></note>
        <note type="input" label="output, path">output.xml</note>
        <note type="input" label="output, sort results by">protein</note>
                <note>values = protein|spectrum (spectrum is the default)</note>
        <note type="input" label="output, path hashing">no</note>
                <note>values = yes|no</note>
        <note type="input" label="output, xsl path">tandem-style.xsl</note>
        <note type="input" label="output, parameters">yes</note>
                <note>values = yes|no</note>
        <note type="input" label="output, performance">yes</note>
                <note>values = yes|no</note>
        <note type="input" label="output, spectra">yes</note>
                <note>values = yes|no</note>
        <note type="input" label="output, histograms">yes</note>
                <note>values = yes|no</note>
        <note type="input" label="output, proteins">yes</note>
                <note>values = yes|no</note>
        <note type="input" label="output, sequences">yes</note>
                <note>values = yes|no</note>
        <note type="input" label="output, one sequence copy">no</note>
                <note>values = yes|no, set to yes to produce only one copy of each protein sequence in the output xml</note>
        <note type="input" label="output, results">valid</note>
                <note>values = all|valid|stochastic</note>
        <note type="input" label="output, maximum valid expectation value">0.1</note>
                <note>value is used in the valid|stochastic setting of output, results</note>
        <note type="input" label="output, histogram column width">30</note>
                <note>values any integer greater than 0. Setting this to '1' makes cutting and pasting histograms
                into spread sheet programs easier.</note>
<note type="description">ADDITIONAL EXPLANATIONS</note>
        <note type="description">Each one of the parameters for X! tandem is entered as a labeled note
                        node. In the current version of X!, keep those note nodes
                        on a single line.
        </note>
        <note type="description">The presence of the type 'input' is necessary if a note is to be considered
                        an input parameter.
        </note>
        <note type="description">Any of the parameters that are paths to files may require alteration for a 
                        particular installation. Full path names usually cause the least trouble,
                        but there is no reason not to use relative path names, if that is the
                        most convenient.
        </note>
        <note type="description">Any parameter values set in the 'list path, default parameters' file are
                        reset by entries in the normal input file, if they are present. Otherwise,
                        the default set is used.
        </note>
        <note type="description">The 'list path, taxonomy information' file must exist.
                </note>
        <note type="description">The directory containing the 'output, path' file must exist: it will not be created.
                </note>
        <note type="description">The 'output, xsl path' is optional: it is only of use if a good XSLT style sheet exists.
                </note> 
</bioml>

<?xml version="1.0"?>
<bioml>
        <note>
        Each one of the parameters for x! tandem is entered as a labeled note node. 
        Any of the entries in the default_input.xml file can be over-ridden by
        adding a corresponding entry to this file. This file represents a minimum
        input file, with only entries for the default settings, the output file
        and the input spectra file name. 
        See the taxonomy.xml file for a description of how FASTA sequence list 
        files are linked to a taxon name.
        </note>

        <note type="input" label="list path, default parameters">%s</note>
        <note type="input" label="list path, taxonomy information">%s</note>

        <note type="input" label="protein, taxon">DB</note>

        <note type="input" label="spectrum, path">%s</note>

        <note type="input" label="output, path">%s</note>
</bioml>

We are making the xtandem_taxonomy xml in the same process because its kinda a pseudo dependency

process MakeXtandemConfig {
    input:
    file template from Channel.fromPath(params.xtandem_template)
    file fastadb_with_decoy from joined_fasta_with_decoys_ch.first()
    val max_missed_cleavages
    output:
    file "xtandem_config.xml" into xtandem_config_ch
    """
    sed 's/@DDA_DB_FILE@/$fastadb_with_decoy/g;s/@FRAGMENT_MASS_TOLERANCE@/$params.fragment_mass_tolerance/g;s/@PRECURSOR_MASS_TOLERANCE@/$params.precursor_mass_tolerance/g;s/@MAX_MISSED_CLEAVAGES@/$max_missed_cleavages/g' $template > xtandem_config.xml
    """
}


process XTandem {
    when:
    params.search_engines.contains("xtandem")

    input:
    file mzxml from maybespectra_ch.xtandem
    file tandem_config from xtandem_config_ch.first()
    file fastadb_with_decoy from joined_fasta_with_decoys_ch.first()
    output:
    file("${mzxml.baseName}.tandem.pep.xml") into xtandem_pepxml_ch
    file mzxml into xinteract_xtandem_mzxml_ch
    """
    printf '
    <<taxonomy-template>>'  $fastadb_with_decoy | tail -n+2 > xtandem_taxonomy.xml

    printf '
    <<xtandem-input-template>>' $tandem_config xtandem_taxonomy.xml $mzxml ${mzxml.baseName}.TANDEM.OUTPUT.xml | tail -n+2 > input.xml
    tandem input.xml
    Tandem2XML ${mzxml.baseName}.TANDEM.OUTPUT.xml ${mzxml.baseName}.tandem.pep.xml 
    """
}

process XinteractXTandem {
    memory '16 GB'
    input:
    file pepxmls from xtandem_pepxml_ch.toSortedList()
    // the filename of needed fastdadb was defined in cometcfg
    // and stored in pepxml in the comet-ms step
    // -a suppplies the absulute path to the data directory where the mzxmls
    // rather than reading wherer the mfrom the xmls
    // where the mzxml are, because its not very
    // nextflow to look outside the cwd.
    file fastadb_with_decoy from joined_fasta_with_decoys_ch.first()
    file mzxmls from  xinteract_xtandem_mzxml_ch.toSortedList()
    output: 
    file "interact_xtandem.pep.xml" into xtandem_search_results_ch
    when:
    pepxmls.size() > 0 
    """
    xinteract -a\$PWD -OARPd -dDECOY_ -Ninteract_xtandem.pep.xml $pepxmls
    """
}

If you customized or have your own the xtandem template, you can point to it with this.

params.xtandem_template="xtandem-template.xml"

# pepxml	size pepxml (GiB)
69	0.7890625

Why is this much smaller than comet?

5.5.1.7. Joining Comet & Xtandem

the tap seems to hap after nextflow has stopped, look more into this.

possible causes: Queue remains open when data is staged from an external source · Issue #2502 · nextflow-io/nextflow · GitHub Parent nextflow process doesn't exit after all compute tasks are complete · Issue #1230 · nextflow-io/nextflow · GitHub

// we handle the one or two engines case
// DSL2 incompat
// would be in workflow body

if (params.search_engines.size() > 1) {  
    process CombineSearchResults {
        publishDir "${params.outdir}/speclib"
        when:

        input:
        file xtandem_search_results from xtandem_search_results_ch
        file comet_search_results from comet_search_results_ch
        output:
        file "lib_iprophet.peps.xml" into combined_search_results_ch
        """
        InterProphetParser DECOY=DECOY_ THREADS=${task.cpus} $xtandem_search_results $comet_search_results lib_iprophet.peps.xml
        """
    }
} else if (params.search_engines.contains("comet")) {
    combined_search_results_ch = comet_search_results_ch
} else if (params.search_engines.contains("xtandem")) {
    combined_search_results_ch =xtandem_search_results_ch
} else {
    combined_search_results_ch = Channel.create()
}

5.5.1.8. Building Specral Library

Inputs from Creating Swath window files http://www.openswath.org/en/latest/docs/openswath.html

5.5.1.8.1. Mayu

https://doi.org/10.5167/uzh-28712 https://doi.org/10.1074/mcp.M900317-MCP200

GENERAL: Mayu is a software package to determine protein identification false discovery rates (protFDR) and peptide identification false discovery rates (pepFDR) additionally to the peptide-spectrum match false discovery rate (mFDR).

Here is what happens in mayu: For a pepxml file with peptide-spectrum-matches PSM (type of (spectrum,peptide,probability), where the probality is based on the similarity of the theoratical spectrum, mayu determines the peptide-spectrum-match False Detection Rate (mFDR), and protein identification false discovery rates (protFDR). We select a protFDR for which mayu finds a matching mFDR level (no higher than the -G flag) and it will filter everything with a higher mFDR level In the output csv the score column is the the probability in PSM (in mayu documentation "discrimant")

We find the lowest probability that still has an mFDR that matched the above, and that is what we use as the filtering criterian in spectrast

This is what we will than filter on with specrtrast

Hmhf why can't mayu return deterministic filenames. (It incorporates the mayu version number in the filename grumbl), it follows the pattern

my $psm_file_base = $out_base . '_psm_';
my $id_csv_file = $psm_file_base
                . $fdr_type
                . $fdr_value . '_'
                . $target_decoy . '_'
                . $version . '.csv';

process  FindMinimumPeptideProbability {
    input: 
    file combined_search_results from combined_search_results_ch.first()
    file fastadb_with_decoy from joined_fasta_with_decoys_ch.first()
    val max_missed_cleavages
    output:
    env PEPTIDEPROBABILITY into  minimum_peptide_probability
    /* explanation of paramaters
     -G  $params.protFDR            | maximum allowed mFDR of $params.protFDR 
     -P protFDR=$params.protFDR:t   |  print out PSMs of targets who have a protFDR of $params.protFDR
     -
     -H | defines the resolution of error analysis (mFDR steps)
     -I number of missed cleavages used for database search
     -M | file name base
     */
    script:
    prefix="filtered"
    // you can change this to a glob-pattern (e.g. "*") for future-proofing
    mayu_version="1.07"
    psm_csv="${prefix}_psm_protFDR${params.protFDR}_t_${mayu_version}.csv"
    """
    Mayu.pl -verbose -A $combined_search_results -C $fastadb_with_decoy -E DECOY_ -G $params.protFDR -P protFDR=${params.protFDR}:t -H 51 -I $max_missed_cleavages -M $prefix
    # test if psm_csv was made
    test -e $psm_csv || exit 1
    # test if the results arent empty
    test `wc -l $psm_csv | cut -d' ' -f1` -gt 1 || exit 1 
    PEPTIDEPROBABILITY=`cat $psm_csv | cut -f 5 -d ',' |tail -n+2 |sort -u | head -n1`
    """
}

Note that sort requires $TMPDIR to actually exists and be writable, $TMPDIR (the envvar) is inherited from the parent env when run in a container, but not mounted (at least not in Singularity), so if $TMPDIR does not exist in the container, this will crash.

// sensible values = floats between 0 and 1 
// target FDR for mayu
// this is equivalent to the "pvalue" parameter in the original (python) gladiator implementation
// which is labeld as "Spectral library building FDR" in the UI
params.protFDR=0.01

5.5.1.8.2. Spectrast

http://tools.proteomecenter.org/wiki/index.php?title=Software:SpectraST

Spectrast in SpectraSTPepXMLLibImporter.cpp readFromFile processSearchHit will read the mzxmls contained in the pepxml. It defaults to looking for the mzxml in the CWD otherwise it checks the path in the base_name property of msms_run_summary element in <search_summary so we need again give the maybespectra_ch on. From the above url

Creating Consensus Libraries

Importing the raw spectra into SpectraST

[ … ] Remember that the .mzXML files must be in the same directories as their corresponding .pepXML files.

Spectrast is from tpp Note that spectrast flags are single-dash multilettered underscored argument-concatenated.thanks. Its argument-parser is very funky, so be careful here. It also doesn't check if illegal flags are given, they will pass silently instead, grumble.

(\_/) .~~ 
(._.)/~~~
(_ _)

5.5.1.8.2.1. Converting traml into spectrast friendly format

BEGIN {FS="     "; OFS="        "}
NR==1 {
    for (i=1; i<=NF; i++) {
        f[$i] = i
    }
}
NR>1 { print $(f["PeptideSequence"]), $(f["NormalizedRetentionTime"]) }

BEGIN {FS="     "; OFS="        "}
# we set the column names so that we can look them up later
NR==1 {
    for (i=1; i<=NF; i++) {
        f[$i] = i
    }
}
# use only the last entry in the table per peptide sequence
NR>1 {
    irt_by_sequence[$(f["PeptideSequence"])] = $(f["NormalizedRetentionTime"])
    peptide_sequences[$(f["PeptideSequence"])]=$(f["PeptideSequence"])
}
END {
    for (sequence in peptide_sequences)
        print(sequence,irt_by_sequence[sequence])
}

TargetedFileConverter from OpenMS

process CreateSpectrastIrtFile {
    input:
    file irt_traml from Channel.fromPath(params.irt_traml_file)
    output:
    file ("irt.txt") into irt_txt_ch
    script:
    intermediate_tsv="intermediate_irt.tsv"
    """
    TargetedFileConverter -in $irt_traml -out_type tsv -out $intermediate_tsv
    """ + '''  awk '
    BEGIN {FS=" "; OFS="        "}
    NR==1 {
        for (i=1; i<=NF; i++) {
            f[$i] = i
        }
    }
    NR>1 { print $(f["PeptideSequence"]), $(f["NormalizedRetentionTime"]) }' ''' + "$intermediate_tsv > irt.txt"
}

5.5.1.8.2.2. Running Spectrast

// spectrast will create *.splib, *.spidx, *.pepidx, 
// note that where-ever a splib goes, so must its spidx and pepidx
///and they must have the same part
process SpectrastCreateSpecLib {
    input:
    file irtfile from irt_txt_ch
    file combined_search_results from combined_search_results_ch.first()
    file fastadb_with_decoy from joined_fasta_with_decoys_ch.first()
    file spectra from maybespectra_ch.spectrast.toSortedList()
    val cutoff from minimum_peptide_probability
    output:
    tuple file ("${prefix}_cons.splib"), file("${prefix}_cons.spidx") into spectrast_ch
    file("${prefix}_cons.sptxt") into consensus_lib_sptxt_ch
    script:
    prefix = "SpecLib"
    to_run = "spectrast -cN${prefix} -cIHCD -cf\"Protein! ~ DECOY_\" -cP$cutoff -c_IRR "
    if (params.use_irt)
        to_run += "-c_IRT$irtfile "
    to_run +=  "$combined_search_results" // spectrast really wants its input-files last.
    to_run += "\n spectrast -cN${prefix}_cons -cD$fastadb_with_decoy -cIHCD -cAC ${prefix}.splib"
}

from original gladiator implementation source is unclear; author forgot.

LGGNEQVTR   -28.308
GAGSSEPVTGLDAK  0.227
VEATFGVDESNAK   13.1078
YILAGVENSK  22.3798
TPVISGGPYEYR    28.9999
TPVITGAPYEYR    33.6311
DGLDAASYYAPVR   43.2819
ADVTPADFSEWSK   54.969
GTFIIDPGGVIR    71.3819
GTFIIDPAAVIR    86.7152
LFLQFGAQGSPFLK  98.0897

// white-space-delimited file of peptide-sequences and internal retention times
// whether or not to use the retention-
params.use_irt=true
params.irt_traml_file = "iRTAssayLibrary.TraML"

Here we forward declary consensus_pseudospectra_openswath_library_tsv so that we can later redirect it.

consensus_pseudospectra_openswath_library_tsv = Channel.create()

process Spectrast2OpenSwathTsv {
 /*
     Choice parts of sprectrast2.tsv --help

     spectrast2tsv.py
     ---------------
     This script is used as filter from spectraST files to swath input files.
     python spectrast2tsv.py [options] spectrast_file(s)

     -d                  Remove duplicate masses from labeling
     -e                  Use theoretical mass
     -k    output_key    Select the output provided. Keys available: openswath, peakview. Default: peakview
     -l    mass_limits   Lower and upper mass limits of fragment ions. Example: -l 400,2000
     -s    ion_series    List of ion series to be used. Example: -s y,b

     -w    swaths_file   File containing the swath ranges. This is used to remove transitions with Q3 falling in the swath mass range. (line breaks in windows/unix format)
     -n    int           Max number of reported ions per peptide/z. Default: 20
     -o    int           Min number of reported ions per peptide/z. Default: 3
     -a    outfile       Output file name (default: appends _peakview.txt)
     */
    input:
    file swath_windows from swath_windows_ch.first()
    file sptxt from consensus_lib_sptxt_ch.first()
    output:
    file consensus_pseudospectra_openswath_library_tsv
    script:
    consensus_pseudospectra_openswath_library_tsv="SpecLib_cons_openswath.tsv"
    """
    MINWINDOW=`head -n1 $swath_windows | cut -d'        ' -f1`
    MAXWINDOW=`tail -n1 $swath_windows | cut -d'        ' -f2`
    spectrast2tsv.py -l \$MINWINDOW,\$MAXWINDOW -s y,b -d -e -o 6 -n 6 -w $swath_windows -k openswath -a $consensus_pseudospectra_openswath_library_tsv $sptxt
    """
}

5.5.2. Building Spectral library from Machine learning

DeepDIA can predict the spectral library from peptide lists See also its documentation https://github.com/lmsac/DeepDIA/raw/master/README.md https://github.com/lmsac/DeepDIA/raw/master/docs/predict_detectability.md https://doi.org/10.1038/s41467-019-13866-z

if(libgen_method_is_enabled("deepdia",params))  {

params._deepdia_url = "https://github.com/lmsac/DeepDIA/raw/c5ad2aa50218fcdfd1d441714702e605fdb00bb3"
// float or null
// if null, do not use minimum detectability filtering
// if a float, filter
params.deepdia_min_detectability = null
params.deepdia_detectability_model = "${params._deepdia_url}/data/models/detectability/epoch_004.hdf5"

// list tuples in the form of
// [charge, model, peptidelist]
params.deepdia_ms2_entries = [
    ["2",
     "${params._deepdia_url}/data/models/charge2/epoch_035.hdf5",
     ],
    ["3",
     "${params._deepdia_url}/data/models/charge3/epoch_034.hdf5",
     ]]

Channel
    .from(params.deepdia_ms2_entries)
    .map( {
            charge, model ->
            tuple(charge, file(model))})
    .set{deepdia_ms2_models}

process DeepDIADigestProtein
{
    input:
    file joined_fasta from joined_fasta_database_ch
    output:
    file deepdia_peptide_list
    script:
    deepdia_peptide_list="deepdia_peptide_list.csv"
    """
    digest_proteins.py --in $joined_fasta --out $deepdia_peptide_list --no-group_duplicated
    """
}

If we do detectability filtering we mix the filtered peptides with the models, otherwise the unfiltered.

deepdia_peptide_list = Channel.create()
if (params.deepdia_min_detectability != null){
    deepdia_peptide_list.set{deepdia_prefilt_peptide_list}
    deepdia_filtered_peptide_list = Channel.create()
    deepdia_filtered_peptide_list
        .tap{deepdia_peptides_for_retention_pred}
        .tap{deepdia_peptides_for_library}
        .combine(deepdia_ms2_models)
        .set{deepdia_ms2_inputs_ch}
} else {
    deepdia_peptide_list
        .tap{deepdia_peptides_for_retention_pred}
        .tap{deepdia_peptides_for_library}
        .combine(deepdia_ms2_models)
        .set{deepdia_ms2_inputs_ch}
}

So here we predict detectability of peptides and filter by them, if requested

if (params.deepdia_min_detectability != null){
    // we seperate these two so that --resume allows for easy tweaking of --minimum-detectability
    process DeepDIATrainDetectibility {
        memory '64 GB'
        input:
        file model from Channel.fromPath(params.deepdia_detectability_model)
        file deepdia_prefilt_peptide_list
        output:
        set file(deepdia_detectability_prediction), file(deepdia_prefilt_peptide_list) into deepdia_detectability
        script:
        deepdia_detectability_prediction="${deepdia_prefilt_peptide_list.baseName}.detectability.csv"
        "predict_detectability.py --in $deepdia_prefilt_peptide_list --model $model --out $deepdia_detectability_prediction"
    }

    process DeepDIAMinimumDetectabilityFiltering
    {
        input:
        set file(detectability_prediction), file(prefilt_peptide_list) from deepdia_detectability
        val min_detectability from Channel.value(params.deepdia_min_detectability)
        output:
        file deepdia_filtered_peptide_list
        script:
        deepdia_filtered_peptide_list="deepdia_filtered_peptide_list.csv"
        """
        filter_peptide_by_detectability.py --peptide $prefilt_peptide_list --detect $detectability_prediction --min_detectability $min_detectability --out ${deepdia_filtered_peptide_list}
        """
    }
}

Then we predict the ms2

process DeepDIAPredictCharge {
    memory '64 GB'
    input:
    set file(peptides),charge,file(model)  from deepdia_ms2_inputs_ch
    output:
    file deepdia_ions
    script:
    deepdia_ions="predictions.charge.${charge}.ions.json"
    """
    predict_ms2.py --charge $charge --in $peptides --model $model --out $deepdia_ions
    """
}

params.deepdia_irt_model =
    "${params._deepdia_url}/data/models/irt/epoch_082.hdf5"
// params.deepdia_peptides =
//     "${params._deepdia_url}/data/peptides/Pan_human.peptide.csv"

deepdia_irt_model = Channel.fromPath(params.deepdia_irt_model)
deepdia_speclib = Channel.create()

process DeepDIAPredictRetention {
    memory '32 GB'
    input:
    file deepdia_irt_model
    file deepdia_peptides_for_retention_pred
    output:
    file predicted_rt
    script:
    predicted_rt="prediction.irt.csv"
    """
    predict_rt.py --in $deepdia_peptides_for_retention_pred --model $deepdia_irt_model --out $predicted_rt
    """
}

process DeepDIAPredictionsToLibrary {
    memory '32 GB'
    input:
    file predicted_rt
    file ions from deepdia_ions.toSortedList()
    file deepdia_peptides_for_library
    output:
    file deepdia_speclib
    script:
    deepdia_speclib="speclib.tsv"
    """
    build_assays_from_prediction.py --peptide ${deepdia_peptides_for_library} --rt ${predicted_rt} --ions ${ions} --out prediction.assay.pickle
    convert_assays_to_openswath.py --in prediction.assay.pickle --out ${deepdia_speclib}
    """
}

deepdia_speclib.set{speclib_tsv_for_decoys}

}  // end of deepdia guard

5.5.3. Supplying a custom spectral library

You can also supply a custom spectral library by setting =--speclib to a glob/path of your customly generated library. This should be in a format that openms's TargetedFileConverte understand, so MaxQuant or OpenMS works.

To check, one can manually inspect (using the gladiator-guix container), whether TargetedFileConverter -in yourlibrary -out speclib.tsv looks proper. Pay special attention to the modified peptides field, if this is being parsed correctly. One bug observed by this developer is that this field is repeated after not being parsed correctly. See also https://abibuilder.cs.uni-tuebingen.de/archive/openms/Documentation/release/2.7.0/html/classOpenMS_1_1TransitionTSVFile.html

// if params.deconvolution method is set
// set this to to spectral libraries tsvs in maxquant / openms / any input format that openms TargetedConverter understands
params.speclib = null

if (libgen_method_is_enabled("custom", params)){
    Channel.fromPath(params.speclib).set{speclib_tsv_for_decoys}
}

5.5.4. combining various spectral libraries into one.

would need to use openms TargetedFileConverter to convert to openswath like tsv, then msproteomicstools tsv2spectrast.py to turn into spectrast. Then merge merge with spectrast with either -cJA or -cJU http://tools.proteomecenter.org/wiki/index.php?title=Software:SpectraST#Creating_Consensus_Libraries See aso schubert et al (https://doi.org/10.1038/nprot.2015.015)

5.6. OpenSwathDecoys

specrast2tsv.py is from msproteomicstools OpenSwathDecoyGenerator from OpenMS topp

// optional 
params.openswath_transitions = ""
// Minimum decoy fraction for open swath decoy generator
// if left unset, gladiator might pick an appropriate one depending on your deconvolution method,
// should be a fraction between 0.0 and 1.0
params.oswdg_min_decoy_fraction = null

if(params.oswdg_min_decoy_fraction != null) { 
    Channel.value("-min_decoy_fraction ${params.oswdg_min_decoy_fraction}").set{oswdg_args} 
} else if (libgen_method_is_enabled("deepdia",params)){
 Channel.value("-min_decoy_fraction 0.0").set{oswdg_args}
} else {
  Channel.value("").set{oswdg_args}
}

With deepDIA method used there seems to be some problems with generating decoys, so we set -min_decoy_fraction to 0.0 in this case.

process AddDecoysToOpenSwathTransitions {
    input:
    file speclib_tsv from speclib_tsv_for_decoys.first()
    val oswdg_args
    output:
    file outputfile into openswath_transitions_ch
    script:
    outputfile="SpecLib_cons_decoys.pqp"

    """
    TargetedFileConverter -in $speclib_tsv -out SpecLib_cons.TraML
    OpenSwathDecoyGenerator -decoy_tag DECOY_ -in SpecLib_cons.TraML -out $outputfile -method reverse $oswdg_args
    """
}

TargetedFileConverter from openms

/usr/bin/TargetedFileConverter: error while loading shared libraries: libQt5Core.so.5: cannot open shared object file: No such file or directory

Here we might actually not need TargetedFileConverter, can give tsv directly to OpenSwathDecoyGenerator. and pass result tsv to OpenSwathWorkflow as -tr.

5.7. Building Dia Matrix

https://openswath.org/_/downloads/en/latest/pdf/ https://openswath.org/_/downloads/en/latest/htmlzip/

5.7.1. OpenSwathWorkflow

Creates tsv with -out_tsv

Using a the cache will decrease memomry usage at the cost of writes & time

// wheter to use -readOptions cacheWorkingInMemory in OSW
// this actually crashes so disabled
params.osw_use_cache = false
// extra flags to pass to OSW
params.osw_extra_flags =  ""

The transitions size is larger if the deconvolution method is deepDIA, which will consume more memory.

to use -out_osw, -tr needs to be a pqp file,

If we are using legacy pyprophet we will need to create a tsv

process OpenSwathWorkflow_legacy {
        memory { 16.GB * (libgen_method_is_enabled("deepdia",params) ? 2 : 1 )}
        input:
        file dia_mzml_file from dia_mzml_files_ch.osw
        // file openswath_transitions from Channel.fromPath("data_from_original/bruderer-pwiz-no-dda/SpecLib_cons_decoy.TraML").first()
        file openswath_transitions from openswath_transitions_ch_for_legacy.first()
        file swath_truncated_windows from truncated_swath_windows_ch.first()
        file irt_traml from Channel.fromPath(params.irt_traml_file).first()
        output:
        file dia_tsv_file  into openswath_tsv_ch
        script:
        dia_tsv_file = "${dia_mzml_file.baseName}-DIA.tsv"
        to_execute =
            "OpenSwathWorkflow " +
            "-force " +
            "-in $dia_mzml_file " +
            "-tr $openswath_transitions " +
            "-threads ${task.cpus} " +
            "-min_upper_edge_dist 1 " +
            "-sort_swath_maps " +
            "-out_tsv ${dia_tsv_file} " + 
            "-swath_windows_file $swath_truncated_windows " +
            params.osw_extra_flags + " "
        if (params.use_irt)
            to_execute +=  "-tr_irt $irt_traml "
        to_execute
}

If we are using nonlegacy pyprophet we will need to create an osw

openswath_osw_indirect_ch = Channel.create()
openswath_osw_indirect_ch.multiMap{ it ->
    pyprophet_subsample: pyprophet_score  : it}.set{openswath_osw_ch}
process OpenSwathWorkflow {
    memory { 16.GB * (libgen_method_is_enabled("deepdia",params) ? 2 : 1 )}
    input:
    file dia_mzml_file from dia_mzml_files_ch.osw
    // file openswath_transitions from Channel.fromPath("data_from_original/bruderer-pwiz-no-dda/SpecLib_cons_decoy.TraML").first()
    file openswath_transitions from openswath_transitions_ch_for_nonlegacy.first()
    file swath_truncated_windows from truncated_swath_windows_ch.first()
    file irt_traml from Channel.fromPath(params.irt_traml_file).first()
    output:
    file dia_osw_file  into openswath_osw_indirect_ch
    script:
    dia_osw_file = "${dia_mzml_file.baseName}-DIA.osw"
    to_execute =
        "OpenSwathWorkflow " +
        "-force " +
        "-in $dia_mzml_file " +
        "-tr $openswath_transitions " +
        "-threads ${task.cpus} " +
        "-min_upper_edge_dist 1 " +
        "-sort_swath_maps " +
        "-out_osw ${dia_osw_file} " + 
        "-swath_windows_file $swath_truncated_windows " +
        params.osw_extra_flags + " "

    if (params.use_irt)
        to_execute +=  "-tr_irt $irt_traml "
    to_execute
}

Then here we choose which one to use

// we will need the osw to go to various processes
if (params.pyprophet_use_legacy){
    openswath_transitions_ch.into{openswath_transitions_ch_for_legacy}
 <<nf-openswathworkflow-for-pyprophet-legacy>>
} else {
    openswath_transitions_ch.into{openswath_transitions_ch_for_nonlegacy;openswath_transitions_ch_for_pyprophet}
  <<nf-openswathworkflow-for-pyprophet-nonlegacy>>
}

(Apparently these two cant have the same name, even if they are conditionally declared

Extraction windows have a gap. Will abort (override with -force)

OpenSwathWorkflow invocation can output tsv XOR osw, not both. You will get exit status 8

Error: Unexpected internal error (Either out_features, out_tsv or out_osw needs to be set (but not two or three at the same time))

5.7.1.1. What is the relation between `irt.txt` and `iRTAssayLibrary.TraML`

The irt.txt seems to contain pairs of CompoundList/Peptide@sequence and Compoundlist/Peptide/RetentionTimeList/cvParam[@name="normalized retention time"]@value of the traml file, except

LFLQFGAQGSPFLK 98.0897

is not present in the traml file.

traml was not gotten from here: https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/PASS_View?identifier=PASS00779 file://ftp:PASS00779@ftp.peptideatlas.org:/ (but is comparable)

<OA> The files have been downloaded from net and for now they are intended to be used "as is". If there becomes a need to modify those, then there will be a need figure out how to do it. So, it is very much unexplored how to generate those files for now. [10:56] <NA> looks like the .txt contains pairs of "sequence - Normalized retention time" from the traml. The only one in the .txt that isnt in the traml seems to be LFLQFGAQGSPFLK [10:59] <NA> where did you download them froM? [11:00] <OA> That is a good question. I guess it was some example dataset for openswath, but I don't remember which. If you have time and interest, you could try to figure out how iRT assay library should be built. [11:07]

Retention time normalization¶

The retention time normalization peptides are provided using the optional parameter tr_irt in TraML format. We suggest to use the iRTassays.TraML file provided in the tutorial dataset, if the Biognosys iRT-kit was used during sample preparation.

If the iRT-kit was not used, it is highly recommended to use or generate a set of endogenous peptides for RT normalization. A recent publication [5] provides such a set of CiRT peptides suitable for many eukaryotic samples. The TraML file from the supplementary information can be used as input for tr_irt. Since not all CiRT peptides might be found, the flag RTNormalization:estimateBestPeptides should be set to improve initial filtering of poor signals. Further parameters for optimization can be found when invoking OpenSwathWorkflow –helphelp under the RTNormalization section. Those do not require adjustment for most common sample types and LC-MS/MS setups, but might be useful to tweak for specific scenarios.

5 Röst HL, Liu Y, D’Agostino G, Zanella M, Navarro P, Rosenberger G, Collins BC, Gillet L, Testa G, Malmström L, Aebersold R. TRIC: an automated alignment strategy for reproducible protein quantification in targeted proteomics. Nat Methods. 2016 Sep;13(9):777-83. doi: 10.1038/nmeth.3954. Epub 2016 Aug 1. PMID: 27479329

https://doi.org/10.1038/nmeth.3954 file://ftp:PASS00788@ftp.peptideatlas.org:/

also not from here

It seems the traml is based on file://ftp:PASS00289@ftp.peptideatlas.org:/SGS/assays/OpenSWATH_SM4_iRT_AssayLibrary.TraML which has also has the same irt times as iRT.txt, except its still missign LFLQFGAQGSPFLK, The same irt.txt can be found in https://github.com/CaronLab/Allele-specific-library-scripts/blob/main/iRT.txt, published before gladiator, with the retention times also hardcoded in https://github.com/msproteomicstools/msproteomicstools/raw/master/analysis/spectral_libs/spectrast_updateiRTs.py

OpenMs/src/tests/class_tests/openms/data/MRMDecoyGenerator_input.TraML has the same irts and also contains LFLQFGAQGSPFLK so that might also be a good target.

5.7.2. Pyprophet

https://github.com/PyProphet/pyprophet pyprophet License: 3-clause BSD https://doi.org/10.1093/bioinformatics/btu686 https://doi.org/10.1038/s42003-023-04977-x Pyprophet aggregates various quality scores into one score d_score (discriminant_score)

is based on mprophet https://doi.org/10.1038/nmeth.1584

5.7.2.1. Legacy Pyprophet

if (params.pyprophet_use_legacy)
    process pyprophet_legacy {
    publishDir "${params.outdir}/pyprophet/", pattern: "*.csv"
    publishDir "${params.outdir}/reports/pyprophet/", pattern: "*.pdf"
    input:
    file dia_tsv from openswath_tsv_ch
    output:
    file dscore_csv into pyprophet_legacy_ch
    // just for publishing
    file "${dia_tsv.baseName}_report.pdf" 
    script:
    seed="928418756933579397"

    dscore_csv="${dia_tsv.baseName}_with_dscore.csv"
    """
    pyprophet --random_seed=${seed} --delim=tab --export.mayu ${dia_tsv} --ignore.invalid_score_columns
    """
}

WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_summary_stat.csv WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_full_stat.csv WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_with_dscore.csv WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_with_dscore_filtered.csv WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_report.pdf WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_cutoffs.txt WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_svalues.txt WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_qvalues.txt WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_dscores_top_target_peaks.txt WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_dscores_top_decoy_peaks.txt WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_mayu.cutoff WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_mayu.fasta WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_mayu.csv WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_scorer.bin WRITTEN: B_D140314_SGSDSsample1_R01_MHRM_T0.mzML-DIA_weights.txt

If you get an error here as

raise Exception("got empty input file")

try running with params.use_irt=false

5.7.2.2. nonlegacy prypophet

http://www.openswath.org/en/latest/docs/pyprophet.html In order to make the pyrophet step less intensive, by default we subsample to 1 / nr_samples, as suggested in the pyprophet guide. However you can supply your own here

// The ratio (0,1] to subsample by in pyprophet.
// leave to null to use 1 / nr_samples
params.pyprophet_subsample_ratio = null

we calculate the subsample ratio from the number of runs

if(params.pyprophet_subsample_ratio == null){
    Channel.value(1 / Math.max(Channel.fromPath(params.diafiles).toSortedList().size().getVal(), 1))
        .set{subsample_ratio}
} else {
    Channel.value(params.pyprophet_subsample_ratio).set{subsample_ratio}
}

In pyprophet subsample and pyprophet score (of nonlegacy pypropet), we will need to pass --test to not have random behaviour. This will make the sql behaviour behaviour non-random. Because sql doesn't allow for settign a seed, pyprophet will just select the first subsample_ratio of transitions, This can result in no decoy peptides being in the subsampled osw's. This is indicated by the error of pyprophet:

Error: At least 10 decoy groups and 10 target groups are required.

This leaves you with one of three choices:

passing =--pyprophet_subsample_ratio=1.0, sacrificing runtime.
passing =--pyprophet_fixed_seed=false, sacrificing reproducibility.
passing =--pyprophet_use_legacy=true, sacrificing "up-to-dateness".

params.pyprophet_fixed_seed=true
params.pyprophet_use_legacy=false

Below we are following the steps of http://www.openswath.org/en/latest/docs/pyprophet.html#scaling-up

Anything in pyprophet that is not invoked with an --out flag will overwrite the --in file, here we only do that when the --in-file is created in the process

if (!params.pyprophet_use_legacy)
    {

process pyprophet_subsample {
    input:
    file dia_osw_file from openswath_osw_ch.pyprophet_subsample
    val subsample_ratio
    output:
    file subsampled_osw
    script:
    subsampled_osw="${dia_osw_file.baseName}.osws"
    pyprophet_seed_flag=(params.pyprophet_fixed_seed ? "--test" : "--no-test")
    """
    pyprophet subsample $pyprophet_seed_flag --in=$dia_osw_file --out=$subsampled_osw --subsample_ratio=$subsample_ratio
    """
}

pyprophet score requires significantly more memory than pyprophet merge should we split this?

If pyprophet merge or pyprophet score complains about no decoys being found, you can try without passing =--pyprophet_fixed_seed to gladiator.

process pyprophet_learn_classifier {
    input:
    file subsampled_osws from subsampled_osw.toSortedList()
    file openswath_transitions from openswath_transitions_ch_for_pyprophet.first()
    output:
    file osw_model
    script:
    pyprophet_seed_flag=(params.pyprophet_fixed_seed ? "--test" : "--no-test")
    osw_model="model.osw"
    """
    pyprophet merge --template=$openswath_transitions --out=$osw_model $subsampled_osws
    pyprophet score  $pyprophet_seed_flag --in=$osw_model --level=ms1ms2
    """
}

scored_osw_indirect_ch =Channel.create()
scored_osw_indirect_ch.multiMap{it ->
    reduce: backpropagate: it}.set{scored_osw_ch}

process pyprophet_apply_classifier {
    input:
    file osw_model from osw_model.first()
    file osw from openswath_osw_ch.pyprophet_score
    output:
    file scored_osw into scored_osw_indirect_ch
    script:
    pyprophet_seed_flag=(params.pyprophet_fixed_seed ? "--test" : "--no-test")
    scored_osw="${osw.baseName}.scored.${osw.Extension}"
    """
    pyprophet score $pyprophet_seed_flag --in=$osw --out=$scored_osw --apply_weights=$osw_model --level=ms1ms2
    """
}

process pyprophet_reduce {
    input:
    file scored_osw from scored_osw_ch.reduce
    output:
    file reduced_scored_osw 
    script:
    reduced_scored_osw="${file(scored_osw.baseName).baseName}.${scored_osw.Extension}r"
    """
    pyprophet reduce --in=$scored_osw --out=$reduced_scored_osw
    """
}

process pyprophet_control_error {
    input:
    file reduced_scored_osws from reduced_scored_osw.toSortedList()
    file osw_model from osw_model.first()
    output:
    file osw_global_model
    script:
    osw_global_model="model_global.osw"
    """
    pyprophet merge --template=$osw_model --out=$osw_global_model $reduced_scored_osws
    pyprophet peptide --context=global --in=$osw_global_model
    pyprophet protein --context=global --in=$osw_global_model
    """
}

process pyprophet_backpropagate {
    input:
    file osw_scored from scored_osw_ch.backpropagate
    file osw_global_model from osw_global_model.first()
    output:
    file dscore_tsv into pyprophet_nonlegacy_ch
    script:
    base="${file(osw_scored.baseName).baseName}"
    backproposw="${base}.backprop.osw"
    dscore_tsv="${base}.tsv"
    """
    pyprophet backpropagate --in="$osw_scored" --apply_scores="$osw_global_model" --out=$backproposw
    # we supply --format=legacy_merged so that pyprophet export respect the --out parameter
    pyprophet export --in=$backproposw --format=legacy_merged --out=$dscore_tsv
    """
}

https://github.com/PyProphet/pyprophet/issues/49

5.7.2.2.1. KeyError: 'HOME' in `pyprophet_subsample`

If in this section you get the error

Command error: File "/gnu/store/lvip6h5pamjwmvnkwg60sjb63ph8698k-python-3.9.9/lib/python3.9/os.py", line 679, in getitem raise KeyError(key) from None KeyError: 'HOME'

and you are using podman or docker, you will want to add to your config file the following:

process {
    withName: 'pyprophet_.*' {
        containerOptions= '--env HOME="$PWD"'
    }}

This will set your home directory to the running directory in pyprophet processes (Note, single quotes ( ' ), rather than double-quotes ( " ), to prevent expansion

5.7.2.3. Choosing between legacy and nonlegacy pyprophet

Here we choose between legacy or nonlegacy pyprophet

if (params.pyprophet_use_legacy)
    pyprophet_legacy_ch.set{pyprophet_ch}
else
    pyprophet_nonlegacy_ch.set{pyprophet_ch}

5.7.3. feature-alignment

http://www.openswath.org/en/latest/docs/tric.html of msproteomicstools/analysis/alignment/feature_alignment.py

Note: If things fail here, because the right fdr cannot be reached, you can try changing precursor_mass_tolerance and fragment_mass_tolerance earlier upstream.

Additionally you can try setting =--use_irt to false, if you are getting a zero-division errors because the input tsvs are empty.

process feature_alignment
{
    publishDir "${params.outdir}/dia/"
    input:
    file dscore_csvs from pyprophet_ch.toSortedList()
    output:
    file outfile into feature_alignment_ch
    script:
    outfile = "DIA-analysis-results.csv"

    if (params.use_irt) {
        realign_method = "diRT" 
    } else {
        realign_method = "linear"
    }

    "feature_alignment.py " +
        "--method best_overall " +
        "--max_rt_diff 90 " +
        "--target_fdr $params.tric_target_fdr " +
        "--max_fdr_quality $params.tric_max_fdr " +
        "--in $dscore_csvs " +         // will this break on filenames with spaces
        " --realign_method $realign_method " + 
        "--out $outfile"
}

// Target FDR used in TRIC alignment in dirT mode [default 0.01]
// This was "trig_target_pvalue" in the original python gladiator implementation
params.tric_target_fdr=0.01
// Maximum FDR for TRIC alignment in dirT mode [default 0.05]
// This was "trig_max_pvalue" in the original python gladiator implementation
params.tric_max_fdr=0.05

5.7.4. Swath2stats

suppressPackageStartupMessages(library(SWATH2stats))
suppressPackageStartupMessages(library(data.table))

remove_irt <- function(df)
  df[grep("iRT", df[["ProteinName"]], invert=TRUE, fixed=TRUE),, drop=FALSE]

## original gladiator decoy removing behaviour
remove_decoy_strict <- function(df,decoyprefix)
  df[grep(decoyprefix, df[["ProteinName"]], invert=TRUE, fixed=TRUE),, drop=FALSE]


remove_decoy_loose <- function(df)
  df[!df[["decoy"]],, drop = FALSE]


basename_sans_ext <- function(f)
  unlist(strsplit(basename(f), ".",fixed=TRUE))[[1]]


main <- function(diafile,
		 strict_checking=FALSE,
		 peptideoutputfile="",
		 proteinoutputfile="",
		 decoyprefix="DECOY_")
{
  remove_decoy <- `if`(strict_checking,
		       function(df) remove_decoy_strict(df,decoyprefix),
		       remove_decoy_loose)
  filtered_data <-
    data.table::fread(diafile,header=TRUE) |> 
    data.frame(stringsAsFactors = FALSE) |>
    within(run_id <- basename(filename)) |>
    SWATH2stats::reduce_OpenSWATH_output() |>
    remove_irt() |>
    remove_decoy()

  # Writing output
  filtered_data |>
    SWATH2stats::write_matrix_peptides(filename = basename_sans_ext(peptideoutputfile)) |>
    write.table(sep="\t",file=peptideoutputfile,row.names = FALSE)

  filtered_data |>
    SWATH2stats::write_matrix_proteins(filename = basename_sans_ext(proteinoutputfile)) |>
    write.table(sep="\t",file=proteinoutputfile,row.names = FALSE)
}

Where the field "Decoy" is 1, thats a decoy generated by OpenSwathDecoyGenerator, rather than from the fastadataasee of DecoyDatabase

process swath2stats {
    publishDir "${params.outdir}/dia/"
    input:
    file dia_score from feature_alignment_ch

    output:
    file peptide_matrix
    file protein_matrix

    script:
    strict_checking=params.swath2stats_strict_checking
    peptide_matrix="DIA-peptide-matrix.tsv"
    protein_matrix="DIA-protein-matrix.tsv"

    """
    #!/usr/bin/env Rscript
    """ +
        '''
<<r-swath2stats>>
        ''' +
        """
    main("$dia_score", strict_checking = as.logical("$strict_checking"),
        peptideoutputfile="$peptide_matrix",
        proteinoutputfile="$protein_matrix",
        decoyprefix="DECOY_")

    """
}

// whether to exclude in final DIA matrices
// proteins of which
// any peptide can be a decoy.
// this is the default behaviour of original gladiator implementation.
// if set to false, just instead remove anything
// that has the "decoy" column set to false
params.swath2stats_strict_checking=true

5.7.4.1. Dependencies

("swath2stats"
 "r-minimal"
 "r-swath2stats")

The above as an R-script

#!/usr/bin/env Rscript

.libPaths("/opt/Rlibs/")

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.15",ask=FALSE,update=FALSE)
BiocManager::install("SWATH2stats", ask=FALSE,update=FALSE)

6. Configuration for backends

6.1. Defining the config template

Here we define what the registry is that hosts our images

docker.io/elolabfi/

This is the template that we fill with the container manager program to be used (%p), the gladiator container uri to be used (%G), and the pyprophet legacy container to be used (%P).

%p.enabled=true
process {
   container='%G'
   withName: 'DeepDIA.*' {
      container='%D'
   }
   
   // for nonlegacy pyprophet processes
   withName: 'pyprophet_.*' {
      container='%N'
   }

   // because this is the more specific rule
   // we apply it last, so that it overrides the 'pyprophet_.*' rule if
   // this rule also applies
    withName: pyprophet_legacy {
	container='%P'
   }
   // for perl-diamspep
   withName: 'DIAMS2PEP_.*'{
   // there is really no good single letter here
   	container='%V'
   }
}

And here we write the logic that fills the template

(defvar container-property-alist
  (let ((remote-registry registry))
    `(("remote"
       ("singularity"
        ("access-protocol" . "docker://")
        ("registry" . ,remote-registry)
        ("container-suffix" . ":0.1.4-0-f7abd8"))
       ("docker"
        ("access-protocol" . "")
        ("registry" . ,remote-registry)
        ("container-suffix" . ":0.1.4-0-f7abd8"))
       ("podman"
        ("access-protocol" . "docker://")
        ("registry" . ,remote-registry)
        ("container-suffix" . ":0.1.4-0-f7abd8")))
      ("local"
       ("singularity"
        ("access-protocol" . "file://")
        ("registry" . "containers/")
        ("container-suffix" . ".simg"))
       ("docker"
        ("access-protocol" . "")
        ("registry" . "localhost/")
        ("container-suffix" . ""))
       ("podman"
        ("access-protocol" . "")
        ("registry" . "localhost/")
        ("container-suffix" . ""))))))

(defun construct-container-uri (basename program-name locality)
  (let ((properties
         (alist-get
          program-name
          (alist-get locality container-property-alist nil nil 'equal)
          nil nil 'equal)))
    (unless properties
      (error "Could not find properties for '%s' '%s'" program-name locality))
    (format
     "%s%s%s%s"
     (alist-get "access-protocol" properties nil nil 'equal)
     (alist-get "registry" properties  nil nil 'equal)
     basename
     (alist-get "container-suffix" properties nil nil 'equal))))


(cl-defun construct-container-config (template program-name locality &key (gladiator-image "gladiator"))
  (let ((program-name-key
         program-name))
    (format-spec template
               `((?p . ,program-name)
                 (?G . 
                     ,(construct-container-uri gladiator-image
                                               program-name-key
                                               locality))
                 (?D .
                     ,(construct-container-uri "deepdia"
                                               program-name-key
                                               locality))
                 (?N .
                     ,(construct-container-uri "pyprophet"
                                               program-name-key
                                               locality))
                 (?V .
                     ,(construct-container-uri
                       "diams2pep"
                       program-name-key
                       locality))

                 (?P .
                     ,(construct-container-uri
                       "pyprophet-legacy"
                       program-name-key
                       locality))))))

(construct-container-config template program-name locality :gladiator-image gladiator-image)

docker://registry.gitlab.utu.fi/elixirdianf/gladiator-notes/test

https://registry.gitlab.utu.fi/elixirdianf/gladiator-notes/gladiator

6.2. Per backend config files

So here we fill the templates that we defined in the previous section. If you are reading this as an org-mode file, this headings default tangle argument makes each blocks tangle output file based on the name of the block

This block is whats common between singularity

singularity.runOptions = '-B $TMPDIR:/tmp'
singularity.autoMounts=true

This block is whats common between singularity and podman. On certain machines (reported on Ubuntu 22.04.5 LTS, 6.2.0-26-generic, with podman 3.4.4)

matplotlib in pyprophet in podman tries to read the home directory. which does not exist in the container, which will result in the error described here.

process {
    withName: 'pyprophet_.*' {
        containerOptions= '--env HOME="$PWD"'
    }
}

nil
singularity.runOptions = '-B $TMPDIR:/tmp'
singularity.autoMounts=true

#+end_src

<<construct-container-config("singularity","local")>>
<<singularity-general-options>>

nil
singularity.runOptions = '-B $TMPDIR:/tmp'
singularity.autoMounts=true

<<construct-container-config("docker","remote")>>
<<dockerlike-general-options>>

<<construct-container-config("docker","local")>>
<<dockerlike-general-options>>

<<construct-container-config("podman","remote")>>
<<dockerlike-general-options>>

<<construct-container-config("podman","local")>>
<<dockerlike-general-options>>

process {
   beforeScript="source <(guix time-machine -C $projectDir/ci/guix/gladiator-guix-channels.scm -- shell --search-paths -m $projectDir/ci/guix/manifests/gladiator.scm -m $projectDir/ci/guix/manifests/nextflow-trace.scm)"
   withName: "DeepDIA.*" {
      beforeScript="source <(guix time-machine -C $projectDir/ci/guix/deepdia-channels.scm -- shell --search-paths -m $projectDir/ci/guix/manifests/deepdia.scm -m $projectDir/ci/guix/manifests/nextflow-trace.scm)"
   }

   // for nonlegacy pyprophet processes
   withName: "pyprophet_.*" {
      beforeScript="source <(guix time-machine -C $projectDir/ci/guix/gladiator-guix-channels.scm -- shell --search-paths -m $projectDir/ci/guix/manifests/pyprophet.scm -m $projectDir/ci/guix/manifests/nextflow-trace.scm)"
   }

   // because this is the more specific rule
   // we apply it last, so that it overrides the "pyprophet_.*" rule if
   // this rule also applies
    withName: pyprophet_legacy {
        beforeScript="source <(guix time-machine -C $projectDir/ci/guix/pyprophet-legacy-channels.scm -- shell --search-paths -m $projectDir/ci/guix/manifests/pyprophet-legacy.scm -m $projectDir/ci/guix/manifests/nextflow-trace.scm)"
   }
   // for perl-diamspep
   withName: "DIAMS2PEP_.*"{
   // there is really no good single letter here
        beforeScript="source <(guix time-machine -C $projectDir/ci/guix/diams2pep-channels.scm -- shell --search-paths -m $projectDir/ci/guix/manifests/diams2pep.scm -m $projectDir/ci/guix/manifests/nextflow-trace.scm)"
   }

   withName: "DIAtracer" {
     beforeScript="source <(guix time-machine -C $projectDir/ci/guix/diatracer-channels.scm -- shell --search-paths -m $projectDir/ci/guix/manifests/diatracer.scm -m $projectDir/ci/guix/manifests/nextflow-trace.scm)"
   }
}

6.3. Backend Specific Issues

6.3.1. Podman specific Issues

If you get the error in any process

Error: OCI runtime error: the requested cgroup controller `cpu` is not available

This is because nextflow sets the number of CPU's podman can use, but you do not have the cpu cgroup in podman.

If, when you do

podman info  --format={{".Host.CgroupControllers"}}

you do not see cpu listed, but when you do

podman --cgroup-manager cgroupfs info  --format={{".Host.CgroupControllers"}}

you do see cpu listed, you will want to add to your nextflow config file:

podman.engineOptions="--cgroup-manager cgroupfs"

7. SDRF support

You can supply a path to an sdrf file with --sdrf=filename

// path to a sdrf file 
params.sdrf = null

The standard is defined here https:://github.com/bigbio/proteomics-metadata-standard See the annoted-projects directory for examples, Good examples is PXD003977

import nextflow.splitter.CsvSplitter

def readSDRF(filename,params)
{
    def sdrf_file=file(filename)
    def options=[:]
    def errors=[]
    def warnings=[]
    def handlers =[
        this.&parseSDRFFileUri,
        {_sdrf_fields,_params ->
            parseSDRFTolerance(_sdrf_fields, _params,
                                                    field_name = "comment[fragment mass tolerance]",  flag_name="fragment_mass_tolerance",supported_units=["Da"])},
        {_sdrf_fields,_params ->
            parseSDRFTolerance(_sdrf_fields, _params,
                                                    field_name = "comment[precursor mass tolerance]",  flag_name="precursor_mass_tolerance",supported_units=["ppm"])},
    ]
    def sdrf_fields = new CsvSplitter().target(sdrf_file.text).options(header:true,sep:'        ').list()
    sdrf_fields = sdrf_fields.collect(
        {entry ->
            entry.collectEntries(
                {key, row ->
                    [key,
                     (row =="not available" ||
                      row == "not applicable") ? 
                     null :
                     row]})})


    for(handler in handlers) {
        def val = handler(sdrf_fields, params);
        errors += val.errors
        warnings+= val.warnings
        options+=val.values
    }
    for(warning in warnings)
        print("WARNING: " + warnings)
    for(error in errors)
        print("ERROR: " + error)
    if(errors)
        throw new Exception("Errors in SDRF file, see above messages")
    return options
}

Fields we handle:

7.1. `comment[file uri]`

We can handle remote and local files in comment[associated file uri] as long as they are in mzML format. This can be used i.l.o params.diafiles (though other libre formats would also be possible if we add an msconvert step first)

this.&parseSDRFFileUri,

def parseSDRFFileUri(sdrf_fields, params){
    def raise_error_on_non_mzml=true
    def retval= [values:[:],warnings:[],errors:[]]
    def field_name = 'comment[file uri]'
    def flag_name = '--diafiles'
    def files = sdrf_fields*.get(field_name)
    // here we intentionally don't break the switch statement so that we can accumulate retval.errors,
    // so that the user can know all at once.
    switch (true) {
        case (params.diafiles):
            retval.values+=[diafiles:params.diafiles]
            break;
    case(files == null):
            retval.errors += ["No column named '$field_name' in sdrf. Add one or supply files with  $flag_name"];
            break;
        case (isallnull(files) || !files):
            retval.errors += ["All retval.values are missing in  supplied in sdrf '$field_name'. Enter these or supply CLI $flag_name];"]
        case (!isallnull(files) && issomenull(files)):
            retval.errors += ["Some entries in sdrf '$field_name'  are not given, fill in or supply $flag_name"];
        case (!isallnull(files) && raise_error_on_non_mzml && files.any({x -> x  && file(x).getExtension() != "mzML"})):
            retval.errors += ["Some entries in sdrf  '$field_name' are not mzML files"];
            // if there were no retval.errors
            // then the files field in the sdrf is correct
        case(!retval.errors):
            retval.values+=[diafiles:files]
            break;
        case(retval.errors):
            retval.values+=[diafiles:null]
            break;
    }
    return retval
}

7.2. `comment[precursor mass tolerance]`, `comment[fragment mass tolerance]`

We require (for now) precursor mass tolerance to be ppm, and to be the same for all the samples.

{_sdrf_fields,_params ->
    parseSDRFTolerance(_sdrf_fields, _params,
                                            field_name = "comment[fragment mass tolerance]",  flag_name="fragment_mass_tolerance",supported_units=["Da"])},
{_sdrf_fields,_params ->
    parseSDRFTolerance(_sdrf_fields, _params,
                                            field_name = "comment[precursor mass tolerance]",  flag_name="precursor_mass_tolerance",supported_units=["ppm"])},

def parseSDRFTolerance(sdrf_fields, params,  field_name = "comment[precursor mass tolerance]",  flag_name="precursor_mass_tolerance",supported_units=["ppm"]){
    def retval = [values:[:], warnings:[], errors:[]]
    def tolerances = sdrf_fields*.get(field_name)
    switch(true){
        case(params.get(flag_name) != null):
            retval.values[flag_name] = params.get(field_name);
            break
        case(tolerances == null):
            retval.errors += ["No column named '$field_name' in sdrf. Add one or supply  --$flag_name"];
            break;
        case (isallnull(tolerances) || !tolerances):
            retval.errors += ["All values are missing in  supplied in sdrf '$field_name'. Enter these or supply CLI --$flag_name];"]
        case (!isallnull(tolerances) && !tolerances.every()):
            retval.errors += ["Some entries in sdrf '$field_name'  are not given, fill in or supply $flag_name"];
        case(tolerances.any({it && it.split().size() > 1 && !(supported_units.contains(it.split()[1]))})):
            retval.errors += ["Some entries in sdrf '$field_name' have an unsupported unit(supported units for $field_name: $supported_units)"];
        case(tolerances.any({it && it.split().size() < 2})):
            retval.warnings += ["Some entries in sdrd '$field_name' do not have a unit, assuming " + supported_units[0]]
        case(tolerances.unique().size() != 1):
            retval.errors += ["Gladiator currently requires the $field_name to be the same for all the samples, change it or supply --$flag_name"];
    case(!retval.errors):
            retval.values[flag_name] = tolerances.unique()[0].split()[0]
    }
    return retval
}

7.3. Utility functions

def isallnull(value)
{
    value == null || value.every({x-> x==null})
}
def issomenull(value)
{
    value == null || value.any({x -> x == null})
}

7.4. Testing

print(readSDRF("annotated-projects/PXD003977/PXD003977.sdrf.tsv",params))

7.5. Updating params

if (params.sdrf){
    params = readSDRF(params.sdrf, params) + params
}

8. Putting it together

8.1. gwl

(make-workflow
 (name "my-workflow")
 (processes
  (auto-connect
   (list
    <<gwl-proc>>
    ))))

8.2. nf

according to nextflow/docs/process.rst, the cache looks at the timestamp for cache-correctness, which can be inconsistent in on shared file systems, setting process.cache to 'lenient' maybe will work around this? See alos

``'lenient'`` Enable caching. Cache keys are created indexing input files path and size attributes (this policy provides a workaround for incorrect caching invalidation observed on shared file systems due to inconsistent files timestamps; requires version 0.32.x or later).

there is also 'deep' hashing, based on file content and the undocumented 'sha256' hashing bashed on the shasum of the file See also here: https://www.nextflow.io/blog/2019/troubleshooting-nextflow-resume.html

process.cache='lenient'

though this will break if you replace input files with files with the same path but different content. 'sha256' might be the best one?

The results files will all be output in the following directory

// directory where the  results will be output to 
params.outdir = "./results"

<<nf-params>>

9. Troubleshooting

Podman specific Issues
KeyError: 'HOME' in pyprophet_subsample

Table of Contents