DiaNN Download - DiaNN Source code download

DiaNN

Other source code

DIA-NN 1.9.2

Download

DIA-NN

DIA-NN - a universal software suite for data-independent acquisition (DIA) proteomics data processing. Conceived at the University of Cambridge, UK, in the laboratory of Kathryn Lilley (Cambridge Centre for Proteomics), DIA-NN opened a new chapter in proteomics, introducing a number of algorithms which enabled reliable, robust and quantitatively accurate large-scale experiments using high-throughput methods. DIA-NN is currently being further developed in the laboratory of Vadim Demichev at the Charité (University Medicine Berlin, Germany).

DIA-NN is built on the following principles:

Reliability achieved via stringent statistical control
Robustness achieved via flexible modelling of the data and automatic parameter selection
Reproducibility promoted by thorough recording of all analysis steps
Ease of use: high degree of automation, an analysis can be set up in several mouse clicks, no bioinformatics expertise required
Powerful tuning options to enable unconventional experiments
Scalability and speed: up to 1000 mass spec runs processed per hour

Download: https://github.com/vdemichev/DiaNN/releases/tag/1.9.2 (it's recommended to use the latest version - DIA-NN 1.9.2)

Please cite:
DIA-NN: neural networks and interference correction
enable deep proteome coverage in high throughput Nature Methods, 2020

Using DIA-NN for the analysis of post-translation modifications (PTMs), such as phosphorylation or ubiquitination: Time-resolved in vivo ubiquitinome profiling by DIA-MS reveals USP7 targets on a proteome-wide scale Nature Communications, 2021

Using DIA-NN's ion mobility module for timsTOF data analysis or using DIA-NN in combination with FragPipe-generated spectral libraries: dia-PASEF data analysis using FragPipe and DIA-NN for deep proteomics of low sample amounts Nature Communications, 2022

Using DIA-NN for the analysis of multiplexed samples (SILAC, mTRAQ, etc): Increasing the throughput of sensitive proteomics by plexDIA Nature Biotechnology, 2022

Using DIA-NN as part of the CysQuant workflow: CysQuant: Simultaneous quantification of cysteine oxidation and protein abundance using data dependent or independent acquisition mass spectrometry Redox Biology, 2023

Using DIA-NN's QuantUMS module for quantification: QuantUMS: uncertainty minimisation enables confident quantification in proteomics biorxiv

Using DIA-NN to process Slice-PASEF data: Slice-PASEF: fragmenting all ions for maximum sensitivity in proteomics biorxiv

Other key papers

Using DIA-NN for large-scale plasma & serum proteomics:
Cell Systems, 2020 and Cell Systems, 2021
Ultra-fast proteomics with DIA-NN and Scanning SWATH:
Nature Biotechnology, 2021

R package with some useful functions for dealing with DIA-NN's output reports: https://github.com/vdemichev/diann-rpackage

Visualisation of peptide positions in the protein: https://github.com/MannLabs/alphamap (AlphaMap by Mann lab)

Notes and discussions on proteomics in general and the use of DIA-NN: https://github.com/vdemichev/DiaNN/discussions/categories/dia-proteomics-in-detail (this section will be further expanded).

Installation
Getting started
Raw data formats
Spectral library formats
Output
Library-free search
Creation of spectral libraries
Match-between-runs
Changing default settings
Command-line tool
Visualisation
Automated pipelines
PTMs and peptidoforms
Multiplexing using plexDIA
GUI settings reference
Command-line reference
Main output reference
Frequently asked questions (FAQ)
Support

Installation

On Windows, download the .exe installer and run it. Make sure not to run the installer from a network drive. It is recommended to install DIA-NN into the default folder suggested by the installer. Alternatively, just unpack the .binaries.zip archive to a location of your choice.

On Linux, download and unpack the .Linux.zip file. The Linux version of DIA-NN is generated on Linux Mint 21.2, and the target system must have the standard libraries that are at least as recent. There is no such requirement, however, if you make a Docker or Apptainer/Singularity container image. To generate either container, we recommend starting with the latest debian docker image - in this case you only need to install sudo apt install libgomp1 before you can run DIA-NN in it. Please also see the excellent detailed guide by Roger Olivella. For the best performance, use mimalloc with dynamic override as described here https://github.com/microsoft/mimalloc.

It is also possible to run DIA-NN on Linux using Wine 6.8 or later.

Getting Started

DIA mass spectrometry data can be analysed in two ways: by searching against a sequence database (library-free mode), or by using a "spectral library" - a set of known spectra and retention times for selected peptides. We discuss in detail when to use each of these approaches in the Library-free search section. For both kinds of analyses, using DIA-NN is very simple:

Click Raw (in the Input pane), select your raw mass spectrometry data files. See Raw data formats for information on supported formats.
Click Add FASTA, add one or more sequence databases in UniProt format.
If you want to use a spectral library, click Spectral library and select the library. Alternatively, for library-free analysis, select FASTA digest for library-free search/library generation (in the Precursor ion generation pane).
Specify Main output file name in the Output pane and click Run.
If you kept 'report.tsv' as the main output (located, by default, in the DIA-NN installation folder), it will contain the list of all precursor ions identified, along with different kinds of quantities, quality metrics and annotations. The output file report.pg_matrix.tsv will contain protein group quantities, report.gg_matrix.tsv - gene group quantities, report.pr_matrix.tsv - precursor ion quantities.

Now, the above information is sufficient for one to start using DIA-NN, it's indeed this easy! The rest of this Documentation might be helpful, but is not essential for 99% of the projects.

The above is how to run DIA-NN with default settings, and these yield optimal or almost optimal performance for most experiments. In some cases, however, it is better to adjust the settings, see Changing default settings for more details.

DIA-NN also offers powerful tuning options for fancy experiments. DIA-NN is implemented as a user-friendly graphical interface that automatically invokes a command-line tool. But the user can also pass options/commands to the command-line tool directly, via the Additional options text box in the interface. All these options start with a double dash -- followed by the option name and, if applicable, some parameters to be set. So if you see some option/command with -- in its name mentioned in this Documentation, it means this command is meant to be typed in the Additional options text box.

Raw data formats

Formats supported: Sciex .wiff, Bruker .d, Thermo .raw, .mzML and .dia (format used by DIA-NN to store spectra). Conversion from any supported format to .dia is possible. When running on Linux (native builds, not Wine), only .d, .mzML, and .dia data are supported.

For .wiff support, download and install ProteoWizard - choose the version (64-bit) that supports "vendor files"). Then copy all files with 'Clearcore' or 'Sciex' in their name (these will be .dll files) from the ProteoWizard folder to the DIA-NN installation folder (the one which contains diann.exe, DIA-NN.exe and a bunch of other files).

Reading Thermo .raw files requires Thermo MS File Reader to be installed. It is essential to use specifically the version by the link above (3.0 SP3).

.mzML files should be centroided and contain data as spectra (e.g. SWATH/DIA) and not chromatograms.

Technology support

DIA and SWATH are supported
Acquisition schemes with overlapping windows are supported
Gas-phase fractionation is supported
Scanning SWATH is supported
dia-PASEF/py-diAID is supported
Slice-PASEF is supported (add --tims-scan to Additional options)
midia-PASEF and Synchro-PASEF are supported (add --tims-scan to Additional options), but DIA-NN currently does not benefit from Q1 dimension deconvolution
Orbitrap Astral is supported
FAIMS with constant CV is supported
FAIMS with multiple CVs is supported after splitting the runs, see here
BoxCar-DIA is supported, but DIA-NN has not been optimised for it
Bruker Impact II DIA data are supported after conversion to .mzML
multiplexing with non-isobaric tags and SILAC is supported
MSX-DIA is not supported

Conversion

Many mass spec formats, including those few that are not supported by DIA-NN directly, can be converted to .mzML using the MSConvertGUI application from ProteoWizard. This works for all supported formats except Bruker .d and SCIEX Scanning SWATH - these need to be accessed by DIA-NN directly. The following MSConvert settings must be used for conversion:

Spectral library formats

DIA-NN supports comma-separated (.csv), tab-separated (.tsv, .xls or .txt) or .parquet tables as spectral libraries, as well as .speclib (compact format used by DIA-NN), .sptxt (SpectraST, experimental) and .msp (NIST, experimental) library files. Important: the library must not contain non-fragmented precursor ions as 'fragments': each fragment ion must actually be produced by the peptide backbone fragmentation.

In detail

Libraries in the PeakView format as well as libraries produced by FragPipe, TargetedFileConverter (part of OpenMS), exported from Spectronaut (Biognosys) in the .xls format or generated by DIA-NN itself are supported “as is”.

For .tsv/.xls/.txt libraries generated by other means, DIA-NN might require the header names to be specified (separated by commas) (for the columns it requires) using the --library-headers command. Use the * symbol instead of the name of a header to keep its recognition automatic. See below the descriptions of the respective columns (in the order the headers need to be specified).

Required columns:

Modified & labelled peptide sequence
Precursor charge
Precursor m/z
Reference retention time - arbitrary RT scale can be used
Fragment ion m/z
Relative intensity of the fragment ion

It is strongly recommended that columns containing the following are also present in the library:

Protein IDs - identifiers for the protein isoforms
Protein names
Gene names
Proteotypicity - a column containing 0/1 values, depending on whether the peptide in question is 'proteotypic', that is specific to a particular protein isoform, protein name or gene
Decoy - Indicates whether the peptide is a decoy. If there are decoy peptides in the library, DIA-NN uses these and does not generate its own decoys. It is strongly recommended not to include any decoy peptides in the library.
Fragment ion charge
Fragment ion type - either y or b; for x and z fragments also specify fragment type as y, and for a and c - as b
Fragment series number
Fragment neutral loss type
Q-value
Elution group identifier - if not specified, DIA-NN will infer elution groups automatically; not needed for most workflows
Exclude fragment indicator - a column containing 0/1 values, with 1 meaning that the fragment ion should not be used for quantification; not needed for most workflows
Ion Mobility - 1/K0 value for the precursor, arbitrary IM scale can be used

For example, a --library-headers command which specifies all column names except for the 'Decoy' column can look like this:

--library-headers ModifiedPeptide,PrecursorCharge,PrecursorMz,Tr_recalibrated,ProductMz,LibraryIntensity,UniprotID,ProteinName,Genes,Proteotypic,*,FragmentCharge,FragmentType,FragmentSeriesNumber,FragmentLossType,QValue,ExcludeFromAssay,IonMobility

Use --sptxt-acc to set the fragment filtering mass accuracy (in ppm) when reading .sptxt/.msp libraries.

MaxQuant msms.txt can also be used (experimental) as a spectral library in DIA-NN, although fixed modifications might not be read correctly.

DIA-NN can convert any library it supports into its own .parquet format. For this, click Spectral library (Input pane), select the library you want to convert, select the Output library file name (Output pane), click Run. If you use some exotic library format, it's a good idea to convert it to DIA-NN's .parquet and then examine the resulting library (using R 'arrow' or Python 'pyarrow' package) to see if the contents make sense.

All .tsv/.xls/.txt/.csv/.parquet libraries are just simple tables with human-readable data, and can be explored/edited, if necessary, using Excel or (ideally) R/Python.

Importantly, when any library is being converted to a different format, all numbers could be rounded using certain decimal precision, meaning that they might not be exactly the same as in the original library (there might be a tiny difference). Thus, although the performance when analysing using a converted library will be comparable, the results will not match exactly.

Output

The Output pane allows to specify where the output should be saved as well as the file names for the main output report and (optionally) the output spectral library. DIA-NN uses these file names to derive the names of all of its output files. Below one can find information on different types of DIA-NN output. For most workflows one only needs the main report (for analysis in R or Python - recommended) or the matrices (simplified output for MS Excel). When the generation of output matrices is enabled, DIA-NN also produces a .manifest.txt file with a brief description of the output files generated.

Main report

A text table containing precursor and protein IDs, as well as plenty of associated information. Most column names are self-explanatory, and the full reference can be found in Main output reference. The following keywords are used when naming columns:

PG means protein group
GG means gene group
Quantity means non-normalised quantity
Normalised means normalised quantity
MaxLFQ means normalised protein quantity calculated using the MaxLFQ algorithm - it is strongly recommended to use these MaxLFQ quantities and not the regular quantities (also reported by DIA-NN)
Global refers to a global q-value, that is calculated for the entire experiment
Lib refers to the respective value saved in the spectral library, e.g. Lib.Q.Value means q-value for the respective library precursor

Note: since version 1.9, DIA-NN produces a report in the Apache .parquet format. This is a compressed text table format (~10x size reduction) that can be loaded in a single line of code using R 'arrow' package or Python 'pyarrow' package. Most of new functionality (introduced in DIA-NN 1.9) is only reflected in the parquet report, so it is recommended to use it instead of the legacy .tsv report in all cases, while the .tsv report is still generated only for compatibility with old analysis workflows. The generation of the legacy .tsv report can be turned off with --no-main-report. In addition to using R or Python, you can also view .parquet files with the TAD Viewer.

Matrices

These contain normalised MaxLFQ quantities for protein groups ('pg_matrix'), gene groups ('gg_matrix'), unique genes ('unique_genes_matrix'; i.e. genes identified and quantified using only proteotypic, that is gene-specific, peptides) as well as normalised quantities for precursors ('pr_matrix'). They are filtered at 1% FDR, using global q-values for protein groups and both global and run-specific q-values for precursors. Additional 5% run-specific protein-level FDR filter is applied to the protein matrices, use --matrix-spec-q to adjust it. Sometimes DIA-NN will report a zero as the best estimate for a precursor or protein quantity. Such zero quantities are omitted from protein/gene matrices. Special phosphosite quantification matrices (phosphosites_90 and phosphosites_99 .tsv) are generated when phosphorylation (UniMod:21) is declared as a variable modification, see PTMs and peptidoforms.

Protein description

The .protein_description.tsv file is generated along with the Matrices and contains basic protein information known to DIA-NN (sequence IDs, names, gene names, description, sequence). Future versions of DIA-NN will include more information, e.g. protein molecular weight.

Stats report

Contains a number of QC metrics which can be used for data filtering, e.g. to exclude failed runs or as a readout for method optimisation. Note that the number of proteins reported here corresponds to the number of unique proteins (i.e. identified with proteotypic precursors) in a given run at 1% unique protein q-value. This number can be reproduced from the main report generated using precursor FDR threshold of 100% and filtered using Protein.Q.Value <= 0.01 & Proteotypic == 1. What's counted as 'protein' here depends on the 'Protein inference' setting.

PDF report

A visualisation of a number of QC metrics, based on the main report as well as the stats report. The PDF report should be used only for quick preliminary assessment of the data and should not be used in publications.

Flexible reanalysis

The Output pane allows to control how to handle the '.quant files'. Now, to explain what these are, let us consider how DIA-NN processes raw data. It first performs the computationally-demanding part of the processing separately for each individual run in the experiment, and saves the identifications and quantitative information to a separate .quant file. Once all the runs are processed, it collects the information from all the .quant files and performs some cross-run steps, such as global q-value calculation, protein inference, calculation of final quantities and normalisation. This allows DIA-NN to be used in a very flexible manner. For example, you can stop the processing at any moment, and then resume processing starting with the run you stopped at. Or you can remove some runs from the experiment, add some extra runs, and quickly re-run the analysis, without the need to redo the analysis of the runs already processed. All this is enabled by the Use existing .quant files when available option. The .quant files are saved to/read from the Temp/.dia dir (or the same location as the raw files, if there is no temp folder specified). When using this option, the user must ensure that the .quant files had been generated with the exact same settings as applied in the current analysis, with the exception of Precursor FDR (provided it is <= 5%), Threads, Log level, MBR, Cross-run normalisation and Library generation - these settings can be different. It is actually possible to even transfer .quant files to another computer and reuse them there - without transferring the original raw files. Important: it is strongly recommended to only reuse .quant files when both mass accuracies and the scan window are fixed to some values (non-zero), otherwise DIA-NN will perform optimisation of these yet again using the first run for which a .quant file has not been found. Further, when using MBR or creating a spectral library from DIA data with Library generation set to smart or full profiling, .quant files should only be reused if they have been generated in exactly the same order as the current order of raw files, that is with MBR DIA-NN currently cannot combine multiple separate analyses together.

Note: the main report in .parquet format provides the full output information for any kind of downstream processing. All other output types are there to simplify the analysis when using MS Excel or similar software. The numbers of precursors and proteins reported in different types of output files might appear different due to different filtering used to generate those, please see the descriptions above. All the 'matrices' can be reproduced from the main .parquet report, if generated with precursor FDR set to 5%, using R or Python.

Library-free search

DIA-NN has a very advanced library-free module, which is, for certain types of experiments, better than using a high quality project-specific spectral library. In general, the following makes library-free search perform better in comparison to spectral libraries (while the opposite favours spectral libraries):

high peptide numbers detectable per run;
heterogeneous data (e.g. cancer tissue samples are quite heterogeneous, while replicate injections of the same sample are not);
long chromatographic gradients as well as good separation of peptides in the ion mobility dimension;
large dataset (although processing a large dataset in library-free mode might take time).

Please note that in 99% of cases it is essential that MBR is enabled for a quantitative library-free analysis. It gets activated by default when using the DIA-NN GUI.

For most experiments it does indeed make sense to try library-free search. For medium and large-scale experiments it might make sense to first try library-free analysis of a subset of the data, to see whether the performance is OK (on the whole dataset it will typically be a lot better, so no need to be too stringent here). Ourselves we also often perform a quick preliminary QC assessment of the experiment using some public library.

It is often convenient to perform library-free analysis in two steps: by first creating an in silico-predicted spectral library from the sequence database and then analysing with this library. This is the strategy that must be used in all cases except for quick preliminary analyses. Note that the pipeline functionality in DIA-NN allows to easily schedule sequences of tasks, such as creation of a predicted library followed by multiple analyses using this library.

Comment

Note that the larger the search space (the total number of precursors considered), the more difficult it is for the analysis software to identify peptides, and the more time the search takes. DIA-NN is very good at handling very large search spaces, but even DIA-NN cannot do magic and produce as good results with a 100 million search space, as it would with a 2 million search space. So one needs to be careful about enabling all possible variable modifications at once. For example, allowing max 5 variable modifications, while having methionine oxidation, phospho and deamidation enabled simultaneously, is probably not a good idea.

Here lies an important distinction between DIA and DDA data analysis. In DDA allowing all possible variable modifications makes a lot of sense also because the search engine needs to match the spectrum to something - and if it is not matched to the correct modified peptide, it will be matched falsely. In DIA the approach is fundamentally different: the best-matching spectrum is found in the data for each precursor ion being considered (this is a very simplified view just to illustrate the concept). So not being able to identify a particular spectrum is never a problem in DIA (in fact most spectra are highly multiplexed in DIA - that is originate from multiple peptides - and only a fraction of these can be identified). And therefore it only makes sense to enable a particular variable modification if either you are specifically interested in it or if the modification is really ubiquitous.

See PTMs and peptidoforms for information on distinguishing between peptidoforms bearing different sets of modifications.

Creation of spectral Libraries

DIA-NN can create a spectral library from any DIA dataset. This can be done in both spectral library-based and library-free modes: just select the Generate spectral library option in the output pane.

DIA-NN can further create an in silico-predicted spectral library out of either a sequence database (make sure FASTA digest is enabled) or another spectral library (often useful for public libraries): just run DIA-NN without specifying any raw files and enable the Deep learning-based spectra, RTs and IMs prediction option in the Precursor ion generation pane. The modifications currently supported by the deep learning predictor are: C(cam), M(ox), N-term acetyl, N/Q(dea), S/T/Y(phos), K(-GG), nK(mTRAQ) and nK(TMT). Of note, if the predictor module in DIA-NN does not recognise some modification, it will still carry out prediction just ignoring it. To make DIA-NN instead discard any peptides with modifications unknown to the predictor, use --skip-unknown-mods.

Spectral libraries can also be created from DDA data, and in fact offline fractionation + DDA has been the 'gold standard' way of creating libraries since the introduction of SWATH/DIA proteomics. For this we recommend using FragPipe, which is based on the ultra-fast and highly robust MSFragger search engine. FragPipe can further be used to create DIA-NN-compatible libraries also from DIA data, similar to DIA-NN itself.

Match-between-runs

MBR is a powerful mode in DIA-NN, which is beneficial for most quantitative experiments, both with a spectral library and in library-free mode. MBR typically results in both higher average ID numbers, but also a lot better data completeness, that is a lot less missing values.

While processing any dataset, DIA-NN gathers a lot of useful information which could have been used to process the data better. And that is what is enabled by MBR. With MBR, DIA-NN first creates a spectral library from DIA data, and then re-processes the same dataset with this spectral library. The algorithmic innovation implemented in DIA-NN ensures that the FDR is stringently controlled: MBR has been validated on datasets ranging from 2 runs to over 1000 runs.

MBR should be enabled for any quantitative experiment, unless you have a very high quality project-specific spectral library, which you think (i) is likely to provide almost complete coverage of detectable peptides, that is there is no point in trying library-free search + MBR, and (ii) most of the peptides in the library are actually detectable in the DIA experiment. If only (i) is true, might be worth still trying MBR along with Library generation set to IDs profiling.

MBR should not be used for non-quantitative experiments, that is when you only want to create a spectral library, which you would then use on some other dataset.

One can manually 'imitate' MBR using a two-step approach that will result in comparable performance. First, run DIA-NN to create a spectral library from the DIA runs (the whole experiment or just its subset, which can be a lot quicker for large-scale experiments or experiments including blanks/failed runs). Then use this library to analyse the whole experiment. In either case, run DIA-NN with MBR disabled.

When using MBR (or its imitation) and relying on the main .parquet report (recommended) instead of the quantitative matrices, use the following q-value filters:

Lib.Q.Value instead of Global.Q.Value
When applying a filter to Q.Value that is more stringent than the FDR threshold used to generate the DIA library (e.g. Q.Value < 0.001 filter), always apply the same filter to Lib.Q.Value
Lib.PG.Q.Value instead of Global.PG.Q.Value
Lib.Peptidoform.Q.Value instead of Global.Peptidoform.Q.Value, when using peptidoform scoring

Changing default settings

DIA-NN can be successfully used to process almost any experiment with default settings. In general, it is recommended to only change settings when specifically advised to do so in this Documentation (like below), for a specific experiment type, or if there is a very clear and compelling rationale for the change.

In many cases, one might want to change several parameters in the Algorithm pane.

MBR should be enabled in most cases, see Match-between-runs.
Mass accuracies: when set to 0.0, DIA-NN determines mass tolerances automatically, based either on the first run in the experiment (default), or, if Unrelated runs option is selected, for each run separately. However, the automatic algorithm can be affected by the noise in the data, so even for replicate injections, say, acquired on TripleTOF 6600, it can easily yield recommended MS2 mass accuracy tolerances in the range 15ppm - 25ppm - this is perfectly OK. So what we prefer to do in most cases is run DIA-NN on several acquisitions from the experiment, with any spectral library (can choose some small one which allows for quick analysis), see what mass accuracies DIA-NN sets automatically (it prints its recommendations), and set the values to approximate averages of these. Also, often it is known already what DIA-NN parameters are optimal for particular LC-MS settings.
Scan window: ideally should correspond to the approximate average number of data points per peak. Similarly to mass accuracies, can be determined by DIA-NN automatically, but we prefer to have it fixed to some average value.

Please also see the guidance on Library-free search, PTMs and peptidoforms and Multiplexing using plexDIA, if these are relevant for your experiment.

Note that once you select a particular option in the DIA-NN GUI, some other settings might get activated automatically. For example, whenever you choose to perform an in silico FASTA database digest (for library-free search), or just generate a spectral library from DIA data, MBR will get automatically selected too - because in 99% of cases it is beneficial.

Command-line tool

DIA-NN is implemented as a graphical user interface (GUI), which invokes a command-line tool (diann.exe). The command-line tool can also be used separately, e.g. as part of custom automated processing pipelines. Further, even when using the GUI, one can pass options/commands to the command-line tool, in the Additional options text box. Some of such useful options are mentioned in this Documentation, and the full reference is provided in Command-line reference.

When the GUI launches the command-line tool, it prints in the log window the exact set of commands it used. So in order to reproduce the behaviour observed when using the GUI (e.g. if you want to do the analysis on a Linux cluster), one can just pass exactly the same commands to the command-line tool directly.

diann.exe [commands]

Commands are processed in the order they are supplied, and with most commands this order can be arbitrary.

On Linux, the semicolon ';' character is treated as a command separator, therefore ';' as part of DIA-NN commands (e.g. --channels) need to be replaced with ';' on Linux for correct behaviour.

For convenience, as well as for handling experiments consisting of thousands of files, some of the options/commands can be stored in a config file. For this, create a text file with any extension, say, diann_config.cfg, type in any commands supported by DIA-NN in there, and then reference this file with --cfg diann_config.cfg (in the Additional options text box or in the command used to invoke the diann.exe command-line tool).

Visualisation

DIA-NN provides two visualisation options.

Skyline. To visualise chromatograms/spectra in Skyline, analyse your experiment with MBR and a FASTA database specified and then click the 'Skyline' button. DIA-NN will automatically launch Skyline (make sure you have Skyline/Skyline daily version 23.1.1.459 or later installed as 'Administrator install'). Currently this workflow does not support multiplexing and will not work with modifications in any format other than UniMod.

DIA-NN Viewer. Analyse your experiment with the "XICs" checkbox checked and click the 'Viewer' button. By default "XICs" option will make DIA-NN extract chromatograms for the library fragment ions only and within 10s from the elution apex. Use --xic [N] to set the retention time window to N seconds (e.g. --xic 60 will extract chromatograms within a minute from the apex) and --xic-theoretical-fr to extract all charge 1 and 2 y/b-series fragments, including those with common neutral losses. Note that using --xic-theoretical-fr, especially in combination with large retention time window, might require a significant amount of disk space in the output folder. However the visualisation itself is effectively instantaneous, for any experiment size.

Note: The chromatograms extracted with "XICs" are saved in Apache .parquet format (file names end with '.xic.parquet') and can be readily accessed using R or Python. This can be sometimes convenient to prepare publication-ready figures (although can do that with Skyline or DIA-NN Viewer too), or even to set up automatic custom quality control for LC-MS performance.

Peptide & modification positions within a protein can be visualised using AlphaMap by the Mann lab https://github.com/MannLabs/alphamap.

Automated pipelines

The pipeline window within the DIA-NN GUI allows to combine multiple analysis steps into pipelines. Each pipeline step is a set of settings as displayed by the GUI. One can add such steps to the pipeline, update existing steps, remove steps, move steps up/down in the pipeline, disable/enable (by double mouse-click) certain steps within the pipeline and save/load pipelines. Further, individual pipeline steps can be copy-pasted between different GUI tabs/windows (use Copy and Paste buttons for this). We always assemble all DIA-NN runs for a particular publication in a pipeline. One can also use DIA-NN pipelines to store configuration templates.

PTMs and peptidoforms

DIA-NN GUI features built-in workflows (Precursor ion generation pane) for detecting methionine oxidation, N-terminal protein acetylation, phosphorylation and ubiquitination (via the detection of remnant -GG adducts on lysines). Other modificaitons can be declared using --var-mod or --fixed-mod in Additional options.

Distinguishing between peptidoforms bearing different sets of modifications is a non-trivial problem in DIA: without special peptidoform scoring the effective peptidoform FDR can be in the range 5-10% for library-free analyses. DIA-NN implements a statistical target-decoy approach for peptidoform scoring, which is enabled by the Peptidoforms option (Algorithm pane) and is also activated automatically whenever a variable modification is declared, via the GUI settings or the --var-mod command. The resulting peptidoform q-values reflect DIA-NN's confidence in the correctness of the set of modifications reported for the peptide as well as the correctness of the amino acid sequence identified. These q-values, however, do not guarantee the absence of low mass shifts due to some amino acid substitutions or modifications such as deamidation (note that DDA does not guarantee this either).

Further, DIA-NN features an algorithm which reports PTM localisation confidence estimates (as posterior probabilities for correct localisation of all variable PTM sites on the peptide as well as scores for individual sites), included in the .parquet output report. The phosphosites_90 and phosphosites_99 .tsv files contain phosphosite-specific quantities, calculated using the Top 1 method (experimental), that is the highest intensity among precursors with the site localised with the specified confidence (0.9 or 0.99, respectively) is used as the phosphosite quantity in the given run. The 'top 1' algorithm is used here as it is likely the most robust against outliers and mislocalisation errors. However, whether or not this is indeed the best option needs to be investigated, which is currently challenging due to the lack of benchmarks with known ground truth.

In general, when looking for PTMs, we recommend the following:

Essential: the variable modifications you are looking for must be specified as variable (via the GUI checkboxes or the Additional options) both when generating an in silico predicted library and also when analysing the raw data using any predicted or empirical library
Settings for phosphorylation: max 3 variable modifications, max 1 missed cleavage, phosphorylation is the only variable modification specified, precursor charge range 2-3; to reduce RAM usage, make sure that the precursor mass range specified (when generating a predicted library) is not wider than the precursor mass range selected for MS/MS by the DIA method; to speed up processing when using a predicted library, first generate a DIA-based library from a subset of experiment runs (e.g. 10+ best runs) and then analyse the whole dataset using this DIA-based library with MBR disabled
When the above succeeds, also try max 2 missed cleavages
When looking for PTMs other than phosphorylation, in 95% of cases best to use max 1 to 3 variable modifications and max 1 missed cleavage
When not looking for PTMs, i.e. when the goal is relative protein quantification, enabling variable modifications typically does not yield higher proteomic depth. While it usually does not hurt either, it will make the processing slower.

To the best of our knowledge, there is no published validation of the identification confidence for the detection of deamidated peptides (which are easy to confuse to heavier isotopologues, unless the mass spec has a very high resolution and a tight mass accuracy/tolerance setting is used by the search engine), even for DDA. One way to gain confidence in the identification of deamidated peptides is to check if anything is identified if the mass delta for deamidation is declared to be 1.022694, instead of the correct value 0.984016. DIA-NN does pass this test successfully on several datasets (that is no IDs are reported when specifying this 'decoy modification mass'), but we do recommend also trying such 'decoy modification mass' search on several runs from the experiment to be analysed, if looking for deamidated peptides. In each case (correct or decoy mass), --ptm-qvalues should be used to enable PTM-specific scoring for deamidation, in addition to the peptidoform scoring, and either PTM.Q.Value or Global.Q.Value/Lib.Q.Value used for filtering.

Of note, when the ultimate goal is the identification of proteins, it is largely irrelevant if a modified peptide is misidentified, by being matched to a spectrum originating from a different peptidoform. Therefore, if the purpose of the experiment is to identify/quantify specific PTMs, amino acid substitutions or distinguish proteins with high sequence identity, then the Peptidoforms scoring option is recommended. In all other cases peptidoform scoring is typically OK to use but not necessary, and will usually lead to a somewhat slower processing and a slight decrease in identification numbers when using MBR.

Does DIA-NN need to recognise modifications in the spectral library?

In general, yes. However, most workflows will work without the need to recognise modifications. Although if unknown modifications are detected in the library, DIA-NN will print a warning listing those, and it is strongly recommended to declare them using --mod. Note that DIA-NN already recognises many common modifications and can also load the whole UniMod database, see the --full-unimod option.

Multiplexing using plexDIA

In collaboration with the Slavov laboratory, we have developed plexDIA based on DIA-NN, a technology that allows to benefit from non-isobaric multiplexing (mTRAQ, dimethyl, SILAC) in combination with DIA. To analyse a plexDIA experiment, one needs an in silico predicted or empirical spectral library. DIA-NN then needs to be supplied with the following sets of commands, depending on the analysis scenario.

Scenario 1. The library is a regular label-free library (empirical or predicted), and multiplexing is achieved purely with isotopic labelling, i.e. without chemical labelling with tags such as mTRAQ or dimethyl. DIA-NN then needs the following options to be added to Additional options:

--fixed-mod, to declare the base name for the channel labels and the associated amino acids
--lib-fixed-mod, to in silico apply the modification declared with --fixed-mod to the library
--channels, to declare the mass shifts for all the channels considered
--original-mods, to prevent DIA-NN from converting the declared modifications to UniMod

Example for L/H SILAC labels on K and R:

--fixed-mod SILAC,0.0,KR,label
--lib-fixed-mod SILAC
--channels SILAC,L,KR,0:0; SILAC,H,KR,8.014199:10.008269
--original-mods

Note that in the above SILAC is declared as label, i.e. it is not supposed to change the retention time of the peptide. It is also a zero-mass label here, as it only serves to designate the amino acids that will be labelled. What the combination of --fixed-mod and --lib-fixed-mod does here is simply put (SILAC) after each K or R in the precursor id sequence, in the internal library representation used by DIA-NN. --channels then splits each library entry into two, one with masses 0 (K) and 0 (R) added upon each occurrence of K(SILAC) or R(SILAC) in the sequence, respectively, and another one with 8.014199 (K) and 10.008269 (R).

Scenario 2. The library is a regular label-free library (empirical or predicted), and multiplexing is achieved via chemical labelling with mTRAQ.

Scenario 2: Step 1. Label the library in silico with mTRAQ and run the deep learning predictor to adjust spectra/RTs/IMs. For this, run DIA-NN with the input library in the Spectral library field, an Output library specified, Deep learning-based spectra, RTs and IMs prediction enabled, list of raw data files empty and the following options in Additional options:

--fixed-mod mTRAQ,140.0949630177,nK
--lib-fixed-mod mTRAQ
--channels mTRAQ,0,nK,0:0; mTRAQ,4,nK,4.0070994:4.0070994;mTRAQ,8,nK,8.0141988132:8.0141988132
--original-mods

Use the .predicted.speclib file with the name corresponding to the Output library as the spectral library for the next step.

Scenario 2: Step 2. Run DIA-NN with the following options:

--fixed-mod mTRAQ,140.0949630177,nK
--channels mTRAQ,0,nK,0:0; mTRAQ,4,nK,4.0070994:4.0070994;mTRAQ,8,nK,8.0141988132:8.0141988132
--original-mods

Note that --lib-fixed-mod is no longer necessary as the library generated in Step 1 already contains (mTRAQ) at the N-terminus and lysines of each peptide.

Scenario 3. The library is a regular label-free library (empirical or predicted), and multiplexing is achieved via chemical labelling with a label other than mTRAQ. The reason this scenario is treated differently from Scenario 2 is that DIA-NN's in silico predictor has not been specifically trained for labels other than mTRAQ, and therefore an extra step to generate predictions is not necessary. Simply run DIA-NN as you would do in Scenario 1, except the --fixed-mod declaration will have a non-zero mass in this case and will not be a label. For example, for 5-channel dimethyl as described by Thielert et al:

‐‐fixed‐mod Dimethyl, 28.0313, nK
--lib-fixed-mod Dimethyl
‐‐channels Dimethyl,0,nK,0:0; Dimethyl,2,nK,2.0126:2.0126; Dimethyl,4,nK,4.0251:4.0251; Dimethyl,6,nK,6.0377:6.0377; Dimethyl,8,nK,8.0444:8.0444
--original-mods

Scenario 4. The library is an empirical DIA library generated by DIA-NN from a multiplexed DIA dataset. For example, this could be a library generated by DIA-NN in the first pass of MBR (and you'd like to reuse it to analyse the same or some other runs). The Additional options will then be the same as in Scenario 1, Scenario 2: Step 2 or Scenario 3, except (important!) --lib-fixed-mod must not be supplied.

In all the scenarios above, an extra option specifying the normalisation strategy must be included in Additional options. This can be either --channel-run-norm (pulsed SILAC, protein turnover) or -channel-spec-norm (multiplexing of independent samples).

Output. We recommend using the main report in .parquet format for all downstream analyses. Note that PG.Q.Value and GG.Q.Value in the main report are channel-specific, when using multiplexing. The quantities PG.MaxLFQ, Genes.MaxLFQ and Genes.MaxLFQ.Unique are only channel-specific if (i) QuantUMS is used and (ii) either the report corresponds to the second pass of MBR or MBR is not used. Alternatively, one can use the matrices (not recommended), these are precursor-level only. When using matrices, it is essential to specify --matrix-ch-qvalue, with reasonable thresholds 0.01 to 0.5. This setting will not affect the extracted MS1 matrix, which simply reports MS1 signals corresponding to each channel, whenever a precursor is identified in any of the channels - using this matrix is normally not recommended. Protein matrices are not produced when analysing multiplexed data.

GUI settings reference

Description of selected options

Input pane

Convert to .dia Convert the selected raw files to DIA-NN's .dia format, for faster subsequent processing, and save them either to the same folder as the respective source raw files or to Temp/.dia dir (Output pane), if the latter is specified. Conversion is recommended for Sciex files, typically makes little difference for Thermo files and is not recommended for Bruker files.
Reannotate option allows to reannotate the spectral library with protein information from the FASTA database, using the specified digest specificity
Contaminants Adds common contaminants from the Cambridge Centre for Proteomics (CCP) database and automatically excludes them from quantification, see the description of the --cont-quant-exclude option. This option applies when generating a predicted spectral library from a FASTA database or analysing using such a library, if it was generated with Contaminants enabled.

Precursor ion generation pane

FASTA digest instructs DIA-NN to in silico digest the sequence database, for library-free search or to generate a spectral library in silico
Deep learning-based spectra, RTs and IMs prediction instructs DIA-NN to perform deep learning-based prediction of spectra, retention times and ion mobility values. This allows not only to make in silico spectral libraries from sequence databases, but also to replace spectra/RTs/IMs in existing libraries with predicted values

Output pane

Use existing .quant files when available reuse IDs/quantification information from a previous analysis, see Output
Temp/.dia dir specify where .quant files or converted .dia files will be saved, see Output

Algorithm pane

Mass Accuracy set the MS2 mass tolerance (in ppm), see Changing default settings
Mass Accuracy MS1 set the MS1 mass tolerance (in ppm), see Changing default settings
Scan window sets the scan window radius to a specific value. Ideally, should be approximately equal to the average number of data points per peak, see Changing default settings
Unrelated runs determine mass accuracies and scan window, if automatic, independently for different runs, see [Changing default settings](#changing default settings
Peptidoforms activates peptidoform confidence scoring, see PTMs and peptidoforms
MBR enables MBR, should be enabled for most quantitative experiments, see MBR
No shared spectra whether to use a spectrum centric-like algorithm to remove interfering precursors. This algorithm is particularly important when considering variable modifications and should always be enabled
Neural network classifier here 'single-pass' mode is the default option and is recommended. The 'double-pass' mode might be better in some scenarios, but it is almost twice slower and it might make reported FDR values slightly less conservative. Double-pass mode must be tested against single-pass on the specific dataset, before a decision is made to use it.
Protein inference this setting primarily affects proteotypicity definition, the default "Genes" is recommended for almost all applications, provided the gene-level information is actually present in the database (non-UniProt databases might lack it). When set to "Off", protein groups from the spectral library are used - this makes sense if protein inference has already been performed during library generation
Quantification strategy QuantUMS (high-precision) is recommended for most scenarious, use QuantUMS (high-accuracy) for experiments where elimination of any ratio compression bias is critical
Cross-run normalisation whether to use global, RT-dependent (recommended) or also signal-dependent (experimental, be very careful about it) cross-run normalisation. Normalisation can also be disabled completely using --no-norm
Library generation this setting determines if and how empirical RTs/IMs and spectra are added to the newly generated library, instead of the theoretical values. IDs, RT & IM profiling is strongly recommended for almost all workflows. When analysing with a high-quality project-specific library, can switch to IDs profiling. Full profiling means always using empirical information, and might only be beneficial (in very rare cases) when having less than ~1000 peptides identified per run, and only if downstream processing is not very sensitive to a bit higher FDR.
Speed and RAM usage this setting is primarily useful for library-free analyses. The first three modes will typically have little difference in terms of ID numbers, while the Ultra-fast mode is rather extreme: about 5x faster, but ID numbers are not as good, and the effective FDR might be somewhat higher. The setting affects only the first pass when using MBR

Command-line reference

Description of available options/commands

Note that some options below are strongly detrimental to performance and are only there for benchmarking purposes. So the recommendation is to only use the options which are expected to be beneficial for a particular experiment (e.g. those recommended in the present documentation) based on some clear rationale.

--cfg [file name] specifies a file to load options/commands from
--channel-run-norm normalisation of multiplexed samples will be performed in run-specific manner, i.e. to perform normalisation, for each precursor ion DIA-NN will sum the respective channels within each run and will normalise these sums across runs: use e.g. for protein turnover SILAC experiments
--channel-spec-norm normalisation of multiplexed samples will be performed in channel-specific manner, i.e. each channel in each run is treated as a separate sample to be normalised: use to analyse experiments wherein multiplexing of independent samples is used to boost throughput
--channels [channel 1]; [channel 2]; ... lists multiplexing channels, wherein each channel declaration has the form [channel] = [label group],[channel name],[sites],[mass1:mass2:...], wherein [sites] has the same syntax as for --var-mod and if N sites are listed, N masses are listed at the end of the channel declaration. The spectral library will be automatically split into multiple channels, for precursors bearing the [label group] modification. To add the latter to a label-free spectral library, can use --lib-fixed-mod, e.g. --fixed-mod SILAC,0.0,KR,label --lib-fixed-mod SILAC. See Multiplexing using plexDIA for usage examples
--clear-mods makes DIA-NN 'forget' all built-in modification (PTM) names
--compact-report instructs DIA-NN to provide less information in the main .tsv report
--cont-quant-exclude [tag] peptides corresponding to protein sequence ids tagged with the specified tag will be excluded from normalisation as well as quantification of protein groups that do not include proteins with the tag
--convert makes DIA-NN convert the mass spec files to the .dia format. The files are either saved to the same location as the input files, or in the Temp/.dia dir, if it is specified (in the GUI or using the --temp option)
--cut [specificty 1],[specificity 2],... specifies cleavage specificity for the in silico digest. Cleavage sites (pairs of amino acids) are listed separated by commas, '*' indicates any amino acid, and '!' indicates that the respective site will not be cleaved. Examples: "--cut K*,R*,!*P" - canonical tryptic specificity, "--cut " - digest disabled
--decoy-channel [channel] specifies the decoy channel masses, wherein [channel] has the same syntax as for --channels
--decoys-preserve-spectrum informs DIA-NN that decoy peptides in the library are already annotated with 'decoy' spectra
--dir [folder] specifies a folder containing raw files to be processed. All files in the folder must be in .raw, .mzML or .dia format
--direct-quant disable QuantUMS and use legacy DIA-NN quantification algorithms instead, also disables channel-specific protein quantification when analysing multiplexed samples
--dl-no-im when using the deep learning predictor, prediction of ion mobilities will not be performed
--dl-no-rt when using the deep learning predictor, prediction of retention times will not be performed
--duplicate-proteins instructs DIA-NN not to skip entries in the sequence database with duplicate IDs (while by default if several entries have the same protein ID, all but the first entry will be skipped)
--exact-fdr approximate FDR estimation for confident peptides based on parametric modelling will be disabled
--export-quant add fragment quantities and quality information to the .parquet output report
--ext [string] adds a string to the end of each file name (specified with --f)
--f [file name] specifies a run to be analysed, use multiple --f commands to specify multiple runs
--fasta [file name] specifies a sequence database in FASTA format (full support for UniProt proteomes), use multiple --fasta commands to specify multiple databases
--fasta-filter [file name] only consider peptides matching the stripped sequences specified in the text file provided (one sequence per line), when processing a sequence database
--fasta-search instructs DIA-NN to perform an in silico digest of the sequence database
--fixed-mod [name],[mass],[sites],[optional: 'label'] - adds the modification name to the list of recognised names and specifies the modification as fixed. Same syntax as for --var-mod.
--force-swissprot only consider SwissProt (i.e. marked with '>sp|') sequences when processing a sequence database
--foreign-decoys informs DIA-NN that any decoys included in the library have been generated by a tool other than this version of DIA-NN
--full-unimod loads the complete UniMod modification database and disables the automatic conversion of modification names to the UniMod format
--gen-spec-lib instructs DIA-NN to generate a spectral library
--gen-fr-restriction annotates the library with fragment exclusion information, based on the runs being analysed (fragments least affected by interferences are selected for quantification, why the rest are excluded)
--global-mass-cal disables RT-dependent mass calibration
--global-norm instructs DIA-NN to use simple global normalisation instead of RT-dependent normalisation
--high-acc QuantUMS settings will be otimised for maximum accuracy, i.e. to minimise any ratio compression quantitative bias
--ids-to-names protein sequence ids will also be used as protein names and genes, any information on actual protein names or genes will be ignored
--il-eq (experimental) when using the 'Reannotate' function, peptides will be matched to proteins while considering isoleucine and leucine equivalent
--im-window [x] fixes IM extraction window to the specific value
--im-window-factor [x] controls the minimum size of the IM extraction window, default is 2.0
--individual-mass-acc mass accuracies, if set to automatic, will be determined independently for different runs
--individual-reports a separate output report will be created for each run
--individual-windows scan window, if set to automatic, will be determined independently for different runs
--int-removal 0 disables the removal of interfering precursors
--lib [file name] specifies a spectral library. The use of multiple --lib commands (experimental) allows to load multiple libraries in .tsv format