NAguideR

Please note: This function is optinal and designed for many biologists with specific experimental aims, for example, some users may want to check a particular peptide/protein (i.e. spiked-in standard peptides, proteins, or known housekeeping proteins like beta-actin, etc.) before and after imputation.

Please type in a protein id or peptide sequance as that in the original expression data:

Download

1.1 Abstract

Mass-spectrometry (MS) based quantitative proteomics experiments frequently generate data with missing values, which may profoundly affect downstream analyses. A wide variety of missing value imputation methods have been established to deal with the missing-value issue. To date, however, there is a scarcity of efficient, systematic, and easy-to-handle tools that are tailored for proteomics community. Herein, we developed a user-friendly and powerful web tool, NAguideR, to enable implementation and evaluation of different missing value methods offered by twenty popular missing-value imputation algorithms. Evaluation of data imputation results can be performed through classic computational criteria and, unprecedentedly, proteomic empirical criteria such as quantitative consistency between different charge-states of the same peptide, different peptides belonging to the same proteins, and individual proteins participating functional protein complexes. We applied NAguideR into three label-free proteomic datasets featuring peptide-level, protein-level, and phosphoproteomic variables respectively, all generated by data independent mass spectrometry (DIA-MS) with substantial biological replicates. The results indicate that NAguideR is able to discriminate the optimal imputation methods that are facilitating DIA-MS experiments over those sub-optimal and low-performance algorithms. NAguideR web-tool further provides downloadable tables and figures supporting flexible data analysis and interpretation. The flowchart below summarizes the process of data analysis in NAguideR.

1.2 What NAguideR exactly does in each step ?

As described above, there are four main steps in the data analysis process of NAguideR: (1) Upload of proteomics data; (2) Data quality control; (3) Missing value imputation; (4) Performance evaluation. However, many users care about the detailed operation in each step. The figure below shows the major steps of the data analysis process in NAguideR. We take two groups of samples (five biological replicates in each group, labeled A1, A2, A3, A4, A5, B1, B2, B3, B4, B5 in the original intensity data) for example. Feature means the identified proteins/peptides.

2.1 Input data preparation

NAguideR supports four basic file formats (.csv, .txt, .xlsx, .xls). Before analysis, users should prepare two required data: (1) Proteomics expression data and (2) Sample information data. The data required here could be readily generated based on results of several popular tools such as MaxQuant, PEAKS, Spectronaut, and so on. Then can upload the two data into NAguideR with right formats respectively and start subsequent analysis.

2.1.1 Proteomics expression data

There are four types of proteomics expression data supported in NAguideR ('Peptides+Charges+Proteins', 'Peptides+Charges', 'Peptides+Proteins', 'Proteins'), among which the main differences are the first few columns. In addition, users may upload other kinds of omics data (i.e. Genomics, Metabolomics), they can choose the fifth type ('Others'), please note, the fifth type can not generate the results based on those protomic criteria.

2.1.1.1 Expression data with peptide sequences, peptide charge states, and protein ids

In this situation, peptide sequences, peptide charge states, and protein ids are sequentially provided in the first three columns of input file. Peptide sequences in the first column can be peptides with any post-translational modification (PTM, written in any routine format) or stripped peptides (without PTM). The second column is peptide charge states. The protein ids in the third column should be UniProt ids. From the fourth column, peptides/proteins expression intensity or signal abundance in every sample should be listed. The data structure is shown as below:

2.1.1.2 Expression data with peptide sequences and peptide charge states

Similar to the above situation, peptide sequences and peptide charge states are sequentially provided in the first two columns of input file. Peptide sequences in the first column can be peptides with post-translational modification (PTM) or stripped peptides (without PTM). The second column is peptide charge states. From the third column, peptides/proteins expression intensity or signal abundance in every sample should be listed. The data structure is shown as below:

2.1.1.3 Expression data with peptide sequences, and protein ids

Under this circumstance, peptide sequences, and protein ids are sequentially provided in the first two columns of input file. Peptide sequences in the first column can be peptides with post-translational modification (PTM) or stripped peptides (without PTM). The protein ids in the second column should be UniProt ids. From the third column, peptides/proteins expression intensity or signal abundance in every sample should be listed. The data structure is shown as below:

2.1.1.4 Expression data with protein ids

In this situation, protein ids are provided in the first columns of input file. The protein ids here should be UniProt ids. From the second column,peptides/proteins expression intensity or signal abundance in every sample should be listed. The data structure is shown as below:

2.1.1.5 Other kinds of omics data

If users want to use NAguideR for other omics data (i.e. genomics, metabolomics), gene/metabolite ids/names should be provided in the first columns of input file. From the second column, genes/metabolites expression intensity or signal abundance in every sample should be listed. The data structure may be shown as below:

2.1.2 Sample information data

Sample information here means that users should provide sample group identity information. This information could e.g., enable filtration strategy for different group respectively in the quality control step. The sample names are in the first column and their orders are same as those in the expression data. Group information is in the second column. The data structure is shown as below:

2.2 Operating Procedure of NAguideR (Four steps)

Step 1. Uploading proteomics expression data

When preparing required data, users can click 'Import data' and upload their own data in the left panel:

If users want to check the example data first, they can choose 'Load example data' and download these example data by clicking relative button:

Step 2. Data quality control

After uploading the right data, users can click 'NA Overview'. In this part, users can check the NA distribution in their data ('NA distribution' part) and those proteins/peptides with excessively high proportion of NA and large coefficient of variation (CV) will be removed ('Filter' part). After setting suitable parameters, just click the 'Calculate' button.

Step 3. Missing value imputation

After data quality control, users can click 'Methods'. In this step, users should select the imputation methods first. With regard to the running time, we set these fast methods (left part, 15 methods) chosen by default. If users choose those slow methods (right part, 5 methods), that means the running time will be longer.

After selecting suitable methods, users need to click 'Calculate' button, and a popup window will be jumped out to show the selected methods, then click 'OK' button and continue:

Step 4. Performance evaluation

Click 'Results and Assessments'. In this step, based on the methods chosen above, the data with NA will be imputed and shown in the 'Results' panel, then the results will be evaluated under the four classic criteria and the four proteomic criteria, shown as below:

The tables and figures are provided here under the four classic criteria. 1. This table shows the comprehensive ranks of every imputation method; 2-5, the tables show the scores of every imputation method based on 'Normalized root mean squared Error (NRMSE)', 'NRMSE-based sum of ranks (SOR)', 'Procrustes sum of squared errors (PSS)', and 'Average correlation coefficient between original value and imputed value (ACC_OI)', respectively; 6. Figures here show the normalized scores of every imputation method under the four classic criteria. 'Normalized Values' here means every score divides by corresponding max value.

The tables and figures are provided here under the four proteomic criteria. 1. This table shows the comprehensive ranks of every imputation method; 2-5, the tables show the scores of every imputation method based on 'Average correlation coefficient between peptides with different charges (ACC_Charge)', 'Average correlation coefficient between peptides in a same protein (ACC_PepProt)', 'Average correlation coefficient between protein complexes (ACC_CORUM)', 'Average correlation coefficient between protein complexes (ACC_PPI)', respectively; 6. Figures here show the correlation coefficient distribution of the original values and the imputed values from every imputation method under the four proteomic criteria.

If you have any questions, comments or suggestions about NAguideR, please feel free to contact: wsslearning@omicsolution.com. We really appreciate that you use NAguideR, and your suggestions should be valuable to its improvement in the future.

Calculating......

Step 1: Upload Original Data

1. Expression data:

1.1 File format:

1.2 Import your data：

Separator：

2. Samples information data:

2.1 File format:

2.2 Import your data：

Separator：

1. Expression data：

2. Samples information data：

Step 2: NA Overview

1. Missing value type:

3. NA ratio:

6. CV threshold (raw scale):

Height for figure:

Step 3: Missing value imputation. All methods have been classified based on their algorithm, please select the imputation methods you want (by default, fast methods are chosen in each category), then click the 'Calculate' button.

A. Single value approaches

Method 1: Zero

DOI: 10.1021/acs.jproteome.5b00981

Method 2: Minimum

DOI: 10.1038/s41586-019-0987-8

Method 3: Column median (colmedian)

Package: e1071

Method 4: Row median (rowmedian)

Package: e1071

Method 5: Deterministic minimal value (mindet)

Package: imputeLCMD

Method 6: Stochastic minimal value (minprob)

Package: imputeLCMD

Method 7: Perseus imputation (PI)

Width:

Down shift:

DOI:10.1038/nMeth.3901

B. Global structure approaches

Method 8: Singular value decomposition (svd)

DOI: 10.1093/bioinformatics/17.6.520

Method 9: Maximum likelihood estimation (mle)

Package: norm

Method 10: Sequential imputation (impseq)

DOI: 10.1016/j.compbiolchem.2007.07.001

Method 11: Robust sequential imputation (impseqrob)

DOI: 10.1016/j.compbiolchem.2008.07.019

Method 12: Bayesian principal component analysis (bpca)

DOI: 10.1093/bioinformatics/btg287

C. Local similarity approaches

Method 13: K-nearest neighbor (knn)

DOI: 10.1093/bioinformatics/17.6.520

Method 14: Sequential knn (seq-knn)

DOI: 10.1186/1471-2105-5-160

Method 15: Quantile regression (qr)

Package: imputeLCMD

Method 16: Local least squares (lls)

DOI: 10.1093/bioinformatics/bth499

Method 17: Glmnet Ridge Regression (GRR)

Package: DreamAI

Method 18: Multiple imputation bayesian linear regression (mice-norm)

DOI: 10.18637/jss.v045.i03

Method 19: Truncation knn (trknn)

DOI: 10.1186/s12859-017-1547-6

Method 20: Iterative robust model (irm)

DOI: 10.18637/jss.v074.i07

Method 21: Generalized Mass Spectrum (GMS)

DOI: 10.1093/bioinformatics/btz488

Method 22: Multiple imputation classification and regression trees (mice-cart)

DOI: 10.18637/jss.v045.i03

Method 23: Random forest model (rf)

Number of trees:

DOI: 10.1093/bioinformatics/btr597

Step 4: Results and Assessments

1. Parameters for 'Results'

1.1. Select one method:

2. Parameters for 'Criteria'

2.1.1. Please select the criterion/criteria you want:

2.1.2. Please set the weighting for each criterion you select:

2.2.1. Please select the criterion/criteria you want:

2.2.2. Please set the weighting for each criterion you select:

Figure height:

1. Comprehensive ranks under classic criteria: