1.1 Abstract
Mass-spectrometry (MS) based quantitative proteomics experiments frequently generate data with missing values, which may profoundly affect downstream analyses.
A wide variety of missing value imputation methods have been established to deal with the missing-value issue. To date,
however, there is a scarcity of efficient, systematic, and easy-to-handle tools that are tailored for proteomics community.
Herein, we developed a user-friendly and powerful web tool, NAguideR, to enable implementation and evaluation of different
missing value methods offered by twenty popular missing-value imputation algorithms. Evaluation of data imputation results
can be performed through classic computational criteria and, unprecedentedly, proteomic empirical criteria such as
quantitative consistency between different charge-states of the same peptide, different peptides belonging to the same
proteins, and individual proteins participating functional protein complexes. We applied NAguideR into three label-free
proteomic datasets featuring peptide-level, protein-level, and phosphoproteomic variables respectively, all generated by
data independent mass spectrometry (DIA-MS) with substantial biological replicates. The results indicate that NAguideR is
able to discriminate the optimal imputation methods that are facilitating DIA-MS experiments over those sub-optimal and
low-performance algorithms. NAguideR web-tool further provides downloadable tables and figures supporting flexible data
analysis and interpretation. The flowchart below summarizes the process of data analysis in NAguideR.
1.2 What NAguideR exactly does in each step ?
As described above, there are four main steps in the data analysis process of NAguideR: (1) Upload of proteomics data; (2) Data quality control;
(3) Missing value imputation; (4) Performance evaluation. However, many users care about the detailed operation in each step.
The figure below shows the major steps of the data analysis process in NAguideR. We take two groups of samples
(five biological replicates in each group, labeled A1, A2, A3, A4, A5, B1, B2, B3, B4, B5 in the original intensity data)
for example. Feature means the identified proteins/peptides.
2.1 Input data preparation
NAguideR supports four basic file formats (.csv, .txt, .xlsx, .xls). Before analysis, users should prepare two required data: (1) Proteomics expression data and (2) Sample information data.
The data required here could be readily generated based on results of several popular tools such as
MaxQuant,
PEAKS,
Spectronaut, and so on. Then
can upload the two data into NAguideR with right formats respectively and start subsequent analysis.
2.1.1 Proteomics expression data
There are four types of proteomics expression data supported in NAguideR ('Peptides+Charges+Proteins', 'Peptides+Charges', 'Peptides+Proteins', 'Proteins'), among which the main differences are the first few columns.
In addition, users may upload other kinds of omics data (i.e. Genomics, Metabolomics), they can choose the fifth type ('Others'), please note, the fifth type can not generate the results based on those protomic criteria.
2.1.1.1 Expression data with peptide sequences, peptide charge states, and protein ids
In this situation, peptide sequences, peptide charge states, and protein ids are sequentially provided in the first three columns of input file. Peptide sequences in
the first column can be peptides with any post-translational modification (PTM, written in any routine format) or stripped peptides (without PTM). The second column is peptide charge states. The protein ids
in the third column should be
UniProt ids. From the fourth column, peptides/proteins expression intensity or signal abundance in every sample should be listed.
The data structure is shown as below:
2.1.1.2 Expression data with peptide sequences and peptide charge states
Similar to the above situation, peptide sequences and peptide charge states are sequentially provided in the first two columns of input file. Peptide sequences in
the first column can be peptides with post-translational modification (PTM) or stripped peptides (without PTM). The second column is peptide charge states.
From the third column, peptides/proteins expression intensity or signal abundance in every sample should be listed. The data structure is shown as below:
2.1.1.3 Expression data with peptide sequences, and protein ids
Under this circumstance, peptide sequences, and protein ids are sequentially provided in the first two columns of input file. Peptide sequences in
the first column can be peptides with post-translational modification (PTM) or stripped peptides (without PTM). The protein ids
in the second column should be
UniProt ids. From the third column, peptides/proteins expression intensity or signal abundance in every sample should be listed. The data structure is shown as below:
2.1.1.4 Expression data with protein ids
In this situation, protein ids are provided in the first columns of input file. The protein ids here
should be
UniProt ids. From the second column,peptides/proteins expression intensity or signal abundance in every sample should be listed. The data structure is shown as below:
2.1.1.5 Other kinds of omics data
If users want to use NAguideR for other omics data (i.e. genomics, metabolomics), gene/metabolite ids/names should be provided in the first columns of input file.
From the second column, genes/metabolites expression intensity or signal abundance in every sample should be listed. The data structure may be shown as below:
2.1.2 Sample information data
Sample information here means that users should provide sample group identity information. This information could e.g., enable filtration strategy for different group respectively in the quality control step. The sample names are in the first column and their orders are same as those in the expression data. Group information is in the second column. The data structure is shown as below:
2.2 Operating Procedure of NAguideR (Four steps)
Step 1. Uploading proteomics expression data
When preparing required data, users can click 'Import data' and upload their own data in the left panel:
If users want to check the example data first, they can choose 'Load example data' and download these example data by clicking relative button:
Step 2. Data quality control
After uploading the right data, users can click 'NA Overview'. In this part, users can check the NA distribution in their data ('NA distribution' part) and
those proteins/peptides with excessively high proportion of NA and large coefficient of variation (CV) will be removed ('Filter' part). After setting suitable
parameters, just click the 'Calculate' button.
Step 3. Missing value imputation
After data quality control, users can click 'Methods'. In this step, users should select the imputation methods first. With regard to the running time, we set these fast methods (left part, 15 methods) chosen by default. If users choose those slow methods (right part, 5 methods), that means the running time will be longer.
After selecting suitable methods, users need to click 'Calculate' button, and a popup window will be jumped out to show the selected methods, then click 'OK' button and continue:
Step 4. Performance evaluation
Click 'Results and Assessments'. In this step, based on the methods chosen above, the data with NA will be imputed and shown in the 'Results' panel, then the results will be evaluated
under the four classic criteria and the four proteomic criteria, shown as below:
The tables and figures are provided here under the four classic criteria. 1. This table shows the comprehensive ranks of every imputation method; 2-5, the tables show the scores of every imputation method based on 'Normalized root mean squared Error (NRMSE)',
'NRMSE-based sum of ranks (SOR)', 'Procrustes sum of squared errors (PSS)', and 'Average correlation coefficient between original value and imputed value (ACC_OI)', respectively; 6. Figures here show the normalized scores of every imputation method
under the four classic criteria. 'Normalized Values' here means every score divides by corresponding max value.
The tables and figures are provided here under the four proteomic criteria. 1. This table shows the comprehensive ranks of every imputation method; 2-5, the tables show the scores of every imputation method based on
'Average correlation coefficient between peptides with different charges (ACC_Charge)', 'Average correlation coefficient between peptides in a same protein (ACC_PepProt)', 'Average correlation coefficient between protein complexes (ACC_CORUM)',
'Average correlation coefficient between protein complexes (ACC_PPI)', respectively; 6. Figures here show the correlation coefficient distribution of the original values and the imputed values from every imputation method
under the four proteomic criteria.
If you have any questions, comments or suggestions about NAguideR, please feel free to contact: wsslearning@omicsolution.com. We really appreciate that you use NAguideR, and your suggestions should be valuable to its improvement in the future.