A concern of the proteomics community is the confidence of a peptide and protein assignment from tandem MS data. The number of single peptide protein identifications present in peer reviewed publications is unaccountably high, which raises doubts over the validity of the results.
LC-MSE is a parallel unbiased approach to data acquisition that increases both the number of peptides and also the reproducibility of the peptides sampled during an LC-MS experiment. A novel database search algorithm is presented for the qualitative identification of data originating from LC-MSE data, whereby multiple precursor ions are fragmented simultaneously. Properties that are used by the algorithm include retention-time, precursor and product ion intensities, charge states, and crucially the accurate masses of both the precursor and product ions from the LC-MSE data. This strategy has been shown to be highly effective for the identification of proteins in both simple and complex samples over a wide dynamic range.
The database search algorithm is an iterative process whereby each iteration incrementally increases the selectivity, specificity, and sensitivity of the overall strategy.
Tentative peptide and protein identifications are ranked and scored by their relative correlation to a number of well-established models of known and empirically derived physicochemical attributes of proteins and peptides. The algorithm utilizes reverse or random decoy databases for automatically determining the false positive identification rate.
The data presented demonstrates the ability of the method to correctly identify peptides and proteins from data-independent acquisition strategies with high sensitivity and specificity.
After data acquisition and processing, which generates a file containing precursor and product ion masses for each peptide, the user defines several parameters for databank searching e.g. database, database type, and whether to create a reverse or random decoy database (see Figures 2a to 2d).
The sample used in this study was a mixture of four standard proteins, tryptically digested and spiked into a digested cytosolic lysate of E. coli. The total column load was 500 ng: 460 ng of the cytosolic E. coli lysate and 40 ng of the four standard proteins.
A nanoACQUITY UPLC System was used with 75 μm x 10 cm bridged ethyl hybrid C18 (1.7 μm); Gradient: 3 - 40% B for 90 min @ 250 nL/min; eluent A and B: 0.1% formic acid in water and acetonitrile, respectively.
A SYNAPT MS System was operated in the LC-MSE alternate scanning mode. Low and elevated energy spectra were acquired every 1.5 s; Collision energy ramp elevated energy: 15 - 40 volts over 1.5 s; Lock Mass: 100 fmol/μL [Glu1]-Fibrinopeptide B @ 500 nL/min sampled once every 30 s; TOF resolution: 10,000 (v mode of acquisition).
A flow diagram showing the hierarchical principle of the database search algorithm is illustrated in Figure 3. A pre-assessment survey – to assess this particular experimental dataset – and a database search encompassing the physicochemical properties of peptides and proteins in the liquid and gas phase was conducted in a so-called “first pass” search. This process was followed by a peptide ranking process and collapsing the identified peptides into proteins.
Following this, a subset database was generated and a second pass search conducted, which was subsequently used to identify user-defined variable peptide modifications and peptide fragments.
The results of the various search iterations were mapped together and a protein ranking process initiated.
The results from this iterative search program illustrate a high degree of replication at both the peptide and protein levels and this is presented in Figures 4 and 5, respectively. The latter addresses one of the major concerns for dealing with tandem MS/MS data in proteomics experiments.
To illustrate the selectivity, specificity, and sensitivity of the scoring and validation process, an E. coli LC-MSE dataset was queried against species-specific databases of six different bacterial proteomes, using a 4% acceptable false positive rate. Searching of the E. coli data against the different bacterial proteomes should only result in the identification of homologous proteins between the organisms, and not a large number of low scoring, spurious proteins. This is a common failing with traditional tandem MS data and existing search engines.
The results from the IdentityE search are displayed in Figure 6, with peptides displayed in magenta identified to a species-specific protein. It can clearly be seen that an amazing degree of specificity is afforded by the search strategy, providing confidence in the results obtained. The blue, red, and green ions represent matched ions corresponding to tryptic peptides, missed cleavage products, and variable modifications from YEAST_ADH, one of the exogenous spiked in proteins.
An added unique benefit of the LC-MSE data is the ability to generate absolute quantification values for each identified protein, that contains more than two peptides1. For the E.coli dataset this is illustrated in Figure 8, where absolute amounts for each one of the proteins is shown. It can be observed that close to three orders of magnitude of identification dynamic range can be obtained, from the 438 proteins replicating in at least two out of the three injections. Using the absolute quantitation functionality of the software, 96% (480 ng) of the theoretical loading of 500 ng accounted for the 438 proteins.
720002631, May 2008