Advertisement

Analytical Guidelines for co-fractionation Mass Spectrometry Obtained through Global Profiling of Gold Standard Saccharomyces cerevisiae Protein Complexes

      Saccharomyces cerevisiae CF-MS data sets using very high proteome coverage libraries of yeast gold standard complexes. A new method for identifying gold standard complexes in CF-MS data, Reference Complex Profiling, and the Extending ‘Guilt-by-Association’ by Degree (EGAD) R package are used for these evaluations, which are verified with concurrent analyses of published human data. By evaluating data collection designs, which involve fractionation of cell lysates, it is found that near-maximum recall of complexes can be achieved with fewer samples than published studies. Distributing sample collection across orthogonal fractionation methods, rather than a single high resolution data set, leads to particularly efficient recall. By evaluating 17 different similarity scoring metrics, which are central to CF-MS data analysis, it is found that two metrics rarely used in past CF-MS studies – Spearman and Kendall correlations – and the recently introduced Co-apex metric frequently maximize recall, whereas a popular metric—Euclidean distance—delivers poor recall. The common practice of integrating external genomic data into CF-MS data analysis is also evaluated, revealing that this practice may improve the precision and recall of known complexes but is generally unsuitable for predicting novel complexes in model organisms. If studying nonmodel organisms using orthologous genomic data, it is found that particular subsets of fractionation profiles (e.g. the lowest abundance quartile) should be excluded to minimize false discovery. These assessments are summarized in a series of universally applicable guidelines for precise, sensitive and efficient CF-MS studies of known complexes, and effective predictions of novel complexes for orthogonal experimental validation.

      Graphical Abstract

      The physical interactions associated with proteins give rise to multiprotein complexes, other macromolecular assemblies and signaling pathways that are essential for cellular processes. These interactions can be graphed to produce complex networks, and potent insights into biological function can be gained by studying the topologies (
      • Vidal M.
      • Cusick M.E.
      • Barabási A.-L.
      Interactome networks and human disease.
      ) and dynamics (
      • Ideker T.
      • Krogan N.J.
      Differential network biology.
      ) of these networks. For this reason, one enduring goal of contemporary biology has been to comprehensively map the protein-protein interaction (PPI) networks that occur within organisms (
      • Vidal M.
      • Cusick M.E.
      • Barabási A.-L.
      Interactome networks and human disease.
      ,
      • Bonetta L.
      Protein–protein interactions: interactome under construction.
      ). High-throughput methods that have been developed to meet this goal include yeast two-hybrid (Y2H), affinity purification-MS (AP-MS) and co-fractionation MS (CF-MS).
      Of these high-throughput methods, CF-MS is unique in that it does not rely on heterologous expression or the genetic manipulation of cells or organisms. CF-MS has thus been able to predict endogenous and unmanipulated protein complexes on an unprecedented scale (
      • Havugimana P.C.
      • Hart G.T.
      • Nepusz T.
      • Yang H.
      • Turinsky A.L.
      • Li Z.
      • Wang P.I.
      • Boutz D.R.
      • Fong V.
      • Phanse S.
      • Babu M.
      • Craig S.A.
      • Hu P.
      • Wan C.
      • Vlasblom J.
      • Dar V-U-N.
      • Bezginov A.
      • Clark G.W.
      • Wu G.C.
      • Wodak S.J.
      • Tillier E.R.M.
      • Paccanaro A.
      • Marcotte E.M.
      • Emili A.
      A census of human soluble protein complexes.
      ,
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ) and to infer their PPIs (
      • Drew K.
      • Müller C.L.
      • Bonneau R.
      • Marcotte E.M.
      Identifying direct contacts between protein complex subunits from their conditional dependence in proteomics datasets.
      ). This has enabled notable insights into, for example, the evolution of eukaryotic protein complexes (
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ) and protein complex disassembly during apoptosis (
      • Scott N.E.
      • Rogers L.D.
      • Prudova A.
      • Brown N.F.
      • Fortelny N.
      • Overall C.M.
      • Foster L.J.
      Interactome disassembly during apoptosis occurs independent of caspase cleavage.
      ).
      CF-MS involves extensive fractionation of cellular lysates – and the protein complexes therein – using one or more nondenaturing separation techniques. The resulting fractionation profiles of individual protein complex subunits are measured using quantitative proteomics. As subunits of intact complexes will co-fractionate, complexes can be bioinformatically predicted from these data using the correlations between fractionation profiles as a feature of central importance. Previous studies have typically made these predictions either within a machine learning framework (
      • Havugimana P.C.
      • Hart G.T.
      • Nepusz T.
      • Yang H.
      • Turinsky A.L.
      • Li Z.
      • Wang P.I.
      • Boutz D.R.
      • Fong V.
      • Phanse S.
      • Babu M.
      • Craig S.A.
      • Hu P.
      • Wan C.
      • Vlasblom J.
      • Dar V-U-N.
      • Bezginov A.
      • Clark G.W.
      • Wu G.C.
      • Wodak S.J.
      • Tillier E.R.M.
      • Paccanaro A.
      • Marcotte E.M.
      • Emili A.
      A census of human soluble protein complexes.
      ,
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ,
      • Larance M.
      • Kirkwood K.J.
      • Tinti M.
      • Murillo A.B.
      • Ferguson M.A.
      • Lamond A.I.
      Global membrane protein interactome analysis using in vivo crosslinking and MS-based protein correlation profiling.
      ,
      • Shatsky M.
      • Dong M.
      • Liu H.
      • Yang L.L.
      • Choi M.
      • Singer M.E.
      • Geller J.T.
      • Fisher S.J.
      • Hall S.C.
      • Hazen T.C.
      Quantitative tagless co-purification: a method to validate and identify protein-protein interactions.
      ,
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ,
      • Crozier T.W.
      • Tinti M.
      • Larance M.
      • Lamond A.I.
      • Ferguson M.A.
      Prediction of protein complexes in Trypanosoma brucei by protein correlation profiling mass spectrometry and machine learning.
      ,
      • Carlson M.L.
      • Stacey R.G.
      • Young J.W.
      • Wason I.S.
      • Zhao Z.
      • Rattray D.G.
      • Scott N.
      • Kerr C.H.
      • Babu M.
      • Foster L.J.
      • Duong Van Hoa F.
      Profiling the Escherichia coli membrane protein interactome captured in Peptidisc libraries.
      ,
      • McWhite C.D.
      • Papoulas O.
      • Drew K.
      • Cox R.M.
      • June V.
      • Dong O.X.
      • Kwon T.
      • Wan C.
      • Salmi M.L.
      • Roux S.J.
      • Browning K.S.
      • Chen Z.J.
      • Ronald P.C.
      • Marcotte E.M.
      A pan-plant protein complex map reveals deep conservation and novel assemblies.
      ), or by identifying interactions contained in existing protein interaction maps using techniques such as complex-centric profiling (
      • Heusel M.
      • Bludau I.
      • Rosenberger G.
      • Hafen R.
      • Frank M.
      • Banaei-Esfahani A.
      • Drogen A.
      • Collins B.C.
      • Gstaiger M.
      • Aebersold R.
      Complex-centric proteome profiling by SEC-SWATH-MS.
      ,
      • Heusel M.
      • Frank M.
      • Köhler M.
      • Amon S.
      • Frommelt F.
      • Rosenberger G.
      • Bludau I.
      • Aulakh S.K.
      • Linder M.I.
      • Liu Y.
      A global screen for assembly state changes of the mitotic proteome by SEC-SWATH-MS.
      ).
      Despite the successes of CF-MS, the method is yet to reach maturity. This is reflected in the diverse range of methodologies employed for CF-MS data collection and analysis (Table I). Regarding data collection, little is known about how different types and combinations of fractionation, and extents of fraction collection, impact upon the precision and recall of protein complex identification. This has resulted in CF-MS studies of dramatically different experimental cost. For example CF-MS data sets comprised of 40 size exclusion chromatography (SEC) fractions have previously been used for broad-scale identification of complexes in human U2OS cells (
      • Kirkwood K.J.
      • Ahmad Y.
      • Larance M.
      • Lamond A.I.
      Characterisation of native protein complexes and protein isoform variation using size-fractionation based quantitative proteomics.
      ), whereas in another study, performed on the lower complexity organism Desulfovibrio vulgaris, 5273 fractions collected using multiple levels of chromatography were used for a similar purpose (
      • Shatsky M.
      • Dong M.
      • Liu H.
      • Yang L.L.
      • Choi M.
      • Singer M.E.
      • Geller J.T.
      • Fisher S.J.
      • Hall S.C.
      • Hazen T.C.
      Quantitative tagless co-purification: a method to validate and identify protein-protein interactions.
      ). Regarding data analysis, a variety of correlation metrics, genomic data or other features have been employed in the machine learning classifiers (
      • Havugimana P.C.
      • Hart G.T.
      • Nepusz T.
      • Yang H.
      • Turinsky A.L.
      • Li Z.
      • Wang P.I.
      • Boutz D.R.
      • Fong V.
      • Phanse S.
      • Babu M.
      • Craig S.A.
      • Hu P.
      • Wan C.
      • Vlasblom J.
      • Dar V-U-N.
      • Bezginov A.
      • Clark G.W.
      • Wu G.C.
      • Wodak S.J.
      • Tillier E.R.M.
      • Paccanaro A.
      • Marcotte E.M.
      • Emili A.
      A census of human soluble protein complexes.
      ,
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ,
      • Larance M.
      • Kirkwood K.J.
      • Tinti M.
      • Murillo A.B.
      • Ferguson M.A.
      • Lamond A.I.
      Global membrane protein interactome analysis using in vivo crosslinking and MS-based protein correlation profiling.
      ,
      • Shatsky M.
      • Dong M.
      • Liu H.
      • Yang L.L.
      • Choi M.
      • Singer M.E.
      • Geller J.T.
      • Fisher S.J.
      • Hall S.C.
      • Hazen T.C.
      Quantitative tagless co-purification: a method to validate and identify protein-protein interactions.
      ,
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ,
      • Crozier T.W.
      • Tinti M.
      • Larance M.
      • Lamond A.I.
      • Ferguson M.A.
      Prediction of protein complexes in Trypanosoma brucei by protein correlation profiling mass spectrometry and machine learning.
      ,
      • Carlson M.L.
      • Stacey R.G.
      • Young J.W.
      • Wason I.S.
      • Zhao Z.
      • Rattray D.G.
      • Scott N.
      • Kerr C.H.
      • Babu M.
      • Foster L.J.
      • Duong Van Hoa F.
      Profiling the Escherichia coli membrane protein interactome captured in Peptidisc libraries.
      ,
      • McWhite C.D.
      • Papoulas O.
      • Drew K.
      • Cox R.M.
      • June V.
      • Dong O.X.
      • Kwon T.
      • Wan C.
      • Salmi M.L.
      • Roux S.J.
      • Browning K.S.
      • Chen Z.J.
      • Ronald P.C.
      • Marcotte E.M.
      A pan-plant protein complex map reveals deep conservation and novel assemblies.
      ) and complex-centric profiling workflows (
      • Heusel M.
      • Bludau I.
      • Rosenberger G.
      • Hafen R.
      • Frank M.
      • Banaei-Esfahani A.
      • Drogen A.
      • Collins B.C.
      • Gstaiger M.
      • Aebersold R.
      Complex-centric proteome profiling by SEC-SWATH-MS.
      ,
      • Heusel M.
      • Frank M.
      • Köhler M.
      • Amon S.
      • Frommelt F.
      • Rosenberger G.
      • Bludau I.
      • Aulakh S.K.
      • Linder M.I.
      • Liu Y.
      A global screen for assembly state changes of the mitotic proteome by SEC-SWATH-MS.
      ) used to identify protein complexes. There is as yet no consensus on how effective these different features are. For example a recent re-analysis of a large-scale CF-MS data set indicated that, among the novel PPIs identified by the machine learning classifier employed in the original study, >85% were likely false positives and only <7% overlapped across independent screens (
      • Shatsky M.
      • Dong M.
      • Liu H.
      • Yang L.L.
      • Choi M.
      • Singer M.E.
      • Geller J.T.
      • Fisher S.J.
      • Hall S.C.
      • Hazen T.C.
      Quantitative tagless co-purification: a method to validate and identify protein-protein interactions.
      ). Together these examples speak to the urgent need for a better fundamental understanding of the recall and precision of protein complex identification using CF-MS.
      Table IRecent co-fractionation mass spectrometry (CF-MS) studies
      Organism(s) StudiedFractionation Method(s) UsedNumber of Fractions Collected per ExperimentSimilarity Scoring Metric(s)
      a The term similarity scoring metric refers to the scoring (e.g. correlation) metric used to match fractionation profiles. Details for specific metrics are cited in Materials and Methods: Experimental design and statistical rationale.
      Used
      Data Interpretation MethodsRef. (Year)
      Human (HeLa and HEK293 cells)Primarily Ion exchange chromatography (IEX)43, 120 or 269/375
      a The term similarity scoring metric refers to the scoring (e.g. correlation) metric used to match fractionation profiles. Details for specific metrics are cited in Materials and Methods: Experimental design and statistical rationale.
      Pearson correlation, normalized cross-correlation (NCC), Apex LocationSupervised Machine learning to infer PPIs (CORUM complexes for training and testing) and clustering to infer complexes(
      • Havugimana P.C.
      • Hart G.T.
      • Nepusz T.
      • Yang H.
      • Turinsky A.L.
      • Li Z.
      • Wang P.I.
      • Boutz D.R.
      • Fong V.
      • Phanse S.
      • Babu M.
      • Craig S.A.
      • Hu P.
      • Wan C.
      • Vlasblom J.
      • Dar V-U-N.
      • Bezginov A.
      • Clark G.W.
      • Wu G.C.
      • Wodak S.J.
      • Tillier E.R.M.
      • Paccanaro A.
      • Marcotte E.M.
      • Emili A.
      A census of human soluble protein complexes.
      ) (2012)
      Human (HeLa cells)size exclusion chromatography (SEC)50Euclidean distanceFractionation profiles deconvolved into Gaussian curves, and features of curves compared to infer PPIs(
      • Kristensen A.R.
      • Gsponer J.
      • Foster L.J.
      A high-throughput approach for measuring temporal changes in the interactome.
      ) (2012)
      Human PlasmaSEC, IEX, isoelectric focusing (IEF)17 (SEC), 23 (IEX), 20 (IEF)Pearson correlationCorrelations from each fractionation method summed and used to infer PPIs(
      • Gordon S.M.
      • Deng J.
      • Tomann A.B.
      • Shah A.S.
      • Lu L.J.
      • Davidson W.S.
      Multi-dimensional co-separation analysis reveals protein–protein interactions defining plasma lipoprotein subspecies.
      ) (2013)
      Human (U2OS cells)SEC40Euclidean distanceClustering to infer complexes; separate cross-comparisons with CORUM to identify high confidence complexes(
      • Kirkwood K.J.
      • Ahmad Y.
      • Larance M.
      • Lamond A.I.
      Characterisation of native protein complexes and protein isoform variation using size-fractionation based quantitative proteomics.
      ) (2013)
      Arabidopsis thalianaSEC34Euclidean distanceClustering to infer complexes(
      • Aryal U.K.
      • Xiong Y.
      • McBride Z.
      • Kihara D.
      • Xie J.
      • Hall M.C.
      • Szymanski D.B.
      A proteomic strategy for global analysis of plant protein complexes.
      ) (2014)
      Caenorhabditis elegans, Drosophila melanogaster, Mus musculus, Strongylocentrotus purpuratus, human (CB660 and G166 cells), Xenopus laevis, Nematostella vectensis, Dictyostelium discoideum, Saccharomyces cerevisiaeIEX120Pearson correlation, NCC, Apex Location, 1 minus Euclidean distanceSupervised machine learning to infer PPIs (CORUM complexes for training and testing) and clustering to infer complexes(
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ) (2015)
      Human (U2OS cells)SEC48Pearson correlation, NCC, Apex Location, Euclidean distanceSupervised machine learning to infer PPIs (CORUM complexes for training and testing) and clustering to infer complexes(
      • Larance M.
      • Kirkwood K.J.
      • Tinti M.
      • Murillo A.B.
      • Ferguson M.A.
      • Lamond A.I.
      Global membrane protein interactome analysis using in vivo crosslinking and MS-based protein correlation profiling.
      ) (2016)
      Desulfovibrio vulgarisSuccessive fractionation by IEX, HIC and SEC5,273Pearson correlationSupervised machine learning to infer PPIs (curated Escherichia coli PPIs for training and testing)(
      • Shatsky M.
      • Dong M.
      • Liu H.
      • Yang L.L.
      • Choi M.
      • Singer M.E.
      • Geller J.T.
      • Fisher S.J.
      • Hall S.C.
      • Hazen T.C.
      Quantitative tagless co-purification: a method to validate and identify protein-protein interactions.
      ) (2016)
      Arabidopsis thalianaSEC, sucrose velocity24/48 (SEC), 50 (sucrose velocity)not specifiedFractionation profiles deconvolved into Gaussian curves; clustering of curves to infer putative complexes; Arabidopsis complexes predicted using orthologous CORUM complexes(
      • McBride Z.
      • Chen D.
      • Reick C.
      • Xie J.
      • Szymanski D.B.
      Global analysis of membrane-associated protein oligomerization using protein correlation profiling.
      ) (2017)
      human (Jurkat T cells)SEC80Euclidean distanceFractionation profiles deconvolved into Gaussian curves; features of curves compared to infer PPIs(
      • Scott N.E.
      • Rogers L.D.
      • Prudova A.
      • Brown N.F.
      • Fortelny N.
      • Overall C.M.
      • Foster L.J.
      Interactome disassembly during apoptosis occurs independent of caspase cleavage.
      ) (2017)
      Trypanosoma bruceiSEC, IEX48 (SEC), 96 (IEX)Pearson correlation, NCC, Apex Location, Euclidean distanceSupervised machine learning to infer PPIs (CORUM and manually curated complexes for training and testing) and clustering to infer complexes(
      • Crozier T.W.
      • Tinti M.
      • Larance M.
      • Lamond A.I.
      • Ferguson M.A.
      Prediction of protein complexes in Trypanosoma brucei by protein correlation profiling mass spectrometry and machine learning.
      ) (2017)
      Mus musculusSEC80Pearson correlation, Euclidean distance, Peak Location, Co-apexPrInCE workflow: fractionation profiles deconvolved into Gaussian curves; supervised machine learning to infer PPIs (CORUM for training and testing)(
      • Skinnider M.A.
      • Scott N.E.
      • Prudova A.
      • Stoynov N.
      • Stacey R.G.
      • Gsponer J.
      • Foster L.
      An atlas of protein-protein interactions across mammalian tissues.
      ) (2018)
      Arabidopsis thalianaSEC, IEX38 (SEC), 65 (IEX)Squared Euclidean distanceClustering to infer complexes, performed separately for IEX, SEC and IEX+SEC(reproducible profiles only) data sets(
      • McBride Z.
      • Chen D.
      • Lee Y.
      • Aryal U.K.
      • Xie J.
      • Szymanski D.B.
      A label-free mass spectrometry method to predict endogenous protein complex composition.
      ) (2019)
      Human (HEK293 cells)SEC81Pearson correlationCCprofiler workflow: detection of co-eluting subunits of known complexes using a sliding fraction window; FDR estimation using decoy complexes(
      • Heusel M.
      • Bludau I.
      • Rosenberger G.
      • Hafen R.
      • Frank M.
      • Banaei-Esfahani A.
      • Drogen A.
      • Collins B.C.
      • Gstaiger M.
      • Aebersold R.
      Complex-centric proteome profiling by SEC-SWATH-MS.
      ) (2019)
      Human (HeLa CCL2 cells)SEC90Pearson correlationCCprofiler workflow(
      • Heusel M.
      • Frank M.
      • Köhler M.
      • Amon S.
      • Frommelt F.
      • Rosenberger G.
      • Bludau I.
      • Aulakh S.K.
      • Linder M.I.
      • Liu Y.
      A global screen for assembly state changes of the mitotic proteome by SEC-SWATH-MS.
      ) (2019)
      Escherichia coliSEC54Pearson correlation, Euclidean distance, Peak Location, Co-apexPrInCE workflow using manually curated complexes for training and testing of the machine learning classifier; clustering to infer complexes(
      • Carlson M.L.
      • Stacey R.G.
      • Young J.W.
      • Wason I.S.
      • Zhao Z.
      • Rattray D.G.
      • Scott N.
      • Kerr C.H.
      • Babu M.
      • Foster L.J.
      • Duong Van Hoa F.
      Profiling the Escherichia coli membrane protein interactome captured in Peptidisc libraries.
      ) (2019)
      Arabidopsis thaliana, Brassica oleracea, Glycine max, Cannabis sativa, Solanum lycopersicum, Chenopodium quinoa, Zea mays, Oryza sativa ssp. japonica, Triticum aestivum, Cocos nucifera, Ceratopteris richardii, Selaginella moellendorffii, Chlamydomonas reinhardtiiSEC, IEX, IEF70 (SEC), 60 (IEX), 24 (IEF)Pearson correlation, Spearman correlation, Euclidean distance, Bray-Curtis similarity, stationary cross-correlationAssignment of proteins from each species into orthogroups; supervised machine learning to infer orthogroup PPIs (human CORUM and manually curated plant complexes for training and testing) and clustering to infer complexes(
      • McWhite C.D.
      • Papoulas O.
      • Drew K.
      • Cox R.M.
      • June V.
      • Dong O.X.
      • Kwon T.
      • Wan C.
      • Salmi M.L.
      • Roux S.J.
      • Browning K.S.
      • Chen Z.J.
      • Ronald P.C.
      • Marcotte E.M.
      A pan-plant protein complex map reveals deep conservation and novel assemblies.
      ) (2020)
      †Multiple IEX columns were used, and different numbers of fractions were collected using each column.
      a The term similarity scoring metric refers to the scoring (e.g. correlation) metric used to match fractionation profiles. Details for specific metrics are cited in Materials and Methods: Experimental design and statistical rationale.
      Previous attempts to gain such understanding have relied on the use of gold standard protein complexes or PPIs to assess CF-MS data. Specifically, these gold standards have been used to provide an indication of co-purification accuracy during fractionation (
      • Shatsky M.
      • Dong M.
      • Liu H.
      • Yang L.L.
      • Choi M.
      • Singer M.E.
      • Geller J.T.
      • Fisher S.J.
      • Hall S.C.
      • Hazen T.C.
      Quantitative tagless co-purification: a method to validate and identify protein-protein interactions.
      ), profile CF-MS data sets (
      • Heusel M.
      • Bludau I.
      • Rosenberger G.
      • Hafen R.
      • Frank M.
      • Banaei-Esfahani A.
      • Drogen A.
      • Collins B.C.
      • Gstaiger M.
      • Aebersold R.
      Complex-centric proteome profiling by SEC-SWATH-MS.
      ,
      • Heusel M.
      • Frank M.
      • Köhler M.
      • Amon S.
      • Frommelt F.
      • Rosenberger G.
      • Bludau I.
      • Aulakh S.K.
      • Linder M.I.
      • Liu Y.
      A global screen for assembly state changes of the mitotic proteome by SEC-SWATH-MS.
      ,
      • Kirkwood K.J.
      • Ahmad Y.
      • Larance M.
      • Lamond A.I.
      Characterisation of native protein complexes and protein isoform variation using size-fractionation based quantitative proteomics.
      ), identify thresholds for predicting high confidence interactions from distance-based similarities between protein fractionation profiles (
      • Scott N.E.
      • Rogers L.D.
      • Prudova A.
      • Brown N.F.
      • Fortelny N.
      • Overall C.M.
      • Foster L.J.
      Interactome disassembly during apoptosis occurs independent of caspase cleavage.
      ,
      • Kristensen A.R.
      • Gsponer J.
      • Foster L.J.
      A high-throughput approach for measuring temporal changes in the interactome.
      ,
      • Scott N.E.
      • Brown L.M.
      • Kristensen A.R.
      • Foster L.J.
      Development of a computational framework for the analysis of protein correlation profiling and spatial proteomics experiments.
      ), or for the training and testing of classifiers developed using supervised machine learning (
      • Havugimana P.C.
      • Hart G.T.
      • Nepusz T.
      • Yang H.
      • Turinsky A.L.
      • Li Z.
      • Wang P.I.
      • Boutz D.R.
      • Fong V.
      • Phanse S.
      • Babu M.
      • Craig S.A.
      • Hu P.
      • Wan C.
      • Vlasblom J.
      • Dar V-U-N.
      • Bezginov A.
      • Clark G.W.
      • Wu G.C.
      • Wodak S.J.
      • Tillier E.R.M.
      • Paccanaro A.
      • Marcotte E.M.
      • Emili A.
      A census of human soluble protein complexes.
      ,
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ,
      • Larance M.
      • Kirkwood K.J.
      • Tinti M.
      • Murillo A.B.
      • Ferguson M.A.
      • Lamond A.I.
      Global membrane protein interactome analysis using in vivo crosslinking and MS-based protein correlation profiling.
      ,
      • Shatsky M.
      • Dong M.
      • Liu H.
      • Yang L.L.
      • Choi M.
      • Singer M.E.
      • Geller J.T.
      • Fisher S.J.
      • Hall S.C.
      • Hazen T.C.
      Quantitative tagless co-purification: a method to validate and identify protein-protein interactions.
      ,
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ,
      • Crozier T.W.
      • Tinti M.
      • Larance M.
      • Lamond A.I.
      • Ferguson M.A.
      Prediction of protein complexes in Trypanosoma brucei by protein correlation profiling mass spectrometry and machine learning.
      ,
      • Carlson M.L.
      • Stacey R.G.
      • Young J.W.
      • Wason I.S.
      • Zhao Z.
      • Rattray D.G.
      • Scott N.
      • Kerr C.H.
      • Babu M.
      • Foster L.J.
      • Duong Van Hoa F.
      Profiling the Escherichia coli membrane protein interactome captured in Peptidisc libraries.
      ,
      • McWhite C.D.
      • Papoulas O.
      • Drew K.
      • Cox R.M.
      • June V.
      • Dong O.X.
      • Kwon T.
      • Wan C.
      • Salmi M.L.
      • Roux S.J.
      • Browning K.S.
      • Chen Z.J.
      • Ronald P.C.
      • Marcotte E.M.
      A pan-plant protein complex map reveals deep conservation and novel assemblies.
      ). The gold standards used for these purposes only represented limited portions of the human or bacterial complexomes under investigation, and often only exist under a restricted range of experimental conditions (
      • Stacey R.G.
      • Skinnider M.A.
      • Chik J.H.
      • Foster L.J.
      Context-specific interactions in literature-curated protein interaction databases.
      ). Moreover they have often been employed under the assumption that they reflect the characteristics of protein complexes that are uniquely uncovered using CF-MS (
      • Havugimana P.C.
      • Hart G.T.
      • Nepusz T.
      • Yang H.
      • Turinsky A.L.
      • Li Z.
      • Wang P.I.
      • Boutz D.R.
      • Fong V.
      • Phanse S.
      • Babu M.
      • Craig S.A.
      • Hu P.
      • Wan C.
      • Vlasblom J.
      • Dar V-U-N.
      • Bezginov A.
      • Clark G.W.
      • Wu G.C.
      • Wodak S.J.
      • Tillier E.R.M.
      • Paccanaro A.
      • Marcotte E.M.
      • Emili A.
      A census of human soluble protein complexes.
      ,
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ,
      • Scott N.E.
      • Rogers L.D.
      • Prudova A.
      • Brown N.F.
      • Fortelny N.
      • Overall C.M.
      • Foster L.J.
      Interactome disassembly during apoptosis occurs independent of caspase cleavage.
      ,
      • Larance M.
      • Kirkwood K.J.
      • Tinti M.
      • Murillo A.B.
      • Ferguson M.A.
      • Lamond A.I.
      Global membrane protein interactome analysis using in vivo crosslinking and MS-based protein correlation profiling.
      ,
      • Shatsky M.
      • Dong M.
      • Liu H.
      • Yang L.L.
      • Choi M.
      • Singer M.E.
      • Geller J.T.
      • Fisher S.J.
      • Hall S.C.
      • Hazen T.C.
      Quantitative tagless co-purification: a method to validate and identify protein-protein interactions.
      ,
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ,
      • Crozier T.W.
      • Tinti M.
      • Larance M.
      • Lamond A.I.
      • Ferguson M.A.
      Prediction of protein complexes in Trypanosoma brucei by protein correlation profiling mass spectrometry and machine learning.
      ,
      • Carlson M.L.
      • Stacey R.G.
      • Young J.W.
      • Wason I.S.
      • Zhao Z.
      • Rattray D.G.
      • Scott N.
      • Kerr C.H.
      • Babu M.
      • Foster L.J.
      • Duong Van Hoa F.
      Profiling the Escherichia coli membrane protein interactome captured in Peptidisc libraries.
      ,
      • McWhite C.D.
      • Papoulas O.
      • Drew K.
      • Cox R.M.
      • June V.
      • Dong O.X.
      • Kwon T.
      • Wan C.
      • Salmi M.L.
      • Roux S.J.
      • Browning K.S.
      • Chen Z.J.
      • Ronald P.C.
      • Marcotte E.M.
      A pan-plant protein complex map reveals deep conservation and novel assemblies.
      ,
      • Kristensen A.R.
      • Gsponer J.
      • Foster L.J.
      A high-throughput approach for measuring temporal changes in the interactome.
      ,
      • Scott N.E.
      • Brown L.M.
      • Kristensen A.R.
      • Foster L.J.
      Development of a computational framework for the analysis of protein correlation profiling and spatial proteomics experiments.
      ); an assumption that is difficult to test without very high proteome-coverage reference libraries of gold standards. As such, these gold standards have only provided limited insight into the best-practice collection and analysis of CF-MS data. Because extensive reference libraries of gold standard complexes and PPIs only exist for a few model organisms, very high proteome-coverage profiling of CF-MS data sets using gold standards has not yet been possible.
      To overcome these limitations with the use of gold standards—and the resultant dearth in analytical guidelines for CF-MS—here we assess novel and published (
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ) CF-MS data sets of the model organism Saccharomyces cerevisiae, obtained using SEC and ion exchange chromatography (IEX) respectively. Saccharomyces cerevisiae has high proteome-coverage reference libraries of gold standard protein complexes and PPIs (
      • Cusick M.E.
      • Klitgord N.
      • Vidal M.
      • Hill D.E.
      Interactome: gateway into systems biology.
      ,
      • Benschop J.J.
      • Brabers N.
      • van Leenen D.
      • Bakker L.V.
      • van Deutekom H.W.
      • van Berkum N.L.
      • Apweiler E.
      • Lijnzaad P.
      • Holstege F.C.
      • Kemmeren P.
      A consensus of core protein complex compositions for Saccharomyces cerevisiae.
      ,
      • Babu M.
      • Vlasblom J.
      • Pu S.
      • Guo X.
      • Graham C.
      • Bean B.D.M.
      • Burston H.E.
      • Vizeacoumar F.J.
      • Snider J.
      • Phanse S.
      • Fong V.
      • Tam Y.Y.C.
      • Davey M.
      • Hnatshak O.
      • Bajaj N.
      • Chandran S.
      • Punna T.
      • Christopolous C.
      • Wong V.
      • Yu A.
      • Zhong G.
      • Li J.
      • Stagljar I.
      • Conibear E.
      • Wodak S.J.
      • Emili A.
      • Greenblatt J.F.
      Interaction landscape of membrane-protein complexes in Saccharomyces cerevisiae.
      ) derived from exhaustive AP-MS (
      • Babu M.
      • Vlasblom J.
      • Pu S.
      • Guo X.
      • Graham C.
      • Bean B.D.M.
      • Burston H.E.
      • Vizeacoumar F.J.
      • Snider J.
      • Phanse S.
      • Fong V.
      • Tam Y.Y.C.
      • Davey M.
      • Hnatshak O.
      • Bajaj N.
      • Chandran S.
      • Punna T.
      • Christopolous C.
      • Wong V.
      • Yu A.
      • Zhong G.
      • Li J.
      • Stagljar I.
      • Conibear E.
      • Wodak S.J.
      • Emili A.
      • Greenblatt J.F.
      Interaction landscape of membrane-protein complexes in Saccharomyces cerevisiae.
      ,
      • Gavin A.-C.
      • Aloy P.
      • Grandi P.
      • Krause R.
      • Boesche M.
      • Marzioch M.
      • Rau C.
      • Jensen L.J.
      • Bastuck S.
      • Dümpelfeld B.
      • Edelmann A.
      • Heurtier M.-A.
      • Hoffman V.
      • Hoefert C.
      • Klein K.
      • Hudak M.
      • Michon A.-M.
      • Schelder M.
      • Schirle M.
      • Remor M.
      • Rudi T.
      • Hooper S.
      • Bauer A.
      • Bouwmeester T.
      • Casari G.
      • Drewes G.
      • Neubauer G.
      • Rick J.M.
      • Kuster B.
      • Bork P.
      • Russell R.B.
      • Superti-Furga G.
      Proteome survey reveals modularity of the yeast cell machinery.
      ,
      • Krogan N.J.
      • Cagney G.
      • Yu H.
      • Zhong G.
      • Guo X.
      • Ignatchenko A.
      • Li J.
      • Pu S.
      • Datta N.
      • Tikuisis A.P.
      • Punna T.
      • Peregrín-Alvarez J.M.
      • Shales M.
      • Zhang X.
      • Davey M.
      • Robinson M.D.
      • Paccanaro A.
      • Bray J.E.
      • Sheung A.
      • Beattie B.
      • Richards D.P.
      • Canadien V.
      • Lalev A.
      • Mena F.
      • Wong P.
      • Starostine A.
      • Canete M.M.
      • Vlasblom J.
      • Wu S.
      • Orsi C.
      • Collins S.R.
      • Chandran S.
      • Haw R.
      • Rilstone J.J.
      • Gandi K.
      • Thompson N.J.
      • Musso G.
      • St Onge P.
      • Ghanny S.
      • Lam M.H.Y.
      • Butland G.
      • Altaf-Ul A.M.
      • Kanaya S.
      • Shilatifard A.
      • O'Shea E.
      • Weissman J.S.
      • Ingles C.J.
      • Hughes T.R.
      • Parkinson J.
      • Gerstein M.
      • Wodak S.J.
      • Emili A.
      • Greenblatt J.F.
      Global landscape of protein complexes in the yeast Saccharomyces cerevisiae.
      ) and Y2H (
      • Yu H.
      • Braun P.
      • Yildirim M.A.
      • Lemmens I.
      • Venkatesan K.
      • Sahalie J.
      • Hirozane-Kishikawa T.
      • Gebreab F.
      • Li N.
      • Simonis N.
      • Hao T.
      • Rual J.-F.
      • Dricot A.
      • Vazquez A.
      • Murray R.R.
      • Simon C.
      • Tardivo L.
      • Tam S.
      • Svrzikapa N.
      • Fan C.
      • de Smet A.-S.
      • Motyl A.
      • Hudson M.E.
      • Park J.
      • Xin X.
      • Cusick M.E.
      • Moore T.
      • Boone C.
      • Snyder M.
      • Roth F.P.
      • Barabasi A.-L.
      • Tavernier J.
      • Hill D.E.
      • Vidal M.
      High-quality binary protein interaction map of the yeast interactome network.
      ) surveys, which we exploit to thoroughly profile and evaluate these CF-MS data sets. By probing the impacts of the experimental design of the cell lysate fractionation and assessing different methods of fractionation profile similarity scoring, we uncover universally relevant guidelines for several critical aspects of CF-MS. These include guidelines for cost effective data collection and maximizing the precision and recall of protein complexes during data analysis. Differences between fractionation profiles, gene co-expression and Gene Ontology (GO) between gold standard and novel complexes are also probed, informing best-practice predictions of novel complexes. We complement these analyses by assessing published human SEC (
      • Kirkwood K.J.
      • Ahmad Y.
      • Larance M.
      • Lamond A.I.
      Characterisation of native protein complexes and protein isoform variation using size-fractionation based quantitative proteomics.
      ) and IEX (
      • Havugimana P.C.
      • Hart G.T.
      • Nepusz T.
      • Yang H.
      • Turinsky A.L.
      • Li Z.
      • Wang P.I.
      • Boutz D.R.
      • Fong V.
      • Phanse S.
      • Babu M.
      • Craig S.A.
      • Hu P.
      • Wan C.
      • Vlasblom J.
      • Dar V-U-N.
      • Bezginov A.
      • Clark G.W.
      • Wu G.C.
      • Wodak S.J.
      • Tillier E.R.M.
      • Paccanaro A.
      • Marcotte E.M.
      • Emili A.
      A census of human soluble protein complexes.
      ,
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ) CF-MS data sets using reference human complexes from the CORUM database (
      • Ruepp A.
      • Waegele B.
      • Lechner M.
      • Brauner B.
      • Dunger-Kaltenbach I.
      • Fobo G.
      • Frishman G.
      • Montrone C.
      • Mewes H.-W.
      CORUM: the comprehensive resource of mammalian protein complexes—2009.
      ), which further inform and reinforce our findings obtained from yeast. Together these analyses produce firm recommendations for the best-practice collection and analysis of CF-MS data, and a broad-scale understanding of the protein complexes that can be uniquely uncovered using CF-MS.

      MATERIALS AND METHODS

       Yeast Strain, Culture Conditions and Lysis

      SEC CF-MS data were collected from Saccharomyces cerevisiae strain BY4741 (#YSC1052, Open Biosystems), grown at 30 °C in 300 ml YEPD media containing 2% (w/v) glucose, 2% (w/v) bactopeptone and 1% (w/v) yeast extract. Cells were harvested during mid-log phase growth (OD600 of 1.2). Harvested cells were washed with water and resuspended in 10 ml SEC mobile phase (50 mm NaCH3COO, 50 mm KCl, pH 7.2) complemented with cOmplete, EDTA-free Protease Inhibitor mixture (Roche) and PhosStop (Roche). To digest yeast cell walls, 600 lytic units of zymolyase (Zymo Research) in 120 µL vendor-supplied resuspension buffer were added and the sample incubated at 30 °C until a stable OD600 of 3.5 was reached. The resulting spheroplasts were subjected to four 30 s rounds of sonication (amplitude 40%, 0.5 s pulse off/on) and centrifuged for 45 min at 14000 rpm at 4 °C. To enrich for protein complexes, lysates were then concentrated to 500 µL using 100 kDa MWCO filters (Sartorius Stedim) by centrifugation for 30 min at 3000 rpm.

       Size Exclusion Chromatography

      Lysates (2 replicate injections of 120 μl each) were loaded onto a 1290 Infinity UHPLC system (Agilent Technologies) and separated by SEC using a 600 × 7.8 mm BioSep4000 column (Phenomenex). The mobile phase (described above) was run at a flow rate of 0.5 ml/min. For each injection 70 fractions were collected at a rate of 2/min from 20 to 55 min, and equivalent fractions from replicate injections were pooled to produce 70 samples of fractionated lysate.

       Tryptic Digestion

      To each sample of fractionated lysate a solution of 0.467 M DTT, 2.27 M chloroacetamide and 0.964 mm trypsin was added such that each sample had DTT and chloroacetamide concentrations of 47 mm and 236 mm respectively, and 3 µg trypsin. Samples were then incubated at 37 °C for 18 h. The resultant proteolytic peptide solutions were evaporated to dryness in a SpeedVac (Savant SPD1010, Thermo Scientific, Bremen, Germany), after which peptides were reconstituted in 1 ml 0.1% (v/v) heptafluorobutyric acid (pH 2.5) and C18 cleanups performed using Sep-pak cartridges (WAT054960) following manufacturer's instructions. Eluted peptides from each cleanup were evaporated to dryness in a SpeedVac and reconstituted in 20 µL 0.1% (v/v) formic acid.

       Mass Spectrometry

      Proteolytic peptide samples were subjected to LC–MS/MS analysis on a Q Exactive Plus mass spectrometer (Thermo Scientific) interfaced with an UltiMate 3000 HPLC and autosampler system (Dionex, Amsterdam, The Netherlands). Peptides were separated by nano-LC and eluting peptides ionized using positive ion mode nano-ESI as described previously (
      • Hart-Smith G.
      • Raftery M.J.
      Detection and characterization of low abundance glycopeptides via higher-energy C-trap dissociation and orbitrap mass analysis.
      ). Survey scans m/z 350–1750 (MS AGC = 1 × 106) were recorded in the Orbitrap (resolution = 70,000 at m/z 200). The instrument was set to operate in DDA mode, and up to the 12 most abundant ions with charge states of >+2 were sequentially isolated and fragmented via HCD using the following parameters: normalized collision energy 30, resolution = 17,500, maximum injection time = 125 ms, and MSn AGC = 1 × 105. Dynamic exclusion was enabled (exclusion duration = 30 s).

       MaxLFQ of Novel and Published co-Fractionation Mass Spectrometry Raw Files

      To generate SEC fractionation profiles for individual yeast protein complex subunits, LC–MS/MS raw files were analyzed using MaxQuant (version 1.6.2.10) (
      • Cox J.
      • Mann M.
      MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification.
      ). Sequence database searches were performed using Andromeda (
      • Cox J.
      • Neuhauser N.
      • Michalski A.
      • Scheltema R.A.
      • Olsen J.V.
      • Mann M.
      Andromeda: a peptide search engine integrated into the MaxQuant environment.
      ) and the MaxLFQ algorithm (
      • Cox J.
      • Hein M.Y.
      • Luber C.A.
      • Paron I.
      • Nagaraj N.
      • Mann M.
      Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ.
      ) was used to quantify proteins across fractions. The following parameters were employed: precursor ion and peptide fragment mass tolerances were ±4.5 ppm and ±20 ppm respectively; carbamidomethyl (C) was included as a fixed modification; oxidation (M) and N-terminal protein acetylation were included as variable modifications; enzyme specificity was trypsin with up to two missed cleavages; only S. cerevisiae sequences in the Swiss-Prot database (February 2017 release, 553,655 sequence entries) were searched; the minimum peptide length was set as 7; the “match between runs” feature was enabled; and MaxLFQ analyses were performed using default parameters with “fast LFQ” enabled. Protein and peptide false discovery rate thresholds were set at 1%. Only fractionation profiles obtained from proteins identified from ≥2 peptides and with nonzero MaxLFQ values in ≥2 fractions were subjected to downstream analysis. All MaxLFQ values were added with a pseudo-count of one and log-transformed (base 10).
      To generate fractionation profiles from publicly available CF-MS data which are amenable to MaxLFQ analysis following the above procedures, additional MaxQuant analyses were performed on raw files associated with the following 3 CF-MS experiments: S. cerevisiae lysate separated across 108 IEX fractions, as performed by Wan et al. (
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ) (PRIDE data identifier PXD002327); human G166 glioma stem cell lysate separated across 108 IEX fractions, as performed by Havugimana et al. (
      • Havugimana P.C.
      • Hart G.T.
      • Nepusz T.
      • Yang H.
      • Turinsky A.L.
      • Li Z.
      • Wang P.I.
      • Boutz D.R.
      • Fong V.
      • Phanse S.
      • Babu M.
      • Craig S.A.
      • Hu P.
      • Wan C.
      • Vlasblom J.
      • Dar V-U-N.
      • Bezginov A.
      • Clark G.W.
      • Wu G.C.
      • Wodak S.J.
      • Tillier E.R.M.
      • Paccanaro A.
      • Marcotte E.M.
      • Emili A.
      A census of human soluble protein complexes.
      ) (PRIDE data identifier PXD002322); and human U2OS lysate separated across 40 SEC fractions, as performed by Kirkwood et al. (biological replicate 3) (
      • Kirkwood K.J.
      • Ahmad Y.
      • Larance M.
      • Lamond A.I.
      Characterisation of native protein complexes and protein isoform variation using size-fractionation based quantitative proteomics.
      ) (PRIDE data identifier PXD001220). For each set of raw files, MaxQuant analyses were performed as above with the following alterations: precursor ion and peptide fragment mass tolerances were ±4.5 ppm and ±0.5 Da respectively, and only human sequences in the Swiss-Prot database were searched when analyzing raw files associated with human cell lines.
      To test the effects of collecting fewer fractions during CF-MS data collection, simulated fractionation profiles with reduced resolution were generated from the above results. For individual proteins in each CF-MS data set, this involved redistributing the MaxLFQ values from the original number of fractions into a smaller number of fractions using the proportional scaling method detailed in supplemental Fig. S1.
      To identify and characterize Gaussian peaks in fractionation profiles the PrInCE software (
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ) was used, with up to five peaks identified for each protein.

       Experimental Design and Statistical Rationale

       Gold Standard Protein Complex Data sets

      To interpret fractionation profiles produced in S. cerevisiae CF-MS experiments, the reference library of gold standard S. cerevisiae protein complexes published by Benschop et al. (
      • Benschop J.J.
      • Brabers N.
      • van Leenen D.
      • Bakker L.V.
      • van Deutekom H.W.
      • van Berkum N.L.
      • Apweiler E.
      • Lijnzaad P.
      • Holstege F.C.
      • Kemmeren P.
      A consensus of core protein complex compositions for Saccharomyces cerevisiae.
      ) was used. This library contains 1,881 proteins from 518 complexes, and is referred to herein as the Benschop reference set. To interpret fractionation profiles produced in human cell-line CF-MS experiments, the reference library of gold standard human protein complexes in CORUM version 02.07.2017 (
      • Giurgiu M.
      • Reinhard J.
      • Brauner B.
      • Dunger-Kaltenbach I.
      • Fobo G.
      • Frishman G.
      • Montrone C.
      • Ruepp A.
      CORUM: the comprehensive resource of mammalian protein complexes.
      ) was used. This library contains 3241 proteins from 2389 complexes.

       Identifying Gold Standard Protein Complexes through Reference Complex Profiling

      To find evidence for gold standard protein complexes in CF-MS data sets, significantly related fractionation profiles were identified using the randomization procedure illustrated in Fig. 1, hereafter referred to as Reference Complex Profiling. Reference Complex Profiling first calculates mean fractionation profile similarity scores for reference library complexes in which two or more subunits are identified in the CF-MS data set. The fractionation profile similarity scores for all unique pairs of proteins in each query complex are calculated using a similarity scoring metric (e.g. Pearson correlation), and the mean of these scores determined. Secondly, a bootstrapping method tests whether the observed mean fractionation profile similarity score for each query protein complex is statistically significant as compared with the background distribution, generated from 1000 random protein complexes. Each random complex is generated by randomly sampling the same number of proteins observed in the query complex from proteins identified in the CF-MS data set. The bias corrected p-value of each query complex is calculated as a ratio, with a numerator of one plus the number of times the observed mean fractionation profile similarly score of the query complex is larger than the mean fractionation profile similar scores of 1,000 random protein complexes, and a denominator of 1001 (
      • Davison A.C.
      • Hinkley D.V.
      ). Minor alterations to this calculation are applied when using the following similarity scoring metrics: Euclidean distance, maximum cumulative distance between two scaled vectors (DMax), Co-apex and Peak Location (each detailed below). For these metrics, the observed score for a co-fractionating group of protein complex subunits should generally be lower, rather than higher, than the score of a random complex.
      Figure thumbnail gr1
      Fig. 1Identifying gold standard protein complexes in CF-MS data sets through Reference Complex Profiling. For each gold standard (query) complex within a reference library, mean fractionation profile similarity scores are calculated from all unique pairs of proteins within the complex (upper flow diagram; colored circles represent proteins with observed CF-MS fractionation profiles). This same procedure is performed 1000 times using random protein complexes, each generated by randomly sampling the same number of proteins observed in the query complex from proteins identified in the CF-MS data set (lower flow diagram). Bias corrected p-values for each query complex are calculated using these observed and random mean fractionation profile similarity scores (right box).
      When analyzing stand-alone CF-MS data sets, p-values obtained from this randomization procedure were adjusted using the Benjamini-Hochberg procedure to control the false discovery rate (
      • Benjamini Y.
      • Hochberg Y.
      Controlling the false discovery rate: a practical and powerful approach to multiple testing.
      ), and reference complexes under query with bootstrapped p-values less than 0.05 were deemed significant. When co-analyzing SEC and IEX CF-MS data sets, Fisher's method (
      • Mosteller F.
      • Fisher R.A.
      Combining independent tests of significance.
      ) was used to combine p-values. The natural log of SEC and IEX p-values, prior to false discovery rate adjustment, were summed and multiplied by negative two, giving a Chi-square statistic with two degrees of freedom as follows:
      χ22-2(logPSEC+logPIEX)
      (1)


      Chi-square statistics were converted to p-scores using the Chi-square distribution function in R. When a reference complex was only queried in one of the SEC or IEX data sets, the p-score was the p-value from the data set in which the complex was queried. Resulting lists of p-scores were adjusted using the Benjamini-Hochberg procedure (
      • Benjamini Y.
      • Hochberg Y.
      Controlling the false discovery rate: a practical and powerful approach to multiple testing.
      ), and reference complexes with bootstrapped p-scores less than 0.05 were deemed significant.

       Similarity Scoring Metrics Used for Reference Complex Profiling

      Seventeen different similarity scoring metrics were used to profile CF-MS data sets using the above procedures. Scoring metrics were calculated using functions from publicly available R libraries, or custom R scripts available from: //github.com/IgnatiusPang/ReferenceComplexProfiling. The metrics covered four broad categories: correlation-based (Pearson correlation, Spearman correlation, Kendall correlation (
      • Kendall M.G.
      A new measure of rank correlation.
      ) and normalized cross-correlation (NCC) (
      • Larance M.
      • Kirkwood K.J.
      • Tinti M.
      • Murillo A.B.
      • Ferguson M.A.
      • Lamond A.I.
      Global membrane protein interactome analysis using in vivo crosslinking and MS-based protein correlation profiling.
      )); distance-based (Euclidean distance, Distance correlation (
      • Székely G.J.
      • Rizzo M.L.
      • Bakirov N.K.
      Measuring and testing dependence by correlation of distances.
      ), Jaccard distance and DMax); mutual information-based (Mutual Information (MI), Biased Corrected Mutual Information (BCMI) (
      • Pardy C.
      • Wilson S.
      A bioinformatic implementation of mutual information as a distance measure for identification of clusters of variables.
      ), Maximum Information Coefficient (MIC) (
      • Reshef D.N.
      • Reshef Y.A.
      • Finucane H.K.
      • Grossman S.R.
      • McVean G.
      • Turnbaugh P.J.
      • Lander E.S.
      • Mitzenmacher M.
      • Sabeti P.C.
      Detecting novel associations in large data sets.
      ), Total Information Coefficient (TIC) (
      • Reshef D.N.
      • Reshef Y.A.
      • Sabeti P.C.
      • Mitzenmacher M.
      An empirical study of the maximal and total information coefficients and leading measures of dependence.
      ), Generalized Mean Information Coefficient (GMIC) (

      Luedtke, A., and Tran, L., (2013) The generalized mean information coefficient. arXiv preprint arXiv :1308.5712.

      ) and Randomized Information Coefficient (RIC) (
      • Romano S.
      • Vinh N.X.
      • Verspoor K.
      • Bailey J.
      The randomized information coefficient: assessing dependencies in noisy data.
      )); and peak-based (Apex Location (
      • Havugimana P.C.
      • Hart G.T.
      • Nepusz T.
      • Yang H.
      • Turinsky A.L.
      • Li Z.
      • Wang P.I.
      • Boutz D.R.
      • Fong V.
      • Phanse S.
      • Babu M.
      • Craig S.A.
      • Hu P.
      • Wan C.
      • Vlasblom J.
      • Dar V-U-N.
      • Bezginov A.
      • Clark G.W.
      • Wu G.C.
      • Wodak S.J.
      • Tillier E.R.M.
      • Paccanaro A.
      • Marcotte E.M.
      • Emili A.
      A census of human soluble protein complexes.
      ), Peak Location (
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ) and Co-apex (
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      )). Details for how each metric was applied are provided in the Supporting Information.

       Gene-Gene Networks and Reference Gene Annotations

      To test the effects of integrating CF-MS data sets with external genomic data, CF-MS networks were evaluated against reference gene annotation sets (e.g. GO). Weighted CF-MS networks were generated using the Spearman correlations between fractionation profiles in individual CF-MS data sets. The nodes of these networks were the individual genes/proteins associated with each fractionation profile. Edge scores from each network were converted into ranks, with ties averaged. The resulting ranks were divided by the maximum rank in the network to scale values between 0 and 1. Reference gene annotation sets were obtained from the GO Consortium (version September 2017) (
      • Ashburner M.
      • Ball C.A.
      • Blake J.A.
      • Botstein D.
      • Butler H.
      • Cherry J.M.
      • Davis A.P.
      • Dolinski K.
      • Dwight S.S.
      • Eppig J.T.
      • Harris M.A.
      • Hill D.P.
      • Issel-Tarver L.
      • Kasarskis A.
      • Lewis S.
      • Matese J.C.
      • Richardson J.E.
      • Ringwald M.
      • Rubin G.M.
      • Sherlock G.
      Gene ontology: tool for the unification of biology.
      ,
      • Carbon S.
      • Chan J.
      • Kishore R.
      • Lee R.
      • Muller H.-M.
      • Raciti D.
      • Van Auken K.
      • Sternberg P.
      Expansion of the Gene Ontology knowledgebase and resources.
      ) for yeast (6,392 genes across 8,672 GO terms) and human (19,030 genes across 21,946 GO terms). Additional reference gene annotation sets, comprised of high confidence protein-protein interactions, were obtained from Pang et al. for yeast (
      • Ignatius Pang C.N.
      • Goel A.
      • Wilkins M.R.
      Investigating the network basis of negative genetic interactions in Saccharomyces cerevisiae with integrated biological networks and triplet motif analysis.
      ) and Huttlin et al. for human (
      • Huttlin E.L.
      • Bruckner R.J.
      • Paulo J.A.
      • Cannon J.R.
      • Ting L.
      • Baltier K.
      • Colby G.
      • Gebreab F.
      • Gygi M.P.
      • Parzen H.
      • Szpyt J.
      • Tam S.
      • Zarraga G.
      • Pontano-Vaites L.
      • Swarup S.
      • White A.E.
      • Schweppe D.K.
      • Rad R.
      • Erickson B.K.
      • Obar R.A.
      • Guruharsha K.G.
      • Li K.
      • Artavanis-Tsakonas S.
      • Gygi S.P.
      • Harper J.W.
      Architecture of the human interactome defines protein communities and disease networks.
      ).

       Measuring Network Information Content Using ‘Guilt-by-Association‘

      CF-MS networks were evaluated against reference gene annotation sets using the Extending ‘Guilt-by-Association’ by Degree (EGAD) R package (
      • Ballouz S.
      • Weber M.
      • Pavlidis P.
      • Gillis J.
      EGAD: ultra-fast functional analysis of gene networks.
      ). EGAD reports AUROCs as a performance metric, which ranges between 0 and 1, with 0.5 random and 1 perfect.
      EGAD evaluations were performed on both unaltered and merged networks. For merged networks, gene co-expression networks for yeast (
      • Gillis J.
      • Ballouz S.
      • Pavlidis P.
      Bias tradeoffs in the creation and analysis of protein–protein interaction networks.
      ) or human (
      • Ballouz S.
      • Verleyen W.
      • Gillis J.
      Guidance for RNA-seq co-expression network construction and analysis: safety in numbers.
      ), or high confidence PPI networks for yeast (
      • Ignatius Pang C.N.
      • Goel A.
      • Wilkins M.R.
      Investigating the network basis of negative genetic interactions in Saccharomyces cerevisiae with integrated biological networks and triplet motif analysis.
      ) or human (
      • Huttlin E.L.
      • Bruckner R.J.
      • Paulo J.A.
      • Cannon J.R.
      • Ting L.
      • Baltier K.
      • Colby G.
      • Gebreab F.
      • Gygi M.P.
      • Parzen H.
      • Szpyt J.
      • Tam S.
      • Zarraga G.
      • Pontano-Vaites L.
      • Swarup S.
      • White A.E.
      • Schweppe D.K.
      • Rad R.
      • Erickson B.K.
      • Obar R.A.
      • Guruharsha K.G.
      • Li K.
      • Artavanis-Tsakonas S.
      • Gygi S.P.
      • Harper J.W.
      Architecture of the human interactome defines protein communities and disease networks.
      ), were merged with CF-MS networks. Because PPI networks consist of binary edge scores, indirect connections were modeled by adding edges between node pairs with minimum path length less than seven, weighted by the inverse of the minimum path length. Networks were merged by taking the union of networks and summing the edge score of overlapping edges, before applying the rank conversion and scaling described above. Only genes or gene products common in all yeast or all human networks (CF-MS, gene co-expression and PPI) were used in merged networks, and all network comparisons were performed on networks containing the same subset of genes or proteins. Networks comprised of protein interactions were not evaluated against reference gene annotation sets comprised of protein interactions.
      Additional sub-networks, comprised of well-correlated fractionation profiles (Spearman correlation > 0.8) in CF-MS data sets sorted into networks of either known or putative novel interactions, were analyzed alone and merged with co-expression networks. Interactions with either direct or indirect evidence in the BioGrid (
      • Oughtred R.
      • Stark C.
      • Breitkreutz B.-J.
      • Rust J.
      • Boucher L.
      • Chang C.
      • Kolas N.
      • O'Donnell L.
      • Leung G.
      • McAdam R.
      • Zhang F.
      • Dolma S.
      • Willems A.
      • Coulombe-Huntington J.
      • Chatr-Aryamontri A.
      • Dolinski K.
      • Tyers M.
      The BioGRID interaction database: 2019 update.
      ,
      • Stark C.
      • Breitkreutz B.-J.
      • Reguly T.
      • Boucher L.
      • Breitkreutz A.
      • Tyers M.
      BioGRID: a general repository for interaction datasets.
      ) or Interactome3D (
      • Mosca R.
      • Céol A.
      • Aloy P.
      Interactome3D: adding structural details to protein networks.
      ) databases were considered known, with all other interactions considered putative novel. For example, if interactions between both proteins A and B and proteins B and C are known based on direct evidence, interactions between proteins A and C are considered known based on indirect evidence. Sub-network comparisons were performed on networks containing the same subset of genes or proteins.

      RESULTS

      The below results begin with a general analysis of the breadth and reliability of the two yeast and two human CF-MS data sets under investigation. Following this, results describing systematic assessments of these data sets using reference protein complexes and PPIs are presented. Experiments performed using Reference Complex Profiling are firstly shown. These experiments provide insight into two of the most fundamental questions regarding CF-MS data analysis and collection: how should the similarity of fractionation profiles be measured (e.g. correlated) during data analysis? and, how should the fractionation of cell lysates be designed during data collection? Assessments performed using other methods, for example EGAD (
      • Ballouz S.
      • Weber M.
      • Pavlidis P.
      • Gillis J.
      EGAD: ultra-fast functional analysis of gene networks.
      ), are then presented. These latter experiments provide insight into the usefulness of features other than fractionation profile similarity scoring for CF-MS data analysis, particularly in the context of identifying known versus novel complexes.

       Subunits of Gold Standard Protein Complexes co-Fractionate in CF-MS Data Sets

      The two yeast CF-MS data sets under investigation were obtained from lysates of S. cerevisiae strain BY4741 fractionated using SEC, and S. cerevisiae strain W303 fractionated using mixed-bed IEX (
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ). Two additional CF-MS data sets, obtained from human cell lysates, were also investigated: one from U2OS cells fractionated using SEC (
      • Kirkwood K.J.
      • Ahmad Y.
      • Larance M.
      • Lamond A.I.
      Characterisation of native protein complexes and protein isoform variation using size-fractionation based quantitative proteomics.
      ), and another from human G166 glioma stem cells fractionated using mixed-bed IEX (
      • Havugimana P.C.
      • Hart G.T.
      • Nepusz T.
      • Yang H.
      • Turinsky A.L.
      • Li Z.
      • Wang P.I.
      • Boutz D.R.
      • Fong V.
      • Phanse S.
      • Babu M.
      • Craig S.A.
      • Hu P.
      • Wan C.
      • Vlasblom J.
      • Dar V-U-N.
      • Bezginov A.
      • Clark G.W.
      • Wu G.C.
      • Wodak S.J.
      • Tillier E.R.M.
      • Paccanaro A.
      • Marcotte E.M.
      • Emili A.
      A census of human soluble protein complexes.
      ). These are hereafter respectively referred to as the yeast SEC, yeast IEX, human SEC and human IEX CF-MS data sets.
      Fig. 2 shows Kernel Density Plots of the Pearson correlations between fractionation profiles observed in these data sets. For each data set, these are separated into two distributions: one showing the correlations between shared subunits of reference complexes, and another showing the correlations between all other pairs of fractionation profiles. The fractionation profiles of shared subunits of reference complexes correlate to an overall greater extent than other pairs of fractionation profiles. This suggests that for each CF-MS data set, reference complexes were present in the samples under analysis and successfully fractionated.
      Figure thumbnail gr2
      Fig. 2Subunits of gold standard protein complexes co-fractionate in CF-MS data sets. For each of the four CF-MS data sets under investigation, Kernel Density Plots of Pearson correlations between fractionation profiles of shared subunits of reference complexes (blue distributions) and all other pairs of fractionation profiles (red distributions) are shown. Yeast SEC CF-MS data were generated in-house; yeast IEX CF-MS data were obtained from Wan et al. (
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ) (PRIDE data identifier PXD002327); human SEC CF-MS data were obtained from Kirkwood et al. (biological replicate 3) (
      • Kirkwood K.J.
      • Ahmad Y.
      • Larance M.
      • Lamond A.I.
      Characterisation of native protein complexes and protein isoform variation using size-fractionation based quantitative proteomics.
      ) (PRIDE data identifier PXD001220); and human IEX CF-MS data were obtained from Havugimana et al. (
      • Havugimana P.C.
      • Hart G.T.
      • Nepusz T.
      • Yang H.
      • Turinsky A.L.
      • Li Z.
      • Wang P.I.
      • Boutz D.R.
      • Fong V.
      • Phanse S.
      • Babu M.
      • Craig S.A.
      • Hu P.
      • Wan C.
      • Vlasblom J.
      • Dar V-U-N.
      • Bezginov A.
      • Clark G.W.
      • Wu G.C.
      • Wodak S.J.
      • Tillier E.R.M.
      • Paccanaro A.
      • Marcotte E.M.
      • Emili A.
      A census of human soluble protein complexes.
      ) (PRIDE data identifier PXD002322). For yeast data sets, distributions were created using all complexes in the Benschop reference set (
      • Benschop J.J.
      • Brabers N.
      • van Leenen D.
      • Bakker L.V.
      • van Deutekom H.W.
      • van Berkum N.L.
      • Apweiler E.
      • Lijnzaad P.
      • Holstege F.C.
      • Kemmeren P.
      A consensus of core protein complex compositions for Saccharomyces cerevisiae.
      ). For human data sets, distributions were created using all human complexes in CORUM version 02.07.2017 (
      • Giurgiu M.
      • Reinhard J.
      • Brauner B.
      • Dunger-Kaltenbach I.
      • Fobo G.
      • Frishman G.
      • Montrone C.
      • Ruepp A.
      CORUM: the comprehensive resource of mammalian protein complexes.
      ).
      The fractionation profiles within each CF-MS data set were generated using equivalent sequence database searching and label-free quantification methods, and only fractionation profiles with nonzero MaxLFQ values in ≥2 fractions were kept (as detailed in Materials and Methods). This resulted in the following numbers of fractionation profiles per CF-MS data set: 1280 for the yeast SEC data set, 1894 for the yeast IEX data set, 3,935 for the human SEC data set, and 2922 for the human IEX data set.
      Together these results indicate that the data sets under analysis are suitable for assessing different methods of broad-scale protein complex identification using CF-MS, and that the profiling of reference protein complexes and PPIs can play a role in these assessments.

       Performance Characteristics of Different Fractionation Profile Similarity Scoring Metrics

      The results described in Fig. 3 provide insight into the most fundamental aspect of CF-MS data analysis: measuring the similarity of fractionation profiles. Fig. 3A shows the numbers of significant gold standard complexes identified from each CF-MS data set from two or more identified subunits using Reference Complex Profiling. Each data set was systematically profiled 17 times using Reference Complex Profiling; once for each similarity scoring metric listed in Materials and Methods. This enabled multiple methods for performing fractionation profile similarity scoring to be directly compared. Because these comparisons are not reliant on depth of complexome coverage, results from the profiling of yeast and human CF-MS data sets using the Benschop and CORUM reference sets of gold standards, which cover relatively high and low portions of the yeast and human proteomes respectively, are presented alongside one another.
      Figure thumbnail gr3
      Fig. 3Different similarity scoring metrics have different performance characteristics when they are applied to CF-MS data analysis. A, Numbers of significant gold standard complexes identified per scoring metric under investigation for each CF-MS data set using Reference Complex Profiling. Bar colors identify different categories of scoring metrics. B, Numbers of Gaussian peaks per fractionation profile in gold standard complexes, identified using mutual information-based and nonmutual information-based scoring metrics (left). Gaussian peak widths in gold standard complexes identified using the Peak Location metric (
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ) and all other scoring metrics (right). Peak numbers and characteristics were obtained using PrInCE (
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ). Brackets indicate statistical comparisons from two-tailed Welch's t-tests with **, * and n.s. denoting p-values < 0.01, < 0.05 and > 0.05 respectively. C, An example of a gold standard yeast complex, the Asn1-Asn2 complex, which was uniquely identified using mutual information-based scoring metrics after IEX fractionation (left). An example of a gold standard yeast complex, the Ski complex, which was uniquely identified using the Peak Location metric after IEX fractionation (
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ) (right).
      Fig. 3A reveals that some scoring metrics are more widely suitable for CF-MS data analysis than others. It is notable that relative scoring metric performances remain broadly consistent across the four diverse CF-MS data sets studied here. Three of the four correlation-based metrics under analysis, Pearson, Spearman and Kendall, and the peak-based Co-apex metric are consistently among the top performing metrics when performance is judged on the number of statistically significant gold standard complexes identified. Three of the four distance-based metrics – Distance correlation, Jaccard distance and DMax – and the peak-based Apex Location metric also generally perform well, whereas mutual information-based metrics usually identify slightly fewer complexes on average in comparison. In contrast to the aforementioned metrics, NCC, Euclidean distance and Peak Location consistently identify a low number of complexes. Strikingly, a commonly used scoring metrics in CF-MS data analysis – Euclidean distance (see Table I) – returns no significant hits from the present data sets. Together these results indicate that some scoring metrics are more sensitive than others when applied to the analysis of entire CF-MS data sets.
      To provide additional insight into these differences in scoring metric sensitivity, mean score distributions for complexes subjected to Reference Complex Profiling were compared across scoring metrics. Supplemental Fig. S2 shows these distributions for both significant and nonsignificant complexes. These distributions reveal that some scoring metrics, such as Peak Location and Euclidean distance, identify few significant protein complexes because they generally struggle to match fractionation profiles belonging to protein complex subunits. In contrast, other scoring metrics such as NCC, and MIC when applied to the yeast data sets, identify few significant protein complexes because they match fractionation profiles relatively indiscriminately.
      To gain further insight into why these differences in scoring metric performance are observed, the fractionation profiles associated with significant protein complexes were analyzed. Specifically the peak heights, peak widths and numbers of peaks per fractionation profile identified using PrInCE (
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ), together with fractionation profile lengths, were compared for the subunits of significant complexes returned from each scoring metric and category of scoring metric. In addition, the numbers of subunits per significant complex were also compared. Fig. 3B shows instances in which these comparisons reveal significant differences. The box plots on the left show that mutual information-based metrics have a greater preference for matching CF-MS fractionation profiles with multiple peaks than nonmutual information-based metrics. These differences are significant for multiple individual mutual information-based metrics, as shown in supplemental Fig. S3. The box plots on the right show that, relative to other scoring metrics, the Peak Location metric has a greater preference for matching CF-MS fractionation profiles with narrow peaks. Supplemental Fig. S4 shows these comparisons for each individual scoring metric under investigation. Aside from these observations, none of the other features of CF-MS fractionation profiles analyzed here, nor the numbers of subunits per significant complex, consistently differentiate scoring metrics or categories of scoring metrics across CF-MS data sets (see supplemental Figs. S5S8). Together these results reveal that certain scoring metrics can preferentially match CF-MS fractionation profiles with certain characteristics, and that scoring metrics with low overall sensitivities may prefer to match fractionation profiles with atypical characteristics.
      In addition to the total numbers of complexes identified using each scoring metric, complexes uniquely identified using each scoring metric were also investigated (see supplemental Figs. S9–S12). This reveals that even though some scoring metrics are less sensitive overall than others, most scoring metrics or categories of scoring metrics are capable of identifying complexes that others cannot. Fig. 3C provides two examples. The example on the left, taken from the yeast IEX CF-MS data set, shows fractionation profiles for subunits of the yeast Asn1-Asn2 isozyme complex. This complex was only identified using mutual information-based metrics (MI, BCMI and RIC). These fractionation profiles contain multiple peaks according to PrInCE, several of which are observed in later IEX fractions at low peak heights and are thus illustrative of the trends observed for mutual information-based metrics presented in Fig. 3B. The example on the right is taken from the same data set and shows fractionation profiles for subunits of the yeast Ski complex, which was uniquely identified using the Peak Location metric. These fractionation profiles are illustrative of the trends observed for the Peak Location metric presented in Fig. 3B. That is, the co-eluting peaks associated with this complex are particularly narrow. These examples illustrate that scoring metrics with relatively low overall sensitivities can have unique advantages for identifying particular types of complexes, such as complexes that elute across narrow time windows during chromatographic separation.
      Taken together, these results suggest that Pearson, Spearman and Kendall correlations and the Co-apex metric have particularly broad-scale utility for CF-MS data analysis. However, they also reveal that other scoring metrics can provide complementary results. The implications of these findings are elaborated on in the Discussion.

       Performance Characteristics of Stand-Alone and Orthogonal Fractionation

      Having studied the most fundamental aspect of CF-MS data analysis in the previous section, this section turns attention to the most fundamental aspect of CF-MS data collection: the design of the fractionation. A chief consideration in this is to minimize the co-fractionation of unrelated proteins. This is influenced by the number of fractions collected, sample complexity, and the resolving power of the employed fractionation method. Another consideration is experimental cost. Because each collected fraction requires LC–MS/MS analysis, the fewer the number of fractions collected, the lower the cost. Experimental design of CF-MS data collection involves balancing these two considerations.
      Fig. 3A indicates that for the present fractionation methods, which are identical in the yeast and human IEX data sets and broadly similar in the yeast and human SEC data sets, the increase in complexity from yeast to human samples does not dramatically increase the extent to which unrelated proteins produce matching fractionation profiles. This is apparent when considering the numbers of significant complexes identified. There is an approximately 10-fold increase in the number of significant complexes identified in human compared with yeast when equivalent scoring metrics and fractionation techniques are compared. This is even though the library of gold standard human complexes used to profile these data sets is only 5-fold larger than the yeast library (2,389 versus 518 complexes respectively). This suggests that complexes remain well resolved in the human CF-MS data sets relative to the equivalent yeast data sets. The increased complexity of the human samples relative to yeast samples did not necessitate an increase in the numbers of fractions collected.
      Fig. 4A extends these observations. It shows the numbers of significant complexes identified following Reference Complex Profiling of the present CF-MS data sets, after the effects of reduced fraction collection had been simulated following the procedures described in Materials and Methods. (Results for the human SEC data set are not shown as this data set was collected from only 40 fractions.) In the yeast IEX data set, on average fewer significant complexes are identified per scoring metric when fraction collection is reduced from 108 to 70 samples, and then again to 40 samples. However, these decreases are not statistically significant. In the yeast SEC data set, overall, the simulated decrease in fraction collection results in no pronounced effects on the numbers of significant complexes identified. Strikingly this is also the case for the human IEX data set, which was collected from a relatively complex sample and in which the simulated reduction in fraction collection was dramatic. Together these results indicate that even when the co-elution of unrelated proteins is pronounced, protein complexes generally remain well resolved (i.e. produce distinct fractionation profiles) using the present fractionation methods over the ranges of fraction collection investigated here. They reinforce the above suggestion that increasing the numbers of fractions collected from the present samples is unlikely to result in the identification of substantially more complexes.
      Figure thumbnail gr4
      Fig. 4Co-analyses of CF-MS data sets obtained using orthogonal fractionation methods are more efficient than stand-alone analyses of CF-MS data sets obtained using single fractionation methods. A, Number of significant gold standard complexes versus the number of fractions collected for stand-alone SEC and IEX CF-MS data sets. Gold standard complexes were identified using Reference Complex Profiling. CF-MS data sets with fewer than 70 SEC or 108 IEX fractions were simulated by decreasing the resolution of fractionation profiles using proportional scaling, as detailed in Materials and Methods. Note: results for Co-apex scoring are not shown as this scoring metric requires fractionation profile peak picking, which was not performed on simulated fractionation profiles. B, Number of significant gold standard complexes versus the number of fractions collected for stand-alone and co-analyzed SEC and IEX yeast CF-MS data sets. Gold standard complexes were identified using Reference Complex Profiling using Spearman correlations. Co-analyses were performed either using Fisher's Exact test to combine p-values from SEC and IEX data sets (red), or by pooling significant complexes identified from stand-alone SEC and IEX data sets (orange). CF-MS data sets with fewer than 70 SEC or 108 IEX fractions were simulated as per the methods employed in A. C, An example of a gold standard yeast complex, the CURI complex, which was uniquely identified from the co-analysis of SEC and IEX CF-MS data sets using Fisher's Exact test.
      The above results consider the use of stand-alone fractionation methods for CF-MS data collection. Results pertaining to the combined use of orthogonal fractionation methods are presented in Figs. 4B and 4C. Fig. 4B shows the numbers of significant complexes identified from stand-alone and combined analyses of the yeast SEC and IEX CF-MS data sets across different simulated amounts of fraction collection. Two methods of combined analysis were undertaken. The first used Fisher's Exact test to combine p-values following the procedures in Materials and Methods, and the second pooled the significant complexes identified from stand-alone analyses of the SEC and IEX CF-MS data sets. Both methods assume that although the samples used to create the SEC and IEX data sets were prepared from different yeast strains and subject to nonidentical lysis conditions, they nonetheless have a substantial number of gold standard protein complexes in common. Fig. 4B reveals that 30–50% percent more significant protein complexes are identified in combined analyses of SEC and IEX CF-MS data sets than in stand-alone analyses of these data sets, when analyses associated with equivalent numbers of fractions are compared. Overall the co-analysis of SEC and IEX CF-MS data sets using Fisher's Exact test identifies the most significant complexes per fraction collected. Fig. 4C provides an example of a complex uniquely identified using Fisher's Exact test: the yeast CURI complex. In stand-alone analyses of SEC and IEX CF-MS data sets using Spearman correlations, the fractionation profiles of the subunits of the CURI complex cannot be significantly differentiated from those of randomly generated complexes. However, Fisher's Exact test reveals that the combined orthogonal evidence for this complex in the SEC and IEX CF-MS data sets is statistically significant.
      Together the above results indicate that for a set threshold of precision, recall of gold standard protein complexes is similar across broad ranges of fraction collection for most scoring metrics. This is true across the range of samples and fractionation methods studied here. They also indicate that CF-MS experiments are most cost effective if increases in fraction collection are distributed across orthogonal separation methods, rather than increased for any one stand-alone method. The implications and limitations of these findings are elaborated on in the Discussion.

       Fractionation Profile Characteristics for Subunits of Known versus Putative Novel Complexes

      The above analyses were limited to the high confidence profiling of gold standard protein complexes, in which stand-alone scoring metrics alone were used for characterization. Additional complexes are likely to be present in the CF-MS data sets under investigation. The results in this section provide insight into how CF-MS fractionation profiles alone may assist in identifying these additional complexes, both known and novel.
      Evidence indicating that the present CF-MS data sets likely include both known and novel complexes is contained in Fig. 5A. This figure identifies all fractionation profiles capable of being well correlated (Spearman correlations > 0.8) with other fractionation profiles; that is, fractionation profiles with likelihoods of being associated with protein complexes. For both yeast CF-MS data sets, Fig. 5A shows bar charts grouping these fractionation profiles into the following categories: those associated with known interactions only, those associated with both known and putative novel interactions, and those associated with putative novel interactions only. The proportions of fractionation profiles which cannot be well correlated are also shown. Comparative results from the 2 human CF-MS data sets are shown alongside those of yeast. It can be seen that, relative to human, high proportions of the fractionation profiles in the yeast CF-MS data sets are only associated with known interactions. This is expected because the reference libraries of PPIs in yeast are high proteome coverage relative to human. However, despite S. cerevisiae being a particularly highly benchmarked model organism, many of the yeast CF-MS fractionation profiles are associated with putative novel interactions and ∼40% cannot be interpreted using existing reference libraries of PPIs.
      Figure thumbnail gr5
      Fig. 5Both yeast and human CF-MS data sets contain evidence for proteins associated with novel interactions, the fractionation profiles of which are significantly different to proteins associated with known interactions only. A, Proportions of proteins showing evidence for involvement in known interactions only, both known and putative novel interactions, putative novel interactions only, or with no evidence for interactions (i.e. all Spearman correlations with other proteins < 0.8) in each of the 4 CF-MS data sets under analysis. Known interactions were obtained from BioGrid (
      • Oughtred R.
      • Stark C.
      • Breitkreutz B.-J.
      • Rust J.
      • Boucher L.
      • Chang C.
      • Kolas N.
      • O'Donnell L.
      • Leung G.
      • McAdam R.
      • Zhang F.
      • Dolma S.
      • Willems A.
      • Coulombe-Huntington J.
      • Chatr-Aryamontri A.
      • Dolinski K.
      • Tyers M.
      The BioGRID interaction database: 2019 update.
      ,
      • Stark C.
      • Breitkreutz B.-J.
      • Reguly T.
      • Boucher L.
      • Breitkreutz A.
      • Tyers M.
      BioGRID: a general repository for interaction datasets.
      ) or Interactome3D (
      • Mosca R.
      • Céol A.
      • Aloy P.
      Interactome3D: adding structural details to protein networks.
      ) following the procedures described in Materials and Methods: Measuring network information content using ‘guilt-by-association‘. B, Numbers and characteristics of peaks identified in fractionation profiles of proteins associated with known interactions only versus those associated with putative novel interactions only. Results are shown for each of the 4 CF-MS data sets under analysis. Peak numbers and characteristics were obtained using PrInCE (
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ). Brackets indicate statistical comparisons from two-tailed Welch's t-tests with **, * and n.s. denoting p-values < 0.01, < 0.05 and > 0.05 respectively.
      Fig. 5B indicates that fractionation profiles associated with putative novel interactions have distinct characteristics to those derived from known interactions. In yeast, it can be seen that the putative novel complexes associated with these interactions are significantly smaller (i.e. elute in later SEC fractions) and elute significantly earlier during mixed-bed IEX than known complexes when comparing the centers of peaks observed in fractionation profiles. Proteins only associated with putative novel interactions also appear to be of low abundance relative to those only associated with known interactions. That is, their relative peak heights and widths are lower on average, suggesting that they elute at low abundance across relatively few fractions. These observations are statistically significant in the yeast SEC and yeast IEX data sets, respectively. It is also observed that the subunits of putative novel complexes produce fractionation profiles with significantly fewer peaks than those of known complexes; an observation which is consistent with these subunits being of low abundance. All these observations are strongly reinforced by the human data sets, which show identical trends to those in yeast.
      Together these findings indicate that CF-MS data sets of model organisms contain bodies of evidence for both known and novel PPIs, and that the CF-MS fractionation profiles of these two classes of PPI have distinct characteristics. They identify possible avenues toward improved characterization of known and novel protein complexes from the distinct features of their CF-MS fractionation profiles alone, as elaborated on in the Discussion.

       Assessment of Genomic Data Integration in CF-MS Data Analysis

      The results in the above sections relate to the identification of protein complexes solely using CF-MS fractionation profiles. However, most CF-MS studies seek to predict protein complexes using features beyond those related to fractionation profiles; a process which has been termed genomic data integration (
      • Skinnider M.A.
      • Stacey R.G.
      • Foster L.J.
      Genomic data integration systematically biases interactome mapping.
      ). The results in this section assess the utility of genomic data integration via EGAD evaluations (
      • Ballouz S.
      • Weber M.
      • Pavlidis P.
      • Gillis J.
      EGAD: ultra-fast functional analysis of gene networks.
      ) of the present CF-MS data sets. They assess two types of genomic data—gene co-expression and high confidence PPI—which are used in many of the supervised machine learning methods of Table I, and the GO, which is frequently used to define cluster cutoff distances in the clustering methods of Table I.
      The EGAD results for the yeast CF-MS data sets are shown in Fig. 6, with comparative results from the human CF-MS data sets shown alongside. These experiments provide an indication of the coherence of the information contained in the networks under investigation. That is, they show the relative capacities of network neighbors (e.g. well correlated fractionation profiles in CF-MS data sets) to predict shared GO, with AUROC scores > 0.5 indicating better than random performance. Similar experiments showing the relative capacities of network neighbors to predict high confidence PPIs are shown in supplemental Fig. S13.
      Figure thumbnail gr6
      Fig. 6Proteins that correlate in yeast and human CF-MS data sets can predict shared Gene Ontology, but only when they are already known to interact. The capacities for gene co-expression or protein-protein interaction networks to predict shared Gene Ontology do not increase when they are merged with CF-MS networks. A, Relative capacities of network neighbors to predict shared Gene Ontology in networks of correlated proteins in CF-MS data sets, gene co-expression networks (labelled “Coexp”), high confidence protein-protein interaction networks (labelled “PPIN”) and merged networks, as determined using the EGAD R package (
      • Ballouz S.
      • Weber M.
      • Pavlidis P.
      • Gillis J.
      EGAD: ultra-fast functional analysis of gene networks.
      ). Merged networks contain edge scores that are summed from parent networks and scaled, as per the procedures described in Materials and Methods. AUROC scores > 0.5 indicate better than random performance. Only AUROC scores within individual sets of EGAD experiments (i.e. yeast GO evaluations or human GO evaluations) can be compared; AUROC scores across sets cannot be compared. B, Relative capacities of network neighbors to predict shared Gene Ontology in CF-MS networks comprised of highly correlated proteins (Spearman correlations > 0.8) associated with known protein-protein interactions, highly correlated proteins (Spearman correlations > 0.8) associated with putative novel protein-protein interactions, gene co-expression networks (labelled ‘Coexp‘) and merged networks, as per the methods employed in A. Known and putative novel interactions were determined using BioGrid (
      • Oughtred R.
      • Stark C.
      • Breitkreutz B.-J.
      • Rust J.
      • Boucher L.
      • Chang C.
      • Kolas N.
      • O'Donnell L.
      • Leung G.
      • McAdam R.
      • Zhang F.
      • Dolma S.
      • Willems A.
      • Coulombe-Huntington J.
      • Chatr-Aryamontri A.
      • Dolinski K.
      • Tyers M.
      The BioGRID interaction database: 2019 update.
      ,
      • Stark C.
      • Breitkreutz B.-J.
      • Reguly T.
      • Boucher L.
      • Breitkreutz A.
      • Tyers M.
      BioGRID: a general repository for interaction datasets.
      ) or Interactome3D (
      • Mosca R.
      • Céol A.
      • Aloy P.
      Interactome3D: adding structural details to protein networks.
      ) following the procedures described in Materials and Methods.
      The experiments of Fig. 6A reveal that networks of well correlated fractionation profiles in yeast CF-MS data sets can predict shared GO (blue bars). However, these predictive capacities are lower than those of yeast gene co-expression or high confidence PPI networks (purple bars). If the yeast SEC and IEX CF-MS data sets are merged following the procedures described in Materials and Methods, the information contained in the resultant network is no more coherent than in the stand-alone networks (pink bar). Similarly, if correlated fractionation profiles in yeast CF-MS data sets are merged with yeast gene co-expression (
      • Gillis J.
      • Ballouz S.
      • Pavlidis P.
      Bias tradeoffs in the creation and analysis of protein–protein interaction networks.
      ) or high confidence PPI networks (
      • Ignatius Pang C.N.
      • Goel A.
      • Wilkins M.R.
      Investigating the network basis of negative genetic interactions in Saccharomyces cerevisiae with integrated biological networks and triplet motif analysis.
      ), the resultant AUROC scores reflect those of the stand-alone gene co-expression or high confidence PPI networks (light brown bars). This contrasts with the results observed when yeast gene co-expression and high confidence PPI networks are merged. The network neighbors in this merged network have a higher capacity to predict shared GO (dark brown bar) than in the stand-alone networks. All the above trends are also observed when using high confidence yeast PPIs as the reference set instead of GO (supplemental Fig. S13).
      Inspection of Figs. 6A and S4 shows that the above results from yeast are broadly similar to those observed, with some minor differences. Specifically, neighbors in human gene co-expression (
      • Ballouz S.
      • Verleyen W.
      • Gillis J.
      Guidance for RNA-seq co-expression network construction and analysis: safety in numbers.
      ) or high confidence PPI (
      • Huttlin E.L.
      • Bruckner R.J.
      • Paulo J.A.
      • Cannon J.R.
      • Ting L.
      • Baltier K.
      • Colby G.
      • Gebreab F.
      • Gygi M.P.
      • Parzen H.
      • Szpyt J.
      • Tam S.
      • Zarraga G.
      • Pontano-Vaites L.
      • Swarup S.
      • White A.E.
      • Schweppe D.K.
      • Rad R.
      • Erickson B.K.
      • Obar R.A.
      • Guruharsha K.G.
      • Li K.
      • Artavanis-Tsakonas S.
      • Gygi S.P.
      • Harper J.W.
      Architecture of the human interactome defines protein communities and disease networks.
      ) networks only predict shared GO to a marginally higher extent than those in human CF-MS networks (purple and blue bars in Fig. 6A respectively), whereas the human gene co-expression network predicts high confidence human PPIs (purple bars in supplemental Fig. S13) to a lesser extent than the human CF-MS networks (blue bars in supplemental Fig. S13). Another difference is that when gene co-expression or high confidence PPI networks are merged with CF-MS networks in human, marginal improvements in AUROC scores are sometimes observed relative to the stand-alone networks (light brown bars). Altogether, however, these differences between yeast and human are proportionally minor.
      Taken together, the results of Figs. 6A and supplemental Fig. S13 indicate that both GO and high confidence PPIs can assist in the interpretation of CF-MS data sets. However, they also suggest that, for the purposes of predicting protein complexes, there is little additive value in incorporating gene co-expression or high confidence PPI information in CF-MS data analysis.
      Fig. 6B provides further insight into these findings by studying known and putative novel PPIs separately. In the yeast CF-MS data sets, only the well correlated fractionation profiles associated with known PPIs can predict shared GO (light blue bars); those associated with putative novel PPIs have almost no predictive capacity (dark blue bars). In addition, merged networks associated with both known and putative novel PPIs produce AUROC scores that reflect those of the stand-alone yeast gene co-expression network (light brown bars and purple bar respectively), which is consistent with Fig. 6A. The results observed in yeast are similar to but more pronounced than those observed in human.
      Taken together, these experiments demonstrate that GO can play a role in identifying reference protein complexes in CF-MS data sets, whereas gene co-expression and high confidence PPI data can be used to increase the coherence of information generated from CF-MS data analysis. However, these experiments also demonstrate that GO, gene co-expression data and high confidence PPI data can only play a limited role in the prediction of protein complexes and PPIs, as elaborated on in the Discussion.

      DISCUSSION

      The above results provide new insights into how different CF-MS approaches impact the precision and recall of protein complex identification. The below sections firstly place these insights into the context of the existing CF-MS literature. Following this, a series of recommendations for best practice CF-MS, and adjustments to past CF-MS practices, are presented.

       Different Fractionation Profile Similarity Scoring Metrics Have Different Performance Characteristics

      Past CF-MS studies have employed a variety of scoring metrics to assess the similarity of CF-MS fractionation profiles. Pearson correlation and Euclidean distance are particularly commonly used, and it is also common practice for machine learning classifiers to employ a panel of scoring metrics as features (Table I). There is, however, little consensus on how well these scoring metrics perform relative to one another. Salas et al. recently provided some insight into this question by comparing the precision of interactions identified using 11 individual scoring metrics, as estimated by PrInCE (
      • Salas D.
      • Stacey R.G.
      • Akinlaja M.
      • Foster L.J.
      Next-generation interactomics: considerations for the use of co-elution to measure protein interaction networks.
      ). This identified that some scoring metrics are often more precise than others, but that results differ across CF-MS data sets.
      The results described in Fig. 3 substantially extend our knowledge of best-practice fractionation profile similarity scoring. They provide a measure of relative scoring metric sensitivity, while assessing performance from the perspective of characterizing entire protein complexes rather than individual interactions only. Moreover, in addition to the scoring metrics listed in Table I, they assess a range of scoring metrics that have not previously been applied to CF-MS data analysis.
      In considering the implications of these results, it is useful to emphasize that protein complexes obtained from cellular lysates have diverse biophysicochemical properties and may be involved in diverse nonprotein interactions (e.g. RNA interactions). In any CF-MS experiment, different protein complexes will therefore be resolved to different extents during fractionation, producing data sets with ranges of fractionation profile characteristics. This corroborates the present finding that even though some scoring metrics have broader scale utility than others, no single scoring metric is optimal for every protein complex contained in a CF-MS data set.
      The results of Fig. 3 support the current widespread use of Pearson correlation as a broadly suitable scoring metric for CF-MS data analysis. In addition, Spearman and Kendall correlations – which are not commonly used in CF-MS data analysis – and the recently introduced Co-apex metric (
      • Stacey R.G.
      • Skinnider M.A.
      • Scott N.E.
      • Foster L.J.
      A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
      ) are found to have similar broad-scale utility. The results do not, however, support the practice of using Euclidean distance as the sole scoring metric in CF-MS studies designed to maximize recall of protein complexes.
      One of the findings relating to Fig. 3—namely, that many gold standard protein complexes are identified with statistical significance from diverse ranges of scoring metrics—supports a commonly held practice in CF-MS data analysis: the use of panels of scoring metrics in machine learning classifiers to improve the broad-scale precision of protein complex identification (see Table I). However, Fig. 3 also shows that several commonly employed scoring metrics—Euclidean distance, NCC and Peak Location—do not generally appear to be ideal candidates for use in such panels, as they lack relative broad-scale sensitivity.
      The additional findings described in relation to Fig. 3—that scoring metrics of low overall sensitivity may nonetheless have unique advantages for particular subsets of protein complexes—are an underappreciated aspect of CF-MS data analysis. If a CF-MS study is being performed with the goal of maximizing the recall of protein complexes, it may therefore be beneficial to make use of a panel of scoring metrics employed across separate stand-alone analyses (as opposed to a panel within a single machine learning classifier). When seeking to identify protein complexes whose subunits are, for example, frequently involved in multiple interactions, and thus produce fractionation profiles with multiple peaks, mutual information-based metrics may have advantages. Similarly, if seeking to identify protein complexes that only exist at specific masses, such as those without diverse sets of post-translational modifications, the Peak Location metric may have particular advantages when working with CF-MS data obtained via SEC.

       Orthogonal Fractionations Are More Efficient than Stand-Alone Fractionation

      Past CF-MS studies have employed a variety of fractionation methods, as outlined in Table I. These differences in experimental design have, in part, been driven by the types of protein complexes under investigation. SEC and IEX have generally been used to study soluble complexes, whereas BN-PAGE and mild detergent-based separations have been used for membrane-associated complexes (
      • Salas D.
      • Stacey R.G.
      • Akinlaja M.
      • Foster L.J.
      Next-generation interactomics: considerations for the use of co-elution to measure protein interaction networks.
      ). The varying degrees of fraction collection employed in these studies have been influenced by the resolution capable of being achieved using these different forms of separation.
      Although these factors have influenced how past studies have been performed, experimental design for CF-MS data collection has nonetheless remained imprecise. One factor contributing to this imprecision has been the aforementioned lack of consensus on how many fractions should be collected to produce fractionation profiles that are optimal for CF-MS data analysis, if presented with a given chromatographic resolving power. Moreover, the relative recall versus experimental cost of single versus orthogonal separations has not yet been described.
      In considering how much fraction collection should be performed, a finding of Fig. 4A – that the benefits of extensive fractionation of cell lysates may often not outweigh the costs – indicates that an assumption in the CF-MS literature may require re-interpretation. This assumption states that in order to improve the precision of CF-MS data analysis, the elution of unrelated proteins in the same fraction (i.e. chance co-fractionation) should be minimized, for example via very high resolution chromatography combined with extensive fraction collection (
      • Salas D.
      • Stacey R.G.
      • Akinlaja M.
      • Foster L.J.
      Next-generation interactomics: considerations for the use of co-elution to measure protein interaction networks.
      ). However, the present results indicate that chance co-fractionation is not inherently problematic. This is most strongly evidenced by the human SEC versus IEX CF-MS data sets. The average peak widths in the human SEC data set are higher than those in the human IEX data set (according to PrInCE; data not shown), indicating that higher resolution chromatography was achieved via IEX. Moreover, higher resolution fractionation profiles were collected for the IEX data set relative to the SEC data set (108 versus 40 fractions collected respectively). Despite this, Fig. 3A shows that after correcting for the 26% fewer fractionation profiles observed in the human IEX data set relative to the human SEC data set, recall of gold standard protein complexes is higher in the latter relative to the former for almost every scoring metric at identical thresholds of precision. This indicates that for these two data sets, even though the fractionation profiles produced via SEC are subject to more chance co-fractionation than those of IEX, they can be more precisely analyzed as their shapes are generally more distinctive.
      One way to increase the likelihood of producing distinctive fractionation profiles is to use multiple separation techniques. This explains the high efficiency of orthogonal relative to stand-alone fractionation described in relation to Fig. 4B. That is, fractions collected across orthogonal separations are likely to contribute toward an efficient means of improving recall: creating distinctive fractionation profiles. Fractions collected in stand-alone separations contribute toward an inefficient means of improving recall: increasing the resolution of fractionation profiles.
      The above insights are limited by the fact that the results of Fig. 4 were obtained using simulated decreases in fractionation profile resolution. Specifically, the potential for missing values increases when quantifying proteins across fewer, more complex CF-MS fractions, particularly when considering low abundance proteins, and the present simulation methods do not account for this. It has, however, recently been demonstrated that when optimized quantitative LC–MS/MS workflows are applied to the analysis of CF-MS fractions, near-saturation levels of protein quantification can be achieved without difficulty (e.g. with extremely short LC gradients) (
      • Bludau I.
      • Heusel M.
      • Rosenberger G.
      • Hafen R.
      • Frank M.
      • Banaei-Esfahani A.
      • Martelli C.
      • Nicod C.
      • Xue P.
      • Cai Y.
      • Liu Y.
      • Venkitaraman A.
      • Wickramasinghe V.
      • Roest H.
      • Collins B.
      • Gstaiger M.
      • Aebersold R.
      Mini Symposium: Complex-centric proteome profiling by SEC-SWATH-MS.
      ). This indicates that despite the above limitation, the findings of Fig. 4 should broadly hold true if best-practice quantitative proteomics workflows are employed.

       Identifying Known versus Novel Protein Complexes in CF-MS Data Sets

      Most past CF-MS studies have sought to identify PPIs from both known and novel protein complexes, as detailed in Table I. Only recently have some studies focused solely on the broad-scale identification of known protein complexes from model organisms (
      • Heusel M.
      • Bludau I.
      • Rosenberger G.
      • Hafen R.
      • Frank M.
      • Banaei-Esfahani A.
      • Drogen A.
      • Collins B.C.
      • Gstaiger M.
      • Aebersold R.
      Complex-centric proteome profiling by SEC-SWATH-MS.
      ,
      • Heusel M.
      • Frank M.
      • Köhler M.
      • Amon S.
      • Frommelt F.
      • Rosenberger G.
      • Bludau I.
      • Aulakh S.K.
      • Linder M.I.
      • Liu Y.
      A global screen for assembly state changes of the mitotic proteome by SEC-SWATH-MS.
      ). The results described in relation Fig. 5, Fig. 6 provide new insight into best-practice CF-MS data analysis when following each of these approaches.
      Approaches which place a sole focus on known protein complexes have one chief advantage: precision can be robustly measured, for example via Reference Complex Profiling or the CCprofiler workflow (
      • Heusel M.
      • Bludau I.
      • Rosenberger G.
      • Hafen R.
      • Frank M.
      • Banaei-Esfahani A.
      • Drogen A.
      • Collins B.C.
      • Gstaiger M.
      • Aebersold R.
      Complex-centric proteome profiling by SEC-SWATH-MS.
      ). However, the results described in relation to Fig. 5A suggest that present implementations of these approaches do not maximize recall. Both Reference Complex Profiling and CCprofiler identify known protein complexes using fractionation profiles only. Of the fractionation profiles likely to be associated with known protein complexes shown in Fig. 5A, only limited percentages are verified as statistically significant using Reference Complex Profiling: 27% in the yeast SEC data set, 16% in the yeast IEX data set, 25% in the human SEC data set and 31% in the human IEX data set. Although it is difficult to estimate recall using these numbers, these low percentages leave open the possibility that some known protein complexes may not have reached statistical significance because their relevant fractionation profiles are not sufficiently distinctive. To remedy this without collecting additional CF-MS data, gene co-expression data, high confidence PPI data and GO may be of assistance as discussed in relation to Fig. 6.
      The above approaches are inherently limited by the fact that novel protein complexes are not considered. Fig. 5A suggests that this may produce notable limitations in recall even in heavily benchmarked model organisms. Moreover, these approaches are limited to the study of organisms with existing reference libraries of protein complexes. To reach the full potential of CF-MS, identification of novel protein complexes must therefore also be considered.
      To uncover both known and novel protein complexes in model organisms, it can be envisaged that the above approaches can be extended. If known protein complexes are firstly identified, novel protein complexes may then in theory be uncovered by subjecting unassigned fractionation profiles to similarity scoring. Unfortunately, genomic data integration is unlikely to add much value in such analyses, as discussed in relation to Fig. 6B. That is, although Fig. 6B does not preclude the use of gene co-expression data or GO to shortlist putative novel PPIs as candidates for orthogonal validation, it indicates that few, if any, novel PPIs will be uncovered in this manner.
      An alternative is to characterize both known and novel protein complexes or PPIs concurrently. The supervised machine learning classifiers and clustering methods of Table I fall under this category. Unfortunately the current findings indicate that present implementations of these methods are likely to be imprecise. This is because the machine learning classifiers in Table I frequently employ features associated with external genomic data, which can have different characteristics for known and novel PPIs as discussed in relation to Fig. 6B. The common practice of using known PPIs to train and test classifiers is therefore problematic, which helps explain why novel PPIs uncovered using these classifiers may have very high false discovery rates (
      • Shatsky M.
      • Dong M.
      • Liu H.
      • Yang L.L.
      • Choi M.
      • Singer M.E.
      • Geller J.T.
      • Fisher S.J.
      • Hall S.C.
      • Hazen T.C.
      Quantitative tagless co-purification: a method to validate and identify protein-protein interactions.
      ). Moreover these differences preclude the use of GO to define cluster cutoff distances when employing the clustering methods of Table I. Together these insights are strongly consistent with and shed new light on the recent finding that genomic data integration decreases the power to uncover novel PPIs (
      • Skinnider M.A.
      • Stacey R.G.
      • Foster L.J.
      Genomic data integration systematically biases interactome mapping.
      ).
      One possible avenue toward precise characterizations of novel protein complexes is to identify conserved protein complexes across organisms. This has, for example, been attempted using CF-MS data sets collected from multiple metazoan (
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ) or plant (
      • McWhite C.D.
      • Papoulas O.
      • Drew K.
      • Cox R.M.
      • June V.
      • Dong O.X.
      • Kwon T.
      • Wan C.
      • Salmi M.L.
      • Roux S.J.
      • Browning K.S.
      • Chen Z.J.
      • Ronald P.C.
      • Marcotte E.M.
      A pan-plant protein complex map reveals deep conservation and novel assemblies.
      ) species. The present comparisons between yeast and human identify a series of putative novel conserved PPIs, listed in supplemental Table S1, supporting this possibility. Consistent with the findings described in Fig. 6, these putative novel PPIs could not have been uncovered solely by shortlisting proteins with shared GO or gene co-expression. This highlights the unique means by which they were uncovered, but also their need to be validated using orthogonal experimental data.

       Extending Characterisations of Protein Complexes to NonModel Organisms

      The above discussions are centered on the analysis of CF-MS data from model organisms. Together they also provide guidance for the analysis of nonmodel organisms. It is common practice to use orthologous genomic data from model organisms – including orthologous PPIs – to interpret CF-MS data obtained from organisms without extensive reference libraries of protein complexes (
      • Shatsky M.
      • Dong M.
      • Liu H.
      • Yang L.L.
      • Choi M.
      • Singer M.E.
      • Geller J.T.
      • Fisher S.J.
      • Hall S.C.
      • Hazen T.C.
      Quantitative tagless co-purification: a method to validate and identify protein-protein interactions.
      ,
      • Crozier T.W.
      • Tinti M.
      • Larance M.
      • Lamond A.I.
      • Ferguson M.A.
      Prediction of protein complexes in Trypanosoma brucei by protein correlation profiling mass spectrometry and machine learning.
      ,
      • McWhite C.D.
      • Papoulas O.
      • Drew K.
      • Cox R.M.
      • June V.
      • Dong O.X.
      • Kwon T.
      • Wan C.
      • Salmi M.L.
      • Roux S.J.
      • Browning K.S.
      • Chen Z.J.
      • Ronald P.C.
      • Marcotte E.M.
      A pan-plant protein complex map reveals deep conservation and novel assemblies.
      ,
      • McBride Z.
      • Chen D.
      • Reick C.
      • Xie J.
      • Szymanski D.B.
      Global analysis of membrane-associated protein oligomerization using protein correlation profiling.
      ). The findings of Fig. 5B indicate this will only be effective for certain subsets of fractionation profiles. Specifically, there is a high possibility that fractionation profiles from nonmodel organisms with characteristics that match those of putative novel interactions in model organisms cannot be assessed using orthologous genomic data. They should therefore be omitted from CF-MS data analysis as detailed in the section below.
      If analyzing nonmodel organisms using fractionation profiles alone, and not orthologous genomic data, the tools available for increasing precision are further reduced. Following Reference Complex Profiling, large proportions of well correlated pairs of fractionation profiles (i.e. those making up the colored bars in Fig. 5A) cannot immediately be attributed to statistically significant protein complexes: 80 ± 7% and 76 ± 1% in the yeast and human CF-MS data sets respectively. It is likely that some of these nonsignificant pairs represent genuine interactions, as discussed earlier. Nonetheless these high percentages indicate that precise PPI identifications are difficult when using stand-alone fractionation profile similarity scoring. The practice of shortlisting putative conserved PPIs (
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ,
      • McWhite C.D.
      • Papoulas O.
      • Drew K.
      • Cox R.M.
      • June V.
      • Dong O.X.
      • Kwon T.
      • Wan C.
      • Salmi M.L.
      • Roux S.J.
      • Browning K.S.
      • Chen Z.J.
      • Ronald P.C.
      • Marcotte E.M.
      A pan-plant protein complex map reveals deep conservation and novel assemblies.
      ) may therefore be particularly advantageous for nonmodel organisms. For example, if the well-matched fractionation profiles in Fig. 5A are limited to orthologous yeast and human pairs, 97 and 95% can be attributed to known yeast and human PPIs respectively. This suggests that PPIs uncovered by shortlisting conserved fractionation profile pairs may be relatively precise.

       Analytical Recommendations for co-Fractionation Mass Spectrometry

      A series of analytical guidelines and recommendations for the collection and analysis of CF-MS data, based on the insights discussed above, are presented below.
      In relation to CF-MS data collection, it is possible to prioritize cost efficiency while maintaining data quality. If employing stand-alone SEC or IEX of complex cell lysates using methods like those studied here, collecting as few as 40 fractions will produce fractionation profiles with sufficient resolution for near-maximum recall of protein complexes. If collecting additional fractions to maximize recall, the benefits of spreading fractions across complementary data sets created using orthogonal separations, rather than a single higher resolution data set, should be the primary driving force in experimental design. Higher resolution stand-alone data sets may, however, be appropriate if low abundance proteins, such as those potentially associated with novel protein complexes in model organisms, are a specific focus of investigation.
      In relation to data analysis, precise analyses are possible if profiling known protein complexes via, for example, Reference Complex Profiling or CCprofiler (
      • Heusel M.
      • Bludau I.
      • Rosenberger G.
      • Hafen R.
      • Frank M.
      • Banaei-Esfahani A.
      • Drogen A.
      • Collins B.C.
      • Gstaiger M.
      • Aebersold R.
      Complex-centric proteome profiling by SEC-SWATH-MS.
      ). If seeking to maximize the recall of such methods using a stand-alone scoring metric, any one of the following correlation metrics – Pearson, Spearman or Kendall – or the Co-apex metric is recommended. However, if additional computation time is an option, the improvements in recall that can be expected from the union of a panel of scoring metrics should be taken advantage of. Large panels will maximize recall (see supplemental Table S2); however small panels of well-chosen correlation-based, distance-based, mutual information-based or peak-based metrics will also be effective while reducing computation time. The following panel, which identifies 25.0% to 82.3% more significant gold standard protein complexes from the present CF-MS data sets compared with Spearman correlations alone (supplemental Table S2), is recommended as one such option: Spearman correlation, DMax, BCMI and Co-apex. If collecting data using orthogonal fractionation methods, it is recommended that any p-values obtained from individual fractionation methods using the above methods are combined using Fisher's method. Finally, genomic data integration should be considered if seeking further improvements in the precision and recall of known protein complexes.
      Precise CF-MS analyses of novel protein complexes are not yet possible when studying individual organisms, even when genomic data integration and the collection of high-resolution CF-MS data are considered. This holds true for both model and nonmodel organisms. CF-MS studies of novel protein complexes and PPIs in individual organisms should therefore be limited to generating shortlists for targeted orthogonal experimental validation. To maximize the quality of these shortlists, false discovery rates can be reduced via the following recommendations. A) If training and testing a machine learning classifier using gold standard protein complexes from a model organism, the classifier should not be used to identify novel protein complexes in this organism if it employs features derived from genomic data. B) If studying a nonmodel organism, machine learning classifiers employing features derived from orthologous genomic data can be used. However if training the classifier using orthologous gold standard protein complexes, it is recommended that the following fractionation profiles should not be analyzed using the classifier: the lowest abundance quartile, as defined using maximum peak heights; those with peaks in the last quartile of retention times if employing SEC; and those with peaks in the first quartile of retention times if employing mixed-bed IEX. C) If predicting novel protein complexes using stand-alone fractionation profile similarity scoring, Pearson, Spearman or Kendall correlations or the Co-apex metric are recommended, whereas Euclidean distance is not. Moreover, in this context, panels of scoring metrics are generally not recommended outside of a machine learning framework. This is because the union of results derived from such panels may result in high false discovery, because precision cannot be accurately measured, whereas the intersection of results will often lead to very low recall (see supplemental Figs. S9S12). D) If employing clustering methods to define the subunit compositions of novel protein complexes, the use of external genomic data to define cluster cutoff distances is not recommended.
      Finally, the study of CF-MS data sets from multiple organisms presents additional options for identifying novel protein complexes. The present findings lend support to the practice of identifying putative conserved PPIs across organisms to improve precision.

      CONCLUSIONS

      The substantial promise of CF-MS lies not only in its potential for very broad-scale characterizations of endogenous and unmanipulated protein complexes, but also its inherent capacity to study how these protein complexes change (
      • Kristensen A.R.
      • Gsponer J.
      • Foster L.J.
      A high-throughput approach for measuring temporal changes in the interactome.
      ). Recent advances in CF-MS have brought the method closer to fulfilling this potential (
      • Scott N.E.
      • Rogers L.D.
      • Prudova A.
      • Brown N.F.
      • Fortelny N.
      • Overall C.M.
      • Foster L.J.
      Interactome disassembly during apoptosis occurs independent of caspase cleavage.
      ,
      • Heusel M.
      • Frank M.
      • Köhler M.
      • Amon S.
      • Frommelt F.
      • Rosenberger G.
      • Bludau I.
      • Aulakh S.K.
      • Linder M.I.
      • Liu Y.
      A global screen for assembly state changes of the mitotic proteome by SEC-SWATH-MS.
      ). However, a primary objective of such work—to enable routine time course or cohort analyses of cellular assemblies (
      • Heusel M.
      • Frank M.
      • Köhler M.
      • Amon S.
      • Frommelt F.
      • Rosenberger G.
      • Bludau I.
      • Aulakh S.K.
      • Linder M.I.
      • Liu Y.
      A global screen for assembly state changes of the mitotic proteome by SEC-SWATH-MS.
      )—is still experimentally prohibitive, and requires increasingly effective CF-MS data analysis workflows to be developed.
      The present recommendations for more cost-effective CF-MS data collection will bring this objective closer. Moreover, the present recommendations for precise and sensitive CF-MS data analysis, when integrated with workflows such as CCprofiler (
      • Heusel M.
      • Bludau I.
      • Rosenberger G.
      • Hafen R.
      • Frank M.
      • Banaei-Esfahani A.
      • Drogen A.
      • Collins B.C.
      • Gstaiger M.
      • Aebersold R.
      Complex-centric proteome profiling by SEC-SWATH-MS.
      ,
      • Heusel M.
      • Frank M.
      • Köhler M.
      • Amon S.
      • Frommelt F.
      • Rosenberger G.
      • Bludau I.
      • Aulakh S.K.
      • Linder M.I.
      • Liu Y.
      A global screen for assembly state changes of the mitotic proteome by SEC-SWATH-MS.
      ), should further fulfill the data analysis requirements of this objective for known protein complexes in model organisms. Together these advances can be expected to bring routine and impactful studies of the dynamics of known PPIs well within the reach of CF-MS.
      The present findings also highlight the pronounced differences in studying known versus novel protein complexes using CF-MS. It can therefore be envisaged that CF-MS will immensely benefit from increasingly sophisticated reference libraries of protein complexes and PPIs. Targeted orthogonal experimental validation of putative novel complexes shortlisted via CF-MS may be an effective means of expanding these reference libraries. If following the above recommendations, high quality shortlists in nonmodel organisms will comprise mainly of large and high abundance protein complexes, whereas they will contain more small and low abundance protein complexes in highly studied model organisms. These differences suggest that validation experiments can be precisely targeted for different organisms. For example, high depth of coverage chemical cross-linking MS experiments performed on SEC fractions corresponding to selected molecular weight ranges can be envisaged. When coupled with proteomics and CF-MS data sharing initiatives (
      • Wan C.
      • Borgeson B.
      • Phanse S.
      • Tu F.
      • Drew K.
      • Clark G.
      • Xiong X.
      • Kagan O.
      • Kwan J.
      • Bezginov A.
      • Chessman K.
      • Pal S.
      • Cromar G.
      • Papoulas O.
      • Ni Z.
      • Boutz D.R.
      • Stoilova S.
      • Havugimana P.C.
      • Guo X.
      • Malty R.H.
      • Sarov M.
      • Greenblatt J.
      • Babu M.
      • Derry W.B.
      • Tillier E.R.
      • Wallingford J.B.
      • Parkinson J.
      • Marcotte E.M.
      • Emili A.
      Panorama of ancient metazoan macromolecular complexes.
      ,
      • Kirkwood K.J.
      • Ahmad Y.
      • Larance M.
      • Lamond A.I.
      Characterisation of native protein complexes and protein isoform variation using size-fractionation based quantitative proteomics.
      ,
      • Vizcaíno J.A.
      • Deutsch E.W.
      • Wang R.
      • Csordas A.
      • Reisinger F.
      • Ríos D.
      • Dianes J.A.
      • Sun Z.
      • Farrah T.
      • Bandeira N.
      • Binz P.-A.
      • Xenarios I.
      • Eisenacher M.
      • Mayer G.
      • Gatto L.
      • Campos A.
      • Chalkley R.J.
      • Kraus H.-J.
      • Albar J.P.
      • Martinez-Bartolomé S.
      • Apweiler R.
      • Omenn G.S.
      • Martens L.
      • Jones A.R.
      • Hermjakob H.
      ProteomeXchange provides globally coordinated proteomics data submission and dissemination.
      ), expanding reference libraries of protein complexes in this manner will pave a way forward for CF-MS studies of continually increasing depth, precision, and impact.

      DATA AVAILABILITY

      The MS proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE (
      • Perez-Riverol Y.
      • Csordas A.
      • Bai J.
      • Bernal-Llinares M.
      • Hewapathirana S.
      • Kundu D.J.
      • Inuganti A.
      • Griss J.
      • Mayer G.
      • Eisenacher M.
      • Pérez E.
      • Uszkoreit J.
      • Pfeuffer J.
      • Sachsenberg T.
      • Yilmaz S.
      • Tiwary S.
      • Cox J.
      • Audain E.
      • Walzer M.
      • Jarnuczak A.F.
      • Ternent T.
      • Brazma A.
      • Vizcaíno J.A.
      The PRIDE database and related tools and resources in 2019: improving support for quantification data.
      ) partner repository with the data set identifier PXD019513.

      Acknowledgments

      We thank Dr. Ling Zhong, Ms. Sydney Liu Lau and A/Prof. Mark Raftery for their maintenance of the mass spectrometers housed at the UNSW Bioanalytical Mass Spectrometry Facility. This research includes computations using the computational cluster Katana supported by Research Technology Services at UNSW Sydney. We thank Dr. Michele Tinti for his generous advice, and Prof. Paul Haynes for his critical evaluations of manuscript drafts.

      Supplementary Material

      REFERENCES

        • Vidal M.
        • Cusick M.E.
        • Barabási A.-L.
        Interactome networks and human disease.
        Cell. 2011; 144: 986-998
        • Ideker T.
        • Krogan N.J.
        Differential network biology.
        Mol. Syst. Biol. 2012; 8: 565
        • Bonetta L.
        Protein–protein interactions: interactome under construction.
        Nature. 2010; 468: 851-854
        • Havugimana P.C.
        • Hart G.T.
        • Nepusz T.
        • Yang H.
        • Turinsky A.L.
        • Li Z.
        • Wang P.I.
        • Boutz D.R.
        • Fong V.
        • Phanse S.
        • Babu M.
        • Craig S.A.
        • Hu P.
        • Wan C.
        • Vlasblom J.
        • Dar V-U-N.
        • Bezginov A.
        • Clark G.W.
        • Wu G.C.
        • Wodak S.J.
        • Tillier E.R.M.
        • Paccanaro A.
        • Marcotte E.M.
        • Emili A.
        A census of human soluble protein complexes.
        Cell. 2012; 150: 1068-1081
        • Wan C.
        • Borgeson B.
        • Phanse S.
        • Tu F.
        • Drew K.
        • Clark G.
        • Xiong X.
        • Kagan O.
        • Kwan J.
        • Bezginov A.
        • Chessman K.
        • Pal S.
        • Cromar G.
        • Papoulas O.
        • Ni Z.
        • Boutz D.R.
        • Stoilova S.
        • Havugimana P.C.
        • Guo X.
        • Malty R.H.
        • Sarov M.
        • Greenblatt J.
        • Babu M.
        • Derry W.B.
        • Tillier E.R.
        • Wallingford J.B.
        • Parkinson J.
        • Marcotte E.M.
        • Emili A.
        Panorama of ancient metazoan macromolecular complexes.
        Nature. 2015; 525: 339-344
        • Drew K.
        • Müller C.L.
        • Bonneau R.
        • Marcotte E.M.
        Identifying direct contacts between protein complex subunits from their conditional dependence in proteomics datasets.
        PLoS Comput. Biol. 2017; 13: e1005625
        • Scott N.E.
        • Rogers L.D.
        • Prudova A.
        • Brown N.F.
        • Fortelny N.
        • Overall C.M.
        • Foster L.J.
        Interactome disassembly during apoptosis occurs independent of caspase cleavage.
        Mol. Syst. Biol. 2017; 13: 906
        • Larance M.
        • Kirkwood K.J.
        • Tinti M.
        • Murillo A.B.
        • Ferguson M.A.
        • Lamond A.I.
        Global membrane protein interactome analysis using in vivo crosslinking and MS-based protein correlation profiling.
        Mol. Cell. Proteomics. 2016; (O115. 055467)
        • Shatsky M.
        • Dong M.
        • Liu H.
        • Yang L.L.
        • Choi M.
        • Singer M.E.
        • Geller J.T.
        • Fisher S.J.
        • Hall S.C.
        • Hazen T.C.
        Quantitative tagless co-purification: a method to validate and identify protein-protein interactions.
        Mol. Cell. Proteomics. 2016; (M115. 057117)
        • Stacey R.G.
        • Skinnider M.A.
        • Scott N.E.
        • Foster L.J.
        A rapid and accurate approach for prediction of interactomes from co-elution data (PrInCE).
        BMC Bioinformatics. 2017; 18: 457
        • Crozier T.W.
        • Tinti M.
        • Larance M.
        • Lamond A.I.
        • Ferguson M.A.
        Prediction of protein complexes in Trypanosoma brucei by protein correlation profiling mass spectrometry and machine learning.
        Mol. Cell. Proteomics O117. 2017; (068122)
        • Carlson M.L.
        • Stacey R.G.
        • Young J.W.
        • Wason I.S.
        • Zhao Z.
        • Rattray D.G.
        • Scott N.
        • Kerr C.H.
        • Babu M.
        • Foster L.J.
        • Duong Van Hoa F.
        Profiling the Escherichia coli membrane protein interactome captured in Peptidisc libraries.
        Elife. 2019; 8
        • McWhite C.D.
        • Papoulas O.
        • Drew K.
        • Cox R.M.
        • June V.
        • Dong O.X.
        • Kwon T.
        • Wan C.
        • Salmi M.L.
        • Roux S.J.
        • Browning K.S.
        • Chen Z.J.
        • Ronald P.C.
        • Marcotte E.M.
        A pan-plant protein complex map reveals deep conservation and novel assemblies.
        Cell. 2020; 181: 460-474.e14
        • Heusel M.
        • Bludau I.
        • Rosenberger G.
        • Hafen R.
        • Frank M.
        • Banaei-Esfahani A.
        • Drogen A.
        • Collins B.C.
        • Gstaiger M.
        • Aebersold R.
        Complex-centric proteome profiling by SEC-SWATH-MS.
        Mol. Syst. Biol. 2019; 15: e8438
        • Heusel M.
        • Frank M.
        • Köhler M.
        • Amon S.
        • Frommelt F.
        • Rosenberger G.
        • Bludau I.
        • Aulakh S.K.
        • Linder M.I.
        • Liu Y.
        A global screen for assembly state changes of the mitotic proteome by SEC-SWATH-MS.
        CELL-SYSTEMS-D-19-00261. 2019;
        • Kirkwood K.J.
        • Ahmad Y.
        • Larance M.
        • Lamond A.I.
        Characterisation of native protein complexes and protein isoform variation using size-fractionation based quantitative proteomics.
        Mol. Cell. Proteomics. 2013; (M113. 032367)
        • Kristensen A.R.
        • Gsponer J.
        • Foster L.J.
        A high-throughput approach for measuring temporal changes in the interactome.
        Nat. Methods. 2012; 9: 907-909
        • Scott N.E.
        • Brown L.M.
        • Kristensen A.R.
        • Foster L.J.
        Development of a computational framework for the analysis of protein correlation profiling and spatial proteomics experiments.
        J. Proteomics. 2015; 118: 112-129
        • Stacey R.G.
        • Skinnider M.A.
        • Chik J.H.
        • Foster L.J.
        Context-specific interactions in literature-curated protein interaction databases.
        BMC Genomics. 2018; 19: 758
        • Cusick M.E.
        • Klitgord N.
        • Vidal M.
        • Hill D.E.
        Interactome: gateway into systems biology.
        Human Mol. Gen. 2005; 14: R171-R181
        • Benschop J.J.
        • Brabers N.
        • van Leenen D.
        • Bakker L.V.
        • van Deutekom H.W.
        • van Berkum N.L.
        • Apweiler E.
        • Lijnzaad P.
        • Holstege F.C.
        • Kemmeren P.
        A consensus of core protein complex compositions for Saccharomyces cerevisiae.
        Mol. Cell. 2010; 38: 916-928
        • Babu M.
        • Vlasblom J.
        • Pu S.
        • Guo X.
        • Graham C.
        • Bean B.D.M.
        • Burston H.E.
        • Vizeacoumar F.J.
        • Snider J.
        • Phanse S.
        • Fong V.
        • Tam Y.Y.C.
        • Davey M.
        • Hnatshak O.
        • Bajaj N.
        • Chandran S.
        • Punna T.
        • Christopolous C.
        • Wong V.
        • Yu A.
        • Zhong G.
        • Li J.
        • Stagljar I.
        • Conibear E.
        • Wodak S.J.
        • Emili A.
        • Greenblatt J.F.
        Interaction landscape of membrane-protein complexes in Saccharomyces cerevisiae.
        Nature. 2012; 489: 585-589
        • Gavin A.-C.
        • Aloy P.
        • Grandi P.
        • Krause R.
        • Boesche M.
        • Marzioch M.
        • Rau C.
        • Jensen L.J.
        • Bastuck S.
        • Dümpelfeld B.
        • Edelmann A.
        • Heurtier M.-A.
        • Hoffman V.
        • Hoefert C.
        • Klein K.
        • Hudak M.
        • Michon A.-M.
        • Schelder M.
        • Schirle M.
        • Remor M.
        • Rudi T.
        • Hooper S.
        • Bauer A.
        • Bouwmeester T.
        • Casari G.
        • Drewes G.
        • Neubauer G.
        • Rick J.M.
        • Kuster B.
        • Bork P.
        • Russell R.B.
        • Superti-Furga G.
        Proteome survey reveals modularity of the yeast cell machinery.
        Nature. 2006; 440: 631-636
        • Krogan N.J.
        • Cagney G.
        • Yu H.
        • Zhong G.
        • Guo X.
        • Ignatchenko A.
        • Li J.
        • Pu S.
        • Datta N.
        • Tikuisis A.P.
        • Punna T.
        • Peregrín-Alvarez J.M.
        • Shales M.
        • Zhang X.
        • Davey M.
        • Robinson M.D.
        • Paccanaro A.
        • Bray J.E.
        • Sheung A.
        • Beattie B.
        • Richards D.P.
        • Canadien V.
        • Lalev A.
        • Mena F.
        • Wong P.
        • Starostine A.
        • Canete M.M.
        • Vlasblom J.
        • Wu S.
        • Orsi C.
        • Collins S.R.
        • Chandran S.
        • Haw R.
        • Rilstone J.J.
        • Gandi K.
        • Thompson N.J.
        • Musso G.
        • St Onge P.
        • Ghanny S.
        • Lam M.H.Y.
        • Butland G.
        • Altaf-Ul A.M.
        • Kanaya S.
        • Shilatifard A.
        • O'Shea E.
        • Weissman J.S.
        • Ingles C.J.
        • Hughes T.R.
        • Parkinson J.
        • Gerstein M.
        • Wodak S.J.
        • Emili A.
        • Greenblatt J.F.
        Global landscape of protein complexes in the yeast Saccharomyces cerevisiae.
        Nature. 2006; 440: 637-643
        • Yu H.
        • Braun P.
        • Yildirim M.A.
        • Lemmens I.
        • Venkatesan K.
        • Sahalie J.
        • Hirozane-Kishikawa T.
        • Gebreab F.
        • Li N.
        • Simonis N.
        • Hao T.
        • Rual J.-F.
        • Dricot A.
        • Vazquez A.
        • Murray R.R.
        • Simon C.
        • Tardivo L.
        • Tam S.
        • Svrzikapa N.
        • Fan C.
        • de Smet A.-S.
        • Motyl A.
        • Hudson M.E.
        • Park J.
        • Xin X.
        • Cusick M.E.
        • Moore T.
        • Boone C.
        • Snyder M.
        • Roth F.P.
        • Barabasi A.-L.
        • Tavernier J.
        • Hill D.E.
        • Vidal M.
        High-quality binary protein interaction map of the yeast interactome network.
        Science. 2008; 322: 104-110
        • Ruepp A.
        • Waegele B.
        • Lechner M.
        • Brauner B.
        • Dunger-Kaltenbach I.
        • Fobo G.
        • Frishman G.
        • Montrone C.
        • Mewes H.-W.
        CORUM: the comprehensive resource of mammalian protein complexes—2009.
        Nucleic Acids Res. 2010; 38: D497-D501
        • Hart-Smith G.
        • Raftery M.J.
        Detection and characterization of low abundance glycopeptides via higher-energy C-trap dissociation and orbitrap mass analysis.
        J. Am. Soc. Mass Spectrom. 2012; 23: 124-140
        • Cox J.
        • Mann M.
        MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification.
        Nat. Biotechnol. 2008; 26: 1367-1372
        • Cox J.
        • Neuhauser N.
        • Michalski A.
        • Scheltema R.A.
        • Olsen J.V.
        • Mann M.
        Andromeda: a peptide search engine integrated into the MaxQuant environment.
        J. Proteome Res. 2011; 10: 1794-1805
        • Cox J.
        • Hein M.Y.
        • Luber C.A.
        • Paron I.
        • Nagaraj N.
        • Mann M.
        Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ.
        Mol. Cell. Proteomics. 2014; 13: 2513-2526
        • Giurgiu M.
        • Reinhard J.
        • Brauner B.
        • Dunger-Kaltenbach I.
        • Fobo G.
        • Frishman G.
        • Montrone C.
        • Ruepp A.
        CORUM: the comprehensive resource of mammalian protein complexes.
        Nucleic acids Res. 2019; 47 (2019): D559-D563
        • Davison A.C.
        • Hinkley D.V.
        Bootstrap methods and their application. Cambridge University Press, Cambridge1997
        • Benjamini Y.
        • Hochberg Y.
        Controlling the false discovery rate: a practical and powerful approach to multiple testing.
        J. Roy. Statistical Soc. 1995; 57: 289-300
        • Mosteller F.
        • Fisher R.A.
        Combining independent tests of significance.
        Am. Statistician. 1948; 2: 30
        • Kendall M.G.
        A new measure of rank correlation.
        Biometrika. 1938; 30: 81-93
        • Székely G.J.
        • Rizzo M.L.
        • Bakirov N.K.
        Measuring and testing dependence by correlation of distances.
        Ann. Statist. 2007; 35: 2769-2794
        • Pardy C.
        • Wilson S.
        A bioinformatic implementation of mutual information as a distance measure for identification of clusters of variables.
        ANZIAMJ. 2011; 52: 710-726
        • Reshef D.N.
        • Reshef Y.A.
        • Finucane H.K.
        • Grossman S.R.
        • McVean G.
        • Turnbaugh P.J.
        • Lander E.S.
        • Mitzenmacher M.
        • Sabeti P.C.
        Detecting novel associations in large data sets.
        Science. 2011; 334: 1518-1524
        • Reshef D.N.
        • Reshef Y.A.
        • Sabeti P.C.
        • Mitzenmacher M.
        An empirical study of the maximal and total information coefficients and leading measures of dependence.
        Ann. Appl. Stat. 2018; 12: 123-155
      1. Luedtke, A., and Tran, L., (2013) The generalized mean information coefficient. arXiv preprint arXiv :1308.5712.

        • Romano S.
        • Vinh N.X.
        • Verspoor K.
        • Bailey J.
        The randomized information coefficient: assessing dependencies in noisy data.
        Mach. Learn. 2018; 107: 509-549
        • Ashburner M.
        • Ball C.A.
        • Blake J.A.
        • Botstein D.
        • Butler H.
        • Cherry J.M.
        • Davis A.P.
        • Dolinski K.
        • Dwight S.S.
        • Eppig J.T.
        • Harris M.A.
        • Hill D.P.
        • Issel-Tarver L.
        • Kasarskis A.
        • Lewis S.
        • Matese J.C.
        • Richardson J.E.
        • Ringwald M.
        • Rubin G.M.
        • Sherlock G.
        Gene ontology: tool for the unification of biology.
        Nat. Genet. 2000; 25: 25-29
        • Carbon S.
        • Chan J.
        • Kishore R.
        • Lee R.
        • Muller H.-M.
        • Raciti D.
        • Van Auken K.
        • Sternberg P.
        Expansion of the Gene Ontology knowledgebase and resources.
        Nucleic Acids Res. 2017; 45: D331-D338
        • Ignatius Pang C.N.
        • Goel A.
        • Wilkins M.R.
        Investigating the network basis of negative genetic interactions in Saccharomyces cerevisiae with integrated biological networks and triplet motif analysis.
        J. Proteome Res. 2018; 17: 1014-1030
        • Huttlin E.L.
        • Bruckner R.J.
        • Paulo J.A.
        • Cannon J.R.
        • Ting L.
        • Baltier K.
        • Colby G.
        • Gebreab F.
        • Gygi M.P.
        • Parzen H.
        • Szpyt J.
        • Tam S.
        • Zarraga G.
        • Pontano-Vaites L.
        • Swarup S.
        • White A.E.
        • Schweppe D.K.
        • Rad R.
        • Erickson B.K.
        • Obar R.A.
        • Guruharsha K.G.
        • Li K.
        • Artavanis-Tsakonas S.
        • Gygi S.P.
        • Harper J.W.
        Architecture of the human interactome defines protein communities and disease networks.
        Nature. 2017; 545: 505-509
        • Ballouz S.
        • Weber M.
        • Pavlidis P.
        • Gillis J.
        EGAD: ultra-fast functional analysis of gene networks.
        Bioinformatics. 2016; 33: 612-614
        • Gillis J.
        • Ballouz S.
        • Pavlidis P.
        Bias tradeoffs in the creation and analysis of protein–protein interaction networks.
        J. Proteomics. 2014; 100: 44-54
        • Ballouz S.
        • Verleyen W.
        • Gillis J.
        Guidance for RNA-seq co-expression network construction and analysis: safety in numbers.
        Bioinformatics. 2015; 31: 2123-2130
        • Oughtred R.
        • Stark C.
        • Breitkreutz B.-J.
        • Rust J.
        • Boucher L.
        • Chang C.
        • Kolas N.
        • O'Donnell L.
        • Leung G.
        • McAdam R.
        • Zhang F.
        • Dolma S.
        • Willems A.
        • Coulombe-Huntington J.
        • Chatr-Aryamontri A.
        • Dolinski K.
        • Tyers M.
        The BioGRID interaction database: 2019 update.
        Nucleic Acids Res. 2019; 47: D529-D541
        • Stark C.
        • Breitkreutz B.-J.
        • Reguly T.
        • Boucher L.
        • Breitkreutz A.
        • Tyers M.
        BioGRID: a general repository for interaction datasets.
        Nucleic Acids Res. 2006; 34: D535-D539
        • Mosca R.
        • Céol A.
        • Aloy P.
        Interactome3D: adding structural details to protein networks.
        Nat. Methods. 2013; 10: 47-53
        • Skinnider M.A.
        • Stacey R.G.
        • Foster L.J.
        Genomic data integration systematically biases interactome mapping.
        PLoS Comput. Biol. 2018; 14: e1006474
        • Salas D.
        • Stacey R.G.
        • Akinlaja M.
        • Foster L.J.
        Next-generation interactomics: considerations for the use of co-elution to measure protein interaction networks.
        Mol. Cell. Proteomics. 2020; 19: 1-10
        • Bludau I.
        • Heusel M.
        • Rosenberger G.
        • Hafen R.
        • Frank M.
        • Banaei-Esfahani A.
        • Martelli C.
        • Nicod C.
        • Xue P.
        • Cai Y.
        • Liu Y.
        • Venkitaraman A.
        • Wickramasinghe V.
        • Roest H.
        • Collins B.
        • Gstaiger M.
        • Aebersold R.
        Mini Symposium: Complex-centric proteome profiling by SEC-SWATH-MS.
        Mol. Cell. Proteomics. 2019; 18: S15-S18
        • McBride Z.
        • Chen D.
        • Reick C.
        • Xie J.
        • Szymanski D.B.
        Global analysis of membrane-associated protein oligomerization using protein correlation profiling.
        Mol. Cell. Proteomics. 2017; 16: 1972-1989
        • Vizcaíno J.A.
        • Deutsch E.W.
        • Wang R.
        • Csordas A.
        • Reisinger F.
        • Ríos D.
        • Dianes J.A.
        • Sun Z.
        • Farrah T.
        • Bandeira N.
        • Binz P.-A.
        • Xenarios I.
        • Eisenacher M.
        • Mayer G.
        • Gatto L.
        • Campos A.
        • Chalkley R.J.
        • Kraus H.-J.
        • Albar J.P.
        • Martinez-Bartolomé S.
        • Apweiler R.
        • Omenn G.S.
        • Martens L.
        • Jones A.R.
        • Hermjakob H.
        ProteomeXchange provides globally coordinated proteomics data submission and dissemination.
        Nat. Biotechnol. 2014; 32: 223-226
        • Perez-Riverol Y.
        • Csordas A.
        • Bai J.
        • Bernal-Llinares M.
        • Hewapathirana S.
        • Kundu D.J.
        • Inuganti A.
        • Griss J.
        • Mayer G.
        • Eisenacher M.
        • Pérez E.
        • Uszkoreit J.
        • Pfeuffer J.
        • Sachsenberg T.
        • Yilmaz S.
        • Tiwary S.
        • Cox J.
        • Audain E.
        • Walzer M.
        • Jarnuczak A.F.
        • Ternent T.
        • Brazma A.
        • Vizcaíno J.A.
        The PRIDE database and related tools and resources in 2019: improving support for quantification data.
        Nucleic Acids Res. 2019; 47: D442-D450
        • Gordon S.M.
        • Deng J.
        • Tomann A.B.
        • Shah A.S.
        • Lu L.J.
        • Davidson W.S.
        Multi-dimensional co-separation analysis reveals protein–protein interactions defining plasma lipoprotein subspecies.
        Mol. Cell. Proteomics. 2013; 12: 3123-3134
        • Aryal U.K.
        • Xiong Y.
        • McBride Z.
        • Kihara D.
        • Xie J.
        • Hall M.C.
        • Szymanski D.B.
        A proteomic strategy for global analysis of plant protein complexes.
        Plant Cell. 2014; 26: 3867-3882
        • Skinnider M.A.
        • Scott N.E.
        • Prudova A.
        • Stoynov N.
        • Stacey R.G.
        • Gsponer J.
        • Foster L.
        An atlas of protein-protein interactions across mammalian tissues.
        Available at SSRN 3219264. 2018;
        • McBride Z.
        • Chen D.
        • Lee Y.
        • Aryal U.K.
        • Xie J.
        • Szymanski D.B.
        A label-free mass spectrometry method to predict endogenous protein complex composition.
        Mol. Cell. Proteomics. 2019; 18: 1588-1606