[go: up one dir, main page]

Academia.eduAcademia.edu
UCSF UC San Francisco Previously Published Works Title Connecting the dots across time: reconstruction of single-cell signalling trajectories using time-stamped data. Permalink https://escholarship.org/uc/item/33q0s2gb Journal Royal Society open science, 4(8) ISSN 2054-5703 Authors Mukherjee, Sayak Stewart, David Stewart, William et al. Publication Date 2017-08-23 DOI 10.1098/rsos.170811 Peer reviewed eScholarship.org Powered by the California Digital Library University of California rsos.royalsocietypublishing.org Research Cite this article: Mukherjee S, Stewart D, Stewart W, Lanier LL, Das J. 2017 Connecting the dots across time: reconstruction of single-cell signalling trajectories using time-stamped data. R. Soc. open sci. 4: 170811. http://dx.doi.org/10.1098/rsos.170811 Received: 3 July 2017 Accepted: 20 July 2017 Subject Category: Biochemistry and biophysics Connecting the dots across time: reconstruction of single-cell signalling trajectories using time-stamped data Sayak Mukherjee1,8 , David Stewart6 , William Stewart1,4 , Lewis L. Lanier7 and Jayajit Das1,2,3,5 1 Battelle Center for Mathematical Medicine, Research Institute at the Nationwide Children’s Hospital, 700 Children’s Drive, OH 43205, USA 2 Department of Pediatrics, 3 Department of Physics, 4 Department of Statistics, and 5 Department of Biophysics Program, The Ohio State University, Columbus, OH 43210, USA 6 Department of Mathematics, University of Iowa, Iowa City, IA 52242, USA 7 Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA 94143, USA 8 Institute of Bioinformatics and Applied Biotechnology, Electronic City Phase I, Bangalore, 560100 India JD, 0000-0001-9649-4698 Subject Areas: computational biology/computational biology/biophysics Keywords: single-cell signalling kinetics, flow cytometry, mass cytometry, trajectory reconstruction, invariants, pair-matching Author for correspondence: Jayajit Das e-mail: jayajit@gmail.com Electronic supplementary material is available online at https://doi.org/10.6084/m9.figshare. c.3844777. Single-cell responses are shaped by the geometry of signalling kinetic trajectories carved in a multidimensional space spanned by signalling protein abundances. It is, however, challenging to assay a large number (more than 3) of signalling species in live-cell imaging, which makes it difficult to probe single-cell signalling kinetic trajectories in large dimensions. Flow and mass cytometry techniques can measure a large number (4 to more than 40) of signalling species but are unable to track single cells. Thus, cytometry experiments provide detailed time-stamped snapshots of single-cell signalling kinetics. Is it possible to use the time-stamped cytometry data to reconstruct singlecell signalling trajectories? Borrowing concepts of conserved and slow variables from non-equilibrium statistical physics we develop an approach to reconstruct signalling trajectories using snapshot data by creating new variables that remain invariant or vary slowly during the signalling kinetics. We apply this approach to reconstruct trajectories using snapshot data obtained from in silico simulations, live-cell imaging measurements, and, synthetic flow cytometry datasets. The application of invariants and slow variables to reconstruct trajectories provides a radically different way to track 2017 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited. objects using snapshot data. The approach is likely to have implications for solving matching problems in a wide range of disciplines. 0 (I) the parameter ξ (I) (=l(I) /l0 ), which can even fall in the range, ξ (I) ≪ 1, where tracking objects is straightforward. In addition, when ξ ≫ ξ (I) > 1, tracking objects in I becomes more amenable to the ................................................ Tracking signalling events in single cells is a key step towards understanding single-cell response mechanisms. Signalling events are orchestrated by a large number of intercellular molecular species that transmit information pertaining to changes in the extracellular environment to the cell nucleus [1]. Single-cell responses are often influenced by the geometry of multidimensional temporal trajectories describing time evolution of single-cell protein abundances. For example, in human cancer cell lines, fold change in the abundance of the protein phosphorylated Erk (or pErk) as opposed to the absolute value of pErk abundance regulates single-cell growth responses [2]. In general, the signalling kinetics of many molecular species could affect cell decision processes. However, our understanding regarding the link between the geometry of signalling kinetic trajectories and single-cell responses are often limited to few [1–3] molecular species. This is because spectral overlap between fluorescent dyes and photo-toxicity induced by excited fluorophores [3] make it challenging to track a large number (more than 3) of molecular species in live-cell imaging experiments. Flow cytometry [4,5] and recently developed mass cytometry experiments [4,5] can assay 4 to more than 40 proteins simultaneously in single cells at multiple times, but it is not possible to track single cells in these experiments. Is it possible to reconstruct single-cell trajectories, even approximately, from such time-stamped snapshot data? An affirmative answer to this question will be valuable for analysing signalling mechanisms or calculation of autocorrelation functions [6] for extracting relevant time scales and inference of signalling networks [7]. Tracking multiple objects across time using time-stamped data is a common problem encountered in diverse research areas ranging from tracking single molecules in live cells [8] to fluid particles in turbulent flows [9] to birds flying in a flock [10] and to tracking individuals in surveillance videos [11]. The difficulty in tracking individual objects in these problems is characterized by a dimensionless parameter ξ = l/l0 , where l is the average distance an object moves between two successive time recordings and l0 (=ρ −1/d ) is the average object-to-object distance for the objects distributed in d dimensions with a density ρ [9,12] (figure 1). When ξ ≪ 1 connecting objects across time is straightforward, whereas when ξ ≫ 1 matching objects across time becomes ambiguous. In the later scenario, two types of methods are used to generate solutions—either by optimizing cost functions that are often constructed in an empirical manner [9], or by evaluating probabilities for different matching configurations by estimating parameters in an underlying model [12]. The success of the first method depends on stumbling upon a good cost function and the second method requires intensive computation for the evaluation of the likelihood function and estimation of the model parameters. Signalling processes usually involve a large number of molecular species (tens to hundreds). Depending on the available antibodies, cytometry techniques assay 4 to more than 40 molecular species in 103 –106 cells at chosen time points where the abundances of the marker species have changed appreciably [13]. Thus, the time-stamped data collected in cytometry experiments are almost always in the ξ ≫ 1 range. In addition, involvement of a large number of molecular species in generating a signalling response demands analysis of the kinetic trajectories in high dimensions (d > 3). This can create computational road blocks regarding parameter estimation of an underlying model or particle filters using Markov chain Monte Carlo-based algorithms [12,14], which have been extensively used for tracking fluid particles (spatial dimension, d = 2) or flocking birds (spatial dimension, d = 3). The reconstruction using cytometry data is further complicated as, unlike the above examples, the same objects (i.e. single cells) are not assayed at successive time points. A fundamentally different way of approaching this problem would be to use the measured variables to construct new variables (I in figure 1) that do not change (l(I) = 0) or change more slowly (l(I) → 0) with time, while still varying appreciably between the objects at a fixed time point. Thus, when the density of the objects (ρ (I) ) in the d(I) dimensional space of the new variables I is small (i.e. large (I) l = (1/ρ (I) )1/dI ), the matching problem posed in terms of I will result in a substantial reduction in rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 1. Introduction 2 (a) 3 ................................................ l0 rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 t1 t2 t3 x2 Dl x = Dl/l0 >> 1 x1 (b) (c) I Dl(I) l(I) 0 l(I) 0 Dl(I) x(I) = l(I) 0 Dl(I) = 0 x > x(I) > 1 =0 t1 t2 t3 t1 t2 t3 Figure 1. Matching single cells across time using invariants and slow variables. (a) Schematic depiction of time-stamped cytometry data showing single cells against copy numbers (denoted by x 1 and x 2 ) of two signalling species at three different time points t 1 < t 2 < t 3 . When the average distance (l0 ) between the cells in the x 1 –x 2 plane is smaller than the average distance travelled (l) in a time interval (e.g., t 2 − t 1 ) by the cells in the same plane or x = l/l0 ≫ 1, matching the cells across time is non-trivial due to multiple possibilities. Whereas when ξ < 1, connecting the cells could be as straightforward as finding the nearest neighbour. The arrows show a correct trajectory. (b) We cast the matching problem in (a) in the manifold for a new variable (I) constructed from the measured variables (e.g. x 1 , x 2 ). I satisfies two conditions: (i) I does not change (invariant) or changes substantially slowly (slow variable) in individual cells compared with the original variables or l(I) ≪ l and (ii) I varies between single cells at any time point. In this situation, l(I) /l0 (I) = ξ (I) ≪ 1 and connecting the cells across time becomes straightforward. When I is an invariant, l(I) = 0 and the pairing of cells across time is exact. (c) In many cases, I will be a slow variable resulting in ξ (I) > 1. However, even in such cases we find, ξ (I) < ξ , and a cost function is employed to approximately pair single cells across time. standard techniques [12,14] due to the lower dimensionality of the manifold and the smaller value of ξ(I) . However, it is often difficult to construct such conserved or slow variables for the problem in hand. In physical systems, conservation laws (e.g. conservation of energy) [15,16], breakdown of continuous symmetry (e.g. Goldstone modes) [15,16] or the presence of a critical point [15,16] can bring conserved and slow variables into existence. However, direct application of such principles in cell signalling processes is not obvious. Here, we developed a framework for matching untagged single cells across time by constructing conserved and slow variables using abundances of molecular species that follow biochemical signalling kinetics. The calculations of the new variables do not require any parameter estimation, and thus avoid computational difficulties usually encountered in matching problems. We constructed two invariants for an ideal kinetics where the signalling kinetics is described by a closed system of first-order reactions. One of the invariant variables is based on the conservation of total number of molecular species, and the other one involves scaling the measured abundances by a particular function of the covariance matrix. These invariants turn into slow variables or remain as invariants for more general biochemical reactions that include external fluxes, involve second-order or higher order reactions, and contain stochastic fluctuations [17]. The slow variables and invariants, constructed from the measured species abundances, allowed us to connect a single cell with a ‘sister cell’ at a later time whose species abundances are reasonably close to the correct cell partner. We validate the above formalism by reconstructing trajectories using published live-cell imaging data [18]. Lastly, we apply our framework for reconstructing signalling trajectories between successive time points in a synthetic flow cytometry dataset. 2. Results 4 2.1.1. Determination of invariants in an ideal system We constructed two invariants, IT and IM , for an ideal system of biochemical reactions. The invariants do not change with time but vary between single cells at a particular time. The ideal system satisfies the following conditions: (i) the reaction propensities are linear functions of the copy numbers (or abundances) of the molecular species, (ii) the reaction rates are time independent and there is no external in- (or out-)flux of molecules and (iii) the kinetics does not include any intrinsic noise fluctuations. The mass action kinetics of the biochemical reactions are described as an autonomous system of linear firstorder ordinary differential equations (ODEs). Consider an ensemble of N number of single cells where (α) a single cell (indexed by α) contains n different molecular species (indexed by i) with abundances {xi }. Any pair of molecular species, i and j, in an individual cell (say, the αth cell) react following the above j (α) (α) conditions, and the propensity for the reaction j → i (or i → j) is given by Mi xj (or Mij xi ). The reaction j rates Mi and Mij are always positive and constant as long as, i = j. In our notation scheme, the (i,j) element j of the M matrix, Mi , is associated with a reaction, j (superscript index) → i (subscript index). Vanishing j Mi values for both and Mij would imply the absence of any reaction between the species i and j. In addition, the elements of the M matrix do not depend on the cell index implying that the signalling reactions occur with the same rates in individual cells. The species abundances in individual cells follow a mass-action linear kinetics described by a set of coupled first-order linear ODEs, (α) n  dxi j (α) = Mi xj . (2.1) dt j=1 As the elements of the M matrix do not depend on time explicitly, the above equation represents an autonomous system [19]. The source of variations in species abundances following the kinetics in equation (2.1) are the cell– (α) cell variations in the pre-stimulus condition ({xi (t = 0)}) arising due to cell–cell differences in total content of the molecular species and tonic (basal) signalling [20]. These variations are known as extrinsic noise fluctuations [17,21]. In the ideal case, we assume that cytometry experiments can measure all the signalling species abundances in equation (2.1) at any time. Below we describe the invariants, IT and IM . 2.1.1.1. Total abundance (IT )  j If the rates in equation (2.1) are further constrained to satisfy, ni=1 Mi = 0, then the total copy number of the molecular species 1 to n remains unchanged over time in a single cell #α, i.e. (α) IT = n  i=1 (α) xi (t) = n  (α) xi (0). (2.2) i=1 An example of the above case could be first-order reactions describing phosphorylation and dephosphorylation of a single protein species preserving the total number of protein molecules. The number conservation is an elementary conservation principle followed by biochemical reactions of firstorder or higher order reactions in the absence of any particle exchange with the environment. We will analyse implications of this conservation principle for non-ideal cases in later sections. 2.1.1.2. Magnitude of the scaled abundance vector (IM ) (α) (α) (α) x̃i (α) is defined as The magnitude of the scaled abundance vector x̃(α) ≡ {x̃1 , x̃2 , . . . , x̃n } in a single cell (#α) remains unchanged throughout the kinetics from its value at t = 0. (α) x̃i (t1 ) = n  j=1 [(J(t1 ))−1/2 ]ij xj (α) (t1 ). (2.3) ................................................ In this section, we describe the development of the framework and then evaluate the framework on a range of signalling models based on deterministic linear, deterministic nonlinear and stochastic reactions. rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 2.1. Development of the framework The derivation is shown in the Material and methods. The elements of the J matrix ({Jij }) in equation (2.3) denote the covariances of the species abundances at any time t (2.4) α=1 and {µi } are the mean species abundances µi (t) = N 1  (α) xi (t). N (2.5) α=1 Both {µi } and {Jij } can be calculated from the cytometry snapshot data. The magnitude of x̃(α) defined as   n  (α) 2 m(α) (t) = |x̃(α) (t)| =  (x̃ (t)) i i=1 does not change with time, i.e. (α) IM = m(α) (t1 ) = m(α) (t2 ) = m(α) (t = 0), (2.6) is an invariant of the kinetics. One can physically interpret the transformation in equation (2.3) in the following way. The first-order chemical reactions rotate (or reflect) as well as stretch (or compress) the abundance vector x(α) (t) with time. The transformation in equation (2.3) rescales the vector to offset the stretching (or compressing) component. Subsequently, the time evolution of the scaled vector can be described as a pure rotation (or reflection) (see web supplement). As rotation or reflection is an orthogonal transformation, the magnitude of the scaled variable is preserved. 2.1.1.3. Exact matching using invariants IT and IM are functions of copy numbers of molecular species, and therefore vary considerably from cell to cell at a given time point. Thus, these invariants create unique ‘tags’ for single cells, and pairing single cells across time then is reduced to matching the same values of invariants in cell populations at two time points. A possible degeneracy can arise when an invariant takes the same value in multiple (α) single cells. For example, single cells (e.g. #α and #β) with abundances lying in the plane defined IT =  (β)  (α) (α) (β) i xi (0) = IT cannot be resolved by IT. In such cases the invariant IT will be unable to i xi (0) = (α) match these cells across time. This same difficulty holds for the other invariant IM ; however, it is less likely to encounter such degeneracy in this case. 2.1.2. Construction of slow variables for non-ideal situations Cell signalling networks often deviate from the ideal kinetics considered above. This occurs due to multiple reasons: (i) it is usually infeasible to assay all the molecular species involved in a signalling network. In that case, the measured species abundances evolve as a non-autonomous system, because the unmeasured species abundances implicitly give rise to time-dependent reaction rates or external fluxes in the kinetics. (ii) The propensities of biochemical signalling reactions are often nonlinear functions of the species abundances. (iii) The copy numbers of molecular species contain stochastic fluctuations arising from intrinsic noise in biochemical reactions. In the presence of the above effects, except few special cases, both IT and IM cease to behave as invariants of the kinetics. Our investigations (details in §2.1.3.) using simulations of a variety of in silico signalling kinetics showed that for specific subsets of species abundances, the variables IM and IT , evolve at a much slower rate compared with the measured species abundances in a time interval. Borrowing from the lexicon of non-equilibrium statistical physics [16], we designate these variables as ‘slow variables’. Next, we discuss our scheme to identify appropriate combination of measured species abundances where IT and IM behave as slow variables or invariants (also see figure 2). We used a metric known as the Jensen–Shannon divergence JSD (0 ≤ JSD ≤ ln2) [22,23] to determine slow variables and invariants in our scheme. JSD characterizes the difference between a pair of ................................................ N 1  (α) (α) (xi (t)−µi (t))(xj (t) − µj (t)), N rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 Jij (t) = 5 (a) (c) class #3, subset #3 {x4, x6, x1} JSD(x4) t1 t2 x3 x4 x5 P(x4) x4 x6 JSD(x1) P(x1) (b) class #4 ................................................ x2 rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 x1 6 x1 {x1, x2, x4, x6} #1 JSD(I) class #3 {x1, x2, x4} {x2, x4, x6}{x4, x6, x1} {x6, x1, x2} #1 #2 #3 #4 P(I) I class #2 {x1, x2}{x2, x4}{x4, x6}{x6, x1}{x6, x2}{x1, x4} #1 #2 #3 #4 #5 #6 JSD(I) < JSD(x1) < JSD(x4) I is a slow variable Figure 2. Determination of slow variables or invariants: (a) signalling networks are composed of a large number of species; however, only few of them can be assayed. In the schematic network of six different species, only four species (shown with yellow boxes) x 1 , x 2 , x 4 and x 6 can be measured in experiments. Cytometry experiments measure single-cell abundances of these species (also denoted as {xi } here). (b) 24 –4–1 = 11 different subsets of measured species abundances can be constructed. Each of these subsets represents projection of the original data into a manifold spanned by the member species abundances in the subset. The subsets are further divided into classes for our analyses. Each class contains subsets with the same cardinality. For example, class #3 contains all the three species subsets. (c) For each subset (e.g. class #3, subset #3), we evaluate the change in the distribution of the species abundances in the cell population using JSD (equation (2.7)). A large change in the distribution of species abundances indicating faster kinetics will result in a larger value of JSD. Whereas if I behaves as a slow variable or an invariant, the distribution of I in the cell population will go through a small change or no change, respectively. The change in I is quantified by JSD(I) . When JSD(I) is smaller than the slowest species in the subset, we denote I as a slow variable for that subset. In the example shown in (c), P(x 4 ) and P(x 1 ) give rise to the maximum and the minimum values of JSDs in the species subset, i.e. JSD(x1 ) < JSD(x6 ) < JSD(x4 ) . distributions [23]. JSD vanishes when distributions are identical and increases monotonically as the overlap between the distributions decreases. JSD is defined as JSD(y) = 21 [dKL (P1 (y)  M) + dKL (P2 (y)  M)], (2.7) where M(y) = 1/2(P1 (y) + P2 (y)). The y superscript in JSD(y) denotes the variable in the distributions. Here dKL is the Kullback–Leibler divergence [22] between two distributions. When y is a discrete variable, dKL is given by    P1 (y) . dKL (P1 (y)  M) = P1 (y) ln M(y) y When I ∈ (IT , IM ) is an invariant of the kinetics, the distribution of I in a cell population does not change with time, i.e. P(I, t1 ) = P(I, t2 ). (2.8) An exception to this can arise where I varies between individual cells yet the distribution of I remains unchanged across time. This can occur when the stochastic kinetics of the chemical reactions is in the steady state where species abundances (and I) change across time in individual cells but the distributions of these variables remain time independent. In this situation, the presence of the equality as in equation (2.8) will not imply an existence of a slow variable, I. However, such situations can be easily detected from the data by checking if distributions of original variables do not vary across time. Therefore, the JSD between P(I, t1 ) and P(I, t2 ) vanishes when I is an invariant, i.e. JSD(I) = 0. When I does not behave as an invariant, then, JSD(I) > 0 (figure 2). However, I ∈ {IT , IM } can still evolve at a slower rate than any of the measured species in the time interval t1 to t2 . We evaluated this possibility by comparing changes in the JSD values corresponding to the distributions of species abundances and I in a cell population in the time interval t1 to t2 . When IT or IM , evolves slower than any of the individual species (say x) abundances in set of species, specifically, when we denote I (IT or IM ) as the slow variable for that set of species abundances. In general, one or more of the species abundances in a set of all the measured species will not satisfy the condition in equation (2.9), therefore, IT or IM cannot be considered as a slow variable for that set. However, for a subset of the measured abundances, IT or IM , could still behave as a slow variable or an invariant. To explore this possibility, we compared the JSD values corresponding to species abundances and I ∈ (IM , IT ) in a time interval t1 −t2 for all possible subsets that can be constructed using the measured abundances (figure 2). We used a classification scheme for the subsets (figure 2b) to describe our results. When n number of species abundances are measured, we considered all possible combinations (2n −n−1) of the abundances excluding the singletons. These subsets were organized into classes where each class (indexed by k) contained subsets with the same cardinality (k) (figure 2b). The cardinality is defined as the number of species abundances in a subset, e.g. for n = 14 the class with cardinality k = 4 (or class #4) contains 14 C4 = 1001 different subsets with each subset composed of four different molecular species abundances. In addition, the subsets within class #k were indexed by the integers ({m}) 1 to n Ck . Thus, a particular subset is denoted by a class number k and a subset index m (figure 2b). 2.1.2.1. Matching using slow variables and invariants We created a cost function E, that measures the total Euclidean distance between slow variables in pairs of single cells ({α} at t1 , {β} at t2 ) across time. E is defined as E({β}) = N  α=1 2 (I(α) (t1 ) − I(β) (t2 )) . (2.10) The ‘sister cells’ constitute the set {β M } that minimizes E. The minimization amounts to a bipartite matching between the set {α} and {β}; we used an algorithm (see Material and methods) based on sorting with an O(nln(n)) computation time. A sister cell can be thought of as a partnering cell whose species abundances are not substantially different than that in the correct cell partner. 2.1.2.2. Quantification of quality of matching We calculated the error in the reconstructed trajectory when a cell α at time t1 was paired with a cell β (or ′ ′ the ‘sister cell’) at a later time point t2 instead of the correct partner cell β at t2 . β is uniquely determined by α for deterministic signalling kinetics in the non-steady state. We defined an error χ αβ for pairing cell #α at time t1 with cell #β at time t2 :   2   N (β) (β ′ )  χαβ = (xi − xi ) , (2.11) i=1 ′ where χ αβ = 0, when β = β . Note, this metric identifies the cells by the measured species abundances in individual cells, e.g. if the sister cell possesses the same values of the measured abundances as the correct cell partner, the sister cell is identified as the correct match. It is possible that the sister cell contains different values for unmeasured species abundances compared to the correct cell; however, equation (2.11) remains blind to such differences. A small χ αβ would imply a small difference between the correct partner (β ′ ) and the ‘sister cell’ (β) for the measured species in the subset. We calculated a distribution of χ αβ (P(χ )) for the cell pairs matched using our scheme and compared that with the case when cells were paired randomly across time. We also calculated the autocorrelation function, Aij (t1 , t2 ), between species i at time t1 and species j at time t2 for the pairings with the correct partner cells and the sister cells. Aij (t1 , t2 ) between the correct cell pairs ({#α paired with #β ′ }) is defined as (t1 , t2 ) = Acorrect ij N 1  (α) (β ′ ) (xi (t1 ) − µi (t1 ))(xj (t2 ) − µj (t2 )). N α=1 (2.12a) ................................................ (2.9) rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 JSD(IT or IM ) < JSD(x) , 7 This is compared with the autocorrelation function corresponding to the pairings with the sister cells (2.12b) α=1 where the sister cell for α is indexed by β(α). The difference between the autocorrelations between the pairings with the correct (equation (2.12a)) and the sister cells (equation (2.12b)) is calculated by distance A(Acorrect , Asister ) between the matrices: correct A(A sister ,A   n  n  2 )= |Acorrect − Asister | . ij ij (2.13) i=1 j=1 2.1.3. Evaluation of slow modes and quality of matching for model signalling kinetics Here we investigated the occurrence of slow modes (IM and IT ) in subsets of molecular species involved in model biochemical signalling networks as a proof-of-concept for the framework we developed in §2.1.1 and 2.1.2. We studied deterministic and stochastic kinetics in a system of first-order reactions, and the Ras activation signalling network composed of nonlinear biochemical reactions [24]. We assumed that all the signalling species were measured species for the in silico networks. 2.1.3.1. Deterministic first-order kinetics We studied the matching problem for a signalling kinetics described by first-order reaction kinetics composed of 14 different species (electronic supplementary material, figure S1). The ODEs describing the mass action kinetics of all the 14 species is autonomous; however, when subsets of the 14 species are considered, the kinetics is no longer autonomous because the corresponding ODEs contain time-dependent external fluxes arising from the implicit kinetics of the unmeasured species. Identification of slow variables and invariants. We analysed 214 − 14 − 1 = 16 369 different subsets in a time interval where the kinetics is not in the steady state. We acknowledge that the number of possible subsets can become prohibitively large at very large dimensions. We found that for every class (k ≥ 2), the subsets produced a wide range of (0 ≤ JSD(I) < ln(2)) JSD(I) values (figure 3a) corresponding to I = IT (electronic supplementary material, figure S2) and I = IM (figure 3a). Next we analysed if IT or IM behaved as slow variables or invariants in the subsets that are associated with smaller values of JSD(I) . The comparison of the minimum values of JSD(IM ) with JSD values for the fastest and slowest species abundances (figure 3b) in the subsets associated with minimum JSD(IM ) showed that IM evolved as a slow variable in most of those subsets. IM turned into an invariant when all the 14 species were included in the set (k = 14). Similar behaviour was found for IT as well (electronic supplementary material, figure S3). The composition of the subsets associated with the minimum values of JSD(IM or IT ) depends on the variable type (IT or IM ) and the time interval, as well as on the rate constants of the reaction network. Matching using the slow variables. The cost function in equation (2.10) was minimized using the subsets associated with minimum JSD(I) to find the sister cells. The quality of matching was evaluated by calculating the error in matching χ (equation (2.11)) for each single-cell–sister-cell pair. We show the results for the pairing carried out using IM here. IM is an invariant for the subset corresponding to k = 14, and minimizing E produced an exact matching in that case (figure 3c). IM turned out to be a slow variable for the other subsets and the quality of matching using IM was reasonable (figure 3c). The mean χ -values (χ ) for these pairings were 0.6–0.8 times lower than that for random pairing (figure 3c). Next, we calculated the autocorrelation function for the matched pairs. Similar to χ, errors in the autocorrelation function ranged from very small to moderate values (0.2–0.8 times less) compared with that when the single cells were paired randomly (figure 3d). The subsets with smaller number of species show lower error values (figure 3d), this could arise due to the sensitivity of the autocorrelation function to small errors in matching in higher dimensions. The quality of matching using IT was qualitatively similar to that of IM (electronic supplementary material, figure S4). Interestingly, pairing using IT produced better agreement with the correct trajectories even when IT evolved faster than IM (electronic supplementary material, figure S5). This behaviour emphasized the role of the cost function in pair-matching (also see the Discussion). ................................................ N 1  (α) (β(α)) (xi (t1 )− µi (t1 ))(xj (t2 ) − µj (t2 )), N rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 Asister (t1 , t2 ) = ij 8 (a) (b) class index 2500 1500 subset index 8 0.4 6 0.3 4 0.2 2 0.1 max JSD {x} 0.3 0.2 0.1 2 4 JSD(IM) 6 8 10 class index 12 14 (d) (c) 0.8 0.8 DA/DArandom c–/c– random min JSD {x} 0.4 0 500 JSD IM 0.5 0.6 0.4 0.2 4 6 8 10 class index 12 14 0.6 0.4 0.2 IM 4 6 8 10 class index 12 14 Figure 3. Pairing single cells for the ideal kinetics. (a) Heat map for JSD(IM ) . JSD was calculated for 3000 single cells (see Material and methods for details) at time t 1 = 0 and t 2 = 7 min. Other parameters related to the signalling kinetics are shown in the web supplement. (b) The minimum values of JSD(IM ) for each class (grey points). JSD values associated with the fastest (shown in orange) and the slowest species (shown in blue) in the subsets corresponding to the minimum JSD(IM ) are compared with minimum JSD(IM ) . (c) The quality of matching when IM was used for matching the cells. The ratio in the average error χ̄ in the matching using IM with that for random matching is shown for each of the subsets. χ̄ and χ̄ random are the averages of χ̄αβ over the single cells pairs matched using our method and random pairing, respectively. χ̄ < 1 indicates smaller error compared with the random pairing. (d) Error in the autocorrelation function (A) for the subsets corresponding to the minimum values of JSD(IM ) . Arandom denotes the error in the autocorrelation function for random matching. 2.1.3.2. Nonlinear deterministic kinetics We used an experimentally validated signalling network for Ras activation in T lymphocytes (electronic supplementary material, figure S6) [24]. The network describes enzymatic activation of Ras by two enzymes SOS and Rasgrp1, where SOS-mediated Ras activation contains a positive feedback, i.e. an activated form of Ras or RasGTP induces a higher catalytic rate involving SOS. The deterministic kinetics displays bistability for a range of SOS and Rasgrp1 concentrations. We analysed the non-steady-state kinetics in the model for the parameter values that generate monostable or bistable steady states. The in silico model contained 14 different molecular species. Identification of slow variables and invariants. Similar to the first-order kinetics in the previous section, the JSD(I) values corresponding to I = IT showed large variations across subsets (electronic supplementary material, figure S7). However, a noted difference from the previous example was that IT behaved as an invariant for multiple subsets. This is because, even with the nonlinear rates, the total number of certain species is an invariant of the kinetics. For example, the total amount of Rasgrp1 contained in the complexes (free Rasgrp1, Rasgrp1-DAG, Rasgrp1-DAG-RasGDP) remained fixed throughout the kinetics, and the subset {Rasgrp1, Rasgrp1-DAG, Rasgrp1-DAG-RasGDP} produced the minimum JSD(IT ) (=0) in the class k = 3 (figure 4a). IM , in contrast, evolved only as a slow variable in specific subsets (electronic supplementary material, figure S8). Matching using the slow variables. For the subsets where IT behaved as an invariant, the matching produced exact alignment. For other subsets corresponding to the minimum values of JSD(IT ) , the pairing was substantially better than that of random matching (figure 4b). IM -guided pairing showed small errors in abundances in the sister cells (electronic supplementary material, figure S9); however, overall pairing using IT produced smaller errors. ................................................ 3432 0.6 0.5 rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 10 9 0.7 0.6 JSD 12 (b) (a) c–/c– random 0.15 min JSD {x} max JSD {x} 0.10 IT 0.6 0.4 0.2 0.05 4 6 8 10 class index 12 14 4 6 8 10 class index 12 14 (c) DA/DArandom 1.0 0.8 0.6 IT 0.4 0.2 4 6 8 10 class index 12 14 Figure 4. Pairing single cells for non-ideal kinetics. (a) The minimum values of JSD(IT ) for each class (black points) for the deterministic Ras activation kinetics model. We used 3000 single cells across time points t 1 = 100 s and t 2 = 400 s, where the Ras activation displays bistability. Min(JSD(IT ) ) values were compared with the JSD values associated with the fastest (shown in orange) and the slowest species (shown in blue) in the subsets corresponding to min(JSD(IT ) ). IT behaves as an invariant or a slow variable for the subsets. (b) The quality of matching given by χ̄ /χ̄ random when IT was used for matching the cells for the subsets associated with min(JSD(IT ) ) in (a). (c) Error in the autocorrelation function (A) when the cells were matched using IT for the stochastic Ras activation kinetics. The subsets used in the matching yielded minimum values of JSD(IT ) for each class. Note that for exact matching (A = 0) IT behaved as an invariant. 2.1.3.3. Stochastic signalling kinetics We simulated stochastic biochemical reactions in the Ras activation signalling kinetics by including intrinsic noise fluctuations. The simulations contained the same variations in the initial abundances used in the investigations for the deterministic kinetics. IT evolved as an invariant in subsets that were associated with conservation of the total number of molecular species (electronic supplementary material, figure S10). IT also behaved as a slow variable in specific subsets (electronic supplementary material, figure S10a). By contrast, IM behaved as a slow variable for select subsets (electronic supplementary material, figure S10b). Pairing the single cells using IT for the subsets associated with minimum values JSD(IT ) showed reasonably smaller errors in the autocorrelation function compared with that for random pairings for most of the subsets (figure 4c). The matching using IM showed smaller errors (or χ ) compared with random pairing in select subsets (electronic supplementary material, figure S11). 2.2. Application of the framework for matching in live-cell imaging data We reconstructed single-cell gene expression kinetic trajectories by applying our framework using livecell imaging data [18]. Data were collected in single yeast cells where the transcription factor Msn2 translocated to the nucleus upon inhibition of protein kinase A by a small molecule, 1-NM-PP1 and activated target fluorescent reporter genes CFP and YFP residing on homologous chromosomes in the diploid yeast cells (figure 5a). For our reconstruction, we chose the kinetics of the dual reporter of the gene DCS2 (one of the seven genes activated in the study) induced by a single 40 min pulse of 1-NMPP1 at a concentration of 690 nM. The activation kinetics of the reporters depended nonlinearly on the Msn2 abundance and also contained intrinsic and extrinsic noise fluctuations [18]. We carried out our reconstruction method by treating the live-cell imaging data as snapshot data, because it allowed us ................................................ JSD IT rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 0.20 JSD 10 0.8 0.25 (a) reconstruction using IT (b) 11 actual p-Msn2-m nucleus Nucl-Msn2-m YFP 3000 2000 1000 0 0 3000 2000 20 tim CFP e (d) (c) 60 60 0 DA/DArandom 0.8 IT P(c) 50 40 1.0 0.015 0.010 1000 YF P CFP cytoplasm random 0.005 IT 0.6 0.4 0.2 100 200 c 300 400 0 10 20 30 40 time point Figure 5. Trajectory reconstruction using live-cell imaging data. (a) Schematic diagram for gene activation. (b) Reconstruction of a typical kinetics trajectory for CFP and YFP tagged dual reporter for the gene DCS2 in yeast diploid cells. The reconstructed trajectory using IT (shown in orange) is compared with the true trajectory (shown in blue). (c) Distribution of the quality of alignment χ using IT is compared to the quality of alignment using random pairing for 159 single-cell trajectories. The reconstructions were carried out between two successive time measurements for 63 time intervals (e.g. 0–2.5 min, 2.5–5 min and so on). (d) The ratio of the errors (equation (2.13)) in the autocorrelation function for trajectories reconstructed using IT with that for random pairing for the same time intervals in (c). A/Arandom < 1 for most of the time points indicating better matching using IT compared to the random pairing. The data not shown on the graph for the time points between 0 to 10 min produced A/Arandom > 1. to compare the reconstructed trajectories with the measured single-cell trajectories. We analysed the simultaneous CFP and YFP gene expression kinetics data in 159 single-cell trajectories [18]. At each time point, we removed the single-cell tag from the CFP and YFP expressions data and treated the 159 data points as snapshot data to perform the reconstruction. We reconstructed two-dimensional (CFP and YFP) single-cell trajectories using either IT (figure 5b) or IM (electronic supplementary material, figure S12), and both showed a similar level of agreement between the measured and the reconstructed trajectories. We further quantified the quality of alignment using P(χ ) (figure 5c) and autocorrelation functions (figure 5d and electronic supplementary material, S12) for each reconstruction. Both indicators revealed lower errors in the reconstruction using I compared with random pairing. 2.3. Application of the framework for matching single cells using synthetic flow cytometry data We applied our framework on synthetic flow cytometry data for matching single cells assayed at two successive time points. We used in silico or synthetic data instead of measured data from flow cytometry experiments for two main reasons: (i) The synthetic data allowed us to assess the quality of the reconstruction because the correct trajectories are known in the simulations. (ii) We were able to use species abundances in the synthetic data to calculate the slow variables. Single-cell protein abundances are not directly accessible from the fluorescent intensities measured in flow cytometry. However, for many proteins it is possible to quantify single-cell species abundances by calibrating the intensities against a standard curve [25]. The synthetic flow cytometry data were generated at multiple time points by stochastic simulation (details in Material and methods) of the Ras activation network. The species abundances at the unstimulated state (or t = 0) in individual cells were chosen from a multivariate normal ................................................ p Msn2-m rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 1-NM-PP1 PKAmut (a) (b) 12 0.55 –JSDmin – Rand c Rel cRel 0.6 0.4 0.2 0.50 0.45 0.40 0.35 0.30 2 3 4 class 5 6 2 3 4 class 5 6 JSDmax JSDmin Figure 6. Reconstruction using synthetic flow cytometry data. (a) The variation of the ratio of the average relative error χ̄Rel /χ̄Rel JSDmax JSDmin with the class k. The data were measured at t = 0 and t = 100 s. χ̄Rel and χ̄Rel for a class k were calculated for the species (IT ) subsets that generated the smallest and the largest values of JSD , respectively. The smaller than 1 values for the ratio indicate that the errors in the subset corresponding to JSDmin were smaller on average than that for the subset with the largest JSD in the same class. The subsets the corresponded to JSDmin are the following: {RasGDP, RasGAP} for k = 2, {RasGDP, RasGTP, RasGAP} for k = 3, {RasGDP, RasGTP, RasGAP, DAG} for k = 4, {RasGDP, RasGTP, RasGAP, DAG, RasGRP1} for k = 5 and {RasGDP, RasGTP, RasGAP, DAG, RasGRP1, SOS} αβ for k = 6. The relative error χRel between a cell #α at t 1 and its matching partner cell# β (correct partner cell# β ′ ) at t 2 is defined as k (β) αβ (β ′ ) 2 (β ′ ) χ = i=1 (xi − xi ) /|xi |. We used this definition instead of equation (2.11) to compare between the subsets containing Rel αβ abundances of very different values. The average value of χ̄Rel was calculated by taking the average of χRel values for all the single-cell JSDmin Rand Rand /χ̄Rel for the pairing for subsets associated with minimum JSD in (a). χ̄Rel is the average relative error when pairs. (b) The ratio χ̄Rel the matching was generated using random pairing of the cells. The ratio is less than 1 for all the cases indicating that the reconstruction is better than random reconstructions. distribution (Material and methods) to represent cell–cell differences in total protein abundances and tonic signalling. Six molecular species (RasGTP, RasGDP, SOS, RasGRP1, DAG and RasGAP) out of the 14 signalling species involved in the Ras signalling network (electronic supplementary material, figure S6) were recorded in the synthetic data. We used different sets of cell populations to record single-cell species abundances at any two time points in the synthetic data as the same cell populations are not assayed more than once in flow and mass cytometry experiments. We used IT instead of IM for all our reconstructions here because our previous investigations showed IT generated less error in reconstruction compared with IM for the Ras activation network. In addition to carrying out reconstruction of trajectories between a pair of successive time points, we addressed the following relevant issues. (i) Is there a specific subset of measured abundances that produces a more accurate reconstruction compared with the other subsets? (ii) How does the error in the reconstruction depend on the size of the time interval between two successive measurements? (iii) How does the reconstruction using IT compare against other available methods for matching? In our investigation, we divided the six measured species into five classes (figure 2 and §2.1.2 for more details) composed of 2, 3, 4, 5 and 6 species. In each class, we determined the subsets that produced the largest and the smallest JSD(IT ) values. The subset corresponding to the smallest JSD(IT ) gives rise to the slowest IT in a class. The subsets corresponding to the lowest JSD(IT ) values in a time interval also produced the lowest errors in the reconstruction in each class (figure 6a). All the reconstructions corresponding to the lowest JSD(IT ) values faired better than the random pairings (figure 6b). These results demonstrate that it is possible to identify a group of species in the cytometry dataset that will generate better reconstruction in a time interval. Moreover, when the interest is in reconstructing trajectories for a particular subset of molecular species, the JSD(IT ) for the subset can be used to assess the quality of the reconstruction relative to the other combinations of the measured species. This knowledge can help to refine the measurements, e.g. use smaller time intervals (see below) when the JSD(IT ) value for the subset of interest is not the minimum in the class. Thus, the JSD(IT ) value provides a valuable metric to quantitatively assess the quality of a trajectory reconstruction in experimental flow cytometry datasets where the correct trajectory is unknown. Using our in silico data, we found that the errors become larger as the time interval is increased (electronic supplementary material, figure S13a). This is expected as the slow variables start changing appreciably with the increasing magnitude of the time interval resulting in large values of the parameter ξI (≫1). However, the precise answer to the question, ................................................ 0.8 rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 –JSDmin –JSDmax cRel cRel 1.0 We have developed a framework for matching single cells across time using time-stamped data from flow and mass cytometry experiments. Our approach, based on posing the problem in terms of new variables that remain unchanged or vary slowly with time, is radically different from the existing methods [9,12] employed for solving matching problems. Specifically, unlike other pair-matching algorithms [12], our approach does not require any assumption regarding the underlying reaction kinetics and estimation of model parameters. The use of slow variables and invariants reduces the value of parameter ξ making the matching straightforward (when ξ < 1) or more amenable to existing computational methods. The application of the framework for in silico signalling networks, live-cell imaging data and synthetic flow cytometry data showed excellent to reasonable agreement of the reconstructed trajectories with the correct kinetics. We constructed two new variables from the measured species abundances that served as slow variables or invariants in a wide variety of signalling kinetics involving cell-to-cell variations of species abundances, nonlinear reaction propensities and intrinsic noise fluctuations. One of the new variables (IT ) is the total abundance of a molecular species in a single cell. Early time signalling events usually involve chemical modifications of signalling proteins (e.g. phosphorylation) that do not lead to any change in the total content of the proteins in a single cell. However, the total abundance of a protein at any time can vary from cell to cell over a wide range usually described by a lognormal distribution [26,27]. Thus, the total protein content provides a unique tag that can be used to pair a single cell at any time with a ‘sister’ cell at a later time. In flow and mass cytometry measurements, it is possible to assay the total content of certain signalling proteins (e.g. Erk), which can be used to reconstruct single-cell signalling trajectories. The other new variable (IM ) was constructed by scaling the measured single-cell abundances by the inverse of the square root of the covariance matrix for species abundances. The reconstruction using IM worked better when the signalling kinetics in a time interval was effectively described by a closed system of first-order reactions. As IM behaves as a slow variable for a kinetics described by first-order reactions with small stochastic fluctuations and weak external fluxes, it can be used for addressing matching problems in other contexts such as tagged particles in fluid flows [9,12]. We recently developed a method to estimate reaction rates in a system of first-order reactions designed to describe mass cytometry snapshot data in a time interval [7]. The method also determines how well the system of first-order reactions can describe the snapshot data. In the case of a good agreement, we expect IM to behave as a slow variable and generate good to reasonable reconstructions. However, the precise relationship between IM being a slow variable and the underlying reaction network will require further investigation. Our investigations of signalling kinetics in non-ideal cases showed that in several cases (electronic supplementary material, figure S5) the reconstructed trajectories using IT produced better agreement to the correct kinetics as opposed to IM . In these cases, the use of IT or IM reduced the value of the parameter ξ ; however, ξ (I) still remained greater than 1, i.e. ξ (I) > 1. Thus, the cost function E (equation (2.10)) played an important role in matching the cells in these situations. The differences in quality of matching between the two slow variables point to the fact that minimizing E was better suited for IT compared with IM in these examples. The application of our framework to synthetic flow cytometry data showed that JSD(IT ) values, evaluated for different subsets of measured species abundances, can be used to determine groups of species where the reconstruction will produce less errors. The quality of the reconstruction improved as the size of the time intervals between successive measurements was decreased, raising an important question about the existence of an optimal time interval. Similar questions in the context of selecting time points for measuring gene expressions have been dealt with methods based on machine learning where specific cost functions were optimized using finely spaced time measurements in a smaller gene subset [28]. Such approaches might turn out to be helpful for generating trajectory reconstructions ................................................ 3. Discussion 13 rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 if there exists an optimal time interval for cytometry measurements for good trajectory reconstructions needs further investigation (see Discussion). Lastly, we compared our method against a well-known scheme [9] that minimizes the total Euclidean distance between the measured variables at time t1 and t2 (electronic supplementary material, figure S13b,c). Our investigations produced mixed results. For a twospecies sub-module our method faired better (electronic supplementary material, figure S13b), whereas for a three-species sub-module the quality of reconstruction between the two methods was comparable (electronic supplementary material, figure S13c). In a few cases, we found that the Euclidean method produced slightly better reconstructions. 4.1. Derivation of IM for the ideal kinetics Equation (2.1) can be solved analytically to calculate single-cell abundances at any time t: (α) xi (t) = n  (α) [eMt ]ij xj (0). (4.1) j=1 When the abundances follow the above equation, the average quantities in equations (2.4) and (2.5) at the two time points (t1 and t2 , t2 > t1 ) are related by µi (t2 ) = n  [eM(t2 −t1 ) ]ij µj (t1 ) (4.2a) j=1 and T J(t2 ) = eM(t2 −t1 ) J(t1 ) eM (t2 −t1 ) . (4.2b) ................................................ 4. Material and methods 14 rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 using cytometry data where live-cell imaging measurement of a smaller subset of proteins can be used to select time points in cytometry experiments measuring a large number of proteins simultaneously. Another question related to the above issue is if all the cells need to the coherently stimulated in cytometry experiments at time t = 0 for good reconstructions. When the time difference in triggering for different batches of cell populations is long enough to produce large values in the parameter ξ (I) , the reconstructions are likely to contain large errors. We also noticed deterioration (electronic supplementary material figure S13a) in the reconstruction quality as the kinetics approached the bistable behaviour where the Ras activation changed by a large amount in a very short time interval. The framework is likely to be error prone when such abrupt changes occur in the kinetics. In recent years, a host of methods have been developed to construct single-cell development trajectories using snapshot data (e.g. WANDERLUST [29], SCUBA [30]) where, unlike the cases dealt with here, the single cells are not ordered temporally. These methods assign a ‘pseudo time’ to the data and then optimize ad hoc cost functions to create single-cell trajectories. It is unclear whether those cost functions will render any benefit to the matching problems considered here. For example, one of the cost functions that minimized the cosine distance between single cells [29] will not be able to correctly reconstruct signalling trajectories in a simple example where the kinetics are described by first-order reactions. The main difficulty in applying the framework developed here is to identify appropriate invariants or slow variables in a general situation. Singer et al. [31] used nonlinear independent component analysis for constructing slow variables by analysing stochastic kinetics of dynamical systems in a short time window. This approach determined slow variables that were functions of linear combinations of the observables. When the underlying signalling reactions are known, this approach can help find slow modes in the system by simulating the in silico network in short time durations, and these slow variables can then be used to reconstruct single-cell trajectories for cytometry data using our framework. However, the applicability of the approach when the slow modes are complicated nonlinear functions of the measured variables or when only a subset of involved dynamical variables is measured is unclear. In statistical physics [15,16], conservation laws (e.g. conservation of energy, momentum) or symmetry principles help identify such slow modes. Projection operator methods by Zwanzig & Mori [32] provide a formal way to construct variables with slower time scales for a known microscopic dynamics. However, this method requires knowledge of model parameter values (e.g. reaction rates), and the calculations for constructing slow modes could become intractable for a complex system composed of nonlinear interactions such as cell signalling kinetics. A computation intensive step in our framework is to determine specific combinations of species abundances that are associated with low JSD(I) values. Masscytometry experiments can measure over 40 different proteins and the number of possible subsets in such large dimensions can be prohibitively large. When the signalling reactions are well characterized, selecting a group of species that are connected by mass balance in chemical modifications (e.g. enzymatic conversions or binding–unbinding reactions) could provide a way to identify a core species set with a slow mode. Adding new groups of species using smart sampling techniques [33] to expand this core set would be an exciting future endeavour. (4.3) where Q is any orthogonal matrix, i.e. QQT = I, I is the identity matrix. This equation contains the clue that if the abundances are scaled appropriately, the time evolution given by equation (2.1) can be described by a rotation or a reflection. We found that if the species abundances at any time are scaled by the inverse of the square root of the covariance matrix (equation (2.4)), then the scaled abundances are related across time points by orthogonal transformations. Using equation (4.4), we can write equation (4.1) as (α) xi (t2 ) = n  (α) [[J(t2 )]1/2 Q[J(t1 )]−1/2 ]ij xj (t1 ), (4.4) j=1 which implies that the scaled variables in equation (2.3) at t1 and t2 are related by (α) x̃i (t2 ) =  (α) Qij x̃j (t1 ). (4.5) j As rotation and reflection preserves the magnitude of a vector, the magnitude (I(α) M ) of the vector x̃(α) ≡ (α) (α) (α) {x̃1 , x̃2 , . . . , x̃n } remains unchanged throughout the kinetics from its value at t = 0. 4.2. Simulation of the in silico networks The mass action deterministic kinetics and the stochastic kinetics for the reactions for the system of firstorder reactions and the Ras activation network were simulated using the software package BIONETGEN [34]. The initial species abundances were drawn from a multivariate normal distribution by specifying the average abundances and the covariances. The initial conditions and the parameter values for the reaction networks are provided in the electronic supplementary material, tables S1 and S2, and figures S1 and S6. 4.3. Generation of the synthetic flow cytometry data The kinetics of Ras activation network (electronic supplementary material, figure S6) was simulated using BIONETGEN [34]. Single-cell abundances of six different species (RasGTP, RasGDP, SOS, RasGRP1, DAG and RasGAP) were measured at different times (t = 0, 100, 200, 300 and 500 s). Different batches of 2000 single cells were used for measurements at two successive time points. 4.4. Minimization of the cost function E The minimization of E involves finding a bipartite graph where a pair of vertices in two subsets (single cells {α} at t1 and single cells {β} at t2 ) connected with a cost (I(α) − I(β) )2 minimizes the total cost E. The √ graph matching algorithms computes the optimization in time O(|E| V) ≈ O(n2 ) [35]. However, in our case we can use sorting to compute this in O(nln(n)) time. This is achieved by changing the cost function for a pairing #α, #β to (I(α) − I(β) )2 + ǫ(t2 −t1 )2 , where, ǫ → 0. Note, this is the Euclidean distance between the cells in the I−t plane, thus minimizing the cost function E amounts to joining these points (or single cells) in the I−t plane by non-intersecting lines. As the cells in {α} (or {β}) have the same t coordinate t = t1 (or t2 ), the configuration with the non-intersecting lines keeps the same ascending (or descending) order in I for the cells in {α} and {β}. Therefore, first, we sorted the {α} and the {β} cells in arrays where cells were arranged in ascending order of I, and then in the sorted arrays, we connected cells with the same array index. A pseudo-code is provided in the electronic supplementary material. The Mathematica codes are available at http://planetx.nationwidechildrens.org/~jayajit/pair-matching. ................................................ eM(t2 −t1 ) = [J(t2 )]1/2 Q[J(t1 )]−1/2 , 15 rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 Note that the elements of the M matrix cannot be uniquely determined from the above relations because there are n2 unknown elements in the M matrix and equation (4.2a,b) provide n + n(n + 1)/2 < n2 relations between the unknown variables. Therefore, equation (4.1) cannot be used to evolve the abundances in a single cell at time point (at t1 ) to a later time point (t2 ), and then identify the correct cell from the measurements at t2 . Now, equation (4.2b) can be recast as 4.5. Calculation of JSD values 16 (IT )  Imax T ITmin max  P(IT ) P(IT ) ln M(IT )  dIT ≈ IT IT  P((IT )p ) ln p=ITmin P((IT )p ) M((IT )p ) , where IT = k and M(IT ) = [P(IT (t1 )) + P(IT (t2 ))]/2. 4.5.2. Calculation of JSD(IM ) Done in a similar way as IT . The bin width (IM ) was chosen as follows. We calculated I(α) M following,   (α) −1 (α) (α) −1 (α) (α) −1 (α) i,j (x i J ij x j j ) i.j (x i J ij x j + x i J ij x j ) (α) = . I M = 2|x̃(α) | |x̃(α) |  (α) −1 (α) (α) Jij xj , and x(α) = 1 for all species and all the cells the above Using IM = |x̃(α) | = i,j xi expression simplifies to (α) IM = −1 (α) i,j Jij xj  |x̃(α) | . (α) α for all the cells at time t and t and set I = min(I ). A pseudo-code is We evaluated IM 1 2 M M provided in the electronic supplementary material. The Mathematica codes are available at http:// planetx.nationwidechildrens.org/~jayajit/pair-matching. Data accessibility. The computer codes used for the studies shown in the article are available at the website: http:// planetx.nationwidechildrens.org/~jayajit/pair-matching. The codes are written in Mathematica and in BIONETGEN (bionetgen.org). The live-cell imaging data reported in fig. 4 in [18] are available on the web at the link: http://msb. embopress.org/content/9/1/704.short. Authors’ contributions. S.M. and J.D. planned and designed research, and performed analytical and numerical calculations. S.M., W.S., L.L.L. and J.D. analysed data. S.M., W.S. D.S. and J.D. contributed computational tools. S.M., L.L.L. and J.D. wrote the paper. All authors made significant contributions to the manuscript. All authors gave their final approval for publication. Competing interests. We have no competing interests. Funding. This work was supported by the grant R56AI108880–01 to J.D. from NIAID. S.M. was supported in part by a grant from the Department of Biotechnology (BTPR12422/MED/31/287/2014, valid November 2014–2017). We also acknowledge the computation time provided by the Ohio Supercomputer Center. Acknowledgements. J.D. and S.M. thank Helle Jensen, Suzanne Gaudet, Alper Yilmaz, Victoria Best and Anton Zilman for helpful discussions. S.M. and J.D. acknowledge the help from Anders Hansen in Erin O’Shea’s lab for accessing the live imaging data. J.D. also acknowledges the support from the Quantitative Immunology Workshop at KITP where a part of the work was carried out. References 1. Janeway C. 2005 Immunobiology : the immune system in health and disease, pp. xxiii, 823 p, 6th edn. New York, NY: Garland Science. 2. Cohen-Saidon C, Cohen AA, Sigal A, Liron Y, Alon U. 2009 Dynamics and variability of ERK2 response to EGF in individual living cells. Mol. Cell 36, 885–893. (doi:10.1016/ j.molcel.2009.11.025) 3. Hoebe RA, Van Oven CH, Gadella TWJ, Dhonukshe PB, Van Noorden CJF, Manders EMM. 2007 Controlled light-exposure microscopy reduces photobleaching and phototoxicity in fluorescence live-cell imaging. Nat. Biotechnol. 25, 249–253. (doi:10.1038/nbt1278) 4. Bendall SC, Nolan GP, Roederer M, Chattopadhyay PK. 2012 A deep profiler’s guide to cytometry. Trends Immunol. 33, 323–332. (doi:10.1016/j.it.2012. 02.010) 5. Chattopadhyay PK, Gierahn TM, Roederer M, Love JC. 2014 Single-cell technologies for monitoring immune systems. Nat. Immunol. 15, 128–135. (doi:10.1038/ni.2796) 6. Barnett L, Barrett AB, Seth AK. 2009 Granger causality and transfer entropy are equivalent for Gaussian variables. Phys. Rev. Lett. 103, 238701. (doi:10.1103/PhysRevLett.103. 238701) 7. Mukherjee S, Jensen H, Stewart W, Stewart D, Ray WC, Chen S-Y, Nolan GP, Lanier LL, Das J. 2017 In silico modeling identifies CD45 as a regulator of IL-2 synergy in the NKG2D-mediated activation of immature human NK cells. Science Signaling 10, aai9062. ................................................ For a subset of m molecular species (m ≤ n), we calculated the sets {IT (α) (t1 )}and {IT (β) (t2 )} for N = 3000 cells. We then constructed the probability distributions P(IT , t1 ) and P(IT , t2 ) by binning the above sets and normalizing the histograms. The bin width (IT ) was chosen to be the cardinality k of the subset. We then calculated the Kullback–Leibler divergence (dKL ) using the distributions. Here dKL is given by an integral as IT is a continuous variable. We approximated the integral by a sum over the histograms calculated above, i.e. rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 4.5.1. Calculation of JSD 19. 21. 22. 23. 24. 25. 26. 27. 28. Kleyman M, Sefer E, Nicola T, Espinoza C, Chhabra D, Hagood JS, Kaminski N, Ambalavanan N, Bar-Joseph Z. 2017 Selecting the most appropriate time points to profile in high-throughput studies. Elife 6 , e18541. (doi:10.7554/eLife.18541) 29. Bendall SC, Davis KL, Amir el-AD, Tadmor MD, Simonds EF, Chen TJ, Shenfeld DK, Nolan GP, Péer D. 2014 Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725. (doi:10.1016/ j.cell.2014.04.005) 30. Marco E, Karpb RL, Guoc G, Robsond P, Harte AH, Trippaa L, Yuana G-C. 2014 Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape. Proc. Natl Acad. Sci. USA 111, E5643–E5650. (doi:10.1073/pnas. 1408993111) 31. Singer A, Erban R, Kevrekidis IG, Coifman RR. 2009 Detecting intrinsic slow variables in stochastic dynamical systems by anisotropic diffusion maps. Proc. Natl Acad. Sci. USA 106, 16 090–16 095. (doi:10.1073/pnas.090554 7106) 32. Zwanzig R. 2001 Nonequilibrium statistical mechanics. Oxford, NY: Oxford University Press. 33. Hafner M, Koeppl H, Hasler M, Wagner A. 2009 ‘Glocal’ robustness analysis and model discrimination for circadian oscillators. PLoS Comput. Biol. 5, e1000534. (doi:10.1371/journal. pcbi.1000534) 34. Colvin J, Monine MI, Faeder JR, Hlavacek WS, Von Hoff DD, Posner RG. 2009 Simulation of large-scale rule-based models. Bioinformatics 25, 910–917. (doi:10.1093/bioinformatics/btp066) 35. Micali S, Vazirani VV. 1980 An O(\sqrt|V|.E) algorithm for finding maximum matching in general graphs. In Proc. 21st IEEE Symp. on Foundations of Computer Science Syracuse, NY, 13–15 October, pp. 17–27. IEEE. (doi:10.1109/SFCS.1980.12) 17 ................................................ 20. between noise and control of gene expression. Mol. Syst. Biol. 9, 704. (doi:10.1038/msb.2013.56) Strogatz SH. 2000 Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering. Cambridge, MA: Westview Press. Zikherman J, Jenne C, Watson S, Doan K, Raschke W, Goodnow CC, Weiss A. 2010 CD45-Csk phosphatase-kinase titration uncouples basal and inducible T cell receptor signaling during thymic development. Immunity 32, 342–354. (doi:10.1016/ j.immuni.2010.03.006) Feinerman O, Veiga J, Dorfman JR, Germain RN, Altan-Bonnet G. 2008 Variability and robustness in T cell activation from regulated heterogeneity in protein levels. Science 321, 1081–1084. (doi:10.1126/ science.1158013) Cover TM, Thomas JA. 2006 Elements of information theory, 2nd edn. Hoboken, NJ: Wiley-Interscience. Lin JH. 1991 Divergence measures based on the Shannon entropy. IEEE Trans. Inform. Theory 37, 145–151. (doi:10.1109/18.61115) Das J, Ho M, Zikherman J, Govern C, Yang M, Weiss A, Chakraborty AK, Roose JP. 2009 Digital signaling and hysteresis characterize ras activation in lymphoid cells. Cell 136, 337–351. (doi:10.1016/ j.cell.2008.11.051) Coffman VC, Wu JQ. 2012 Counting protein molecules using quantitative fluorescence microscopy. Trends Biochem. Sci. 37, 499–506. (doi:10.1016/j.tibs.2012.08.002) Cotari JW, Voisinne G, Dar OE, Karabacak V, Altan-Bonnet G. 2013 Cell-to-cell variability analysis dissects the plasticity of signaling of common gamma chain cytokines in T cells. Sci. Signal. 6, ra17. (doi:10.1126/scisignal.2003240) Salman H, Brenner N, Tung C-k, Elyahu N, Stolovicki E, Moore L, Libchaber A, Braun E. 2012 Universal protein fluctuations in populations of microorganisms. Phys. Rev. Lett. 108, 238105. (doi:10.1103/PhysRevLett.108.238105) rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811 8. Saxton MJ. 2008 Single-particle tracking: connecting the dots. Nat. Methods 5, 671–672. (doi:10.1038/nmeth0808-671) 9. Ouellette NT, Xu HT, Bodenschatz E. 2006 A quantitative study of three-dimensional Lagrangian particle tracking algorithms. Exp. Fluids 40, 301–313. (doi:10.1007/s00348-005-0068-7) 10. Ballerini M et al. 2008 Interaction ruling animal collective behavior depends on topological rather than metric distance: evidence from a field study. Proc. Natl Acad. Sci. USA 105, 1232–1237. (doi:10.1073/pnas.0711437105) 11. Lipton AJ, Fujiyoshi H, Patil RS. 1998 Moving target classification and tracking from real-time video. In Proc. Fourth IEEE Workshop on Applications of Computer Vision (Wacv’98), Princeton, NJ, 19–21 October, pp. 8–14. IEEE. (doi:10.1109/ACV.1998.732851) 12. Chertkov M, Kroc L, Krzakala F, Vergassola M, Zdeborova L. 2010 Inference in particle tracking experiments by passing messages between images. Proc. Natl Acad. Sci. USA 107, 7663–7668. (doi:10.1073/pnas.0910994107) 13. Bendall SC, Nolan GP. 2012 From single cells to deep phenotypes in cancer. Nat. Biotechnol. 30, 639–647. (doi:10.1038/nbt.2283) 14. Gustafsson F, Gunnarsson F, Bergman N, Forssell U, Jansson J, Karlsson R, Nordlund P. 2002 Particle filters for positioning, navigation, and tracking. IEEE Trans. Signal Process. 50, 425–437. (doi:10.1109/78. 978396) 15. Chaikin PM, Lubensky TC. 2000 Principles of condensed matter physics. Cambridge, UK: Cambridge University Press. 16. Mazenko G. 2006 Nonequilibrium statistical mechanics. Weinheim, Germany: Wiley. 17. Bialek WS. 2012 Biophysics : searching for principles. Princeton, NJ: Princeton University Press. 18. Hansen AS, O’Shea EK. 2013 Promoter decoding of transcription factor dynamics involves a trade-off