UCSF
UC San Francisco Previously Published Works
Title
Connecting the dots across time: reconstruction of single-cell signalling
trajectories using time-stamped data.
Permalink
https://escholarship.org/uc/item/33q0s2gb
Journal
Royal Society open science, 4(8)
ISSN
2054-5703
Authors
Mukherjee, Sayak
Stewart, David
Stewart, William
et al.
Publication Date
2017-08-23
DOI
10.1098/rsos.170811
Peer reviewed
eScholarship.org
Powered by the California Digital Library
University of California
rsos.royalsocietypublishing.org
Research
Cite this article: Mukherjee S, Stewart D,
Stewart W, Lanier LL, Das J. 2017 Connecting
the dots across time: reconstruction of
single-cell signalling trajectories using
time-stamped data. R. Soc. open sci. 4: 170811.
http://dx.doi.org/10.1098/rsos.170811
Received: 3 July 2017
Accepted: 20 July 2017
Subject Category:
Biochemistry and biophysics
Connecting the dots across
time: reconstruction of
single-cell signalling
trajectories using
time-stamped data
Sayak Mukherjee1,8 , David Stewart6 , William
Stewart1,4 , Lewis L. Lanier7 and Jayajit Das1,2,3,5
1 Battelle Center for Mathematical Medicine, Research Institute at the Nationwide
Children’s Hospital, 700 Children’s Drive, OH 43205, USA
2 Department of Pediatrics, 3 Department of Physics, 4 Department of Statistics, and
5 Department of Biophysics Program, The Ohio State University, Columbus, OH 43210,
USA
6 Department of Mathematics, University of Iowa, Iowa City, IA 52242, USA
7 Department of Microbiology and Immunology, University of California, San Francisco,
San Francisco, CA 94143, USA
8 Institute of Bioinformatics and Applied Biotechnology, Electronic City Phase I,
Bangalore, 560100 India
JD, 0000-0001-9649-4698
Subject Areas:
computational biology/computational
biology/biophysics
Keywords:
single-cell signalling kinetics, flow cytometry,
mass cytometry, trajectory reconstruction,
invariants, pair-matching
Author for correspondence:
Jayajit Das
e-mail: jayajit@gmail.com
Electronic supplementary material is available
online at https://doi.org/10.6084/m9.figshare.
c.3844777.
Single-cell responses are shaped by the geometry of signalling
kinetic trajectories carved in a multidimensional space
spanned by signalling protein abundances. It is, however,
challenging to assay a large number (more than 3) of
signalling species in live-cell imaging, which makes it
difficult to probe single-cell signalling kinetic trajectories
in large dimensions. Flow and mass cytometry techniques
can measure a large number (4 to more than 40) of
signalling species but are unable to track single cells.
Thus, cytometry experiments provide detailed time-stamped
snapshots of single-cell signalling kinetics. Is it possible to
use the time-stamped cytometry data to reconstruct singlecell signalling trajectories? Borrowing concepts of conserved
and slow variables from non-equilibrium statistical physics
we develop an approach to reconstruct signalling trajectories
using snapshot data by creating new variables that remain
invariant or vary slowly during the signalling kinetics. We
apply this approach to reconstruct trajectories using snapshot
data obtained from in silico simulations, live-cell imaging
measurements, and, synthetic flow cytometry datasets. The
application of invariants and slow variables to reconstruct
trajectories provides a radically different way to track
2017 The Authors. Published by the Royal Society under the terms of the Creative Commons
Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted
use, provided the original author and source are credited.
objects using snapshot data. The approach is likely to have implications for solving matching
problems in a wide range of disciplines.
0
(I)
the parameter ξ (I) (=l(I) /l0 ), which can even fall in the range, ξ (I) ≪ 1, where tracking objects is
straightforward. In addition, when ξ ≫ ξ (I) > 1, tracking objects in I becomes more amenable to the
................................................
Tracking signalling events in single cells is a key step towards understanding single-cell response
mechanisms. Signalling events are orchestrated by a large number of intercellular molecular species
that transmit information pertaining to changes in the extracellular environment to the cell nucleus [1].
Single-cell responses are often influenced by the geometry of multidimensional temporal trajectories
describing time evolution of single-cell protein abundances. For example, in human cancer cell lines,
fold change in the abundance of the protein phosphorylated Erk (or pErk) as opposed to the absolute
value of pErk abundance regulates single-cell growth responses [2]. In general, the signalling kinetics
of many molecular species could affect cell decision processes. However, our understanding regarding
the link between the geometry of signalling kinetic trajectories and single-cell responses are often
limited to few [1–3] molecular species. This is because spectral overlap between fluorescent dyes
and photo-toxicity induced by excited fluorophores [3] make it challenging to track a large number
(more than 3) of molecular species in live-cell imaging experiments. Flow cytometry [4,5] and recently
developed mass cytometry experiments [4,5] can assay 4 to more than 40 proteins simultaneously in
single cells at multiple times, but it is not possible to track single cells in these experiments. Is it
possible to reconstruct single-cell trajectories, even approximately, from such time-stamped snapshot
data? An affirmative answer to this question will be valuable for analysing signalling mechanisms or
calculation of autocorrelation functions [6] for extracting relevant time scales and inference of signalling
networks [7].
Tracking multiple objects across time using time-stamped data is a common problem encountered
in diverse research areas ranging from tracking single molecules in live cells [8] to fluid particles in
turbulent flows [9] to birds flying in a flock [10] and to tracking individuals in surveillance videos
[11]. The difficulty in tracking individual objects in these problems is characterized by a dimensionless
parameter ξ = l/l0 , where l is the average distance an object moves between two successive
time recordings and l0 (=ρ −1/d ) is the average object-to-object distance for the objects distributed
in d dimensions with a density ρ [9,12] (figure 1). When ξ ≪ 1 connecting objects across time is
straightforward, whereas when ξ ≫ 1 matching objects across time becomes ambiguous. In the later
scenario, two types of methods are used to generate solutions—either by optimizing cost functions that
are often constructed in an empirical manner [9], or by evaluating probabilities for different matching
configurations by estimating parameters in an underlying model [12]. The success of the first method
depends on stumbling upon a good cost function and the second method requires intensive computation
for the evaluation of the likelihood function and estimation of the model parameters.
Signalling processes usually involve a large number of molecular species (tens to hundreds).
Depending on the available antibodies, cytometry techniques assay 4 to more than 40 molecular species
in 103 –106 cells at chosen time points where the abundances of the marker species have changed
appreciably [13]. Thus, the time-stamped data collected in cytometry experiments are almost always
in the ξ ≫ 1 range. In addition, involvement of a large number of molecular species in generating a
signalling response demands analysis of the kinetic trajectories in high dimensions (d > 3). This can
create computational road blocks regarding parameter estimation of an underlying model or particle
filters using Markov chain Monte Carlo-based algorithms [12,14], which have been extensively used
for tracking fluid particles (spatial dimension, d = 2) or flocking birds (spatial dimension, d = 3). The
reconstruction using cytometry data is further complicated as, unlike the above examples, the same
objects (i.e. single cells) are not assayed at successive time points.
A fundamentally different way of approaching this problem would be to use the measured variables
to construct new variables (I in figure 1) that do not change (l(I) = 0) or change more slowly (l(I) → 0)
with time, while still varying appreciably between the objects at a fixed time point. Thus, when the
density of the objects (ρ (I) ) in the d(I) dimensional space of the new variables I is small (i.e. large
(I)
l = (1/ρ (I) )1/dI ), the matching problem posed in terms of I will result in a substantial reduction in
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
1. Introduction
2
(a)
3
................................................
l0
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
t1
t2
t3
x2
Dl
x = Dl/l0 >> 1
x1
(b)
(c)
I
Dl(I)
l(I)
0
l(I)
0
Dl(I)
x(I) =
l(I)
0
Dl(I) = 0
x > x(I) > 1
=0
t1
t2
t3
t1
t2
t3
Figure 1. Matching single cells across time using invariants and slow variables. (a) Schematic depiction of time-stamped cytometry data
showing single cells against copy numbers (denoted by x 1 and x 2 ) of two signalling species at three different time points t 1 < t 2 < t 3 .
When the average distance (l0 ) between the cells in the x 1 –x 2 plane is smaller than the average distance travelled (l) in a time interval
(e.g., t 2 − t 1 ) by the cells in the same plane or x = l/l0 ≫ 1, matching the cells across time is non-trivial due to multiple possibilities.
Whereas when ξ < 1, connecting the cells could be as straightforward as finding the nearest neighbour. The arrows show a correct
trajectory. (b) We cast the matching problem in (a) in the manifold for a new variable (I) constructed from the measured variables (e.g. x 1 ,
x 2 ). I satisfies two conditions: (i) I does not change (invariant) or changes substantially slowly (slow variable) in individual cells compared
with the original variables or l(I) ≪ l and (ii) I varies between single cells at any time point. In this situation, l(I) /l0 (I) = ξ (I) ≪ 1
and connecting the cells across time becomes straightforward. When I is an invariant, l(I) = 0 and the pairing of cells across time is
exact. (c) In many cases, I will be a slow variable resulting in ξ (I) > 1. However, even in such cases we find, ξ (I) < ξ , and a cost function
is employed to approximately pair single cells across time.
standard techniques [12,14] due to the lower dimensionality of the manifold and the smaller value of
ξ(I) . However, it is often difficult to construct such conserved or slow variables for the problem in hand.
In physical systems, conservation laws (e.g. conservation of energy) [15,16], breakdown of continuous
symmetry (e.g. Goldstone modes) [15,16] or the presence of a critical point [15,16] can bring conserved
and slow variables into existence. However, direct application of such principles in cell signalling
processes is not obvious.
Here, we developed a framework for matching untagged single cells across time by constructing
conserved and slow variables using abundances of molecular species that follow biochemical signalling
kinetics. The calculations of the new variables do not require any parameter estimation, and thus avoid
computational difficulties usually encountered in matching problems. We constructed two invariants for
an ideal kinetics where the signalling kinetics is described by a closed system of first-order reactions.
One of the invariant variables is based on the conservation of total number of molecular species, and the
other one involves scaling the measured abundances by a particular function of the covariance matrix.
These invariants turn into slow variables or remain as invariants for more general biochemical reactions
that include external fluxes, involve second-order or higher order reactions, and contain stochastic
fluctuations [17]. The slow variables and invariants, constructed from the measured species abundances,
allowed us to connect a single cell with a ‘sister cell’ at a later time whose species abundances are
reasonably close to the correct cell partner. We validate the above formalism by reconstructing trajectories
using published live-cell imaging data [18]. Lastly, we apply our framework for reconstructing signalling
trajectories between successive time points in a synthetic flow cytometry dataset.
2. Results
4
2.1.1. Determination of invariants in an ideal system
We constructed two invariants, IT and IM , for an ideal system of biochemical reactions. The invariants
do not change with time but vary between single cells at a particular time. The ideal system satisfies
the following conditions: (i) the reaction propensities are linear functions of the copy numbers (or
abundances) of the molecular species, (ii) the reaction rates are time independent and there is no external
in- (or out-)flux of molecules and (iii) the kinetics does not include any intrinsic noise fluctuations. The
mass action kinetics of the biochemical reactions are described as an autonomous system of linear firstorder ordinary differential equations (ODEs). Consider an ensemble of N number of single cells where
(α)
a single cell (indexed by α) contains n different molecular species (indexed by i) with abundances {xi }.
Any pair of molecular species, i and j, in an individual cell (say, the αth cell) react following the above
j (α)
(α)
conditions, and the propensity for the reaction j → i (or i → j) is given by Mi xj (or Mij xi ). The reaction
j
rates Mi and Mij are always positive and constant as long as, i = j. In our notation scheme, the (i,j) element
j
of the M matrix, Mi , is associated with a reaction, j (superscript index) → i (subscript index). Vanishing
j
Mi
values for both
and Mij would imply the absence of any reaction between the species i and j. In
addition, the elements of the M matrix do not depend on the cell index implying that the signalling
reactions occur with the same rates in individual cells.
The species abundances in individual cells follow a mass-action linear kinetics described by a set
of coupled first-order linear ODEs,
(α)
n
dxi
j (α)
=
Mi xj .
(2.1)
dt
j=1
As the elements of the M matrix do not depend on time explicitly, the above equation represents
an autonomous system [19].
The source of variations in species abundances following the kinetics in equation (2.1) are the cell–
(α)
cell variations in the pre-stimulus condition ({xi (t = 0)}) arising due to cell–cell differences in total
content of the molecular species and tonic (basal) signalling [20]. These variations are known as extrinsic
noise fluctuations [17,21]. In the ideal case, we assume that cytometry experiments can measure all the
signalling species abundances in equation (2.1) at any time. Below we describe the invariants, IT and IM .
2.1.1.1. Total abundance (IT )
j
If the rates in equation (2.1) are further constrained to satisfy, ni=1 Mi = 0, then the total copy number
of the molecular species 1 to n remains unchanged over time in a single cell #α, i.e.
(α)
IT =
n
i=1
(α)
xi (t) =
n
(α)
xi (0).
(2.2)
i=1
An example of the above case could be first-order reactions describing phosphorylation and
dephosphorylation of a single protein species preserving the total number of protein molecules. The
number conservation is an elementary conservation principle followed by biochemical reactions of firstorder or higher order reactions in the absence of any particle exchange with the environment. We will
analyse implications of this conservation principle for non-ideal cases in later sections.
2.1.1.2. Magnitude of the scaled abundance vector (IM )
(α)
(α)
(α)
x̃i
(α)
is defined as
The magnitude of the scaled abundance vector x̃(α) ≡ {x̃1 , x̃2 , . . . , x̃n } in a single cell (#α) remains
unchanged throughout the kinetics from its value at t = 0.
(α)
x̃i (t1 ) =
n
j=1
[(J(t1 ))−1/2 ]ij xj (α) (t1 ).
(2.3)
................................................
In this section, we describe the development of the framework and then evaluate the framework on a
range of signalling models based on deterministic linear, deterministic nonlinear and stochastic reactions.
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
2.1. Development of the framework
The derivation is shown in the Material and methods. The elements of the J matrix ({Jij }) in equation (2.3)
denote the covariances of the species abundances at any time t
(2.4)
α=1
and {µi } are the mean species abundances
µi (t) =
N
1 (α)
xi (t).
N
(2.5)
α=1
Both {µi } and {Jij } can be calculated from the cytometry snapshot data.
The magnitude of x̃(α) defined as
n
(α) 2
m(α) (t) = |x̃(α) (t)| =
(x̃ (t))
i
i=1
does not change with time, i.e.
(α)
IM = m(α) (t1 ) = m(α) (t2 ) = m(α) (t = 0),
(2.6)
is an invariant of the kinetics. One can physically interpret the transformation in equation (2.3) in the
following way. The first-order chemical reactions rotate (or reflect) as well as stretch (or compress) the
abundance vector x(α) (t) with time. The transformation in equation (2.3) rescales the vector to offset
the stretching (or compressing) component. Subsequently, the time evolution of the scaled vector can
be described as a pure rotation (or reflection) (see web supplement). As rotation or reflection is an
orthogonal transformation, the magnitude of the scaled variable is preserved.
2.1.1.3. Exact matching using invariants
IT and IM are functions of copy numbers of molecular species, and therefore vary considerably from
cell to cell at a given time point. Thus, these invariants create unique ‘tags’ for single cells, and pairing
single cells across time then is reduced to matching the same values of invariants in cell populations at
two time points. A possible degeneracy can arise when an invariant takes the same value in multiple
(α)
single cells. For example, single cells (e.g. #α and #β) with abundances lying in the plane defined IT =
(β)
(α)
(α)
(β)
i xi (0) = IT cannot be resolved by IT. In such cases the invariant IT will be unable to
i xi (0) =
(α)
match these cells across time. This same difficulty holds for the other invariant IM ; however, it is less
likely to encounter such degeneracy in this case.
2.1.2. Construction of slow variables for non-ideal situations
Cell signalling networks often deviate from the ideal kinetics considered above. This occurs due to
multiple reasons: (i) it is usually infeasible to assay all the molecular species involved in a signalling
network. In that case, the measured species abundances evolve as a non-autonomous system, because the
unmeasured species abundances implicitly give rise to time-dependent reaction rates or external fluxes in
the kinetics. (ii) The propensities of biochemical signalling reactions are often nonlinear functions of the
species abundances. (iii) The copy numbers of molecular species contain stochastic fluctuations arising
from intrinsic noise in biochemical reactions. In the presence of the above effects, except few special
cases, both IT and IM cease to behave as invariants of the kinetics. Our investigations (details in §2.1.3.)
using simulations of a variety of in silico signalling kinetics showed that for specific subsets of species
abundances, the variables IM and IT , evolve at a much slower rate compared with the measured species
abundances in a time interval. Borrowing from the lexicon of non-equilibrium statistical physics [16],
we designate these variables as ‘slow variables’. Next, we discuss our scheme to identify appropriate
combination of measured species abundances where IT and IM behave as slow variables or invariants
(also see figure 2).
We used a metric known as the Jensen–Shannon divergence JSD (0 ≤ JSD ≤ ln2) [22,23] to determine
slow variables and invariants in our scheme. JSD characterizes the difference between a pair of
................................................
N
1 (α)
(α)
(xi (t)−µi (t))(xj (t) − µj (t)),
N
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
Jij (t) =
5
(a)
(c) class #3, subset #3
{x4, x6, x1}
JSD(x4)
t1 t2
x3
x4
x5
P(x4)
x4
x6
JSD(x1)
P(x1)
(b)
class #4
................................................
x2
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
x1
6
x1
{x1, x2, x4, x6}
#1
JSD(I)
class #3
{x1, x2, x4} {x2, x4, x6}{x4, x6, x1} {x6, x1, x2}
#1
#2
#3
#4
P(I)
I
class #2 {x1, x2}{x2, x4}{x4, x6}{x6, x1}{x6, x2}{x1, x4}
#1
#2
#3
#4
#5
#6
JSD(I) < JSD(x1) < JSD(x4)
I is a slow variable
Figure 2. Determination of slow variables or invariants: (a) signalling networks are composed of a large number of species; however,
only few of them can be assayed. In the schematic network of six different species, only four species (shown with yellow boxes) x 1 , x 2 , x 4
and x 6 can be measured in experiments. Cytometry experiments measure single-cell abundances of these species (also denoted as {xi }
here). (b) 24 –4–1 = 11 different subsets of measured species abundances can be constructed. Each of these subsets represents projection
of the original data into a manifold spanned by the member species abundances in the subset. The subsets are further divided into classes
for our analyses. Each class contains subsets with the same cardinality. For example, class #3 contains all the three species subsets. (c) For
each subset (e.g. class #3, subset #3), we evaluate the change in the distribution of the species abundances in the cell population using
JSD (equation (2.7)). A large change in the distribution of species abundances indicating faster kinetics will result in a larger value of JSD.
Whereas if I behaves as a slow variable or an invariant, the distribution of I in the cell population will go through a small change or no
change, respectively. The change in I is quantified by JSD(I) . When JSD(I) is smaller than the slowest species in the subset, we denote I as
a slow variable for that subset. In the example shown in (c), P(x 4 ) and P(x 1 ) give rise to the maximum and the minimum values of JSDs
in the species subset, i.e. JSD(x1 ) < JSD(x6 ) < JSD(x4 ) .
distributions [23]. JSD vanishes when distributions are identical and increases monotonically as the
overlap between the distributions decreases. JSD is defined as
JSD(y) = 21 [dKL (P1 (y) M) + dKL (P2 (y) M)],
(2.7)
where M(y) = 1/2(P1 (y) + P2 (y)). The y superscript in JSD(y) denotes the variable in the distributions.
Here dKL is the Kullback–Leibler divergence [22] between two distributions. When y is a discrete variable,
dKL is given by
P1 (y)
.
dKL (P1 (y) M) =
P1 (y) ln
M(y)
y
When I ∈ (IT , IM ) is an invariant of the kinetics, the distribution of I in a cell population does not change
with time, i.e.
P(I, t1 ) = P(I, t2 ).
(2.8)
An exception to this can arise where I varies between individual cells yet the distribution of I remains
unchanged across time. This can occur when the stochastic kinetics of the chemical reactions is in
the steady state where species abundances (and I) change across time in individual cells but the
distributions of these variables remain time independent. In this situation, the presence of the equality
as in equation (2.8) will not imply an existence of a slow variable, I. However, such situations can be
easily detected from the data by checking if distributions of original variables do not vary across time.
Therefore, the JSD between P(I, t1 ) and P(I, t2 ) vanishes when I is an invariant, i.e. JSD(I) = 0. When I
does not behave as an invariant, then, JSD(I) > 0 (figure 2). However, I ∈ {IT , IM } can still evolve at a
slower rate than any of the measured species in the time interval t1 to t2 . We evaluated this possibility
by comparing changes in the JSD values corresponding to the distributions of species abundances and I
in a cell population in the time interval t1 to t2 . When IT or IM , evolves slower than any of the individual
species (say x) abundances in set of species, specifically, when
we denote I (IT or IM ) as the slow variable for that set of species abundances. In general, one or more of
the species abundances in a set of all the measured species will not satisfy the condition in equation (2.9),
therefore, IT or IM cannot be considered as a slow variable for that set. However, for a subset of the
measured abundances, IT or IM , could still behave as a slow variable or an invariant. To explore this
possibility, we compared the JSD values corresponding to species abundances and I ∈ (IM , IT ) in a time
interval t1 −t2 for all possible subsets that can be constructed using the measured abundances (figure 2).
We used a classification scheme for the subsets (figure 2b) to describe our results. When n number of
species abundances are measured, we considered all possible combinations (2n −n−1) of the abundances
excluding the singletons.
These subsets were organized into classes where each class (indexed by k) contained subsets with
the same cardinality (k) (figure 2b). The cardinality is defined as the number of species abundances in a
subset, e.g. for n = 14 the class with cardinality k = 4 (or class #4) contains 14 C4 = 1001 different subsets
with each subset composed of four different molecular species abundances. In addition, the subsets
within class #k were indexed by the integers ({m}) 1 to n Ck . Thus, a particular subset is denoted by
a class number k and a subset index m (figure 2b).
2.1.2.1. Matching using slow variables and invariants
We created a cost function E, that measures the total Euclidean distance between slow variables in pairs
of single cells ({α} at t1 , {β} at t2 ) across time. E is defined as
E({β}) =
N
α=1
2
(I(α) (t1 ) − I(β) (t2 )) .
(2.10)
The ‘sister cells’ constitute the set {β M } that minimizes E. The minimization amounts to a bipartite
matching between the set {α} and {β}; we used an algorithm (see Material and methods) based on sorting
with an O(nln(n)) computation time. A sister cell can be thought of as a partnering cell whose species
abundances are not substantially different than that in the correct cell partner.
2.1.2.2. Quantification of quality of matching
We calculated the error in the reconstructed trajectory when a cell α at time t1 was paired with a cell β (or
′
′
the ‘sister cell’) at a later time point t2 instead of the correct partner cell β at t2 . β is uniquely determined
by α for deterministic signalling kinetics in the non-steady state. We defined an error χ αβ for pairing cell
#α at time t1 with cell #β at time t2 :
2
N (β)
(β ′ )
χαβ =
(xi − xi ) ,
(2.11)
i=1
′
where χ αβ = 0, when β = β . Note, this metric identifies the cells by the measured species abundances
in individual cells, e.g. if the sister cell possesses the same values of the measured abundances as the
correct cell partner, the sister cell is identified as the correct match. It is possible that the sister cell
contains different values for unmeasured species abundances compared to the correct cell; however,
equation (2.11) remains blind to such differences. A small χ αβ would imply a small difference between
the correct partner (β ′ ) and the ‘sister cell’ (β) for the measured species in the subset. We calculated a
distribution of χ αβ (P(χ )) for the cell pairs matched using our scheme and compared that with the case
when cells were paired randomly across time.
We also calculated the autocorrelation function, Aij (t1 , t2 ), between species i at time t1 and species j
at time t2 for the pairings with the correct partner cells and the sister cells. Aij (t1 , t2 ) between the correct
cell pairs ({#α paired with #β ′ }) is defined as
(t1 , t2 ) =
Acorrect
ij
N
1 (α)
(β ′ )
(xi (t1 ) − µi (t1 ))(xj (t2 ) − µj (t2 )).
N
α=1
(2.12a)
................................................
(2.9)
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
JSD(IT or IM ) < JSD(x) ,
7
This is compared with the autocorrelation function corresponding to the pairings with the sister cells
(2.12b)
α=1
where the sister cell for α is indexed by β(α). The difference between the autocorrelations between the
pairings with the correct (equation (2.12a)) and the sister cells (equation (2.12b)) is calculated by distance
A(Acorrect , Asister ) between the matrices:
correct
A(A
sister
,A
n
n
2
)=
|Acorrect − Asister | .
ij
ij
(2.13)
i=1 j=1
2.1.3. Evaluation of slow modes and quality of matching for model signalling kinetics
Here we investigated the occurrence of slow modes (IM and IT ) in subsets of molecular species involved
in model biochemical signalling networks as a proof-of-concept for the framework we developed in
§2.1.1 and 2.1.2. We studied deterministic and stochastic kinetics in a system of first-order reactions, and
the Ras activation signalling network composed of nonlinear biochemical reactions [24]. We assumed
that all the signalling species were measured species for the in silico networks.
2.1.3.1. Deterministic first-order kinetics
We studied the matching problem for a signalling kinetics described by first-order reaction kinetics
composed of 14 different species (electronic supplementary material, figure S1). The ODEs describing
the mass action kinetics of all the 14 species is autonomous; however, when subsets of the 14
species are considered, the kinetics is no longer autonomous because the corresponding ODEs contain
time-dependent external fluxes arising from the implicit kinetics of the unmeasured species.
Identification of slow variables and invariants. We analysed 214 − 14 − 1 = 16 369 different subsets in a
time interval where the kinetics is not in the steady state. We acknowledge that the number of possible
subsets can become prohibitively large at very large dimensions. We found that for every class (k ≥ 2),
the subsets produced a wide range of (0 ≤ JSD(I) < ln(2)) JSD(I) values (figure 3a) corresponding to I = IT
(electronic supplementary material, figure S2) and I = IM (figure 3a). Next we analysed if IT or IM
behaved as slow variables or invariants in the subsets that are associated with smaller values of JSD(I) .
The comparison of the minimum values of JSD(IM ) with JSD values for the fastest and slowest species
abundances (figure 3b) in the subsets associated with minimum JSD(IM ) showed that IM evolved as a slow
variable in most of those subsets. IM turned into an invariant when all the 14 species were included in
the set (k = 14). Similar behaviour was found for IT as well (electronic supplementary material, figure
S3). The composition of the subsets associated with the minimum values of JSD(IM or IT ) depends on
the variable type (IT or IM ) and the time interval, as well as on the rate constants of the reaction
network.
Matching using the slow variables. The cost function in equation (2.10) was minimized using the subsets
associated with minimum JSD(I) to find the sister cells. The quality of matching was evaluated by
calculating the error in matching χ (equation (2.11)) for each single-cell–sister-cell pair. We show the
results for the pairing carried out using IM here. IM is an invariant for the subset corresponding to
k = 14, and minimizing E produced an exact matching in that case (figure 3c). IM turned out to be a slow
variable for the other subsets and the quality of matching using IM was reasonable (figure 3c). The mean
χ -values (χ ) for these pairings were 0.6–0.8 times lower than that for random pairing (figure 3c). Next, we
calculated the autocorrelation function for the matched pairs. Similar to χ, errors in the autocorrelation
function ranged from very small to moderate values (0.2–0.8 times less) compared with that when the
single cells were paired randomly (figure 3d). The subsets with smaller number of species show lower
error values (figure 3d), this could arise due to the sensitivity of the autocorrelation function to small
errors in matching in higher dimensions. The quality of matching using IT was qualitatively similar to
that of IM (electronic supplementary material, figure S4). Interestingly, pairing using IT produced better
agreement with the correct trajectories even when IT evolved faster than IM (electronic supplementary
material, figure S5). This behaviour emphasized the role of the cost function in pair-matching (also see
the Discussion).
................................................
N
1 (α)
(β(α))
(xi (t1 )− µi (t1 ))(xj
(t2 ) − µj (t2 )),
N
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
Asister
(t1 , t2 ) =
ij
8
(a)
(b)
class index
2500
1500
subset index
8
0.4
6
0.3
4
0.2
2
0.1
max JSD {x}
0.3
0.2
0.1
2
4
JSD(IM)
6
8
10
class index
12
14
(d)
(c)
0.8
0.8
DA/DArandom
c–/c– random
min JSD {x}
0.4
0
500
JSD IM
0.5
0.6
0.4
0.2
4
6
8
10
class index
12
14
0.6
0.4
0.2
IM
4
6
8
10
class index
12
14
Figure 3. Pairing single cells for the ideal kinetics. (a) Heat map for JSD(IM ) . JSD was calculated for 3000 single cells (see Material and
methods for details) at time t 1 = 0 and t 2 = 7 min. Other parameters related to the signalling kinetics are shown in the web supplement.
(b) The minimum values of JSD(IM ) for each class (grey points). JSD values associated with the fastest (shown in orange) and the slowest
species (shown in blue) in the subsets corresponding to the minimum JSD(IM ) are compared with minimum JSD(IM ) . (c) The quality of
matching when IM was used for matching the cells. The ratio in the average error χ̄ in the matching using IM with that for random
matching is shown for each of the subsets. χ̄ and χ̄ random are the averages of χ̄αβ over the single cells pairs matched using our method
and random pairing, respectively. χ̄ < 1 indicates smaller error compared with the random pairing. (d) Error in the autocorrelation
function (A) for the subsets corresponding to the minimum values of JSD(IM ) . Arandom denotes the error in the autocorrelation function
for random matching.
2.1.3.2. Nonlinear deterministic kinetics
We used an experimentally validated signalling network for Ras activation in T lymphocytes (electronic
supplementary material, figure S6) [24]. The network describes enzymatic activation of Ras by two
enzymes SOS and Rasgrp1, where SOS-mediated Ras activation contains a positive feedback, i.e. an
activated form of Ras or RasGTP induces a higher catalytic rate involving SOS. The deterministic kinetics
displays bistability for a range of SOS and Rasgrp1 concentrations. We analysed the non-steady-state
kinetics in the model for the parameter values that generate monostable or bistable steady states. The in
silico model contained 14 different molecular species.
Identification of slow variables and invariants. Similar to the first-order kinetics in the previous section, the
JSD(I) values corresponding to I = IT showed large variations across subsets (electronic supplementary
material, figure S7). However, a noted difference from the previous example was that IT behaved as
an invariant for multiple subsets. This is because, even with the nonlinear rates, the total number of
certain species is an invariant of the kinetics. For example, the total amount of Rasgrp1 contained
in the complexes (free Rasgrp1, Rasgrp1-DAG, Rasgrp1-DAG-RasGDP) remained fixed throughout
the kinetics, and the subset {Rasgrp1, Rasgrp1-DAG, Rasgrp1-DAG-RasGDP} produced the minimum
JSD(IT ) (=0) in the class k = 3 (figure 4a). IM , in contrast, evolved only as a slow variable in specific subsets
(electronic supplementary material, figure S8).
Matching using the slow variables. For the subsets where IT behaved as an invariant, the matching
produced exact alignment. For other subsets corresponding to the minimum values of JSD(IT ) , the pairing
was substantially better than that of random matching (figure 4b). IM -guided pairing showed small errors
in abundances in the sister cells (electronic supplementary material, figure S9); however, overall pairing
using IT produced smaller errors.
................................................
3432
0.6
0.5
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
10
9
0.7
0.6
JSD
12
(b)
(a)
c–/c– random
0.15
min JSD {x}
max JSD {x}
0.10
IT
0.6
0.4
0.2
0.05
4
6
8
10
class index
12
14
4
6
8
10
class index
12
14
(c)
DA/DArandom
1.0
0.8
0.6
IT
0.4
0.2
4
6
8
10
class index
12
14
Figure 4. Pairing single cells for non-ideal kinetics. (a) The minimum values of JSD(IT ) for each class (black points) for the deterministic
Ras activation kinetics model. We used 3000 single cells across time points t 1 = 100 s and t 2 = 400 s, where the Ras activation displays
bistability. Min(JSD(IT ) ) values were compared with the JSD values associated with the fastest (shown in orange) and the slowest species
(shown in blue) in the subsets corresponding to min(JSD(IT ) ). IT behaves as an invariant or a slow variable for the subsets. (b) The quality
of matching given by χ̄ /χ̄ random when IT was used for matching the cells for the subsets associated with min(JSD(IT ) ) in (a). (c) Error in
the autocorrelation function (A) when the cells were matched using IT for the stochastic Ras activation kinetics. The subsets used in
the matching yielded minimum values of JSD(IT ) for each class. Note that for exact matching (A = 0) IT behaved as an invariant.
2.1.3.3. Stochastic signalling kinetics
We simulated stochastic biochemical reactions in the Ras activation signalling kinetics by including
intrinsic noise fluctuations. The simulations contained the same variations in the initial abundances
used in the investigations for the deterministic kinetics. IT evolved as an invariant in subsets that
were associated with conservation of the total number of molecular species (electronic supplementary
material, figure S10). IT also behaved as a slow variable in specific subsets (electronic supplementary
material, figure S10a). By contrast, IM behaved as a slow variable for select subsets (electronic
supplementary material, figure S10b). Pairing the single cells using IT for the subsets associated with
minimum values JSD(IT ) showed reasonably smaller errors in the autocorrelation function compared
with that for random pairings for most of the subsets (figure 4c). The matching using IM showed smaller
errors (or χ ) compared with random pairing in select subsets (electronic supplementary material, figure
S11).
2.2. Application of the framework for matching in live-cell imaging data
We reconstructed single-cell gene expression kinetic trajectories by applying our framework using livecell imaging data [18]. Data were collected in single yeast cells where the transcription factor Msn2
translocated to the nucleus upon inhibition of protein kinase A by a small molecule, 1-NM-PP1 and
activated target fluorescent reporter genes CFP and YFP residing on homologous chromosomes in the
diploid yeast cells (figure 5a). For our reconstruction, we chose the kinetics of the dual reporter of the
gene DCS2 (one of the seven genes activated in the study) induced by a single 40 min pulse of 1-NMPP1 at a concentration of 690 nM. The activation kinetics of the reporters depended nonlinearly on the
Msn2 abundance and also contained intrinsic and extrinsic noise fluctuations [18]. We carried out our
reconstruction method by treating the live-cell imaging data as snapshot data, because it allowed us
................................................
JSD IT
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
0.20
JSD
10
0.8
0.25
(a)
reconstruction using IT
(b)
11
actual
p-Msn2-m
nucleus
Nucl-Msn2-m
YFP
3000
2000
1000
0
0
3000
2000
20
tim
CFP
e
(d)
(c)
60
60
0
DA/DArandom
0.8
IT
P(c)
50
40
1.0
0.015
0.010
1000
YF
P
CFP
cytoplasm
random
0.005
IT
0.6
0.4
0.2
100
200
c
300
400
0
10
20
30
40
time point
Figure 5. Trajectory reconstruction using live-cell imaging data. (a) Schematic diagram for gene activation. (b) Reconstruction of a
typical kinetics trajectory for CFP and YFP tagged dual reporter for the gene DCS2 in yeast diploid cells. The reconstructed trajectory
using IT (shown in orange) is compared with the true trajectory (shown in blue). (c) Distribution of the quality of alignment χ using IT is
compared to the quality of alignment using random pairing for 159 single-cell trajectories. The reconstructions were carried out between
two successive time measurements for 63 time intervals (e.g. 0–2.5 min, 2.5–5 min and so on). (d) The ratio of the errors (equation (2.13))
in the autocorrelation function for trajectories reconstructed using IT with that for random pairing for the same time intervals in (c).
A/Arandom < 1 for most of the time points indicating better matching using IT compared to the random pairing. The data not shown
on the graph for the time points between 0 to 10 min produced A/Arandom > 1.
to compare the reconstructed trajectories with the measured single-cell trajectories. We analysed the
simultaneous CFP and YFP gene expression kinetics data in 159 single-cell trajectories [18]. At each
time point, we removed the single-cell tag from the CFP and YFP expressions data and treated the 159
data points as snapshot data to perform the reconstruction. We reconstructed two-dimensional (CFP
and YFP) single-cell trajectories using either IT (figure 5b) or IM (electronic supplementary material,
figure S12), and both showed a similar level of agreement between the measured and the reconstructed
trajectories. We further quantified the quality of alignment using P(χ ) (figure 5c) and autocorrelation
functions (figure 5d and electronic supplementary material, S12) for each reconstruction. Both indicators
revealed lower errors in the reconstruction using I compared with random pairing.
2.3. Application of the framework for matching single cells using synthetic flow cytometry data
We applied our framework on synthetic flow cytometry data for matching single cells assayed at
two successive time points. We used in silico or synthetic data instead of measured data from flow
cytometry experiments for two main reasons: (i) The synthetic data allowed us to assess the quality of
the reconstruction because the correct trajectories are known in the simulations. (ii) We were able to use
species abundances in the synthetic data to calculate the slow variables. Single-cell protein abundances
are not directly accessible from the fluorescent intensities measured in flow cytometry. However, for
many proteins it is possible to quantify single-cell species abundances by calibrating the intensities
against a standard curve [25]. The synthetic flow cytometry data were generated at multiple time points
by stochastic simulation (details in Material and methods) of the Ras activation network. The species
abundances at the unstimulated state (or t = 0) in individual cells were chosen from a multivariate normal
................................................
p
Msn2-m
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
1-NM-PP1
PKAmut
(a)
(b)
12
0.55
–JSDmin – Rand
c Rel
cRel
0.6
0.4
0.2
0.50
0.45
0.40
0.35
0.30
2
3
4
class
5
6
2
3
4
class
5
6
JSDmax
JSDmin
Figure 6. Reconstruction using synthetic flow cytometry data. (a) The variation of the ratio of the average relative error χ̄Rel
/χ̄Rel
JSDmax
JSDmin
with the class k. The data were measured at t = 0 and t = 100 s. χ̄Rel
and χ̄Rel
for a class k were calculated for the species
(IT )
subsets that generated the smallest and the largest values of JSD , respectively. The smaller than 1 values for the ratio indicate that
the errors in the subset corresponding to JSDmin were smaller on average than that for the subset with the largest JSD in the same class.
The subsets the corresponded to JSDmin are the following: {RasGDP, RasGAP} for k = 2, {RasGDP, RasGTP, RasGAP} for k = 3, {RasGDP,
RasGTP, RasGAP, DAG} for k = 4, {RasGDP, RasGTP, RasGAP, DAG, RasGRP1} for k = 5 and {RasGDP, RasGTP, RasGAP, DAG, RasGRP1, SOS}
αβ
for k = 6. The relative error χRel between a cell #α at t 1 and its matching partner cell# β (correct partner cell# β ′ ) at t 2 is defined as
k (β)
αβ
(β ′ ) 2 (β ′ )
χ =
i=1 (xi − xi ) /|xi |. We used this definition instead of equation (2.11) to compare between the subsets containing
Rel
αβ
abundances of very different values. The average value of χ̄Rel was calculated by taking the average of χRel values for all the single-cell
JSDmin
Rand
Rand
/χ̄Rel
for the pairing for subsets associated with minimum JSD in (a). χ̄Rel
is the average relative error when
pairs. (b) The ratio χ̄Rel
the matching was generated using random pairing of the cells. The ratio is less than 1 for all the cases indicating that the reconstruction
is better than random reconstructions.
distribution (Material and methods) to represent cell–cell differences in total protein abundances and
tonic signalling. Six molecular species (RasGTP, RasGDP, SOS, RasGRP1, DAG and RasGAP) out of the 14
signalling species involved in the Ras signalling network (electronic supplementary material, figure S6)
were recorded in the synthetic data. We used different sets of cell populations to record single-cell species
abundances at any two time points in the synthetic data as the same cell populations are not assayed more
than once in flow and mass cytometry experiments. We used IT instead of IM for all our reconstructions
here because our previous investigations showed IT generated less error in reconstruction compared
with IM for the Ras activation network. In addition to carrying out reconstruction of trajectories between
a pair of successive time points, we addressed the following relevant issues. (i) Is there a specific subset
of measured abundances that produces a more accurate reconstruction compared with the other subsets?
(ii) How does the error in the reconstruction depend on the size of the time interval between two
successive measurements? (iii) How does the reconstruction using IT compare against other available
methods for matching?
In our investigation, we divided the six measured species into five classes (figure 2 and §2.1.2 for
more details) composed of 2, 3, 4, 5 and 6 species. In each class, we determined the subsets that produced
the largest and the smallest JSD(IT ) values. The subset corresponding to the smallest JSD(IT ) gives rise
to the slowest IT in a class. The subsets corresponding to the lowest JSD(IT ) values in a time interval
also produced the lowest errors in the reconstruction in each class (figure 6a). All the reconstructions
corresponding to the lowest JSD(IT ) values faired better than the random pairings (figure 6b). These
results demonstrate that it is possible to identify a group of species in the cytometry dataset that
will generate better reconstruction in a time interval. Moreover, when the interest is in reconstructing
trajectories for a particular subset of molecular species, the JSD(IT ) for the subset can be used to assess the
quality of the reconstruction relative to the other combinations of the measured species. This knowledge
can help to refine the measurements, e.g. use smaller time intervals (see below) when the JSD(IT ) value
for the subset of interest is not the minimum in the class. Thus, the JSD(IT ) value provides a valuable
metric to quantitatively assess the quality of a trajectory reconstruction in experimental flow cytometry
datasets where the correct trajectory is unknown. Using our in silico data, we found that the errors
become larger as the time interval is increased (electronic supplementary material, figure S13a). This
is expected as the slow variables start changing appreciably with the increasing magnitude of the time
interval resulting in large values of the parameter ξI (≫1). However, the precise answer to the question,
................................................
0.8
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
–JSDmin –JSDmax
cRel
cRel
1.0
We have developed a framework for matching single cells across time using time-stamped data from flow
and mass cytometry experiments. Our approach, based on posing the problem in terms of new variables
that remain unchanged or vary slowly with time, is radically different from the existing methods [9,12]
employed for solving matching problems. Specifically, unlike other pair-matching algorithms [12], our
approach does not require any assumption regarding the underlying reaction kinetics and estimation of
model parameters. The use of slow variables and invariants reduces the value of parameter ξ making
the matching straightforward (when ξ < 1) or more amenable to existing computational methods. The
application of the framework for in silico signalling networks, live-cell imaging data and synthetic flow
cytometry data showed excellent to reasonable agreement of the reconstructed trajectories with the
correct kinetics. We constructed two new variables from the measured species abundances that served
as slow variables or invariants in a wide variety of signalling kinetics involving cell-to-cell variations
of species abundances, nonlinear reaction propensities and intrinsic noise fluctuations. One of the new
variables (IT ) is the total abundance of a molecular species in a single cell. Early time signalling events
usually involve chemical modifications of signalling proteins (e.g. phosphorylation) that do not lead to
any change in the total content of the proteins in a single cell. However, the total abundance of a protein
at any time can vary from cell to cell over a wide range usually described by a lognormal distribution
[26,27]. Thus, the total protein content provides a unique tag that can be used to pair a single cell at any
time with a ‘sister’ cell at a later time. In flow and mass cytometry measurements, it is possible to assay
the total content of certain signalling proteins (e.g. Erk), which can be used to reconstruct single-cell
signalling trajectories.
The other new variable (IM ) was constructed by scaling the measured single-cell abundances by the
inverse of the square root of the covariance matrix for species abundances. The reconstruction using IM
worked better when the signalling kinetics in a time interval was effectively described by a closed system
of first-order reactions. As IM behaves as a slow variable for a kinetics described by first-order reactions
with small stochastic fluctuations and weak external fluxes, it can be used for addressing matching
problems in other contexts such as tagged particles in fluid flows [9,12]. We recently developed a method
to estimate reaction rates in a system of first-order reactions designed to describe mass cytometry
snapshot data in a time interval [7]. The method also determines how well the system of first-order
reactions can describe the snapshot data. In the case of a good agreement, we expect IM to behave
as a slow variable and generate good to reasonable reconstructions. However, the precise relationship
between IM being a slow variable and the underlying reaction network will require further investigation.
Our investigations of signalling kinetics in non-ideal cases showed that in several cases (electronic
supplementary material, figure S5) the reconstructed trajectories using IT produced better agreement to
the correct kinetics as opposed to IM . In these cases, the use of IT or IM reduced the value of the parameter
ξ ; however, ξ (I) still remained greater than 1, i.e. ξ (I) > 1. Thus, the cost function E (equation (2.10)) played
an important role in matching the cells in these situations. The differences in quality of matching between
the two slow variables point to the fact that minimizing E was better suited for IT compared with IM in
these examples.
The application of our framework to synthetic flow cytometry data showed that JSD(IT ) values,
evaluated for different subsets of measured species abundances, can be used to determine groups of
species where the reconstruction will produce less errors. The quality of the reconstruction improved
as the size of the time intervals between successive measurements was decreased, raising an important
question about the existence of an optimal time interval. Similar questions in the context of selecting
time points for measuring gene expressions have been dealt with methods based on machine learning
where specific cost functions were optimized using finely spaced time measurements in a smaller gene
subset [28]. Such approaches might turn out to be helpful for generating trajectory reconstructions
................................................
3. Discussion
13
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
if there exists an optimal time interval for cytometry measurements for good trajectory reconstructions
needs further investigation (see Discussion). Lastly, we compared our method against a well-known
scheme [9] that minimizes the total Euclidean distance between the measured variables at time t1 and t2
(electronic supplementary material, figure S13b,c). Our investigations produced mixed results. For a twospecies sub-module our method faired better (electronic supplementary material, figure S13b), whereas
for a three-species sub-module the quality of reconstruction between the two methods was comparable
(electronic supplementary material, figure S13c). In a few cases, we found that the Euclidean method
produced slightly better reconstructions.
4.1. Derivation of IM for the ideal kinetics
Equation (2.1) can be solved analytically to calculate single-cell abundances at any time t:
(α)
xi (t) =
n
(α)
[eMt ]ij xj (0).
(4.1)
j=1
When the abundances follow the above equation, the average quantities in equations (2.4) and (2.5)
at the two time points (t1 and t2 , t2 > t1 ) are related by
µi (t2 ) =
n
[eM(t2 −t1 ) ]ij µj (t1 )
(4.2a)
j=1
and
T
J(t2 ) = eM(t2 −t1 ) J(t1 ) eM
(t2 −t1 )
.
(4.2b)
................................................
4. Material and methods
14
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
using cytometry data where live-cell imaging measurement of a smaller subset of proteins can be used
to select time points in cytometry experiments measuring a large number of proteins simultaneously.
Another question related to the above issue is if all the cells need to the coherently stimulated in
cytometry experiments at time t = 0 for good reconstructions. When the time difference in triggering
for different batches of cell populations is long enough to produce large values in the parameter ξ (I) , the
reconstructions are likely to contain large errors. We also noticed deterioration (electronic supplementary
material figure S13a) in the reconstruction quality as the kinetics approached the bistable behaviour
where the Ras activation changed by a large amount in a very short time interval. The framework is
likely to be error prone when such abrupt changes occur in the kinetics.
In recent years, a host of methods have been developed to construct single-cell development
trajectories using snapshot data (e.g. WANDERLUST [29], SCUBA [30]) where, unlike the cases dealt
with here, the single cells are not ordered temporally. These methods assign a ‘pseudo time’ to the data
and then optimize ad hoc cost functions to create single-cell trajectories. It is unclear whether those cost
functions will render any benefit to the matching problems considered here. For example, one of the
cost functions that minimized the cosine distance between single cells [29] will not be able to correctly
reconstruct signalling trajectories in a simple example where the kinetics are described by first-order
reactions.
The main difficulty in applying the framework developed here is to identify appropriate invariants or
slow variables in a general situation. Singer et al. [31] used nonlinear independent component analysis
for constructing slow variables by analysing stochastic kinetics of dynamical systems in a short time
window. This approach determined slow variables that were functions of linear combinations of the
observables. When the underlying signalling reactions are known, this approach can help find slow
modes in the system by simulating the in silico network in short time durations, and these slow variables
can then be used to reconstruct single-cell trajectories for cytometry data using our framework. However,
the applicability of the approach when the slow modes are complicated nonlinear functions of the
measured variables or when only a subset of involved dynamical variables is measured is unclear.
In statistical physics [15,16], conservation laws (e.g. conservation of energy, momentum) or symmetry
principles help identify such slow modes. Projection operator methods by Zwanzig & Mori [32] provide
a formal way to construct variables with slower time scales for a known microscopic dynamics. However,
this method requires knowledge of model parameter values (e.g. reaction rates), and the calculations
for constructing slow modes could become intractable for a complex system composed of nonlinear
interactions such as cell signalling kinetics. A computation intensive step in our framework is to
determine specific combinations of species abundances that are associated with low JSD(I) values. Masscytometry experiments can measure over 40 different proteins and the number of possible subsets in
such large dimensions can be prohibitively large. When the signalling reactions are well characterized,
selecting a group of species that are connected by mass balance in chemical modifications (e.g. enzymatic
conversions or binding–unbinding reactions) could provide a way to identify a core species set with a
slow mode. Adding new groups of species using smart sampling techniques [33] to expand this core set
would be an exciting future endeavour.
(4.3)
where Q is any orthogonal matrix, i.e. QQT = I, I is the identity matrix. This equation contains the
clue that if the abundances are scaled appropriately, the time evolution given by equation (2.1) can be
described by a rotation or a reflection. We found that if the species abundances at any time are scaled by
the inverse of the square root of the covariance matrix (equation (2.4)), then the scaled abundances are
related across time points by orthogonal transformations.
Using equation (4.4), we can write equation (4.1) as
(α)
xi (t2 ) =
n
(α)
[[J(t2 )]1/2 Q[J(t1 )]−1/2 ]ij xj (t1 ),
(4.4)
j=1
which implies that the scaled variables in equation (2.3) at t1 and t2 are related by
(α)
x̃i (t2 ) =
(α)
Qij x̃j (t1 ).
(4.5)
j
As rotation and reflection preserves the magnitude of a vector, the magnitude (I(α) M ) of the vector x̃(α) ≡
(α) (α)
(α)
{x̃1 , x̃2 , . . . , x̃n } remains unchanged throughout the kinetics from its value at t = 0.
4.2. Simulation of the in silico networks
The mass action deterministic kinetics and the stochastic kinetics for the reactions for the system of firstorder reactions and the Ras activation network were simulated using the software package BIONETGEN
[34]. The initial species abundances were drawn from a multivariate normal distribution by specifying
the average abundances and the covariances. The initial conditions and the parameter values for the
reaction networks are provided in the electronic supplementary material, tables S1 and S2, and figures
S1 and S6.
4.3. Generation of the synthetic flow cytometry data
The kinetics of Ras activation network (electronic supplementary material, figure S6) was simulated
using BIONETGEN [34]. Single-cell abundances of six different species (RasGTP, RasGDP, SOS,
RasGRP1, DAG and RasGAP) were measured at different times (t = 0, 100, 200, 300 and 500 s).
Different batches of 2000 single cells were used for measurements at two successive time points.
4.4. Minimization of the cost function E
The minimization of E involves finding a bipartite graph where a pair of vertices in two subsets (single
cells {α} at t1 and single cells {β} at t2 ) connected with a cost (I(α) − I(β) )2 minimizes the total cost E. The
√
graph matching algorithms computes the optimization in time O(|E| V) ≈ O(n2 ) [35]. However, in our
case we can use sorting to compute this in O(nln(n)) time. This is achieved by changing the cost function
for a pairing #α, #β to (I(α) − I(β) )2 + ǫ(t2 −t1 )2 , where, ǫ → 0. Note, this is the Euclidean distance between
the cells in the I−t plane, thus minimizing the cost function E amounts to joining these points (or single
cells) in the I−t plane by non-intersecting lines. As the cells in {α} (or {β}) have the same t coordinate
t = t1 (or t2 ), the configuration with the non-intersecting lines keeps the same ascending (or descending)
order in I for the cells in {α} and {β}. Therefore, first, we sorted the {α} and the {β} cells in arrays where
cells were arranged in ascending order of I, and then in the sorted arrays, we connected cells with the
same array index. A pseudo-code is provided in the electronic supplementary material. The Mathematica
codes are available at http://planetx.nationwidechildrens.org/~jayajit/pair-matching.
................................................
eM(t2 −t1 ) = [J(t2 )]1/2 Q[J(t1 )]−1/2 ,
15
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
Note that the elements of the M matrix cannot be uniquely determined from the above relations
because there are n2 unknown elements in the M matrix and equation (4.2a,b) provide n + n(n + 1)/2 < n2
relations between the unknown variables. Therefore, equation (4.1) cannot be used to evolve the
abundances in a single cell at time point (at t1 ) to a later time point (t2 ), and then identify the correct
cell from the measurements at t2 . Now, equation (4.2b) can be recast as
4.5. Calculation of JSD values
16
(IT )
Imax
T
ITmin
max
P(IT )
P(IT ) ln
M(IT )
dIT ≈ IT
IT
P((IT )p ) ln
p=ITmin
P((IT )p )
M((IT )p )
,
where IT = k and M(IT ) = [P(IT (t1 )) + P(IT (t2 ))]/2.
4.5.2. Calculation of JSD(IM )
Done in a similar way as IT . The bin width (IM ) was chosen as follows. We calculated I(α) M following,
(α) −1
(α)
(α) −1 (α)
(α) −1
(α)
i,j (x i J ij x j j )
i.j (x i J ij x j + x i J ij x j )
(α)
=
.
I M =
2|x̃(α) |
|x̃(α) |
(α) −1 (α)
(α)
Jij xj , and x(α) = 1 for all species and all the cells the above
Using IM = |x̃(α) | =
i,j xi
expression simplifies to
(α)
IM
=
−1 (α)
i,j Jij xj
|x̃(α) |
.
(α)
α for all the cells at time t and t and set I = min(I ). A pseudo-code is
We evaluated IM
1
2
M
M
provided in the electronic supplementary material. The Mathematica codes are available at http://
planetx.nationwidechildrens.org/~jayajit/pair-matching.
Data accessibility. The computer codes used for the studies shown in the article are available at the website: http://
planetx.nationwidechildrens.org/~jayajit/pair-matching. The codes are written in Mathematica and in BIONETGEN
(bionetgen.org). The live-cell imaging data reported in fig. 4 in [18] are available on the web at the link: http://msb.
embopress.org/content/9/1/704.short.
Authors’ contributions. S.M. and J.D. planned and designed research, and performed analytical and numerical calculations.
S.M., W.S., L.L.L. and J.D. analysed data. S.M., W.S. D.S. and J.D. contributed computational tools. S.M., L.L.L. and J.D.
wrote the paper. All authors made significant contributions to the manuscript. All authors gave their final approval
for publication.
Competing interests. We have no competing interests.
Funding. This work was supported by the grant R56AI108880–01 to J.D. from NIAID. S.M. was supported in part by a
grant from the Department of Biotechnology (BTPR12422/MED/31/287/2014, valid November 2014–2017). We also
acknowledge the computation time provided by the Ohio Supercomputer Center.
Acknowledgements. J.D. and S.M. thank Helle Jensen, Suzanne Gaudet, Alper Yilmaz, Victoria Best and Anton Zilman for
helpful discussions. S.M. and J.D. acknowledge the help from Anders Hansen in Erin O’Shea’s lab for accessing the
live imaging data. J.D. also acknowledges the support from the Quantitative Immunology Workshop at KITP where
a part of the work was carried out.
References
1. Janeway C. 2005 Immunobiology : the immune
system in health and disease, pp. xxiii, 823 p,
6th edn. New York, NY: Garland Science.
2. Cohen-Saidon C, Cohen AA, Sigal A,
Liron Y, Alon U. 2009 Dynamics and variability
of ERK2 response to EGF in individual living cells.
Mol. Cell 36, 885–893. (doi:10.1016/
j.molcel.2009.11.025)
3. Hoebe RA, Van Oven CH, Gadella TWJ, Dhonukshe
PB, Van Noorden CJF, Manders EMM. 2007
Controlled light-exposure microscopy reduces
photobleaching and phototoxicity in fluorescence
live-cell imaging. Nat. Biotechnol. 25, 249–253.
(doi:10.1038/nbt1278)
4. Bendall SC, Nolan GP, Roederer M, Chattopadhyay
PK. 2012 A deep profiler’s guide to cytometry. Trends
Immunol. 33, 323–332. (doi:10.1016/j.it.2012.
02.010)
5. Chattopadhyay PK, Gierahn TM, Roederer M, Love
JC. 2014 Single-cell technologies for monitoring
immune systems. Nat. Immunol. 15, 128–135.
(doi:10.1038/ni.2796)
6. Barnett L, Barrett AB, Seth AK. 2009 Granger
causality and transfer entropy are equivalent
for Gaussian variables. Phys. Rev. Lett. 103,
238701. (doi:10.1103/PhysRevLett.103.
238701)
7. Mukherjee S, Jensen H, Stewart W, Stewart D, Ray
WC, Chen S-Y, Nolan GP, Lanier LL, Das J. 2017 In
silico modeling identifies CD45 as a regulator of IL-2
synergy in the NKG2D-mediated activation of
immature human NK cells. Science Signaling 10,
aai9062.
................................................
For a subset of m molecular species (m ≤ n), we calculated the sets {IT (α) (t1 )}and {IT (β) (t2 )} for N = 3000
cells. We then constructed the probability distributions P(IT , t1 ) and P(IT , t2 ) by binning the above sets
and normalizing the histograms. The bin width (IT ) was chosen to be the cardinality k of the subset.
We then calculated the Kullback–Leibler divergence (dKL ) using the distributions. Here dKL is given by
an integral as IT is a continuous variable. We approximated the integral by a sum over the histograms
calculated above, i.e.
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
4.5.1. Calculation of JSD
19.
21.
22.
23.
24.
25.
26.
27.
28. Kleyman M, Sefer E, Nicola T, Espinoza C, Chhabra D,
Hagood JS, Kaminski N, Ambalavanan N,
Bar-Joseph Z. 2017 Selecting the most appropriate
time points to profile in high-throughput studies.
Elife 6 , e18541. (doi:10.7554/eLife.18541)
29. Bendall SC, Davis KL, Amir el-AD, Tadmor MD,
Simonds EF, Chen TJ, Shenfeld DK, Nolan GP, Péer D.
2014 Single-cell trajectory detection uncovers
progression and regulatory coordination in human
B cell development. Cell 157, 714–725. (doi:10.1016/
j.cell.2014.04.005)
30. Marco E, Karpb RL, Guoc G, Robsond P,
Harte AH, Trippaa L, Yuana G-C. 2014 Bifurcation
analysis of single-cell gene expression data
reveals epigenetic landscape. Proc. Natl Acad.
Sci. USA 111, E5643–E5650. (doi:10.1073/pnas.
1408993111)
31. Singer A, Erban R, Kevrekidis IG, Coifman RR.
2009 Detecting intrinsic slow variables in
stochastic dynamical systems by anisotropic
diffusion maps. Proc. Natl Acad. Sci. USA 106,
16 090–16 095. (doi:10.1073/pnas.090554
7106)
32. Zwanzig R. 2001 Nonequilibrium statistical
mechanics. Oxford, NY: Oxford University Press.
33. Hafner M, Koeppl H, Hasler M, Wagner A. 2009
‘Glocal’ robustness analysis and model
discrimination for circadian oscillators. PLoS
Comput. Biol. 5, e1000534. (doi:10.1371/journal.
pcbi.1000534)
34. Colvin J, Monine MI, Faeder JR, Hlavacek WS, Von
Hoff DD, Posner RG. 2009 Simulation of large-scale
rule-based models. Bioinformatics 25, 910–917.
(doi:10.1093/bioinformatics/btp066)
35. Micali S, Vazirani VV. 1980 An O(\sqrt|V|.E)
algorithm for finding maximum matching
in general graphs. In Proc. 21st IEEE Symp. on
Foundations of Computer Science Syracuse, NY,
13–15 October, pp. 17–27. IEEE.
(doi:10.1109/SFCS.1980.12)
17
................................................
20.
between noise and control of gene expression. Mol.
Syst. Biol. 9, 704. (doi:10.1038/msb.2013.56)
Strogatz SH. 2000 Nonlinear dynamics and chaos:
with applications to physics, biology, chemistry,
and engineering. Cambridge, MA: Westview Press.
Zikherman J, Jenne C, Watson S, Doan K, Raschke
W, Goodnow CC, Weiss A. 2010 CD45-Csk
phosphatase-kinase titration uncouples basal and
inducible T cell receptor signaling during thymic
development. Immunity 32, 342–354. (doi:10.1016/
j.immuni.2010.03.006)
Feinerman O, Veiga J, Dorfman JR, Germain RN,
Altan-Bonnet G. 2008 Variability and robustness in
T cell activation from regulated heterogeneity in
protein levels. Science 321, 1081–1084. (doi:10.1126/
science.1158013)
Cover TM, Thomas JA. 2006 Elements of information
theory, 2nd edn. Hoboken, NJ: Wiley-Interscience.
Lin JH. 1991 Divergence measures based on the
Shannon entropy. IEEE Trans. Inform. Theory 37,
145–151. (doi:10.1109/18.61115)
Das J, Ho M, Zikherman J, Govern C, Yang M, Weiss
A, Chakraborty AK, Roose JP. 2009 Digital signaling
and hysteresis characterize ras activation in
lymphoid cells. Cell 136, 337–351. (doi:10.1016/
j.cell.2008.11.051)
Coffman VC, Wu JQ. 2012 Counting protein
molecules using quantitative fluorescence
microscopy. Trends Biochem. Sci. 37, 499–506.
(doi:10.1016/j.tibs.2012.08.002)
Cotari JW, Voisinne G, Dar OE, Karabacak V,
Altan-Bonnet G. 2013 Cell-to-cell variability analysis
dissects the plasticity of signaling of common
gamma chain cytokines in T cells. Sci. Signal. 6, ra17.
(doi:10.1126/scisignal.2003240)
Salman H, Brenner N, Tung C-k, Elyahu N, Stolovicki
E, Moore L, Libchaber A, Braun E. 2012 Universal
protein fluctuations in populations of
microorganisms. Phys. Rev. Lett. 108, 238105.
(doi:10.1103/PhysRevLett.108.238105)
rsos.royalsocietypublishing.org R. Soc. open sci. 4: 170811
8. Saxton MJ. 2008 Single-particle tracking:
connecting the dots. Nat. Methods 5, 671–672.
(doi:10.1038/nmeth0808-671)
9. Ouellette NT, Xu HT, Bodenschatz E. 2006 A
quantitative study of three-dimensional
Lagrangian particle tracking algorithms. Exp. Fluids
40, 301–313. (doi:10.1007/s00348-005-0068-7)
10. Ballerini M et al. 2008 Interaction ruling animal
collective behavior depends on topological rather
than metric distance: evidence from a field study.
Proc. Natl Acad. Sci. USA 105, 1232–1237.
(doi:10.1073/pnas.0711437105)
11. Lipton AJ, Fujiyoshi H, Patil RS. 1998 Moving target
classification and tracking from real-time video.
In Proc. Fourth IEEE Workshop on Applications of
Computer Vision (Wacv’98), Princeton, NJ, 19–21
October, pp. 8–14. IEEE.
(doi:10.1109/ACV.1998.732851)
12. Chertkov M, Kroc L, Krzakala F, Vergassola M,
Zdeborova L. 2010 Inference in particle tracking
experiments by passing messages between images.
Proc. Natl Acad. Sci. USA 107, 7663–7668.
(doi:10.1073/pnas.0910994107)
13. Bendall SC, Nolan GP. 2012 From single cells to deep
phenotypes in cancer. Nat. Biotechnol. 30, 639–647.
(doi:10.1038/nbt.2283)
14. Gustafsson F, Gunnarsson F, Bergman N, Forssell U,
Jansson J, Karlsson R, Nordlund P. 2002 Particle
filters for positioning, navigation, and tracking. IEEE
Trans. Signal Process. 50, 425–437. (doi:10.1109/78.
978396)
15. Chaikin PM, Lubensky TC. 2000 Principles of
condensed matter physics. Cambridge, UK:
Cambridge University Press.
16. Mazenko G. 2006 Nonequilibrium statistical
mechanics. Weinheim, Germany: Wiley.
17. Bialek WS. 2012 Biophysics : searching for principles.
Princeton, NJ: Princeton University Press.
18. Hansen AS, O’Shea EK. 2013 Promoter decoding of
transcription factor dynamics involves a trade-off