Quantiles over data streams: experimental comparisons, new analyses, and further improvements

Ge Luo ORCID: orcid.org/0000-0002-2823-4488¹,
Lu Wang¹,
Ke Yi¹ &
…
Graham Cormode²

1119 Accesses
31 Citations
3 Altmetric
Explore all metrics

Abstract

A fundamental problem in data management and analysis is to generate descriptions of the distribution of data. It is most common to give such descriptions in terms of the cumulative distribution, which is characterized by the quantiles of the data. The design and engineering of efficient methods to find these quantiles has attracted much study, especially in the case where the data are given incrementally, and we must compute the quantiles in an online, streaming fashion. While such algorithms have proved to be extremely useful in practice, there has been limited formal comparison of the competing methods, and no comprehensive study of their performance. In this paper, we remedy this deficit by providing a taxonomy of different methods and describe efficient implementations. In doing so, we propose new variants that have not been studied before, yet which outperform existing methods. To illustrate this, we provide detailed experimental comparisons demonstrating the trade-offs between space, time, and accuracy for quantile computation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Stream Management: A Brave New World

Quantiles and Equi-depth Histograms over Streams

Tracking of multiple quantiles in dynamically varying data streams

Article 16 January 2019

Notes

Note that floating-point numbers in standard representations (e.g., IEEE 754) can be mapped to integers in a fixed universe in an order-preserving fashion.
http://www.minorplanetcenter.net/iau/ecs/mpcat-obs/mpcat-obs.html.
Right ascension is an astronomical term used to locate a point (a minor planet in this case) in the equatorial coordinate system.
http://www.ncfloodmaps.com.
It is possible for the error to be affected due to more duplicates in smaller universes, but we found this effect negligible in practice.

References

Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J.M., Wei, Z., Yi, K.: Mergeable summaries. ACM Trans. Database Syst. 38, 26 (2013)
Article MathSciNet MATH Google Scholar
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999). doi:10.1006/jcss.1997.1545. http://www.sciencedirect.com/science/article/B6WJ0-45JCBTJ-D/2/2a71f12f1f0112bc83447b9d48eba529
Arasu, A., Manku, G.: Approximate counts and quantiles over sliding windows. In: Proceedings of the ACM Symposium on Principles of Database Systems (2004)
Blum, M., Floyd, R.W., Pratt, V., Rievest, R.L., Tarjan, R.E.: Time bounds for selection. J. Comput. Syst. Sci. 7, 448–461 (1973)
Article MathSciNet MATH Google Scholar
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages, and Programming (2002)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: Proceedings of the International Conference on Very Large Data Bases (2008)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Article MathSciNet MATH Google Scholar
Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004)
Cormode, G., Garofalakis, M., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In: Proceedings of the ACM Symposium on Principles of Database Systems (2006)
Felber, D., Ostrovsky, R.: A randomized online quantile summary in O(1/$\varepsilon $ log 1/$\varepsilon $) words (2015). CoRR abs/1503.01156, http://arxiv.org/abs/1503.01156
Ganguly, S., Majumder, A.: CR-precis: A deterministic summary structure for update data streams. In: ESCAPE (2007)
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of the International Conference on Very Large Data Bases (2002)
Govindaraju, N.K., Raghuvanshi, N., Manocha, D.: Fast and approximate stream mining of quantiles and frequencies using graphics processors. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)
Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2001)
Greenwald, M., Khanna, S.: Power conserving computation of order-statistics over sensor networks. In: Proceedings of the ACM Symposium on Principles of Database Systems (2004)
Huang, Z., Wang, L., Yi, K., Liu, Y.: Sampling based algorithms for quantile computation in sensor networks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2011)
Hung, R.Y.S., Ting, H.F.: An $\varOmega (\frac{1}{\varepsilon }\log \frac{1}{\varepsilon })$ space lower bound for finding $\varepsilon $-approximate quantiles in a data stream. In: FAW (2010)
Lagrange, J.L.: Mécanique analytique, vol. 1. Mallet-Bachelier, Paris (1853)
Google Scholar
Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing histogram queries under differential privacy (2009). CoRR abs/0912.4742, http://arxiv.org/abs/0912.4742
Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1998)
Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1999)
Munro, J.I., Paterson, M.S.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315–323 (1980)
Article MathSciNet MATH Google Scholar
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Dyn. Grids Worldw. Comput. 13(4), 277–298 (2005)
Google Scholar
Rao, C.R.: Linear Statistical Inference and its Applications, vol. 22. Wiley, New Jersey (2009)
Google Scholar
Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of the ACM SenSys (2004)
Suri, S., Toth, C., Zhou, Y.: Range counting over multidimensional data streams. Discrete Comput. Geom. 36, 633–655 (2006)
Article MathSciNet MATH Google Scholar
Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971)
Article MathSciNet MATH Google Scholar
Wang, L., Luo, G., Yi, K., Cormode, G.: Quantiles over data streams: an experimental study. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2013)
Yi, K., Zhang, Q.: Optimal tracking of distributed heavy hitters and quantiles. Algorithmica 65(1), 206–223 (2013)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The first three authors are supported by HKRGC Grants GRF-621413, GRF-16211614, and GRF-16200415, and by a Microsoft Grant MRA14EG05. The work of GC is supported in part by European Research Council Grant ERC-2014-CoG 647557, the Yahoo Faculty Research and Engagement Program and a Royal Society Wolfson Research Merit Award.

Author information

Authors and Affiliations

Department of Compute Science and Engineering, HKUST, Kowloon, Hong Kong
Ge Luo, Lu Wang & Ke Yi
Department of Computer Science, University of Warwick, Coventry, UK
Graham Cormode

Authors

Ge Luo
View author publications
You can also search for this author in PubMed Google Scholar
Lu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ke Yi
View author publications
You can also search for this author in PubMed Google Scholar
Graham Cormode
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ge Luo.

Additional lemmas and proofs

1.1 Size of the truncated tree $\hat{T}$

Lemma 1

The truncated tree $\hat{T}$ has size $O({1 \over \varepsilon } \log u)$ in expectation.

Proof

Recall that only nodes whose estimated frequency is above $\eta \varepsilon n$ are added to the truncated tree $\hat{T}$, where $\eta $ is a constant. We classify these nodes into heavy nodes and non-heavy nodes. A node is heavy if its true frequency is above ${1\over 2}\eta \varepsilon n$, and non-heavy otherwise.

We first observe that, on any level of the dyadic structure, there are at most ${n \over {1\over 2} \eta \varepsilon n} = O(1/\varepsilon )$ heavy nodes, treating $\eta $ as a (small) constant. So even if they are all added to $\hat{T}$, there are only $O({1\over \varepsilon } \log u)$ of them in total.

For a non-heavy node v, the Count-Sketch may overestimate its frequency to above $\eta \varepsilon n$ with a constant probability, say 1 / 4. Note that there can be $\varTheta (u)$ non-heavy nodes being overestimated in expectation, but we will argue below that only $O({1\over \varepsilon }\log u)$ will be added during the top-down construction of $\hat{T}$.

Note that for a non-heavy node to be added, its parent must be a heavy node or another non-heavy node that has been overestimated. Thus, all the non-heavy nodes in $\hat{T}$ make up a number of subtrees, where the root of each subtree must be a child of a heavy node. Let ${\mathbf {t}}$ be any such non-heavy subtree, and we will bound ${\mathrm {E}}[|{\mathbf {t}}|]$. Let r be the root of ${\mathbf {t}}$. For any node v below r, let $I_v$ be an indicator variable where $I_v=1$ if $v\in {\mathbf {t}}$ and $I_v=0$ otherwise. Let d(v) be the depth of v in ${\mathbf {t}}$, and we define $d(r) = 0$. For v to be added to ${\mathbf {t}}$, all of its d(v) ancestors must have been added, so ${\mathrm {E}}[I_v] \le (1/4)^{d(v)+1}$ since each level uses an independent Count-Sketch. We then have

$$\begin{aligned} \begin{aligned} E[|{\mathbf {t}}|]&=\sum _{v \text { below } r}E[I_v] \\&=\sum _{v \text { below } r } (1/4)^{d(v)+1} \\&\le \sum _{d = 0}^{\log u} 2^{d}(1/4)^{d+1}\le 1. \end{aligned} \end{aligned}$$

Finally, we observe that at most two such ${\mathbf {t}}$’s can be attached to a heavy node, and there are only $O({1\over \varepsilon }\log u)$ heavy nodes, so we conclude that there are $O({1\over \varepsilon }\log u)$ non-heavy nodes in $\hat{T}$ in expectation. $\square $

1.2 Constraints on the $x^*_i$’s

Lemma 2

Let $x^*$ be the BLUE of (4). Let $\lambda _v$ be any solution to (5). For any leaf w, let $Z_w= \lambda _w \sum _{z\in \text {anc}(w)\setminus r}y_z/\sigma ^2_z$, and for any internal node v, $Z_v= \sum _{w\prec v}\lambda _w Z_w$. For any node v except the root r, let $F_v = \sum _{w\in \text {anc}(v)\setminus \{r\}}x^*_w/\sigma ^2_w$. Let $\varDelta = (Z_r-y_r\pi _{r\text {'s left child}})/\lambda _r$. Then, we have

$$\begin{aligned} \left\{ \begin{aligned}&x^*_r=y_r, \\&x^*_v = \left( Z_v - \lambda _v F_{\text {parent}(v)} - \lambda _v \varDelta \right) /\pi _v,\quad {\mathrm{for\;all\;nodes}}\;v\ne r. \end{aligned} \right. \end{aligned}$$

Proof

We will follow the method of Lagrange multiplier [19] to find the BLUE of (4). Since only the subtree root is known exactly, we introduce a single Lagrange multiplier $\eta $. We set $\sigma ^2_r=1/\eta $ instead of 0, and will later take the limit as $\eta $ goes to $\infty $. Denote by ${\text {diag}}(1/\sigma _v)$ the diagonal matrix with $1/\sigma _v$ at entry (v, v). We further define $Z={\text {diag}}(1/\sigma _v){y}$ and $U={\text {diag}}(1/\sigma _v)A$. Then, the Lagrange function can be rewritten as $(Z-Ux^*)^T(Z-Ux^*)$. By differentiation, we can derive (i) $y_r=x_r^*$; and (ii)

$$\begin{aligned} U^TUx^* = U^TZ. \end{aligned}$$

(7)

This is sufficient to define a solution; we can solve for $x^*$ by premultiplying by $(U^TU)^{-1}$, at the cost of a computing a matrix inverse. In the following, to derive the equations stated in the lemma from (7), which leads to a much more efficient algorithm to compute $x^*$.

Let $\text {anc}(u,v)=\text {anc}(u)\cap \text {anc}(v)$. Let u be any node of $T_r$. We also use $[\tau ]$ denote the $\tau $ leaves of $T_r$. Then, by simple calculation, we can see that $(U^TU)_{u,w}=\sum _{v\in \text {anc}(u,w)}\sigma _v^{-2}$, and $(U^TZ)_u=\sum _{v\in \text {anc}(u)}y_v/\sigma ^2_v$.

First, we take the weighted sum of corresponding rows on the LHS of (7) to obtain $\sum _{u\prec v}\lambda _u(U^TU)_ux^*=$

$$\begin{aligned} \begin{aligned}&\sum _{u\prec v}\sum _{z\in [\tau ]}\left( \sum _{ w\in \text {anc}(u,z) \setminus \text {anc}(v)}\frac{\lambda _ux^*_z}{\sigma ^2_w} + \sum _{w\in \text {anc}(u,z)\cap \text {anc}(v)}\frac{\lambda _ux^*_z}{\sigma ^2_w}\right) \\&\quad = \sum _{u\prec v}\sum _{z\prec v}\sum _{w\in \text {anc}(u,z)\setminus \text {anc}(v)}\frac{\lambda _ux^*_z}{\sigma ^2_w} + \sum _{w\in \text {anc}(v)}\sum _{u\prec v}\sum _{z\prec w}\frac{\lambda _ux^*_z}{\sigma ^2_w} \\&\quad = \sum _{u\prec v}\sum _{w\in \text {anc}(u)\setminus \text {anc}(v)}\frac{\lambda _u}{\sigma ^2_w}\sum _{z\prec w}x^*_z + \sum _{w\in \text {anc}(v)}\sum _{u\prec v}\frac{x^*_w}{\sigma ^2_w}\lambda _u \\&\quad = \sum _{u\prec v}\sum _{w\in \text {anc}(u)\setminus \text {anc}(v)}\lambda _ux^*_w/\sigma ^2_w + \lambda _v\sum _{w\in \text {anc}(v)}x^*_w/\sigma ^2_w. \\ \end{aligned} \end{aligned}$$

(8)

Note that in the last line of (8), the second component can be written as

$$\begin{aligned} \lambda _v\sum _{w\in \text {anc}(v)}\frac{x^*_w}{\sigma ^2_w} = \lambda _v\left( F_{\text {parent}(v)} + \frac{x^*_v}{\sigma ^2_v}+\frac{x^*_r}{\sigma ^2_r}\right) . \end{aligned}$$

We can also derive that the first component is

$$\begin{aligned} \sum _{u\prec v}\sum _{w\in \text {anc}(u)\setminus \text {anc}(v)}\frac{\lambda _ux^*_w}{\sigma ^2_w} = \left( \pi _v-\frac{\lambda _v}{\sigma ^2_v}\right) x^*_v . \end{aligned}$$

To see this, let us assume that this holds for any descendant of v. Then, we can derive $\sum _{u\prec v}\sum _{w\in \text {anc}(u)\setminus \text {anc}(v)}\frac{\lambda _ux^*_w}{\sigma ^2_w}$

$$\begin{aligned} \begin{aligned}&=\sum _{\{s\text { is a child of }v\}}\sum _{u\prec s}\left( \frac{\lambda _ux^*_s}{\sigma ^2_s} + \sum _{w\in \text {anc}(u)\setminus \text {anc}(s)}\frac{\lambda _ux^*_w}{\sigma ^2_w} \right) \\&= \sum _{\{s\text { is a child of }v\}} \left( \frac{\lambda _sx^*_s}{\sigma ^2_s} +\sum _{u\prec s}\sum _{w\in \text {anc}(u)\setminus \text {anc}(s)}\frac{\lambda _ux^*_w}{\sigma ^2_w} \right) \\&= \sum _{\{s\text { is a child of }v\}}\left( \frac{\lambda _s x^*_s}{\sigma ^2_s}+ \left( \pi _s-\frac{\lambda _s}{\sigma ^2_s}\right) x^*_s \right) \\&= \sum _{\{s\text { is a child of }v\}} \pi _sx^*_s =\pi _s x^*_v = \left( \pi _v-\frac{\lambda _v}{\sigma ^2_v}\right) x^*_v. \end{aligned} \end{aligned}$$

Combining the above two results, we have

$$\begin{aligned} \sum _{u\prec v}\lambda _u\left( U^TU\right) _ux^*=\pi _vx^*_v + \lambda _v\left( \frac{x^*_r}{\sigma ^2_r} +F_{\text {parent}(v)}\right) . \end{aligned}$$

(9)

Secondly, we take the weighted sum of corresponding rows on the RHS of (7) to obtain

$$\begin{aligned} \begin{aligned} \sum _{u\prec v}\lambda _u\left( U^TZ\right) _u&= \sum _{u\prec v}\lambda _u\sum _{w\in \text {anc}(u)}\frac{y_w}{\sigma ^2_w} \\&=\sum _{u\prec v}\lambda _u\left( Z_u +\frac{y_r}{\sigma ^2_r}\right) \\&=Z_v+\frac{\lambda _vy_r}{\sigma ^2_r}. \end{aligned} \end{aligned}$$

(10)

Finally, by combing (9) and (10), we derive that $\forall v$,

$$\begin{aligned} \pi _vx^*_v = Z_v-\lambda _vF_{\text {parent}(v)} -\lambda _v\left( x^*_r-y_r\right) \eta . \end{aligned}$$

(11)

Substituting v by r in (11), we can derive

$$\begin{aligned} x^*_r = \frac{\left( \frac{Z_r}{\eta }+y_r\lambda _r\right) }{ \left( \frac{\pi _r}{\eta }+\lambda _r\right) }. \end{aligned}$$

We already have $x^*_r=y_r$. Then, we can conclude that either $\eta =+\infty $ or $y_r\pi _r=Z_r$. As we know, $y_r$ is given and irrelevant to $\pi _rZ_r$ which implies $\eta =+\infty $. To handle this infinity, we first express

$$\begin{aligned} \begin{aligned} \varDelta (\eta ) = \left( x^*_r-y_r\right) \eta = \left( Z_r - y_r\pi _s\right) \left( \pi _r/\eta +\lambda _r\right) , \end{aligned} \end{aligned}$$

where s is a child of root r. Now we take limit of $\varDelta (\eta )$ and derive that

$$\begin{aligned} \varDelta = \lim _{\eta \rightarrow +\infty }\varDelta (\eta ) = \left( Z_r-y_r\pi _s\right) /\lambda _r. \end{aligned}$$

Finally by taking limit on (11) for any $v\ne r$, we derive $x^*_v=(Z_v-\lambda _v F_{\text {parent}(v)}-\lambda _v\varDelta )/\pi _v $, and the lemma is proved. $\square $

1.3 Covariance analysis

Lemma 3

Suppose we build a Count-Sketch with $d=1$ row and w columns on a vector x. For any two elements $u\ne v$, let $y_u$ and $y_v$ be the estimators for $x_u$ and $x_v$. Then, ${\mathrm {Cov}}(y_u,y_v) = x_u x_v/w$.

Proof

Recall that in the Count-Sketch, for any element v, the estimator is computed as $y_v = g(v)\sum _{z}g(z)x_zI_v(z)$, where $I_v(z)=1$ if $h(v)=h(z)$ and 0 otherwise. So we have

$$\begin{aligned} \begin{aligned} {\mathrm {Cov}}(y_u,y_v)&={\mathrm {E}}(y_u y_v)-{\mathrm {E}}(y_u)E(y_v)\\&={\mathrm {E}}\left[ \left( g(u)\sum _{h(i)=u}g(i)x_i\right) \left( g(v)\sum _{h(j)=v}g(j)x_j\right) \right] \\&\quad -x_u x_v\\&=\sum _{i,j}{\mathrm {E}}[g(u)g(v)g(i)g(j)]x_ix_j {\mathrm {E}}[I_u(i)I_v(j)]-x_u x_v \end{aligned} \end{aligned}$$

Let $f(i,j)={\mathrm {E}}[g(u)g(v)g(i)g(j)]x_ix_j{\mathrm {E}}[I_u(i)I_v(j)]$. We find that $f(u,v)=x_ux_v$ and $f(v,u)= x_u x_v/w$ (note, the last term in f(i, j) is not symmetric in i and j). If $\{i,j\}\ne \{u,v\}$, then $f(i,j)=0$, since $g(\cdot )$ is 4-wise independent hash function. Therefore, we derive that ${\mathrm {Cov}}(y_u,y_v)=x_u x_v/w$. $\square $

On the other hand, prior analysis on the Count-Sketch shows that ${\mathrm {Var}}(y_v) = {1\over w} \sum _{i} x_i^2$. Thus, the covariance is usually order of magnitude smaller than the variance.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luo, G., Wang, L., Yi, K. et al. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB Journal 25, 449–472 (2016). https://doi.org/10.1007/s00778-016-0424-7

Download citation

Received: 29 June 2015
Revised: 10 December 2015
Accepted: 20 January 2016
Published: 08 February 2016
Issue Date: August 2016
DOI: https://doi.org/10.1007/s00778-016-0424-7

Quantiles over data streams: experimental comparisons, new analyses, and further improvements

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Data Stream Management: A Brave New World

Quantiles and Equi-depth Histograms over Streams

Tracking of multiple quantiles in dynamically varying data streams

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional lemmas and proofs

1.1 Size of the truncated tree \(\hat{T}\)

Lemma 1

Proof

1.2 Constraints on the \(x^*_i\)’s

Lemma 2

Proof

1.3 Covariance analysis

Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Quantiles over data streams: experimental comparisons, new analyses, and further improvements

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Data Stream Management: A Brave New World

Quantiles and Equi-depth Histograms over Streams

Tracking of multiple quantiles in dynamically varying data streams

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional lemmas and proofs

Additional lemmas and proofs

1.1 Size of the truncated tree \(\hat{T}\)

Lemma 1

Proof

1.2 Constraints on the \(x^*_i\)’s

Lemma 2

Proof

1.3 Covariance analysis

Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now