Abstract
A fundamental problem in data management and analysis is to generate descriptions of the distribution of data. It is most common to give such descriptions in terms of the cumulative distribution, which is characterized by the quantiles of the data. The design and engineering of efficient methods to find these quantiles has attracted much study, especially in the case where the data are given incrementally, and we must compute the quantiles in an online, streaming fashion. While such algorithms have proved to be extremely useful in practice, there has been limited formal comparison of the competing methods, and no comprehensive study of their performance. In this paper, we remedy this deficit by providing a taxonomy of different methods and describe efficient implementations. In doing so, we propose new variants that have not been studied before, yet which outperform existing methods. To illustrate this, we provide detailed experimental comparisons demonstrating the trade-offs between space, time, and accuracy for quantile computation.















Similar content being viewed by others
Notes
Note that floating-point numbers in standard representations (e.g., IEEE 754) can be mapped to integers in a fixed universe in an order-preserving fashion.
Right ascension is an astronomical term used to locate a point (a minor planet in this case) in the equatorial coordinate system.
It is possible for the error to be affected due to more duplicates in smaller universes, but we found this effect negligible in practice.
References
Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J.M., Wei, Z., Yi, K.: Mergeable summaries. ACM Trans. Database Syst. 38, 26 (2013)
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999). doi:10.1006/jcss.1997.1545. http://www.sciencedirect.com/science/article/B6WJ0-45JCBTJ-D/2/2a71f12f1f0112bc83447b9d48eba529
Arasu, A., Manku, G.: Approximate counts and quantiles over sliding windows. In: Proceedings of the ACM Symposium on Principles of Database Systems (2004)
Blum, M., Floyd, R.W., Pratt, V., Rievest, R.L., Tarjan, R.E.: Time bounds for selection. J. Comput. Syst. Sci. 7, 448–461 (1973)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages, and Programming (2002)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: Proceedings of the International Conference on Very Large Data Bases (2008)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004)
Cormode, G., Garofalakis, M., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In: Proceedings of the ACM Symposium on Principles of Database Systems (2006)
Felber, D., Ostrovsky, R.: A randomized online quantile summary in O(1/\(\varepsilon \) log 1/\(\varepsilon \)) words (2015). CoRR abs/1503.01156, http://arxiv.org/abs/1503.01156
Ganguly, S., Majumder, A.: CR-precis: A deterministic summary structure for update data streams. In: ESCAPE (2007)
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of the International Conference on Very Large Data Bases (2002)
Govindaraju, N.K., Raghuvanshi, N., Manocha, D.: Fast and approximate stream mining of quantiles and frequencies using graphics processors. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)
Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2001)
Greenwald, M., Khanna, S.: Power conserving computation of order-statistics over sensor networks. In: Proceedings of the ACM Symposium on Principles of Database Systems (2004)
Huang, Z., Wang, L., Yi, K., Liu, Y.: Sampling based algorithms for quantile computation in sensor networks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2011)
Hung, R.Y.S., Ting, H.F.: An \(\varOmega (\frac{1}{\varepsilon }\log \frac{1}{\varepsilon })\) space lower bound for finding \(\varepsilon \)-approximate quantiles in a data stream. In: FAW (2010)
Lagrange, J.L.: Mécanique analytique, vol. 1. Mallet-Bachelier, Paris (1853)
Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing histogram queries under differential privacy (2009). CoRR abs/0912.4742, http://arxiv.org/abs/0912.4742
Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1998)
Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1999)
Munro, J.I., Paterson, M.S.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315–323 (1980)
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Dyn. Grids Worldw. Comput. 13(4), 277–298 (2005)
Rao, C.R.: Linear Statistical Inference and its Applications, vol. 22. Wiley, New Jersey (2009)
Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of the ACM SenSys (2004)
Suri, S., Toth, C., Zhou, Y.: Range counting over multidimensional data streams. Discrete Comput. Geom. 36, 633–655 (2006)
Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971)
Wang, L., Luo, G., Yi, K., Cormode, G.: Quantiles over data streams: an experimental study. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2013)
Yi, K., Zhang, Q.: Optimal tracking of distributed heavy hitters and quantiles. Algorithmica 65(1), 206–223 (2013)
Acknowledgments
The first three authors are supported by HKRGC Grants GRF-621413, GRF-16211614, and GRF-16200415, and by a Microsoft Grant MRA14EG05. The work of GC is supported in part by European Research Council Grant ERC-2014-CoG 647557, the Yahoo Faculty Research and Engagement Program and a Royal Society Wolfson Research Merit Award.
Author information
Authors and Affiliations
Corresponding author
Additional lemmas and proofs
Additional lemmas and proofs
1.1 Size of the truncated tree \(\hat{T}\)
Lemma 1
The truncated tree \(\hat{T}\) has size \(O({1 \over \varepsilon } \log u)\) in expectation.
Proof
Recall that only nodes whose estimated frequency is above \(\eta \varepsilon n\) are added to the truncated tree \(\hat{T}\), where \(\eta \) is a constant. We classify these nodes into heavy nodes and non-heavy nodes. A node is heavy if its true frequency is above \({1\over 2}\eta \varepsilon n\), and non-heavy otherwise.
We first observe that, on any level of the dyadic structure, there are at most \({n \over {1\over 2} \eta \varepsilon n} = O(1/\varepsilon )\) heavy nodes, treating \(\eta \) as a (small) constant. So even if they are all added to \(\hat{T}\), there are only \(O({1\over \varepsilon } \log u)\) of them in total.
For a non-heavy node v, the Count-Sketch may overestimate its frequency to above \(\eta \varepsilon n\) with a constant probability, say 1 / 4. Note that there can be \(\varTheta (u)\) non-heavy nodes being overestimated in expectation, but we will argue below that only \(O({1\over \varepsilon }\log u)\) will be added during the top-down construction of \(\hat{T}\).
Note that for a non-heavy node to be added, its parent must be a heavy node or another non-heavy node that has been overestimated. Thus, all the non-heavy nodes in \(\hat{T}\) make up a number of subtrees, where the root of each subtree must be a child of a heavy node. Let \({\mathbf {t}}\) be any such non-heavy subtree, and we will bound \({\mathrm {E}}[|{\mathbf {t}}|]\). Let r be the root of \({\mathbf {t}}\). For any node v below r, let \(I_v\) be an indicator variable where \(I_v=1\) if \(v\in {\mathbf {t}}\) and \(I_v=0\) otherwise. Let d(v) be the depth of v in \({\mathbf {t}}\), and we define \(d(r) = 0\). For v to be added to \({\mathbf {t}}\), all of its d(v) ancestors must have been added, so \({\mathrm {E}}[I_v] \le (1/4)^{d(v)+1}\) since each level uses an independent Count-Sketch. We then have
Finally, we observe that at most two such \({\mathbf {t}}\)’s can be attached to a heavy node, and there are only \(O({1\over \varepsilon }\log u)\) heavy nodes, so we conclude that there are \(O({1\over \varepsilon }\log u)\) non-heavy nodes in \(\hat{T}\) in expectation. \(\square \)
1.2 Constraints on the \(x^*_i\)’s
Lemma 2
Let \(x^*\) be the BLUE of (4). Let \(\lambda _v\) be any solution to (5). For any leaf w, let \(Z_w= \lambda _w \sum _{z\in \text {anc}(w)\setminus r}y_z/\sigma ^2_z\), and for any internal node v, \(Z_v= \sum _{w\prec v}\lambda _w Z_w\). For any node v except the root r, let \(F_v = \sum _{w\in \text {anc}(v)\setminus \{r\}}x^*_w/\sigma ^2_w\). Let \(\varDelta = (Z_r-y_r\pi _{r\text {'s left child}})/\lambda _r\). Then, we have
Proof
We will follow the method of Lagrange multiplier [19] to find the BLUE of (4). Since only the subtree root is known exactly, we introduce a single Lagrange multiplier \(\eta \). We set \(\sigma ^2_r=1/\eta \) instead of 0, and will later take the limit as \(\eta \) goes to \(\infty \). Denote by \({\text {diag}}(1/\sigma _v)\) the diagonal matrix with \(1/\sigma _v\) at entry (v, v). We further define \(Z={\text {diag}}(1/\sigma _v){y}\) and \(U={\text {diag}}(1/\sigma _v)A\). Then, the Lagrange function can be rewritten as \((Z-Ux^*)^T(Z-Ux^*)\). By differentiation, we can derive (i) \(y_r=x_r^*\); and (ii)
This is sufficient to define a solution; we can solve for \(x^*\) by premultiplying by \((U^TU)^{-1}\), at the cost of a computing a matrix inverse. In the following, to derive the equations stated in the lemma from (7), which leads to a much more efficient algorithm to compute \(x^*\).
Let \(\text {anc}(u,v)=\text {anc}(u)\cap \text {anc}(v)\). Let u be any node of \(T_r\). We also use \([\tau ]\) denote the \(\tau \) leaves of \(T_r\). Then, by simple calculation, we can see that \((U^TU)_{u,w}=\sum _{v\in \text {anc}(u,w)}\sigma _v^{-2}\), and \((U^TZ)_u=\sum _{v\in \text {anc}(u)}y_v/\sigma ^2_v\).
First, we take the weighted sum of corresponding rows on the LHS of (7) to obtain \(\sum _{u\prec v}\lambda _u(U^TU)_ux^*=\)
Note that in the last line of (8), the second component can be written as
We can also derive that the first component is
To see this, let us assume that this holds for any descendant of v. Then, we can derive \(\sum _{u\prec v}\sum _{w\in \text {anc}(u)\setminus \text {anc}(v)}\frac{\lambda _ux^*_w}{\sigma ^2_w}\)
Combining the above two results, we have
Secondly, we take the weighted sum of corresponding rows on the RHS of (7) to obtain
Finally, by combing (9) and (10), we derive that \(\forall v\),
Substituting v by r in (11), we can derive
We already have \(x^*_r=y_r\). Then, we can conclude that either \(\eta =+\infty \) or \(y_r\pi _r=Z_r\). As we know, \(y_r\) is given and irrelevant to \(\pi _rZ_r\) which implies \(\eta =+\infty \). To handle this infinity, we first express
where s is a child of root r. Now we take limit of \(\varDelta (\eta )\) and derive that
Finally by taking limit on (11) for any \(v\ne r\), we derive \(x^*_v=(Z_v-\lambda _v F_{\text {parent}(v)}-\lambda _v\varDelta )/\pi _v \), and the lemma is proved. \(\square \)
1.3 Covariance analysis
Lemma 3
Suppose we build a Count-Sketch with \(d=1\) row and w columns on a vector x. For any two elements \(u\ne v\), let \(y_u\) and \(y_v\) be the estimators for \(x_u\) and \(x_v\). Then, \({\mathrm {Cov}}(y_u,y_v) = x_u x_v/w\).
Proof
Recall that in the Count-Sketch, for any element v, the estimator is computed as \(y_v = g(v)\sum _{z}g(z)x_zI_v(z)\), where \(I_v(z)=1\) if \(h(v)=h(z)\) and 0 otherwise. So we have
Let \(f(i,j)={\mathrm {E}}[g(u)g(v)g(i)g(j)]x_ix_j{\mathrm {E}}[I_u(i)I_v(j)]\). We find that \(f(u,v)=x_ux_v\) and \(f(v,u)= x_u x_v/w\) (note, the last term in f(i, j) is not symmetric in i and j). If \(\{i,j\}\ne \{u,v\}\), then \(f(i,j)=0\), since \(g(\cdot )\) is 4-wise independent hash function. Therefore, we derive that \({\mathrm {Cov}}(y_u,y_v)=x_u x_v/w\). \(\square \)
On the other hand, prior analysis on the Count-Sketch shows that \({\mathrm {Var}}(y_v) = {1\over w} \sum _{i} x_i^2\). Thus, the covariance is usually order of magnitude smaller than the variance.
Rights and permissions
About this article
Cite this article
Luo, G., Wang, L., Yi, K. et al. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB Journal 25, 449–472 (2016). https://doi.org/10.1007/s00778-016-0424-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-016-0424-7