[go: up one dir, main page]

skip to main content
research-article

Characterizing Directed and Undirected Networks via Multidimensional Walks with Jumps

Published: 23 January 2019 Publication History

Abstract

Estimating distributions of node characteristics (labels) such as number of connections or citizenship of users in a social network via edge and node sampling is a vital part of the study of complex networks. Due to its low cost, sampling via a random walk (RW) has been proposed as an attractive solution to this task. Most RW methods assume either that the network is undirected or that walkers can traverse edges regardless of their direction. Some RW methods have been designed for directed networks where edges coming into a node are not directly observable. In this work, we propose Directed Unbiased Frontier Sampling (DUFS), a sampling method based on a large number of coordinated walkers, each starting from a node chosen uniformly at random. It applies to directed networks with invisible incoming edges because it constructs, in real time, an undirected graph consistent with the walkers trajectories, and its use of random jumps to prevent walkers from being trapped. DUFS generalizes previous RW methods and is suited for undirected networks and to directed networks regardless of in-edge visibility. We also propose an improved estimator of node label distribution that combines information from initial walker locations with subsequent RW observations. We evaluate DUFS, compare it to other RW methods, investigate the impact of its parameters on estimation accuracy and provide practical guidelines for choosing them. In estimating out-degree distributions, DUFS yields significantly better estimates of the head of the distribution than other methods, while matching or exceeding estimation accuracy of the tail. Last, we show that DUFS outperforms uniform sampling when estimating distributions of node labels of the top 10% largest degree nodes, even when sampling a node uniformly has the same cost as RW steps.

References

[1]
Dimitris Achlioptas, Aaron Clauset, David Kempe, and Cristopher Moore. 2009. On the bias of trace route sampling: Or, power-law degree distributions in regular graphs. Journal of the ACM 56, 4 (July 2009), Article 21, 28 pages.
[2]
Nesreen K. Ahmed, Jennifer Neville, and Ramana Kompella. 2014. Network sampling: From static to streaming graphs. ACM Transactions on Knowledge Discovery from Data 8, 2 (2014), 7.
[3]
Nesreen K. Ahmed, Jennifer Neville, and Ramana Rao Kompella. 2012. Network sampling designs for relational classification. In ICWSM.
[4]
Konstantin Avrachenkov, Bruno Ribeiro, and Don Towsley. 2010. Improving Random Walk Estimation Accuracy with Uniform Restarts. Springer, Berlin, 98--109.
[5]
Ziv Bar-Yossef and Maxim Gurevich. 2008. Random sampling from a search engine’s index. Journal of the ACM 55, 5 (2008), 1--74.
[6]
S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.-U. Hwang. 2006. Complex networks: Structure and dynamics. Physics Reports 424, 4--5 (2006), 175--308.
[7]
Flavio Chiericetti, Anirban Dasgupta, Ravi Kumar, Silvio Lattanzi, and Tamás Sarlós. 2016. On sampling nodes in a network. In Proceedings of the 25th International Conference on World Wide Web (WWW’16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 471--481.
[8]
Anirban Dasgupta, Ravi Kumar, and D. Sivakumar. 2012. Social sampling. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). ACM, New York, NY, 235--243.
[9]
Nathan Eagle, Alex (Sandy) Pentland, and David Lazer. 2009. Inferring friendship network structure by using mobile phone data. Proceedings of the National Academy of Sciences 106, 36 (2009), 15274--15278. arXiv:http://www.pnas.org/content/106/36/15274.full.pdf
[10]
Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, and C. Seshadhri. 2018. Provable and practical approximations for the degree distribution using sublinear graph samples. In Proceedings of the 27th International Conference on World Wide Web (WWW'18). International World Wide Web Conferences Steering Committee. Republic and Canton of Geneva, 449--458.
[11]
Minas Gjoka, Carter T. Butts, Maciej Kurant, and Athina Markopoulou. 2010. Walking in Facebook: A case study of unbiased sampling of OSNs. In Proceedings of IEEE INFOCOM. 1--9.
[12]
Douglas D. Heckathorn. 1997. Respondent-driven sampling: A new approach to the study of hidden populations. Social Problems 44, 2 (1997), 174--199. arXiv:http://socpro.oxfordjournals.org/content/44/2/174.full.pdf
[13]
Douglas D. Heckathorn. 2002. Respondent-driven sampling II: Deriving valid population estimates from chain-referral samples of hidden populations. Social Problems 49, 1 (2002), 11--34. arXiv:http://socpro.oxfordjournals.org/content/49/1/11.full.pdf
[14]
Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, and Marc Najork. 2000. On near-uniform URL sampling. Computer Networks 33, 1--6 (2000), 295--308.
[15]
Christian Hubler, H.-P. Kriegel, Karsten Borgwardt, and Zoubin Ghahramani. 2008. Metropolis algorithms for representative subgraph sampling. In Proceedings of the 2008 8th IEEE International Conference on Data Mining. 283--292.
[16]
Maciej Kurant, Minas Gjoka, Carter T. Butts, and Athina Markopoulou. 2011. Walking on a graph with a magnifying glass: Stratified sampling via weighted random walks. In ACM SIGMETRICS. ACM, New York, NY, 281--292.
[17]
Maciej Kurant, Athina Markopoulou, and Patrick Thiran. 2011. Towards unbiased BFS sampling. IEEE Journal on Selected Areas in Communications 29, 9 (September 2011), 1799--1809.
[18]
Erich Leo Lehmann and George Casella. 1991. Theory of Point Estimation. Wadsworth 8 Brooks/Cole Advanced Books 8 Software.
[19]
Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, NY, 631--636.
[20]
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection.
[21]
Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2008. Statistical properties of community structure in large social and information networks. In Proceedings of the 17th International Conference on World Wide Web (WWW’08). ACM, New York, NY, 695--704.
[22]
Yifei Ma, Tzu-Kuo Huang, and Jeff G. Schneider. 2015. Active search and bandits on graphs using sigma-optimality. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. 542--551.
[23]
Arun S. Maiya and Tanya Y. Berger-Wolf. 2011. Benefits of bias: Towards better characterization of network sampling. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). ACM, New York, NY, 105--113.
[24]
Andrew McGregor. 2014. Graph stream algorithms: A survey. ACM SIGMOD Record 43, 1 (2014), 9--20.
[25]
Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, New York, NY, 29--42.
[26]
Fabricio Murai, Bruno Ribeiro, Don Towsley, and Pinghui Wang. 2013. On set size distribution estimation and the characterization of large networks via sampling. IEEE Journal on Selected Areas in Communications 31, 6 (June 2013), 1017--1025.
[27]
Fabricio Murai, Bruno Ribeiro, Don Towsley, and Pinghui Wang. 2018. Characterizing Directed and Undirected Networks via Multidimensional Walks with Jumps. Technical Report. arXiv:1703.08252.
[28]
Amir H. Rasti, Mojtaba Torkjazi, Reza Rejaie, Nick Duffield, Walter Willinger, and Daniel Stutzbach. 2009. Respondent-driven sampling for characterizing unstructured overlays. In Proceedings of the IEEE INFOCOM. 2701--2705.
[29]
Bruno Ribeiro, William Gauvin, Benyuan Liu, and Don Towsley. 2010. On MySpace account spans and double pareto-like distribution of friends. In Proceedings of INFOCOM IEEE Conference on Computer Communications Workshops. 1--6.
[30]
Bruno Ribeiro and Don Towsley. 2010. Estimating and sampling graphs with multidimensional random walks. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (IMC’10). ACM, New York, NY, 390--403.
[31]
Bruno Ribeiro and Don Towsley. 2012. On the estimation accuracy of degree distributions from graph sampling. In Proceedings of the 51st IEEE Conference on Decision and Control (CDC’12). 5240--5247.
[32]
Bruno Ribeiro, Pinghui Wang, Fabricio Murai, and Don Towsley. 2012. Sampling directed graphs with random walks. In Proceedings of the IEEE INFOCOM. 1692--1700.
[33]
M. Salehi and H. R. Rabiee. 2013. A measurement framework for directed networks. IEEE Journal on Selected Areas in Communications 31, 6 (June 2013), 1007--1016.
[34]
Matthew J. Salganik and Douglas D. Heckathorn. 2004. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology 34 (2004), 193--239.
[35]
Daniel Stutzbach, Rea Rejaie, Nick Duffield, Subhabrata Sen, and Walter Willinger. 2009. On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Transactions on Networking 17, 2 (April 2009), 377--390.
[36]
Erik Volz and Douglas D. Heckathorn. 2008. Probability based estimation theory for respondent driven sampling. Journal of Official Statistics 24, 1 (03 2008), 79.
[37]
Pinghui Wang, Junzhou Zhao, Bruno Ribeiro, John C. S. Lui, Don Towsley, and Xiaohong Guan. 2018. Fast crawling methods of exploring content distributed over large graphs. Knowledge and Information Systems (2018), 26 pages.
[38]
Pinghui Wang, Junzhou Zhao, Bruno Ribeiro, John C. S. Lui, Don Towsley, and Xiaohong Guan. 2018. Practical characterization of large networks using neighborhood information. Knowledge and Information Systems (2018), 28 pages.
[39]
Zhuojie Zhou, Nan Zhang, Zhiguo Gong, and Gautam Das. 2016. Faster random walks by rewiring online social networks on-the-fly. ACM Transactions on Database Systems 40, 4 (January 2016), Article 26, 36 pages.

Cited By

View all
  • (2021)Sampling Based Estimation of In-Degree Distribution for Directed Complex NetworksJournal of Computational and Graphical Statistics10.1080/10618600.2021.187314330:4(863-876)Online publication date: 12-Mar-2021

Index Terms

  1. Characterizing Directed and Undirected Networks via Multidimensional Walks with Jumps

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 13, Issue 1
        February 2019
        340 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/3301280
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 23 January 2019
        Accepted: 01 November 2018
        Revised: 01 November 2018
        Received: 01 January 2017
        Published in TKDD Volume 13, Issue 1

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Complex networks
        2. directed networks
        3. graph sampling
        4. random walks

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        • National Council for Scientific and Technological Development - Brazil
        • Army Research Laboratory under Cooperative Agreement
        • Universidade Federal de Minas Gerais under the Programa Institucional de Auxílio à Pesquisa de Docentes Recém-Contratados
        • MURI ARO

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)8
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 06 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2021)Sampling Based Estimation of In-Degree Distribution for Directed Complex NetworksJournal of Computational and Graphical Statistics10.1080/10618600.2021.187314330:4(863-876)Online publication date: 12-Mar-2021

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media