[go: up one dir, main page]

skip to main content
research-article

Assessing and Improving an Evaluation Dataset for Detecting Semantic Code Clones via Deep Learning

Published: 28 July 2022 Publication History

Abstract

In recent years, applying deep learning to detect semantic code clones has received substantial attention from the research community. Accordingly, various evaluation benchmark datasets, with the most popular one as BigCloneBench, are constructed and selected as benchmarks to assess and compare different deep learning models for detecting semantic clones. However, there is no study to investigate whether an evaluation benchmark dataset such as BigCloneBench is properly used to evaluate models for detecting semantic code clones. In this article, we present an experimental study to show that BigCloneBench typically includes semantic clone pairs that use the same identifier names, which however are not used in non-semantic-clone pairs. Subsequently, we propose an undesirable-by-design Linear-Model that considers only which identifiers appear in a code fragment; this model can achieve high effectiveness for detecting semantic clones when evaluated on BigCloneBench, even comparable to state-of-the-art deep learning models recently proposed for detecting semantic clones. To alleviate these issues, we abstract a subset of the identifier names (including type, variable, and method names) in BigCloneBench to result in AbsBigCloneBench and use AbsBigCloneBench to better assess the effectiveness of deep learning models on the task of detecting semantic clones.

References

[1]
Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!). 143–153.
[2]
Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering 33, 9 (2007), 577–591.
[3]
Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, and Sebastian Riedel. 2016. Learning Python Code Suggestion with a Sparse Pointer Network. (2016). arxiv:cs.NE/1611.08307
[4]
Pai-Hsuen Chen, Chih-Jen Lin, and Bernhard Schölkopf. 2005. A tutorial on \(\nu\) -support vector machines: Research articles. Appl. Stoch. Model. Bus. Ind. 21, 2 (2005), 111–136.
[5]
Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. Synthesizing benchmarks for predictive modeling. In Proceedings of the 15th IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 86–99.
[6]
Jianglang Feng, Baojiang Cui, and Kunfeng Xia. 2013. A code comparison algorithm based on AST for plagiarism detection. In Proceedings of the 4th International Conference on Emerging Intelligent Data and Web Technologies (EIDWT). 393–397.
[7]
Mark Gabel, Lingxiao Jiang, and Zhendong Su. 2008. Scalable detection of semantic clones. In Proceedings of the 30th ACM/IEEE International Conference on Software Engineering (ICSE). 321–330.
[8]
Yi Gao, Zan Wang, Shuang Liu, Lin Yang, Wei Sang, and Yuanfang Cai. 2019. TECCD: A tree embedding approach for code clone detection. In Proceedings of the 35th IEEE International Conference on Software Maintenance and Evolution (ICSME). 145–156.
[9]
Jacob A. Harer, Louis Y. Kim, Rebecca L. Russell, Onur Ozdemir, Leonard R. Kosta, Akshay Rangamani, Lei H. Hamilton, Gabriel I. Centeno, Jonathan R. Key, Paul M. Ellingwood, Erik Antelman, Alan Mackay, Marc W. McConley, Jeffrey M. Opper, Peter Chin, and Tomo Lazovich. 2018. Automated Software Vulnerability Detection with Machine Learning. (2018). arxiv:cs.SE/1803.04497
[10]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In Proceedings of the 26th IEEE/ACM International Conference on Program Comprehension (ICPC). 200–210.
[11]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25, 3 (2020), 2179–2217.
[12]
Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing source code with transferred API knowledge. In Proceedings of the 27th AAAI International Joint Conference on Artificial Intelligence (IJCAI). 2269–2275.
[13]
IJaDataset2.0. January 2013. Ambient Software Evolution Group. (January 2013). http://secold.org/projects/seclone.
[14]
Shruti Jadon. 2016. Code clones detection using machine learning technique: Support vector machine. In Proceedings of the 2nd IEEE International Conference on Computing, Communication and Automation (ICCCA). 399–403.
[15]
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. DECKARD: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th ACM/IEEE International Conference on Software Engineering (ICSE). 96–105.
[16]
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654–670.
[17]
Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. VUDDY: A scalable approach for vulnerable code clone discovery. In Proceedings of the 38th IEEE Symposium on Security and Privacy (S&P). 595–614.
[18]
Egambaram Kodhai and Selvadurai Kanmani. 2014. Method-level code clone detection through LWH (light weight hybrid) approach. Software Engineering Research & Development 2, 1 (2014), 12.
[19]
Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of the 8th ACM SIGPLAN International Symposium on Static Analysis (SAS). 40–56.
[20]
Rainer Koschke, Raimar Falke, and Pierre Frenzel. 2006. Clone detection using abstract syntax suffix trees. In Proceedings of the 13th IEEE Working Conference on Reverse Engineering (WCRE). 253–262.
[21]
Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of the 8th IEEE Working Conference on Reverse Engineering (WCRE). 301–309.
[22]
Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A neural model for generating natural language summaries of program subroutines. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE). 795–806.
[23]
Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CCLearner: A deep learning-based clone detection approach. In Proceedings of the 33rd IEEE International Conference on Software Maintenance and Evolution (ICSME). 249–260.
[24]
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. VulDeePecker: A deep learning-based system for vulnerability detection. Proceedings of the 26th Network and Distributed System Security Symposium (NDSS).
[25]
Chao Liu, Chen Chen, Jiawei Han, and Philip S. Yu. 2006. GPLAG: Detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 872–881.
[26]
Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-machine-translation-based commit message generation: How far are we?. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE). 373–384.
[27]
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI). 1287–1293.
[28]
Hiroaki Murakami, Keisuke Hotta, Yoshiki Higo, Hiroshi Igaki, and Shinji Kusumoto. 2013. Gapped code clone detection with lightweight source code analysis. In Proceedings of the 21st IEEE International Conference on Program Comprehension (ICPC). 93–102.
[29]
Annamalai Narayanan, Mahinthan Chandramohan, Lihui Chen, Yang Liu, and Santhoshkumar Saminathan. 2016. SubGraph2Vec: Learning Distributed Representations of Rooted Sub-graphs from Large Graphs. (2016). arxiv:cs.LG/1606.08928
[30]
Michael Pradel and Koushik Sen. 2018. DeepBugs: A learning approach to name-based bug detection. Proceedings of the ACM on Programming Languages. 2, OOPSLA, Article 147 (2018).
[31]
Marc’Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun. 2007. Sparse feature learning for deep belief networks. In Proceedings of the 20th International Conference on Neural Information Processing Systems (NIPS). 1185–1192.
[32]
Chanchal Kumar Roy and James R. Cordy. 2007. A survey on software clone detection research. Queen’s School of Computing TR 541, 115 (2007), 64–68.
[33]
Chanchal Kumar Roy and James R. Cordy. 2008. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 16th IEEE International Conference on Program Comprehension (ICPC). 172–181.
[34]
Chanchal Kumar Roy and James R. Cordy. 2009. A mutation/injection-based automatic framework for evaluating code clone detection tools. In Proceedings of the 3rd IEEE International Conference on Software Testing, Verification, and Validation Workshops (ICSTW). 157–166.
[35]
Chanchal Kumar Roy and James R. Cordy. 2018. Benchmarks for software clone detection: A ten-year retrospective. In Proceedings of the 25th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 26–37.
[36]
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th IEEE/ACM International Conference on Software Engineering (ICSE). 1157–1168.
[37]
Harry Shum. 2020. From deep learning to deep understanding. In Frontiers in AI and Robotics. https://www.bilibili.com/video/av754058420.
[38]
Alex J. Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and Computing 14, 3 (2004), 199–222.
[39]
Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 30th International Conference on Software Maintenance and Evolution (ICSME). 476–480.
[40]
Jeffrey Svajlenko and Chanchal K. Roy. 2015. Evaluating clone detection tools with BigCloneBench. In Proceedings of the 31st IEEE International Conference on Software Maintenance and Evolution (ICSME). 131–140.
[41]
V. David Sánchez A. 2003. Advanced support vector machines and kernel methods. Neurocomputing 55, 1 (2003), 5–20.
[42]
Suresh Thummalapenta and Tao Xie. 2007. PARSEWeb: A programmer assistant for reusing open source code on the web. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 204–213.
[43]
Pengcheng Wang, Jeffrey Svajlenko, Yanzhao Wu, Yun Xu, and Chanchal K. Roy. 2018. CCAligner: A token based large-gap clone detector. In Proceedings of the 40th IEEE/ACM International Conference on Software Engineering (ICSE). 1066–1077.
[44]
Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In Proceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 261–271.
[45]
Hui-Hui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI). 3034–3040.
[46]
Hui-Hui Wei and Ming Li. 2018. Positive and unlabeled learning for detecting software functional clones with adversarial training. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI). 2840–2846.
[47]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). 87–98.
[48]
Hao Yu, Wing Lam, Long Chen, Ge Li, Tao Xie, and Qianxiang Wang. 2019. Neural detection of semantic code clones via tree-based convolution. In Proceedings of the 27th IEEE International Conference on Program Comprehension (ICPC). 70–80.
[49]
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE). 783–794.
[50]
Yuxia Zhang, Minghui Zhou, Audris Mockus, and Zhi Jin. 2021. Companies’ participation in OSS development–An empirical study of OpenStack. IEEE Transactions on Software Engineering 47, 10 (2021), 2242–2259.
[51]
Gang Zhao and Jeff Huang. 2018. DeepSim: Deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 141–151.

Cited By

View all
  • (2023)Assessing and Improving Dataset and Evaluation Methodology in Deep Learning for Code Clone Detection2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00044(497-508)Online publication date: 9-Oct-2023
  • (2023)GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench2023 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58846.2023.00013(1-13)Online publication date: 1-Oct-2023
  • (2023)Keeping Pace with Ever-Increasing Data: Towards Continual Learning of Code Intelligence ModelsProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00015(30-42)Online publication date: 14-May-2023
  • Show More Cited By

Index Terms

  1. Assessing and Improving an Evaluation Dataset for Detecting Semantic Code Clones via Deep Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Software Engineering and Methodology
    ACM Transactions on Software Engineering and Methodology  Volume 31, Issue 4
    October 2022
    867 pages
    ISSN:1049-331X
    EISSN:1557-7392
    DOI:10.1145/3543992
    • Editor:
    • Mauro Pezzè
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 July 2022
    Online AM: 25 June 2022
    Accepted: 01 November 2021
    Revised: 01 September 2021
    Received: 01 January 2021
    Published in TOSEM Volume 31, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Code clone detection
    2. deep learning
    3. dataset collection

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • Key-Area Research and Development Program of Guangdong Province
    • National Natural Science Foundation of China
    • Tencent Foundation or XPLORER PRIZE

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)178
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 04 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Assessing and Improving Dataset and Evaluation Methodology in Deep Learning for Code Clone Detection2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00044(497-508)Online publication date: 9-Oct-2023
    • (2023)GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench2023 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58846.2023.00013(1-13)Online publication date: 1-Oct-2023
    • (2023)Keeping Pace with Ever-Increasing Data: Towards Continual Learning of Code Intelligence ModelsProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00015(30-42)Online publication date: 14-May-2023
    • (2022)BigCloneBench Considered Harmful for Machine Learning2022 IEEE 16th International Workshop on Software Clones (IWSC)10.1109/IWSC55060.2022.00008(1-7)Online publication date: Oct-2022

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media