research-article

Assessing and Improving an Evaluation Dataset for Detecting Semantic Code Clones via Deep Learning

Authors:

Qianxiang Wang,

Tao XieAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology (TOSEM), Volume 31, Issue 4

Article No.: 62, Pages 1 - 25

https://doi.org/10.1145/3502852

Published: 28 July 2022 Publication History

Abstract

In recent years, applying deep learning to detect semantic code clones has received substantial attention from the research community. Accordingly, various evaluation benchmark datasets, with the most popular one as BigCloneBench, are constructed and selected as benchmarks to assess and compare different deep learning models for detecting semantic clones. However, there is no study to investigate whether an evaluation benchmark dataset such as BigCloneBench is properly used to evaluate models for detecting semantic code clones. In this article, we present an experimental study to show that BigCloneBench typically includes semantic clone pairs that use the same identifier names, which however are not used in non-semantic-clone pairs. Subsequently, we propose an undesirable-by-design Linear-Model that considers only which identifiers appear in a code fragment; this model can achieve high effectiveness for detecting semantic clones when evaluated on BigCloneBench, even comparable to state-of-the-art deep learning models recently proposed for detecting semantic clones. To alleviate these issues, we abstract a subset of the identifier names (including type, variable, and method names) in BigCloneBench to result in AbsBigCloneBench and use AbsBigCloneBench to better assess the effectiveness of deep learning models on the task of detecting semantic clones.

References

[1]

Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!). 143–153.

Digital Library

[2]

Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering 33, 9 (2007), 577–591.

Digital Library

[3]

Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, and Sebastian Riedel. 2016. Learning Python Code Suggestion with a Sparse Pointer Network. (2016). arxiv:cs.NE/1611.08307

[4]

Pai-Hsuen Chen, Chih-Jen Lin, and Bernhard Schölkopf. 2005. A tutorial on \(\nu\) -support vector machines: Research articles. Appl. Stoch. Model. Bus. Ind. 21, 2 (2005), 111–136.

Digital Library

[5]

Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. Synthesizing benchmarks for predictive modeling. In Proceedings of the 15th IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 86–99.

[6]

Jianglang Feng, Baojiang Cui, and Kunfeng Xia. 2013. A code comparison algorithm based on AST for plagiarism detection. In Proceedings of the 4th International Conference on Emerging Intelligent Data and Web Technologies (EIDWT). 393–397.

Digital Library

[7]

Mark Gabel, Lingxiao Jiang, and Zhendong Su. 2008. Scalable detection of semantic clones. In Proceedings of the 30th ACM/IEEE International Conference on Software Engineering (ICSE). 321–330.

Digital Library

[8]

Yi Gao, Zan Wang, Shuang Liu, Lin Yang, Wei Sang, and Yuanfang Cai. 2019. TECCD: A tree embedding approach for code clone detection. In Proceedings of the 35th IEEE International Conference on Software Maintenance and Evolution (ICSME). 145–156.

[9]

Jacob A. Harer, Louis Y. Kim, Rebecca L. Russell, Onur Ozdemir, Leonard R. Kosta, Akshay Rangamani, Lei H. Hamilton, Gabriel I. Centeno, Jonathan R. Key, Paul M. Ellingwood, Erik Antelman, Alan Mackay, Marc W. McConley, Jeffrey M. Opper, Peter Chin, and Tomo Lazovich. 2018. Automated Software Vulnerability Detection with Machine Learning. (2018). arxiv:cs.SE/1803.04497

[10]

Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In Proceedings of the 26th IEEE/ACM International Conference on Program Comprehension (ICPC). 200–210.

Digital Library

[11]

Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25, 3 (2020), 2179–2217.

Digital Library

[12]

Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing source code with transferred API knowledge. In Proceedings of the 27th AAAI International Joint Conference on Artificial Intelligence (IJCAI). 2269–2275.

[13]

IJaDataset2.0. January 2013. Ambient Software Evolution Group. (January 2013). http://secold.org/projects/seclone.

[14]

Shruti Jadon. 2016. Code clones detection using machine learning technique: Support vector machine. In Proceedings of the 2nd IEEE International Conference on Computing, Communication and Automation (ICCCA). 399–403.

[15]

Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. DECKARD: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th ACM/IEEE International Conference on Software Engineering (ICSE). 96–105.

Digital Library

[16]

Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654–670.

Digital Library

[17]

Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. VUDDY: A scalable approach for vulnerable code clone discovery. In Proceedings of the 38th IEEE Symposium on Security and Privacy (S&P). 595–614.

[18]

Egambaram Kodhai and Selvadurai Kanmani. 2014. Method-level code clone detection through LWH (light weight hybrid) approach. Software Engineering Research & Development 2, 1 (2014), 12.

[19]

Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In Proceedings of the 8th ACM SIGPLAN International Symposium on Static Analysis (SAS). 40–56.

[20]

Rainer Koschke, Raimar Falke, and Pierre Frenzel. 2006. Clone detection using abstract syntax suffix trees. In Proceedings of the 13th IEEE Working Conference on Reverse Engineering (WCRE). 253–262.

Digital Library

[21]

Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of the 8th IEEE Working Conference on Reverse Engineering (WCRE). 301–309.

[22]

Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A neural model for generating natural language summaries of program subroutines. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE). 795–806.

Digital Library

[23]

Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CCLearner: A deep learning-based clone detection approach. In Proceedings of the 33rd IEEE International Conference on Software Maintenance and Evolution (ICSME). 249–260.

[24]

Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. VulDeePecker: A deep learning-based system for vulnerability detection. Proceedings of the 26th Network and Distributed System Security Symposium (NDSS).

[25]

Chao Liu, Chen Chen, Jiawei Han, and Philip S. Yu. 2006. GPLAG: Detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 872–881.

Digital Library

[26]

Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-machine-translation-based commit message generation: How far are we?. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE). 373–384.

Digital Library

[27]

Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI). 1287–1293.

[28]

Hiroaki Murakami, Keisuke Hotta, Yoshiki Higo, Hiroshi Igaki, and Shinji Kusumoto. 2013. Gapped code clone detection with lightweight source code analysis. In Proceedings of the 21st IEEE International Conference on Program Comprehension (ICPC). 93–102.

[29]

Annamalai Narayanan, Mahinthan Chandramohan, Lihui Chen, Yang Liu, and Santhoshkumar Saminathan. 2016. SubGraph2Vec: Learning Distributed Representations of Rooted Sub-graphs from Large Graphs. (2016). arxiv:cs.LG/1606.08928

[30]

Michael Pradel and Koushik Sen. 2018. DeepBugs: A learning approach to name-based bug detection. Proceedings of the ACM on Programming Languages. 2, OOPSLA, Article 147 (2018).

Digital Library

[31]

Marc’Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun. 2007. Sparse feature learning for deep belief networks. In Proceedings of the 20th International Conference on Neural Information Processing Systems (NIPS). 1185–1192.

Digital Library

[32]

Chanchal Kumar Roy and James R. Cordy. 2007. A survey on software clone detection research. Queen’s School of Computing TR 541, 115 (2007), 64–68.

[33]

Chanchal Kumar Roy and James R. Cordy. 2008. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 16th IEEE International Conference on Program Comprehension (ICPC). 172–181.

Digital Library

[34]

Chanchal Kumar Roy and James R. Cordy. 2009. A mutation/injection-based automatic framework for evaluating code clone detection tools. In Proceedings of the 3rd IEEE International Conference on Software Testing, Verification, and Validation Workshops (ICSTW). 157–166.

Digital Library

[35]

Chanchal Kumar Roy and James R. Cordy. 2018. Benchmarks for software clone detection: A ten-year retrospective. In Proceedings of the 25th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 26–37.

[36]

Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th IEEE/ACM International Conference on Software Engineering (ICSE). 1157–1168.

Digital Library

[37]

Harry Shum. 2020. From deep learning to deep understanding. In Frontiers in AI and Robotics. https://www.bilibili.com/video/av754058420.

[38]

Alex J. Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and Computing 14, 3 (2004), 199–222.

Digital Library

[39]

Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 30th International Conference on Software Maintenance and Evolution (ICSME). 476–480.

Digital Library

[40]

Jeffrey Svajlenko and Chanchal K. Roy. 2015. Evaluating clone detection tools with BigCloneBench. In Proceedings of the 31st IEEE International Conference on Software Maintenance and Evolution (ICSME). 131–140.

Digital Library

[41]

V. David Sánchez A. 2003. Advanced support vector machines and kernel methods. Neurocomputing 55, 1 (2003), 5–20.

[42]

Suresh Thummalapenta and Tao Xie. 2007. PARSEWeb: A programmer assistant for reusing open source code on the web. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 204–213.

Digital Library

[43]

Pengcheng Wang, Jeffrey Svajlenko, Yanzhao Wu, Yun Xu, and Chanchal K. Roy. 2018. CCAligner: A token based large-gap clone detector. In Proceedings of the 40th IEEE/ACM International Conference on Software Engineering (ICSE). 1066–1077.

Digital Library

[44]

Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In Proceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 261–271.

[45]

Hui-Hui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI). 3034–3040.

[46]

Hui-Hui Wei and Ming Li. 2018. Positive and unlabeled learning for detecting software functional clones with adversarial training. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI). 2840–2846.

[47]

Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). 87–98.

Digital Library

[48]

Hao Yu, Wing Lam, Long Chen, Ge Li, Tao Xie, and Qianxiang Wang. 2019. Neural detection of semantic code clones via tree-based convolution. In Proceedings of the 27th IEEE International Conference on Program Comprehension (ICPC). 70–80.

Digital Library

[49]

Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE). 783–794.

Digital Library

[50]

Yuxia Zhang, Minghui Zhou, Audris Mockus, and Zhi Jin. 2021. Companies’ participation in OSS development–An empirical study of OpenStack. IEEE Transactions on Software Engineering 47, 10 (2021), 2242–2259.

[51]

Gang Zhao and Jeff Huang. 2018. DeepSim: Deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 141–151.

Digital Library

Cited By

Li HGao QZhang S(2023)Assessing and Improving Dataset and Evaluation Methodology in Deep Learning for Code Clone Detection2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00044(497-508)Online publication date: 9-Oct-2023
https://doi.org/10.1109/ISSRE59848.2023.00044
Alam ARoy PAl-Omari FRoy CRoy BSchneider K(2023)GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench2023 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58846.2023.00013(1-13)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICSME58846.2023.00013
Gao SZhang HGao CWang CGrundy JPollock LPenta M(2023)Keeping Pace with Ever-Increasing Data: Towards Continual Learning of Code Intelligence ModelsProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00015(30-42)Online publication date: 14-May-2023
https://dl.acm.org/doi/10.1109/ICSE48619.2023.00015
Show More Cited By

Index Terms

Assessing and Improving an Evaluation Dataset for Detecting Semantic Code Clones via Deep Learning
1. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Maintaining software

Recommendations

Functional code clone detection with syntax and semantics fusion learning
ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

Clone detection of source code is among the most fundamental software engineering techniques. Despite intensive research in the past decade, existing techniques are still unsatisfactory in detecting "functional" code clones. In particular, existing ...
Clones in deep learning code: what, where, and why?
Abstract
Deep Learning applications are becoming increasingly popular worldwide. Developers of deep learning systems like in every other context of software development strive to write more efficient code in terms of performance, complexity, and ...
Deep learning code fragments for code clone detection
ASE '16: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering

Code clone detection is an important problem for software maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These techniques also depend ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 31, Issue 4

October 2022

867 pages

ISSN:1049-331X

EISSN:1557-7392

DOI:10.1145/3543992

Editor:
Mauro Pezzè
USI Università della Svizzera italiana and SIT Schaffhausen Institute of Technology, Switzerland

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 July 2022

Online AM: 25 June 2022

Accepted: 01 November 2021

Revised: 01 September 2021

Received: 01 January 2021

Published in TOSEM Volume 31, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Key-Area Research and Development Program of Guangdong Province
National Natural Science Foundation of China
Tencent Foundation or XPLORER PRIZE

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
597
Total Downloads

Downloads (Last 12 months)178
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li HGao QZhang S(2023)Assessing and Improving Dataset and Evaluation Methodology in Deep Learning for Code Clone Detection2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00044(497-508)Online publication date: 9-Oct-2023
https://doi.org/10.1109/ISSRE59848.2023.00044
Alam ARoy PAl-Omari FRoy CRoy BSchneider K(2023)GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench2023 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58846.2023.00013(1-13)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICSME58846.2023.00013
Gao SZhang HGao CWang CGrundy JPollock LPenta M(2023)Keeping Pace with Ever-Increasing Data: Towards Continual Learning of Code Intelligence ModelsProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00015(30-42)Online publication date: 14-May-2023
https://dl.acm.org/doi/10.1109/ICSE48619.2023.00015
Krinke JRagkhitwetsagul C(2022)BigCloneBench Considered Harmful for Machine Learning2022 IEEE 16th International Workshop on Software Clones (IWSC)10.1109/IWSC55060.2022.00008(1-7)Online publication date: Oct-2022
https://doi.org/10.1109/IWSC55060.2022.00008

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents