Call for Papers
The International Conference on Software Engineering (ICSE) is the premier forum for presenting and discussing the most recent and significant technical research contributions in the field of Software Engineering. In the research track, we invite high-quality submissions of technical research papers describing original and unpublished results of software engineering research.
ICSE 2025 will follow a dual deadline structure introduced in 2024. In other words, submissions will occur in two cycles. Please refer to the section on Dual Submission Cycles in the following for the information.
NEW THIS YEAR #1: Due to the rapid growth of the area of “AI and Software Engineering”, it is now split into two: “AI for Software Engineering” and “Software Engineering for AI”. A new area “Architecture and Design” is introduced. The topics listed under each area have also been revised. Please see the “Research Areas” section below.
NEW THIS YEAR #2: We add back opportunities for “Author Response” in addition to “Revision” so that some potential misunderstandings can be clarified in the review process for papers that otherwise would be rejected. Also, for a paper receiving a “Revision” outcome, authors will be given an additional page of text in the revised paper to accommodate the required changes specified in the reviews.
NEW THIS YEAR #3: Submissions must follow the latest “IEEE Submission and Peer Review Policy” and “ACM Policy on Authorship” (with associated FAQ, which includes a policy regarding the use of generative AI tools and technologies, such as ChatGPT. After checking with the ICSE Steering Committee, we are piloting a human-in-the-loop automated process to identify AI-generated papers. A Review Process Co-Chair has volunteered to design and run this pilot process on submitted papers. To preserve confidentiality, when scanning submitted papers, the scripts will not make use of any third-party services.
NEW THIS YEAR #4: IEEE Transactions on Software Engineering, ACM Transactions on Software Engineering and Methodology and ICSE 2025, have received approval from the ICSE Steering Committee to launch the Sustainable Community Review Effort (SCRE) program, aimed at reducing community effort in reviewing journal extensions of conference papers and allowing authors to get faster and more consistent feedback. More information is available at: http://tinyurl.com/icse25-scre
NEW THIS YEAR #5: ICSE Steering Committee has recently approved a proposal for streamlining and enhancing the paper bidding and assignment process, aimed at reducing the workload of PC members and resulting in better assignments of papers. Two Review Process Co-Chairs have volunteered to help manage the updated bidding and assignment process. More information is available at: http://tinyurl.com/icse25-streamlining
NEW THIS YEAR #6: ICSE Steering Committee has recently approved a proposal for Shadow PC. Shadow PC is a mentoring program to train early-career researchers (PhD students, postdocs, new faculty members, and industry practitioners) in the review process of the technical track. For Cycle 2, for the first time, authors of ICSE submissions can opt-in for their papers to be considered for review in the Shadow PC track. Shadow reviews for papers that are reviewed by the Shadow PC will be sent out to authors after the end of the actual review process; shadow reviews will not affect the official decision made by the regular PC. More detailed information about the program is available at: http://tinyurl.com/icse25-shadowpc
Research Areas
ICSE welcomes submissions addressing topics across the full spectrum of Software Engineering, being inclusive of quantitative, qualitative, and mixed-methods research. Topics of interest include the following and are grouped into the following nine research areas. Please note that these topics are by no means exhaustive.
Each submission will need to indicate one of these nine areas as the chosen area. Optionally, the authors can consider adding an additional area. A paper may be moved from the chosen area(s) to another focus area at the discretion of the program chairs. Program chairs will ultimately assign a paper to an area chair, considering the authors’ selection, the paper’s content, and other factors such as (if applicable) possible conflicts of interest.
AI for Software Engineering
- AI-enabled recommender systems for automated SE (e.g., code generation, program repair, AIOps, software composition analysis, etc.)
- Human-centered AI for SE (e.g., how software engineers can synergistically work with AI agents)
- Trustworthy AI for SE (e.g., how to provide guarantees, characterize limits, and prevent misuse of AI for SE)
- Sustainable AI for SE (e.g., how to reduce energy footprint for greener AI for SE)
- Collaborative AI for SE (e.g., how AI agents collaborate for automating SE)
- Automating SE tasks with LLM and other foundation models (e.g., large vision model)
- Efficacy measurement beyond traditional metrics (e.g., accuracy, BLEU, etc.)
- Prompt engineering for SE (e.g., novel prompt design)
- AI-assisted software design and model driven engineering (e.g., specification mining, program synthesis, software architectural design)
Analytics
- Mining software repositories, including version control systems, issue tracking systems, software ecosystems, configurations, app stores, communication platforms, and novel software engineering data sources, to generate insights through various research methods
- Software visualization
- Data-driven user experience understanding and improvement
- Data driven decision making in software engineering
- Software metrics (and measurements)
Architecture and Design
- Architecture and design measurement and assessment
- Software design methodologies, principles, and strategies
- Theory building for/of software design
- Architecture quality attributes, such as security, privacy, performance, reliability
- Modularity and reusability
- Design and architecture modeling and analysis
- Architecture recovery
- Dependency and complexity analysis
- Distributed architectures, such as microservice, SOA, cloud computing
- Patterns and anti-patterns
- Technical debt in design and architecture
- Architecture refactoring
- Adaptive architectures
- Architecture knowledge management
Dependability and Security
- Formal methods and model checking (excluding solutions focusing solely on hardware)
- Reliability, availability, and safety
- Resilience and antifragility
- Confidentiality, integrity, privacy, and fairness
- Performance
- Design for dependability and security
- Vulnerability detection to enhance software security
- Dependability and security for embedded and cyber-physical systems
Evolution
- Evolution and maintenance
- API design and evolution
- Release engineering and DevOps
- Software reuse
- Refactoring and program differencing
- Program comprehension
- Reverse engineering
- Environments and software development tools
- Traceability to understand evolution
Human and Social Aspects
- Focusing on individuals (from program comprehension, workplace stress to job satisfaction and career progression)
- Focusing on teams (e.g., collocated, distributed, global, virtual; communication and collaboration within a team), communities (e.g., open source, communities of practice) and companies (organization, economics)
- Focusing on society (e.g., sustainability; diversity and inclusion)
- Focusing on programming languages, environments, and tools supporting individuals, teams, communities, and companies.
- Focusing on software development processes
Requirements and Modeling
- Requirements engineering (incl. non-functional requirements)
- Theoretical requirement foundations
- Requirements and architecture
- Feedback, user and requirements management
- Requirements traceability and dependencies
- Modeling and model-driven engineering
- Variability and product lines
- Systems and software traceability
- Modeling languages, techniques, and tools
- Empirical studies on the application of model-based engineering
- Model-based monitoring and analysis
Software Engineering for AI
- SE for AI models
- SE for systems with AI components
- SE for AI code, libraries, and datasets
- Engineering autonomic systems and self-healing systems
- Automated repair of AI models
- Testing and verification of AI-based systems
- Validation and user-based evaluation of AI-based systems
- Requirements engineering for AI-based systems
Testing and Analysis
- Software testing
- Automated test generation techniques such as fuzzing, search-based approaches, and symbolic execution
- Testing and analysis of non-functional properties
- GUI testing
- Mobile application testing
- Program analysis
- Program synthesis (e.g., constraint based techniques)
- Program repair
- Debugging and fault localization
- Runtime analysis and/or error recovery
Scope
Since the authors will choose an area for their submission, the scope of each area becomes important. Some submissions may relate to multiple areas. In such cases, the authors should choose the area for which their paper brings the maximum new insights. Moreover, authors also have the choice of indicating an alternate area for each paper.
Similarly, for certain papers. authors may have a question whether it belongs to any area, or is simply out of scope. For such cases, we recommend the authors to judge whether their paper brings new insights for software engineering. As an example, a formal methods paper with a focus on hardware verification may be deemed out of scope for ICSE. In general, papers that only peripherally concern software engineering and do not give new insights from the software engineering perspective would be less relevant to ICSE. Our goal is, however, to be descriptive, rather than prescriptive, to enable authors to make their own decisions about relevance.
Dual Submission Cycles
Similar to ICSE 2024, we will have two submission cycles as follows:
First submission cycle
- (Mandatory) Abstract: Mar 15, 2024
- Submission: Mar 22, 2024
- Author response period (3 days): Jun 10-13, 2024
- Notification: Jul 5, 2024
- Revision due: Aug 2, 2024
- Camera-ready (of directly accepted papers): Aug 16, 2024
- Final decision (of revised papers): Nov 1, 2024
- Camera-ready (of accepted revised papers): Dec 13, 2024
Second submission cycle
- (Mandatory) Abstract: Jul 26, 2024
- Submission: Aug 2, 2024
- Author response period (3 days): Oct 7-10, 2024
- Notification: Nov 1, 2024
- Revision due: Nov 29, 2024
- Camera-ready (of directly accepted papers): Dec 13, 2024
- Final decision (of revised papers): Jan 22, 2025
- Camera-ready (of accepted revised papers): Feb 12, 2025
All dates are 23:59:59 AoE (UTC-12h).
Review Criteria
Each paper submitted to the Research Track will be evaluated based on the following criteria:
i) Novelty: The novelty and innovativeness of contributed solutions, problem formulations, methodologies, theories, and/or evaluations, i.e., the extent to which the paper is sufficiently original with respect to the state-of-the-art.
ii) Rigor: The soundness, clarity, and depth of a technical or theoretical contribution, and the level of thoroughness and completeness of an evaluation.
iii) Relevance: The significance and/or potential impact of the research on the field of software engineering.
iv) Verifiability and Transparency: The extent to which the paper includes sufficient information to understand how an innovation works; to understand how data was obtained, analyzed, and interpreted; and how the paper supports independent verification or replication of the paper’s claimed contributions. Any artifacts attached to or linked from the paper will be checked by one reviewer.
v) Presentation: The clarity of the exposition in the paper.
Reviewers will carefully consider all of the above criteria during the review process, and authors should take great care in clearly addressing them all. The paper should clearly explain and justify the claimed contributions. Each paper will be handled by an area chair who will ensure reviewing consistency among papers submitted within that area.
The outcome of each paper will be one of the following Accept, Revision, Reject. We now elaborate on the Revision outcome in the following.
Revisions
Papers submitted can go through revisions in response to specific revision requests made by the reviewers. Authors of papers receiving a Revision decision are expected to submit the revised papers, as well as the revised papers with changes marked in a different color, such as using LaTeXdiff. The authors also need to submit an “Author Response” document capturing the authors’ response to each reviewer’s comment and how those comments were addressed in the revision. This is similar to the “Summary of Changes and Response” document that is typically submitted by authors for a journal paper’s major revision. Authors may use the revision opportunity to revise and improve the paper, but should not use this to submit a substantially different paper. The reviewers will check the revised paper against the original paper and the suggested changes. Revised papers will be examined by the same set of reviewers. An unsatisfactory revised paper will be rejected. Authors are given 4 weeks to submit the revised papers. This is 1 week less than prior year as we reallocate the time to (i) add authors’ rebuttal, and (ii) provide more time for PC members to complete their reviews (to reduce reviewing fatigue considering the high workload to review increasing number of submissions). Authors are given an additional page of text in a revised paper to accommodate the required changes specified in the reviews.
Re-submissions of Rejected Papers
Authors of papers which receive a REJECT decision in the first submission cycle are strongly discouraged from re-submitting it to the second submission cycle. However, in exceptional cases where the reviewers evidently misunderstood their paper, upon approval of PC Chairs, authors can re-submit their paper to the second submission cycle with a “Clarifications and Summary of Improvements” document stating how they have changed the paper. They should also include the past reviews as part of this document, for completeness. These papers will be treated as new submissions, which may or may not get the same set of reviewers at the discretion of the PC chairs. Authors who try to bypass this guideline (e.g., by changing the paper title without significantly changing paper content, or by making small changes to the paper content) will have their papers desk-rejected by the PC chairs without further consideration.
Submission Process
Submissions must conform to the IEEE conference proceedings template, specified in the IEEE Conference Proceedings Formatting Guidelines (title in 24pt font and full text in 10pt type, LaTeX users must use \documentclass[10pt,conference]{IEEEtran} without including the compsoc or compsocconf options). Note that IEEE format is being used this year, whereas last year it was ACM format, hence the appearance will differ from year to year.
- All submissions must not exceed 10 pages for the main text, inclusive of all figures, tables, appendices, etc. Two more pages containing only references are permitted. All submissions must be in PDF. Accepted papers will be allowed one extra page for the main text of the camera-ready version.
- Submissions must strictly conform to the IEEE conference proceedings formatting instructions specified above. Alterations of spacing, font size, and other changes that deviate from the instructions may result in desk rejection without further review.
- By submitting to the ICSE Technical Track, authors acknowledge that they are aware of and agree to be bound by the ACM Policy and Procedures on Plagiarism and the IEEE Plagiarism FAQ. In particular, papers submitted to ICSE 2025 must not have been published elsewhere and must not be under review or submitted for review elsewhere whilst under consideration for ICSE 2025. Contravention of this concurrent submission policy will be deemed a serious breach of scientific ethics, and appropriate action will be taken in all such cases. To check for double submission and plagiarism issues, the chairs reserve the right to (1) share the list of submissions with the PC Chairs of other conferences with overlapping review periods and (2) use external plagiarism detection software, under contract to the ACM or IEEE, to detect violations of these policies.
- If the research involves human participants/subjects, the authors must adhere to the ACM Publications Policy on Research Involving Human Participants and Subjects. Upon submitting, authors will declare their compliance with such a policy. Alleged violations of this policy or any ACM Publications Policy will be investigated by ACM and may result in a full retraction of your paper, in addition to other potential penalties, as per ACM Publications Policy.
- Please ensure that you and your co-authors obtain an ORCID ID, so you can complete the publishing process for your accepted paper. ACM and IEEE have been involved in ORCID and may collect ORCID IDs from all published authors. We are committed to improve author discoverability, ensure proper attribution and contribute to ongoing community efforts around name normalization; your ORCID ID will help in these efforts.
- The ICSE 2025 Research Track will employ a double-anonymous review process. Thus, no submission may reveal its authors’ identities. The authors must make every effort to honor the double-anonymous review process. In particular:
- Authors’ names must be omitted from the submission.
- All references to the author’s prior work should be in the third person.
- While authors have the right to upload preprints on ArXiV or similar sites, they must avoid specifying that the manuscript was submitted to ICSE 2025.
- All communication with the program committee must go through the program committee chairs. Do not contact individual program committee members regarding your submission.
-
Further advice, guidance, and explanation about the double-anonymous review process can be found on the Q&A page.
- By submitting to the ICSE Research Track, authors acknowledge that they conform to the authorship policy of the IEEE, submission policy of the IEEE, and the authorship policy of the ACM (and associated FAQ. This includes following these points related to the use of Generative AI:
- “Generative AI tools and technologies, such as ChatGPT, may not be listed as authors of an ACM published Work. The use of generative AI tools and technologies to create content is permitted but must be fully disclosed in the Work. For example, the authors could include the following statement in the Acknowledgements section of the Work: ChatGPT was utilized to generate sections of this Work, including text, tables, graphs, code, data, citations, etc.). If you are uncertain about the need to disclose the use of a particular tool, err on the side of caution, and include a disclosure in the acknowledgements section of the Work.” - ACM
- “The use of artificial intelligence (AI)–generated text in an article shall be disclosed in the acknowledgements section of any paper submitted to an IEEE Conference or Periodical. The sections of the paper that use AI-generated text shall have a citation to the AI system used to generate the text.” - IEEE
- “If you are using generative AI software tools to edit and improve the quality of your existing text in much the same way you would use a typing assistant like Grammarly to improve spelling, grammar, punctuation, clarity, engagement or to use a basic word processing system to correct spelling or grammar, it is not necessary to disclose such usage of these tools in your Work.” - ACM
Submissions to the Technical Track that meet the above requirements can be made via the Research Track submission site by the submission deadline. Any submission that does not comply with these requirements may be desk rejected without further review.
Submission site: https://icse2025.hotcrp.com/
We encourage the authors to upload their paper info early (and can submit the PDF later) to properly enter conflicts for double-anonymous reviewing. It is the sole responsibility of the authors to ensure that the formatting guidelines, double anonymous guidelines, and any other submission guidelines are met at the time of paper submission.
Open Science Policy
The research track of ICSE 2025 is governed by the ICSE 2025 Open Science policies. The guiding principle is that all research results should be accessible to the public and, if possible, empirical studies should be reproducible. In particular, we actively support the adoption of open artifacts and open source principles. We encourage all contributing authors to disclose (anonymized and curated) data/artifacts to increase reproducibility and replicability. Note that sharing research artifacts is not mandatory for submission or acceptance. However, sharing is expected to be the default, and non-sharing needs to be justified. We recognize that reproducibility or replicability is not a goal in qualitative research and that, similar to industrial studies, qualitative studies often face challenges in sharing research data. For guidelines on how to report qualitative research to ensure the assessment of the reliability and credibility of research results, see this curated Q&A page.
Upon submission to the research track, authors are asked
- to make their artifact available to the program committee (via upload of supplemental material or a link to an anonymous repository) – and provide instructions on how to access this data in the paper; or
- to include in the submission an explanation as to why this is not possible or desirable; and
- to indicate in the submission why they do not intend to make their data or study materials publicly available upon acceptance, if that is the case. The default understanding is that the data and/or other artifacts will be publicly available upon acceptance of a paper.
Withdrawing a Paper
Authors can withdraw their paper at any moment until the final decision has been made, through the paper submission system. Resubmitting the paper to another venue before the final decision has been made without withdrawing from ICSE 2025 first is considered a violation of the concurrent submission policy, and will lead to automatic rejection from ICSE 2025 as well as any other venue adhering to this policy. Such violations may also be reported to appropriate organizations e.g. ACM and IEEE.
Conference Attendance Expectation
If a submission is accepted, at least one author of the paper is required to register for ICSE 2025 and present the paper. We are assuming that the conference will be in-person, and if it is virtual or hybrid, virtual presentations may be possible. These matters will be discussed with the authors closer to the date of the conference.
Accepted papers
The proceedings are published by the IEEE. As of February, papers from the first cycle are available at the IEEE Computer Society Digital Library. Later on the remaining papers will appear there; they will also appear in the IEEE
The following papers have been accepted so far in the ICSE 2025 Research Track. The papers are will be published by the IEEE and appear in the IEEE and ACM digital libraries, subject to an author submitting their camera-ready and copyright forms, and registering to attend the conference. (Authors are required to present the papers at the conference, otherwise they will be withdrawn).
Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang, "Puppy: Finding Performance Degradation Bugs in DBMSs via Limited-Optimization Plan Construction"
Abstract: Database management systems (DBMSs) consistently strive for enhanced performance. For a given query, the optimizer of a DBMS aims to construct an optimal execution plan that incorporates multiple optimization operations. However, the resulting plan may sometimes perform worse than even if no optimizations were applied. This occurs because the interactions between optimizations are complex and some situations might be overlooked in the implementation. We refer to these issues as Performance Degradation Bugs (PDBs). PDBs can result in significant consequences from decreased system efficiency and prolonged query processing times to potential disruptions in critical business operations. In this paper, we present Puppy, an automated approach for detecting PDBs in DBMSs using limited-optimization plan construction. The key idea is to compare the performance with the plan generated with all optimization operations enabled, against the plan generated with only a subset of optimization operations in the same DBMS. If the response time of the plan with the limited optimization set is shorter than that of the fully optimized plan, it indicates a potential PDB. Specifically, Puppy first generates queries that incorporate multiple optimization sequences, guided by optimization operation sequence coverage. Secondly, Puppy analyzes the query plan and selectively disables specific optimizations to construct the limited optimization plan. We evaluate Puppy on five widely-used DBMSs, namely MySQL, Percona, TiDB, PolarDB, and PostgreSQL against the state-of-the-art DBMS performance testing tools APOLLO and AMOEBA. Puppy detected 26 and 25 more performance anomalies, covered 151,201 and 173,798 more branches than APOLLO and AMOEBA in 48 hours, respectively. More importantly, Puppy reports 62 PDBs, with 54 anomalies confirmed as previously unknown bugs.
Tags: "Databases", "Prog Comprehension/Reeng/Maint"Chun Li, Hui Li, Zhong Li, Minxue Pan, Xuandong Li, "Enhancing Fault Localization in Industrial Software Systems via Contrastive Learning"
Abstract: Engineers utilize logs as a primary resource for fault localization in large-scale software and system testing, a process that is notoriously time-consuming, costly, and labor-intensive. Despite considerable progress in automated fault localization approaches, their applicability remains limited in such settings, due to the unavailability of fine-grained features in logs essential for most existing fault localization methods. In response, we introduce FALCON, a novel log-based fault localization framework. FALCON organizes complex semantic log information into graphical representations and employs contrastive learning to capture the differences between passed and failed logs, enabling the identification of crucial fault-related features. It also incorporates a specifically designed transitive analysis-based adaptive graph augmentation to minimize the influence of fault-unrelated log information on contrastive learning. Through extensive evaluations against 34 spectrum-based and 4 learning-based fault localization methods, FALCON demonstrates superior performance by outperforming all the methods in comparison. In addition, FALCON demonstrated its practical value by successfully identifying 71 out of 90 faults with a file-level Top-1 accuracy rate during a one-month deployment within a global company’s testing system.
Tags: "Prog Comprehension/Reeng/Maint", "Analysis", "Other: Visualization"Wenqian Deng, Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang, "Coni: Detecting Database Connector Bugs via State-Aware Test Case Generation"
Abstract: Database connectors are widely used in many applications to facilitate flexible and convenient database interactions. Potential vulnerabilities in database connectors can lead to various abnormal behaviors within applications, such as returning incorrect results or experiencing unexpected connection interruption. However, existing fuzzing works cannot be directly applied to testing database connectors as they mainly focus on SQL generation and use a small subset of database connector interfaces to execute SQLs. Due to a lack of domain knowledge, automated test case generation also struggles to generate complex test cases that explore connectors' deep logic. The main challenge in testing database connectors is to generate semantically correct test cases that can trigger a wide range of connector state transitions. To address that, we propose CONI, a framework designed for detecting logic bugs of database connectors with state-aware test case generation. First, we define the database connector state model by analyzing the corresponding specification. Building upon this model, CONI generates interface call sequences within test cases to encompass more connector state transitions. After that, CONI generates suitable parameter values based on the parameter information and contextual information collected during runtime. Then the test cases are executed on a target and a reference database connector. Inconsistent results indicate potential logic bugs. We evaluate CONI on 5 widely-used JDBC database connectors, namely MySQL Connector/J, MariaDB Connector/J, AWS JDBC Driver for MySQL, PGJDBC NG, and PostgreSQL JDBC Driver. In total, CONI successfully detected 44 previously unknown bugs, of which 34 have been confirmed.
Tags: "Testing and Quality", "Databases"Gong Chen, Xiaoyuan Xie, Daniel Tang, Qi Xin, Wenjie Liu, "HedgeCode: A Multi-Task Hedging Contrastive Learning Framework for Code Search"
Abstract: Code search is a vital activity in software engineering, focused on identifying and retrieving the correct code snippets based on a query provided in natural language. Approaches based on deep learning techniques have been increasingly adopted for this task, enhancing the initial representations of both code and its natural language descriptions. Despite this progress, there remains an unexplored gap in ensuring consistency between the representation spaces of code and its descriptions. Furthermore, existing methods have not fully leveraged the potential relevance between code snippets and their descriptions, presenting a challenge in discerning fine-grained semantic distinctions among similar code snippets. To address these challenges, we introduce a multi-task hedging contrastive Learning framework for Code Search, referred to as HedgeCode. HedgeCode is structured around two primary training phases. The first phase, known as the representation alignment stage, proposes a hedging contrastive learning approach. This method aims to detect subtle differences between code and natural language text, thereby aligning their representation spaces by identifying relevance. The subsequent phase involves multi-task joint learning, wherein the previously trained model serves as the encoder. This stage optimizes the model through a combination of supervised and self-supervised contrastive learning tasks. Our framework’s effectiveness is demonstrated through its performance on the CodeSearchNet benchmark, showcasing HedgeCode’s ability to address the mentioned limitations in code search tasks.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Jiashuo Zhang, Yiming Shen, Jiachi Chen, Jianzhong Su, Yanlin Wang, Ting Chen, Jianbo Gao, Zhong Chen, "Demystifying and Detecting Cryptographic Defects in Ethereum Smart Contracts"
Abstract: To enhance smart contracts with cryptographic capabilities, Ethereum has officially provided a set of system-level cryptographic APIs, such as ecrecover. These APIs have been utilized in over 10% of Ethereum transactions, motivating developers to implement various on-chain cryptographic tasks, such as digital signatures. However, since developers may not always be cryptographic experts, their ad-hoc and potentially defective implementations could compromise the theoretical guarantees of cryptography, leading to real-world security issues. To mitigate this threat, we conducted the first study aimed at demystifying and detecting cryptographic defects in smart contracts. Through the analysis of 2,406 real-world security reports, we defined nine types of cryptographic defects in smart contracts with detailed descriptions and practical detection patterns. Based on this categorization, we proposed CrySol, a fuzzing-based tool to automate the detection of cryptographic defects in smart contracts. It combines transaction replaying and dynamic taint analysis to extract fine-grained crypto-related semantics and employs crypto-specific strategies to guide the test case generation process. urthermore, we collected a large-scale dataset containing 25,745 real-world crypto-related smart contracts and evaluated CrySol's effectiveness on it. The result demonstrated that CrySol achieves an overall precision of 95.4% and a recall of 91.2%. Notably, CrySol revealed that 5,847 (22.7%) out of 25,745 contracts contain at least one cryptographic defect, highlighting the prevalence of these defects.
Tags: "Security", "Blockchain", "Other: Cryptography"Chijin Zhou, Quan Zhang, Bingzhou Qian, Yu Jiang, "Janus: Detecting Rendering Bugs in Web Browsers via Visual Delta Consistency"
Abstract: Rendering lies at the heart of our modern web experience. However, the correctness of browser rendering is not always guaranteed, often leading to rendering bugs. Traditional differential testing, while successful in various domains, falls short when applied to rendering bug detection because an HTML file is likely yield different rendered outcomes across different browsers. This paper introduces Visual Delta Consistency, a test oracle to detect rendering bugs in web browsers, aiming to make rendered pages across browsers comparable. Our key insight is that any modifications made to an HTML file should uniformly influence rendering outcomes across browsers. Specifically, when presented with two HTML files that differ only by minor modifications, the reaction of all browsers should be consistent, i.e., either all browsers render them identically or all render them differently. Based on this insight, We implemented it as a practical fuzzer named Janus. It constructs pairs of slightly modified HTML files and observes the change statuses of the corresponding rendered pages across browsers for bug detection. We evaluated it on three widely-used browsers, i.e., Chrome, Safari, and Firefox. In total, Janus detected 34 rendering bugs, out of which 26 confirmed with 8 fixed by the developers.
Tags: "Testing and Quality", "Other: Web"Seongmin Lee, Shreyas Minocha, Marcel Böhme, "Accounting for Missing Events in Statistical Information Leakage Analysis"
Abstract: The leakage of secret information via a public channel is a critical privacy flaw in software systems. The more information is leaked per observation, the less time an attacker needs to learn the secret. Due to the size and complexity of the modern software, and because some empirical facts are not available to a formal analysis of the source code, researchers started investigating statistical methods using program executions as samples. However, current statistical methods require a high sample coverage. Ideally, the sample is large enough to contain every possible combination of secret $\times$ observable value to accurately reflect the joint distribution of $\langle$secret, observable$\rangle$. Otherwise, the information leakage is severely underestimated, which is problematic as it can lead to overconfidence in the security of an otherwise vulnerable program. In this paper, we introduce an improved estimator for information leakage and propose to use methods from applied statistics to improve our estimate of the joint distribution when sample coverage is low. The key idea is to reconstruct the joint distribution by casting our problem as a multinomial estimation problem in the absence of samples for all classes. We suggest two approaches and demonstrate the effectiveness of each approach on a set of benchmark subjects. We also propose novel refinement heuristics, which help to adjust the joint distribution and gain better estimation accuracy. Compared to existing statistical methods for information leakage estimation, our method can safely overestimate the mutual information and provide a more accurate estimate from a limited number of program executions.
Tags: "Security", "Analysis"Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Premkumar Devanbu, Toufique Ahmed, "Calibration and Correctness of Language Models for Code"
Abstract: Machine learning models are widely used but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a \emph{confidence measure}; if this confidence measure is strongly associated with \emph{likelihood of correctness}, then the model is said to be \emph{well-calibrated}. In this case, the confidence measure can serve as a basis for rational graduated decision making on how much review and care is needed. \emph{Calibration} has so far been studied in mostly non-generative (\emph{e.g.}, classification) settings, especially in Software Engineering. However, generated code can quite often be wrong: Given generated code developers must decide whether to directly use, use after varying intensity of careful review, or discard model-generated code; thus calibration is vital in generative settings. In this paper we make several contributions. We develop a framework for evaluating the calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that by and large generative code models are \textbf{\textit{\underline{not}}} well-calibrated out of the box. We then show how calibration can be improved, using standard methods such as Platt scaling. Since Platt scaling relies on the prior availability of correctness data, we evaluate the applicability and generalizability of Platt scaling in Software Engineering, discuss settings where it has good potential for practical use, and settings where it does not. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in Software Engineering.
Tags: "AI for SE"Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, Rishi Poddar, Aseem Rastogi, "RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code"
Abstract: The Rust programming language, with its safety guarantees, has established itself as a viable choice for low-level systems programming language over the traditional, unsafe alternatives like C/C++. These guarantees come from a strong ownership-based type system, as well as primitive support for features like closures, pattern matching, etc., that make the code more concise and amenable to reasoning. These unique Rust features also pose a steep learning curve for programmers. This paper presents a tool called RustAssistant that leverages the emergent capabilities of Large Language Models (LLMs) to automatically suggest fixes for Rust compilation errors. RustAssistant uses a careful combination of prompting techniques as well as iteration between an LLM and the Rust compiler to deliver high accuracy of fixes. RustAssistant is able to achieve an impressive peak accuracy of roughly 74% on real-world compilation errors in popular open-source Rust repositories. We also contribute a dataset of Rust compilation errors to enable further research.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"guoping rong, Yongda Yu, Song Liu, Xin Tan, Tianyi Zhang, Haifeng Shen, Jidong Hu, "Code Comment Inconsistency Detection and Rectification Using a Large Language Model"
Abstract: Comments are widely used in source code. If a comment is consistent with the code snippet it intends to annotate, it would aid code comprehension. Otherwise, Code Comment Inconsistency (CCI) is not only detrimental to the understanding of code, but more importantly, it would negatively impact the development, testing, and maintenance of software. To tackle this issue, existing research has been primarily focused on detecting inconsistencies with varied performance. It is evident that detection alone does not solve the problem; it merely paves the way for solving it. A complete solution requires detecting inconsistencies and, more importantly, rectifying them by amending comments. However, this type of work is scarce. In this paper, we contribute C4RLLaMA, a fine-tuned large language model based on the open-source CodeLLaMA. It not only has the ability to rectify inconsistencies by correcting relevant comment content but also outperforms state-of-the-art approaches in detecting inconsistencies. Experiments with various datasets confirm that C4RLLaMA consistently surpasses both Post Hoc and Just-in-time CCI detection approaches. More importantly, C4RLLaMA outperforms substantially the only known CCI rectification approach in terms of multiple performance metrics. To further examine C4RLLaMA's efficacy in rectifying inconsistencies, we conducted a manual evaluation, and the results showed that the percentage of correct comment updates by C4RLLaMA was 65.0\% and 55.9\% in Just-in-time and Post Hoc, respectively, implying C4RLLaMA's real potential in practical use.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Kai Huang, Jian Zhang, Xiangxin Meng, Yang Liu, "Template-Guided Program Repair in the Era of Large Language Models"
Abstract: Recent advancements in automated program repair (APR) have been significantly driven by the application of Large Language Models (LLMs). In particular, the integration of LLMs with traditional template-based repair methods has demonstrated effective outcomes. Despite this, the synergy between the strengths of traditional methods and LLMs remains underexploited. This oversight originates from the indiscriminate use of templates and their insufficient coverage. Also, using small-scale LLMs within the zero-shot learning context proves to be suboptimal. To alleviate the limitations, we propose NTR (Neural Template Repair), a two-stage repair framework including template selection and patch generation, both of which are under the fine-tuning paradigm. In the template selection phase, we formulate it as a multiclass classification problem and fine-tune million-level LLMs for better selecting possible templates. During the patch generation phase, we leverage the chosen templates as probable directions (e.g., `Mutate Conditional Expression') to guide the fine-tuning process of LLMs at the billion-level scale for precise patch creation. Moreover, we incorporate a unique template to signify the absence of a suitable template and employ a probability-based prioritization of templates, thereby optimizing patch generation. This framework not only effectively addresses template mismatch issues, but also enables the billion-level LLMs to explore the patch space more efficiently, despite the GPU memory constraints. We evaluate NTR with different foundational models on Defects4J V1.2 and HumanEval-Java, the framework consistently demonstrates significant effectiveness. When utilizing StarCoder as the foundational model for patch generation, NTR fixes 128 and 129 bugs in Defects4J and HumanEval, outperforming the best baseline APR tool by 14 and 59 bugs. With the larger CodeLlama model, the fixed bugs rise to 139 and 136, respectively, exceeding the baseline by 25 and 66 bugs. Notably, the performance stems not only from the foundational models but also benefits greatly from our NTR framework. Specifically, NTR's implementation with StarCoder and CodeLlama leads to 22 and 23 additional fixes, which is beyond what the models achieve on their own. This emphasizes the success of our new perspective on utilizing templates to unlock the bug-fixing potential of LLMs.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Syed Fatiul Huq, Mahan Tafreshipour, Kate Kalcevich, Sam Malek, "Automated Generation of Accessibility Test Reports from Recorded User Transcripts"
Abstract: Testing for accessibility is a significant step when developing software, as it ensures that all users, including those with disabilities, can effectively engage with web and mobile applications. While automated tools exist to detect accessibility issues in software, none are as comprehensive and effective as the process of user testing, where testers with various disabilities evaluate the application for accessibility and usability issues. However, user testing is not popular with software developers as it requires conducting lengthy interviews with users and later parsing through large recordings to derive the issues to fix. In this paper, we explore how large language models (LLMs) like GPT 4.0, which have shown promising results in context comprehension and semantic text generation, can mitigate this issue and streamline the user testing process. Our solution, called Reca11, takes in informal transcripts of test recordings and extracts the accessibility and usability issues mentioned by the tester. Our systematic prompt engineering determines the optimal configuration of input, instruction, context and demonstrations for best results. We evaluate Reca11's effectiveness on 36 user testing sessions across three applications. Based on the findings, we investigate the strengths and weaknesses of using LLMs in this space.
Tags: "AI for SE", "User experience"Xinyu Lian, Yinfang Chen, Runxiang Cheng, Jie Huang, Parth Thakkar, Minjia Zhang, Tianyin Xu, "Large Language Models as Configuration Validators"
Abstract: Misconfigurations are major causes of software failures. Existing practices rely on developer-written rules or test cases to validate configurations, which are expensive. Machine learning (ML) for configuration validation is considered a promising direction, but has been facing challenges such as the need of large-scale field data and system-specific models. Recent advances in Large Language Models (LLMs) show promise in addressing some of the long-lasting limitations of ML-based configuration validation. We present a first analysis on the feasibility and effectiveness of using LLMs for configuration validation. We empirically evaluate LLMs as configuration validators by developing a generic LLM-based configuration validation framework, named Ciri. Ciri employs effective prompt engineering with few-shot learning based on both valid configuration and misconfiguration data. Ciri checks outputs from LLMs when producing results, addressing hallucination and nondeterminism of LLMs. We evaluate Ciri’s validation effectiveness on eight popular LLMs using configuration data of ten widely deployed open-source systems. Our analysis (1) confirms the potential of using LLMs for configuration validation, (2) explores design space of LLMbased validators like Ciri, and (3) reveals open challenges such as ineffectiveness in detecting certain types of misconfigurations and biases towards popular configuration parameters.
Tags: "AI for SE", "Security"Wen Zhang, Botang Xiao, Qingchen Kong, Le Guan, Wenwen Wang, "BSan: A Powerful Identifier-Based Hardware-Independent Memory Error Detector for COTS Binaries"
Abstract: This paper presents BSan, a practical software-only memory error detector for binary code. Different from state-of-the-art binary-level detectors, which rely on either the shadow memory-based approach or the hardware-specific feature and thus suffer from several fundamental limitations, BSan adopts an identifier-based approach, enabling it to detect deep memory errors missed by existing detectors. Also, BSan does not depend on any specific hardware features. To reduce the high performance overhead caused by identifier propagation, BSan creates a novel hybrid approach, static analysis+dynamic instrumentation, to improve the performance without inheriting the poor reliability of static binary rewriting, distinguishing it from existing detectors that simply refer to static binary rewriting for better performance. The comprehensive evaluation demonstrates that BSan can detect more memory errors than state-of-the-art binary-level detectors. Meanwhile, the performance and memory overheads of BSan are comparable to those of existing detectors.
Tags: "Security", "Analysis"Ying Fu, Zhiyong Wu, Yuanliang Zhang, Jie Liang, Jingzhou Fu, Yu Jiang, Shanshan Li, Xiangke Liao, "Thanos: DBMS Bug Detection via Storage Engine Rotation Based Differential Testing"
Abstract: Differential testing is a prevalent strategy for establishing test oracles in automated DBMS testing. However, meticulously selecting equivalent DBMSs with diverse implementations and compatible input syntax requires huge manual efforts. In this paper, we propose Thanos, a framework that finds DBMS bugs via storage engine rotation based differential testing. Our key insight is that a DBMS with different storage engines must provide consistent basic storage functionalities. Therefore, it’s feasible to construct equivalent DBMSs based on storage engine rotation, ensuring that the same SQL test cases to these equivalent DBMSs yield consistent results. The framework involves four main steps: 1) select the appropriate storage engines; 2) extract equivalence information among the selected storage engines; 3) synthesize feature-orient test cases that ensure the DBMS equivalence; and 4) send test cases to the DBMSs with selected storage engines and compare the results. We evaluate Thanos on three widely used and extensively tested DBMSs, namely MySQL, MariaDB, and Percona against state-of-the-art fuzzers SQLancer, SQLsmith, and Squirrel. Thanos outperforms them on branch coverage by 24%–116%, and also finds many bugs missed by other fuzzers. More importantly, the vendors have confirmed 32 previously unknown bugs found by Thanos, with 29 verified as Critical.
Tags: "Testing and Quality", "Databases"Courtney Miller, Mahmoud Jahanshahi, Audris Mockus, Bogdan Vasilescu, Christian Kästner, "Understanding the Response to Open-Source Dependency Abandonment in the npm Ecosystem"
Abstract: Many developers relying on open-source digital infrastructure expect continuous maintenance, but even the most critical packages can become unmaintained. Despite this, there is little understanding of the prevalence of abandonment of widely-used packages, of subsequent exposure, and of reactions to abandonment in practice, or the factors that influence them. We perform a large-scale quantitative analysis of all widely-used npm packages and find that abandonment is common among them, that abandonment exposes many projects which often do not respond, that responses correlate with other dependency management practices, and that removal is significantly faster when a projects end-of-life status is explicitly stated. We end with recommendations to both researchers and practitioners who are facing dependency abandonment or are sunsetting projects, such as opportunities for low-effort transparency mechanisms to help exposed projects make better, more informed decisions.
Tags: "Prog Comprehension/Reeng/Maint", "Human/Social"Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, Yizheng Chen, "Vulnerability Detection with Code Language Models: How Far Are We?"
Abstract: In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26\% F1 on BigVul but only 3.09\% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.
Tags: "AI for SE", "Security"Deepak-George Thomas, Matteo Biagiola, Nargiz Humbatova, Mohammad Wardat, Gunel Jahangirova, Hridesh Rajan, Paolo Tonella, "µPRL: A Mutation Testing Pipeline for Deep Reinforcement Learning based on Real Faults"
Abstract: Reinforcement Learning (RL) is increasingly adopted to train agents that can deal with complex sequential tasks, such as driving an autonomous vehicle or controlling a complex environment. Correspondingly, novel approaches are needed to ensure that RL agents have been tested adequately before going to production. Among them, mutation testing is quite promising, especially under the assumption that the injected faults (mutations) mimic the real ones. In this paper, we first describe a taxonomy of real RL faults obtained by repository mining. Then, we present the mutation operators derived from such real faults and implemented in the tool µPRL. Finally, we discuss the experimental results, which show that µPRL is extremely effective at discriminating strong from weak test generators, hence providing useful feedback to developers about the adequacy of the test scenarios generated and executed so far.
Tags: "SE for AI", "Testing and Quality"Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, Christian Kästner, "The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML Products"
Abstract: Machine learning (ML) components are increasingly incorporated into software products for end-users, but developers face challenges in transitioning from ML prototypes to products. Academics have limited access to the source of commercial ML products, challenging research progress. In this study, first, we contribute a novel process to identify 262 open-source ML products among more than half a million ML-related projects on GitHub. Then, we qualitatively and quantitatively analyze 30 open-source ML products to answer six broad research questions about development practices and system architecture. We find that the majority of the ML products in our sample represent startup-style development reported in past interview studies. We report 21 findings, including limited involvement of data scientists in many ML products, unusually low modularity between ML and non-ML code, diverse architectural choices on incorporating models into products, and limited prevalence of industry best practices such as model testing, pipeline automation, and monitoring. Additionally, we discuss 7 implications of this study on research, development, and education, including the need for tools to assist teams without data scientists, education opportunities, and open-source-specific research for privacy-preserving telemetry.
Tags: "SE for AI", "MSR"Sanan Hasanov, Stefan Nagy, Paul Gazzillo, "A Little Goes a Long Way: Tuning Configuration Selection for Continuous Kernel Fuzzing"
Abstract: The Linux kernel is actively-developed and widely-used. It supports billions of devices of all classes, from high-performance computing to the Internet-of-Things, in part because of its sophisticated configuration system, which automatically tailors the source code according to thousands of user-provided configuration options. Fuzzing has been highly successful at finding kernel bugs, being among the top bug reporters. Since the kernel receives 100s of patches per day, fuzzers run continuously, stopping regularly to rebuild the kernel with the latest changes before restarting fuzzing. But kernel fuzzers currently use predefined configuration settings that, as we show, exclude the majority of new patches from the kernel binary, nullifying the benefits of continuous fuzzing. Unfortunately, state-of-the-art configuration testing techniques are generally ill-suited to the needs of continuous fuzzing, excluding necessary options or requiring too many configuration files to be tractible. We distill down the needs of continuous testing into six properties with the most impact, systematically analyze the space of configuration selection strategies, and provide actionable recommendations. Through our analysis, we discover that continuous fuzzers can improve configuration variety without sacrificing performance. We empirically evaluate our discovery by modifying the configuration selection strategy for syzkaller, the most popular Linux kernel fuzzer, which subsequently found more than twice as many new bugs (35 vs. 13) than with the original configuration file and 12x more (24 vs. 2) when considering only unique bugs---with one security vulnerability being assigned a CVE.
Tags: "Testing and Quality", "Requirements"Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, Yuriy Brun, "QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning"
Abstract: Formal verification is a promising method for producing highly reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using machine learning and a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. This makes proofs hard to find, and, when they are found, longer than necessary. Reinforcement learning could help estimate progress, but sparse rewards make this method ineffective. To address this problem, we create QEDCartographer, an novel automated proof-synthesis tool that combines supervised and reinforcement learning. QEDCartographer's key insight is that incorporating the branching structure of proofs into its learning enables reward-free search, mitigating the sparse reward challenge. We evaluate QEDCartographer on the CoqGym benchmark of 68,501 theorems from 124 open-source Coq projects. QEDCartographer proves 186 more theorems than Proverbot9001, a state-of-the-art proof synthesis tool, an increase of 8%. Further, the tools are complementary, together proving 12% more theorems than Proverbot9001 alone. For theorems both can prove, QEDCartographer produces 26% shorter proofs 27% faster.
Tags: "AI for SE", "Analysis"Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Eric Zitong Zhou, Miryung Kim, "Fuzzing MLIR Compilers with Custom Mutation Synthesis"
Abstract: A growing trend in compiler design is to enable modular extensions to intermediate representations (IRs). Multi- Level Intermediate Representation (MLIR) is a new effort to enable faster compiler development by providing an extensible framework for downstream developers to define custom IRs with MLIR dialects. Sets of MLIR dialects define new IRs that are tailored for specific domains. The diversity and rapid evolution of these IRs make it impractical to pre-define custom test generator logic for every available dialect. We design a new approach called SYNTHFUZZ that automatically infers and applies custom mutations from existing tests. The key essence of SYNTHFUZZ is that inferred custom mutations are parameterized and context-dependent such that they can be concretized differently depending on the target context. By doing this, we obviate the need to manually write custom mutations for newly introduced MLIR dialects. Further, SYNTHFUZZ increases the chance of finding effective edit locations and reduces the chance of inserting invalid edit content by performing k-ancestor- prefix and l-sibling-postfix matching. We compare SYNTHFUZZ to three baselines: Grammarinator—a grammar-based fuzzer without custom mutators, MLIRSmith—a custom test generator for MLIR, and NeuRI—a custom test generator with support for parameterized generation. We conduct this comprehensive comparison on 4 different MLIR projects where each project defines a new set of MLIR dialects that would take months of effort to manually write custom input generation and mutation logic. Our evaluation shows that SYNTHFUZZ on average improves input diversity by 1.51×, which increases branch coverage by 1.16×. Further, we show that our context dependent custom mutation increases the proportion of valid tests by up to 1.11×, indicating that SYNTHFUZZ correctly concretizes its parameterized mutations with respect to the target context. Parameterization of the mutations reduces the fraction of tests violating general MLIR constraints by 0.57×, increasing the time spent fuzzing dialect-specific code.
Tags: "Testing and Quality", "Security"Forough Mehralian, Ziyao He, Sam Malek, "Automated Accessibility Analysis of Dynamic Content Changes on Mobile Apps"
Abstract: With mobile apps playing an increasingly vital role in our daily lives, the importance of ensuring their accessibility for users with disabilities is also growing. Despite this, app developers often overlook the accessibility challenges encountered by users of assistive technologies, such as screen readers. Screen reader users typically navigate content sequentially, focusing on one element at a time, unaware of changes occurring elsewhere in the app. While dynamic changes to content displayed on an app’s user interface may be apparent to sighted users, they pose significant accessibility obstacles for screen reader users. Existing accessibility testing tools are unable to identify challenges faced by blind users resulting from dynamic content changes. In this work, we first conduct a formative user study on dynamic changes in Android apps and their accessibility barriers for screen reader users. We then present TIMESTUMP, an automated framework that leverages our findings in the formative study to detect accessibility issues regarding dynamic changes. Finally, we empirically evaluate TIMESTUMP on real-world apps to assess its effectiveness and efficiency in detecting such accessibility issues.
Tags: "Testing and Quality", "Human/Social"Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, Zibin Zheng, "RLCoder: Reinforcement Learning for Repository-Level Code Completion"
Abstract: Repository-level code completion aims to generate code for unfinished code snippets within the context of a specified repository. Existing approaches mainly rely on retrieval-augmented generation strategies due to limitations in input sequence length. However, traditional lexical-based retrieval methods like BM25 struggle to capture code semantics, while model-based retrieval methods face challenges due to the lack of labeled data for training. Therefore, we propose RLCoder, a novel reinforcement learning framework, which can enable the retriever to learn to retrieve useful content for code completion without the need for labeled data. Specifically, we iteratively evaluate the usefulness of retrieved content based on the perplexity of the target code when provided with the retrieved content as additional context, and provide feedback to update the retriever parameters. This iterative process enables the retriever to learn from its successes and failures, gradually improving its ability to retrieve relevant and high-quality content. Considering that not all situations require information beyond code files and not all retrieved context is helpful for generation, we also introduce a stop signal mechanism, allowing the retriever to decide when to retrieve and which candidates to retain autonomously. Extensive experimental results demonstrate that RLCoder consistently outperforms state-of-the-art methods on CrossCodeEval and RepoEval, achieving 12.2\% EM improvement over previous methods. Moreover, experiments show that our framework can generalize across different programming languages and further improve previous methods like RepoCoder.
Tags: "AI for SE"Nausheen Mohammed, Akash lal, Aseem Rastogi, Subhajit Roy, Rahul Sharma, "LLM Assistance for Memory Safety"
Abstract: Memory safety violations in low-level code, written in languages like C, continues to remain one of the major sources of software vulnerabilities. One method of removing such violations by construction is to port C code to a safe C dialect. Such dialects rely on programmer-supplied annotations to guarantee safety with minimal runtime overhead. This porting, however, is a manual process that imposes significant burden on the programmer and, hence, there has been limited adoption of this technique. The task of porting not only requires inferring annotations, but may also need refactoring/rewriting of the code to make it amenable to such annotations. In this paper, we use Large Language Models (LLMs) towards addressing both these concerns. We show how to harness LLM capabilities to do complex code reasoning as well as rewriting of large codebases. We also present a novel framework for whole-program transformations that leverages lightweight static analysis to break the transformation into smaller steps that can be carried out effectively by an LLM. We implement our ideas in a tool called MSA that targets the CheckedC dialect. We evaluate MSA on several micro-benchmarks, as well as real-world code ranging up to 20K lines of code. We showcase superior performance compared to a vanilla LLM baseline, as well as demonstrate improvement over a state-of-the-art symbolic (non-LLM) technique.
Tags: "AI for SE", "Security"Chenkai Guo, Qianlu Wang, Naipeng Dong, Lingling Fan, Tianhong Wang, Weijie Zhang, EnBao Chen, Zheli Liu, Lu Yu, "EP-Detector: Automatic Detection of Error-prone Operation Anomalies in Android Applications"
Abstract: Android applications are pervasively adopted and heavily relied on in our daily life, leading to the growing demand for enhanced user experiences, such as ease for operation and robustness. Nevertheless, developers continue to prioritize traditional functionality and performance, overlooking the pivotal role of user experience in real-world scenarios. For example, poorly designed page elements can lead to user confusion, resulting in unexpected outcomes, termed as the error-prone operation anomalies (EPAs). In this work, we undertake the first effort to uncover the underlying essence of the EPA problem. To achieve this objective, we investigated the root causes of EPAs from three dimensions, i.e., subject, object and environment. These causes were identified by multi-stage attribute capturing and precise similarity computation. In this process, the causes are categorized into fine-grained classes, namely confusing behaviours, unsuitable layout, and resource overload. Building upon these insights, we propose a dynamic GUI-based testing tool EP-Detector to facilitate detecting the EPAs in real-world apps. The EP-Detector is equipped with widget-exploration based target navigation and automatic test oracle, enabling it to detect error-prone page elements and simulate events with both comprehensiveness and precision. To systematically study the prevalence and severity of real-world EPAs, we conducted experiments on 53 popular Android apps with EP-Detector. The confirmed results not only validate the high precision and completeness of EP-Detector but also highlight that EPAs are prevalent in current apps, with at least one EPA existing in every two page widgets on average, and 28.3% of them may lead to security and functionality issues or risks. The EP-Detector is available at https://github.com/WordDealer/EP-Detector.
Tags: "Testing and Quality", "Mobile SW"Shuzheng Gao, Cuiyun Gao, Wenchao Gu, Michael Lyu, "Search-Based LLMs for Code Optimization"
Abstract: The code written by developers usually suffers from efficiency problems and contain various performance bugs. These inefficiencies necessitate the research of automated refactoring methods for code optimization. Early research in code optimization employs rule-based methods and focuses on specific inefficiency issues, which are labor-intensive and suffer from the low coverage issue. Recent work regards the task as a sequence generation problem, and resorts to deep learning (DL) techniques such as large language models (LLMs). These methods typically prompt LLMs to directly generate optimized code. Although these methods show state-of-the-art performance, such one-step generation paradigm is hard to achieve an optimal solution. First, complex optimization methods such as combinatorial ones are hard to be captured by LLMs. Second, the one-step generation paradigm poses challenge in precisely infusing the knowledge required for effective code optimization within LLMs, resulting in under-optimized code. To address these problems, we propose to model this task from the search perspective, and propose a search-based LLMs framework named SBLLM that enables iterative refinement and discovery of improved optimization methods. SBLLM synergistically integrate LLMs with evolutionary search and consists of three key components: 1) an execution-based representative sample selection part that evaluates the fitness of each existing optimized code and prioritizes promising ones to pilot the generation of improved code; 2) an adaptive optimization pattern retrieval part that infuses targeted optimization patterns into the model for guiding LLMs towards rectifying and progressively enhancing their optimization methods; and 3) a genetic operator-inspired chain-of-thought prompting part that aids LLMs in combining different optimization methods and generating improved optimization methods. Our evaluation of SBLLM on a dataset of Python and C++ code demonstrates its effectiveness in improving code efficiency. Specifically, the results indicate that SBLLM can improve program execution efficiency by up to 109.59% and consistently outperform all baseline methods by 8.72% ∼ 28.06% and 1.15% ∼ 9.56% with different LLMs in terms of top-5 speedup rate on Python and C++, respectively.
Tags: "AI for SE"Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, Hui Liu, "A First Look at Conventional Commits Classification"
Abstract: Modern distributed software development relies on commits to control system versions. Commit classification plays a vital role in both industry and academia. The widely-used commit classification framework was proposed in 1976 by Swanson and includes three base classes: perfective, corrective, and adaptive. With the increasing complexity of software development, the industry has shifted towards a more fine-grained commit category, i.e., adopting Conventional Commits Specification (CCS) for delicacy management. The new commit framework requires developers to classify commits into ten distinct categories, such as ``feat'', ``fix'', and ``docs''. However, existing studies mainly focus on the three-category classification, leaving the definition and application of the fine-grained commit categories as knowledge gaps. This paper reports a preliminary study on this mechanism from its application status and problems. We also explore ways to address these identified problems. We find that a growing number of projects on GitHub are adopting CCS. By analyzing 194 issues from GitHub and 100 questions from Stack Overflow about the CCS application, we qualitatively categorized 52 challenges developers encountered. The most common one is CCS-type confusion. To address these challenges, we propose a clear definition of CCS types based on existing variants. Further, we designed an approach to automatically classify commits into CCS types, and the evaluation results demonstrate a promising performance. Our work facilitates a deeper comprehension of the present fine-grained commit categorization and holds the potential to alleviate application challenges significantly.
Tags: "Prog Comprehension/Reeng/Maint", "AI for SE"Chong Wang, Jian Zhang, Yiling Lou, Mingwei Liu, Weisong Sun, Yang Liu, Xin Peng, "TIGER: A Generating-Then-Ranking Framework for Practical Python Type Inference"
Abstract: Python’s dynamic typing system offers flexibility and expressiveness but can lead to type-related errors, prompting the need for automated type inference despite efforts like Python Enhancement Proposals (PEPs) to enhance type hinting. While existing learning-based approaches show promising inference accuracy, they struggle with practical challenges in comprehensively handling various types, including complex generics and (unseen) user/library-defined types. To address these challenges, we introduce TIGER, employing a two-stage generating-then-ranking (GTR) framework. By fine-tuning pre-trained code models, TIGER trains a generation model with a generative span masking objective and a similarity model with a contrastive training objective. This enables TIGER to execute the GTR inference, generating diverse candidates and then ranking them alongside user/library-defined types. Evaluation on the ManyTypes4Py dataset demonstrates TIGER’s effectiveness across different type categories, particularly excelling in (unseen) user-defined types (with improvements of 11.2% and 20.1% in Top-5 Exact Match). The evaluation results also confirm the robustness and efficiency of TIGER, highlighting the contributions of the employed two stages.
Tags: "AI for SE", "Analysis"Qingchao Shen, Yongqiang Tian, Haoyang Ma, Junjie Chen, Lili Huang, Ruifeng Fu, Shing-Chi Cheung, Zan Wang, "A Tale of Two DL Cities: When Library Tests Meet Compiler"
Abstract: Deep Learning (DL) compilers typically load a DL model and optimize it with intermediate representation. Existing DL compiler testing techniques mainly focus on model optimization stages, but rarely explore bug detection at the model loading stage. Effectively testing the model loading stage requires covering diverse usages of each DL operator from various DL libraries, which shares a common objective with DL library testing, indicating that the embedded knowledge in DL library tests could potentially be beneficial for testing the model loading stage of DL compilers. Thus, we conducted the first empirical study to investigate the effectiveness and efficiency of migrating the knowledge embedded in DL library tests to test the model loading stage. To support the conduct of this study, we develop a technique, called OPERA, consisting of test migration (regarding effectiveness investigation) and test prioritization (regarding efficiency investigation). We considered three sources of tests in DL libraries for migration and used eight frontends from three DL compilers (e.g., TVM, TensorRT, and OpenVINO) for evaluation. The migrated tests with the aid of OPERA detected 170 previously unknown bugs in total, 90 of which have been confirmed/fixed by developers, demonstrating the effectiveness of such the migration-based idea. The test prioritization strategy in OPERA improves testing efficiency with migrated tests by 11.9%~47.4% on average compared to general test prioritization strategies. Finally, we obtained 7 major findings and provided a set of guidelines for future work from this study.
Tags: "SE for AI", "Testing and Quality"Rodrigo Pedro, Miguel E. Coimbra, Daniel Castro, Paulo Carreira, Nuno Santos, "Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses"
Abstract: Large Language Models (LLMs) have found widespread applications in various domains, including web applications with chatbot interfaces. Aided by an LLM-integration middleware such as LangChain, user prompts are translated into SQL queries used by the LLM to provide meaningful responses to users. However, unsanitized user prompts can lead to SQL injection attacks, potentially compromising the security of the database. In this paper, we present a comprehensive examination of prompt-to-SQL (P2SQL) injections targeting web applications based on frameworks such as LangChain and LlamaIndex. We characterize P2SQL injections, exploring their variants and impact on application security through multiple concrete examples. We evaluate seven state-of-the-art LLMs, demonstrating the risks of P2SQL attacks across language models. By employing both manual and automated methods, we discovered P2SQL vulnerabilities in five real-world applications. Our findings indicate that LLM-integrated applications are highly susceptible to P2SQL injection attacks, warranting the adoption of robust defenses. To counter these attacks, we propose four effective defense techniques that can be integrated as extensions to the LangChain framework.
Tags: "AI for SE", "Security"Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, Xin Xia, "Reasoning Runtime Behavior of a Program with LLM: How Far Are We?"
Abstract: Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs, but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting the input and output of a program, ignoring the evaluation of the intermediate behavior during program execution, as well as the logical consistency (e.g., the model should not give the correct output if the prediction of execution path is wrong) when performing the reasoning. To address these problems, in this paper, we propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution. We utilize existing code benchmarks and adapt them to new benchmarks within our framework. A large-scale empirical study is conducted and most LLMs show unsatisfactory performance on both Runtime Behavior Reasoning (i.e., an average accuracy of 44.4\%) and Incremental Consistency Evaluation (i.e., an average IC score of 10.3). Evaluation results of current code LLMs reflect the urgent need for the community to strengthen the code reasoning capability of code LLMs.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, Yang Liu, "Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications"
Abstract: Smart contracts are decentralized applications built atop blockchains like Ethereum. Recent research has shown that large language models (LLMs) have potential in auditing smart contracts, but the state-of-the-art indicates that even GPT-4 can achieve only 30% precision (when both decision and justification are correct). This is likely because off-the-shelf LLMs were primarily pre-trained on a general text/code corpus and not fine- tuned on the specific domain of Solidity smart contract auditing. In this paper, we propose iAudit, a general framework that combines fine-tuning and LLM-based agents for intuitive smart contract auditing with justifications. Specifically, iAudit is inspired by the observation that expert human auditors first perceive what could be wrong and then perform a detailed analysis of the code to identify the cause. As such, iAudit employs a two-stage fine-tuning approach: it first tunes a Detector model to make decisions and then tunes a Reasoner model to generate causes of vulnerabilities. However, fine-tuning alone faces challenges in accurately identifying the optimal cause of a vulnerability. Therefore, we introduce two LLM-based agents, the Ranker and Critic, to iteratively select and debate the most suitable cause of vulnerability based on the output of the fine-tuned Reasoner model. To evaluate iAudit, we collected a balanced dataset with 1,734 positive and 1,810 negative samples to fine-tune iAudit. We then compared it with traditional fine-tuned models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) as well as prompt learning-based LLMs (GPT4, GPT-3.5, and CodeLlama-13b/34b). On a dataset of 263 real smart contract vulnerabilities, iAudit achieves an F1 score of 91.21% and an accuracy of 91.11%. The causes generated by iAudit achieved a consistency of about 38% compared to the ground truth causes.
Tags: "AI for SE", "Security"Shuo Yang, Xingwei Lin, Jiachi Chen, Qingyuan Zhong, Lei Xiao, Renke Huang, Yanlin Wang, Zibin Zheng, "Hyperion: Unveiling DApp Inconsistencies using LLM and Dataflow-Guided Symbolic Execution"
Abstract: The rapid advancement of blockchain platforms has significantly accelerated the growth of decentralized applications (DApps). Similar to traditional applications, DApps integrate front-end descriptions that showcase their features to attract users, and back-end smart contracts for executing their business logic. However, inconsistencies between the features promoted in front-end descriptions and those actually implemented in the contract can confuse users and undermine DApps's trustworthiness. In this paper, we first conducted an empirical study to identify seven types of inconsistencies, each exemplified by a real-world DApp. Furthermore, we introduce Hyperion, an approach designed to automatically identify inconsistencies between front-end descriptions and back-end code implementation in DApps. This method leverages a fine-tuned large language model LLaMA2 to analyze DApp descriptions and employs dataflow-guided symbolic execution for contract bytecode analysis. Finally, Hyperion reports the inconsistency based on predefined detection patterns. The experiment on our ground truth dataset consisting of 54 DApps shows that Hyperion reaches 84.06\% overall recall and 92.06\% overall precision in reporting DApp inconsistencies. We also implement Hyperion to analyze 835 real-world DApps. The experimental results show that Hyperion discovers 459 real-world DApps containing at least one inconsistency.
Tags: "Analysis", "Security"Zhiqing Zhong, Shilin He, Haoxuan Wang, Boxi Yu, Haowen Yang, Pinjia He, "An Empirical Study on Package-Level Deprecation in Python Ecosystem"
Abstract: Open-source software (OSS) plays a crucial role in modern software development. Utilizing OSS code can greatly accelerate software development, reduce redundancy, and enhance reliability. Python, a widely adopted programming language, is particularly renowned for its extensive and diverse third-party package ecosystem. However, a significant number of OSS packages within the Python ecosystem are in poor maintenance, leading to potential risks in terms of functionality and security. Consequently, it is essential to establish a deprecation mechanism that assists package developers and users in effectively managing these packages. To facilitate the establishment of the package-level deprecation mechanism, this paper presents a mixed-method empirical study, including data analysis and surveys. We investigate the current practices of announcing, receiving, and handling package-level deprecation in the Python ecosystem. We also assess the benefits of having deprecation announcements for inactively maintained packages. Furthermore, we investigate the challenges faced by package developers and users and their expectations for future deprecation practices. Our findings reveal valuable insights. For instance, 75.4\% of inactive package developers have no intention of releasing deprecation declarations for various reasons, while 89.5\% of users express a desire to be notified about the deprecation, highlighting a gap between developers and users; In many cases, no alternative solutions are available when deprecation occurs, emphasizing the need to explore practical approaches that enable seamless package handover and require less maintenance effort. We anticipate that our work will enhance the understanding of existing package-level deprecation patterns within the Python OSS realm and facilitate the development of deprecation practices for the Python community in the future.
Tags: "MSR", "Prog Comprehension/Reeng/Maint", "Open Source"Lizhi Liao, Simon Eismann, Heng Li, Cor-Paul Bezemer, Diego Elias Costa, André van Hoorn, Weiyi Shang, "Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural Models"
Abstract: During software development, developers often make numerous modifications to the software to address existing issues or implement new features. However, certain changes may inadvertently have a detrimental impact on the overall system performance. To ensure that the performance of new software releases does not degrade (i.e., absence of performance regressions), existing practices rely on system-level performance testing, such as load testing, or component-level performance testing, such as microbenchmarking, to detect performance regressions. However, performance testing for the entire system is often expensive and time-consuming, posing challenges to adapting to the rapid release cycles common in modern DevOps practices. In addition, system-level performance testing cannot be conducted until the system is fully built and deployed. On the other hand, component-level testing focuses on isolated components, neglecting overall system performance and the impact of system workloads. In this paper, we propose a novel approach to early detection of performance regressions by bridging the local performance data generated by component-level testing and the system-level architectural models. Our approach uses local performance data to identify deviations at the component level, and then propagate these deviations to the architectural model. We then use the architectural model to predict regressions in the performance of the overall system. In an evaluation of our approach on two representative open-source benchmark systems, we show that it can effectively detect end-to-end system performance regressions from local performance deviations with different intensities and under various system workloads. More importantly, our approach can detect regressions as early as in the development phase, in contrast to existing approaches that require the system to be fully built and deployed. Our approach is lightweight and can complement traditional system performance testing when testing resources are scarce.
Tags: "Security", "Testing and Quality"Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, Andy Zaidman, "Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests"
Abstract: Automated unit test generators, particularly search-based software testing tools like EvoSuite, are capable of generating tests with high coverage. Although these generators alleviate the burden of writing unit tests, they often pose challenges for software engineers in terms of understanding the generated tests. To address this, we introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We achieve this enhancement through contextualizing test data, improving identifier naming, and adding descriptive comments. Through a controlled experiment with 32 participants, we investigate how the understandability of unit tests affects a software engineer's ability to perform bug-fixing tasks. We selected bug-fixing to simulate a real-world scenario that emphasizes the importance of understandable test cases. We observe that participants working on assignments with test cases fix up to 33% more bugs and use up to 20\% less time when compared to baseline test cases. From the post-test questionnaire, we gathered that participants found that enhanced test names, test data, and variable names improved their bug-fixing process.
Tags: "Testing and Quality", "AI for SE"Sebastian Uchitel, Francisco Cirelli, Dalal Alrajeh, "Unavoidable Boundary Conditions: A Control Perspective on Goal Conflicts"
Abstract: Boundary Conditions (BCs) express situations under which requirements specifications conflict. They are used within a broader conflict management process to produce less idealized specifications. Several approaches have been proposed to identify BCs automatically. Some introduce a prioritization criteria to reduce the number of BCs presented to an engineer. However, identifying the few, relevant boundary conditions remains an open challenge. In this paper, we argue that one of the problems of the state of the art is with the definition of BC itself -- it is too weak. We propose a stronger definition for the few, relevant BCs, which we refer to as Unavoidable Boundary Conditions (UBCs), which utilizes the notion of realizability in reactive synthesis. We show experimentally that UBCs non-trivially reduce the number of conditions produced by existing BC identification techniques. We also relate UBCs to existing concepts in reactive synthesis used to provide feedback for unrealizable specifications (including counter-strategies and unrealizable cores). We then show that UBCs provide a targeted form of feedback for repairing unrealizable specifications.
Tags: "Requirements", "Analysis"Brian Hyeongseok Kim, Jingbo Wang, Chao Wang, "FairQuant: Certifying and Quantifying Fairness of Deep Neural Networks"
Abstract: We propose a method for formally certifying and quantifying individual fairness of a deep neural network (DNN). Individual fairness guarantees that any two individuals who are identical except for some protected input attribute (e.g., gender or race) receive the same treatment. While there are existing techniques that provide such a guarantee, they suffer from lack of scalability or accuracy as the size and input dimension of the DNN increase. Our method overcomes this limitation by applying abstraction to a symbolic interval based analysis of the DNN followed by iterative refinement guided by the fairness property. Furthermore, our method lifts the interval based analysis from the conventional qualitative certification to quantitative certification, by computing the percentage of individuals whose classification outputs are provably fair, instead of merely deciding if the DNN is fair. We have implemented our method and evaluated it on deep neural networks trained on five popular fairness research datasets. The experimental results show that our method is not only more accurate than state-of-the-art techniques but also several orders-of-magnitude faster.
Tags: "SE for AI", "Analysis"Mingyuan Wu, Jiahong Xiang, Kunqiu Chen, Peng DI, Shin Hwei Tan, Heming Cui, Yuqun Zhang, "Tumbling Down the Rabbit Hole: How do Assisting Exploration Strategies Facilitate Grey-box Fuzzing?"
Abstract: Many assisting exploration strategies have been proposed to assist grey-box fuzzers in exploring program states guarded by tight and complex branch conditions such as equality constraints. Although they have shown promising results in their original papers, their evaluations seldom follow equivalent protocols, e.g., they are rarely evaluated on identical benchmarks. Moreover, there is a lack of sufficient investigations on the specifics of the program states explored by these strategies which can obfuscate the future application and development of such strategies. Consequently, there is a pressing need for a comprehensive study of assisting exploration strategies on their effectiveness, versatility, and limitations to enlighten their future development. To this end, we perform the first comprehensive study about the assisting exploration strategies for grey-box fuzzers. Specifically, we first collect nine recent fuzzers representing the mainstream assisting exploration strategies as our studied subjects and 21 real-world projects to form our benchmark suite. After evaluating the subjects on the benchmark suite, we then surprisingly find that the dictionary strategy is the most promising since it not only achieves similar or even slightly better performance over the other studied assisting exploration strategies in terms of exploring program states but also is more practical to be enhanced. Accordingly, we propose CDFUZZ, which generates a customized dictionary for each seed upon the baseline fuzzer AFL to improve over the original dictionary strategy. The evaluation results demonstrate that CDFUZZ increases the edge coverage by 16.1% on average for all benchmark projects over the best performer in our study (i.e., AFL++ with the dictionary strategy). CDFUZZ also successfully exposed 37 previously unknown bugs, with nine confirmed and seven fixed by the corresponding developers.
Tags: "Testing and Quality", "Analysis"Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu Lahiri, Nikhil Swamy, "Towards Neural Synthesis for SMT-assisted Proof-Oriented Programming"
Abstract: Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux, to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem---producing a definition given a formal specification expressed as an F* type. We provide a program-fragment checker that queries F* to check the correctness of candidate solutions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.
Tags: "AI for SE", "Security"Kunpeng Zhang, Shuai Wang, Jitao Han, Xiaogang Zhu, Xian Li, Shaohua Wang, Sheng Wen, "Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models"
Abstract: Deep learning (DL) libraries are widely used to form the basis of various AI applications in computer vision, natural language processing, and software engineering domains. Despite their popularity, DL libraries are known to have vulnerabilities, such as buffer overflows, use-after-free, and integer overflows, that can be exploited to compromise the security or effectiveness of the underlying libraries. While traditional fuzzing techniques have been used to find bugs in software, they are not well-suited for DL libraries. In general, the complexity of DL libraries and the diversity of their APIs make it challenging to test them thoroughly. To date, mainstream DL libraries like TensorFlow and PyTorch have featured over 1,000 APIs, and the number of APIs is still growing. Fuzzing all these APIs is a daunting task, especially when considering the complexity of the input data and the diversity of the API usage patterns. Recent advances in large language models (LLMs) have illustrated the high potential of LLMs in understanding and synthesizing human-like code. Despite their high potential, we find that emerging LLM-based fuzzers are less optimal for DL library API fuzzing, given their lack of in-depth knowledge on API input edge cases and inefficiency in generating test inputs. In this paper, we propose DFUZZ, a LLM-driven DL library fuzzing approach. We have two key insights: (1) With high reasoning ability, LLMs can replace human experts to reason edge cases (likely error-triggering inputs) from checks in an API's code, and transfer the extracted knowledge to test other (new or rarely-tested) APIs. (2) With high generation ability, LLMs can synthesize initial test programs with high accuracy that automates API testing. DFUZZ provides LLMs with a novel ''white-box view'' of DL library APIs, and therefore, can leverage LLMs' reasoning and generation abilities to achieve comprehensive fuzzing. Our experimental results on popular DL libraries demonstrate that DFUZZ is able to cover more APIs than SOTA (LLM-based) fuzzers on TensorFlow and PyTorch, respectively. Moreover, DFUZZ successfully detected 37 bugs, with 17 already confirmed as previously unknown bugs.
Tags: "AI for SE", "Testing and Quality"Yubo Mai, Zhipeng Gao, Haoye Wang, Tingting Bi, Xing Hu, Xin Xia, jianling Sun, "Towards Better Answers: Automated Stack Overflow Post Updating"
Abstract: Utilizing code snippets on Stack Overflow (SO) is a common practice among developers for problem-solving. Although SO code snippets serve as valuable resources, it is important to acknowledge their imperfections, reusing problematic code snippets can lead to the introduction of suboptimal or buggy code into software projects. \textit{SO comments} often point out weaknesses of a post and provide valuable insights to improve the quality of answers, while SO comments are usually missed and/or ignored, leaving these problematic code snippets untouched. In this work, we first investigate the task of automatic SO posts updating based on their associated comments. We introduce a novel framework, named \textbf{Soup} (\textbf{\underline{S}}tack \textbf{\underline{O}}verflow \textbf{\underline{U}}pdator for \textbf{\underline{P}}ost) for this task. \textbf{Soup} addresses two key tasks: Valid Comment-Edit Prediction (VCP) and Automatic Post Updating (APU). We fine-tuned a large language model, CodeLlama, using low-rank adaptation techniques to complete the VCP task, and constructed a dataset containing 78k valid comment-edit pairs for the APU task. Subsequently, we tested the performance of multiple large language models on the APU task. Extensive experimental results show the promising performance of our model over a set of benchmarks. Moreover, we also perform an in-the-wild evaluation on Stack Overflow, we submitted 50 edits generated by our approach to Stack Overflow posts and 21 of them have been verified and accepted by SO maintainers, further proving the practical value of \textbf{Soup}.
Tags: "AI for SE"Tianchang Gao, Junjie Chen, Dong Wang, Yile Guo, Yingquan Zhao, Zan Wang, "Selecting Initial Seeds for Better JVM Fuzzing"
Abstract: JVM fuzzing techniques serve as a cornerstone for guaranteeing the quality of implementations. In typical fuzzing workflows, initial seeds are crucial as they form the basis of the process. Literature in traditional program fuzzing has confirmed that effectiveness is largely impacted by redundancy among initial seeds, thereby proposing a series of seed selection methods. JVM fuzzing, compared to traditional ones, presents unique characteristics, including large-scale and intricate code, and programs with both syntactic and semantic features. However, it remains unclear whether the existing initial seed selection methods are suitable for JVM fuzzing and whether utilizing program features can enhance effectiveness. To address this, we devised a total of 10 initial seed selection methods, comprising coverage-based, prefuzz-based, and program-feature-based methods. We then conducted an empirical study on three JVM implementations to extensively evaluate the performance of the initial seed selection methods within two state-of-the-art fuzzing techniques (JavaTailor and VECT). Specifically, we examine performance from three aspects: (i) effectiveness and efficiency using widely studied initial seeds, (ii) effectiveness using the programs in the wild, and (iii) the ability to detect new bugs. Evaluation results first show that the program-feature-based method that utilizes the control flow graph not only has a significantly lower time overhead (i.e., 30s), but also outperforms other methods, achieving 142% to 269% improvement compared to the full set of initial seeds. Second, results reveal that the initial seed selection greatly improves the quality of wild programs and exhibits complementary effectiveness by detecting new behaviors. Third, results demonstrate that given the same testing period, initial seed selection improves the JVM fuzzing techniques by detecting more unknown bugs. Particularly, 16 out of the 25 detected bugs have been confirmed or fixed by developers. This work takes the first look at initial seed selection in JVM fuzzing, confirming its importance in fuzzing effectiveness and efficiency.
Tags: "Testing and Quality", "Analysis"Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, Zhenyu Chen, "Source Code Summarization in the Era of Large Language Models"
Abstract: To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLM-based code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top\_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types (e.g., procedural and object-oriented programming languages). Finally, we unexpectedly find that \codellama{} with 7B parameters can outperform advanced GPT-4 in generating summaries describing code implementation details and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Yuling Shi, Hongyu Zhang, Chengcheng Wan, Xiaodong Gu, "Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers"
Abstract: Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose a novel machine-generated code detection method called DetectCodeGPT, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.
Tags: "AI for SE", "Human/Social"Yue Wang, Chao Yang, Xiaodong Zhang, Yuwanqi Deng, JianFeng Ma, "DPFuzzer: Discovering Safety Critical Vulnerabilities for Drone Path Planners"
Abstract: State-of-the-art drone path planners enable drones to autonomously navigate around obstacles in GPS-denied, uncharted and cluttered environments. However, our investigation shows that path planners fail to maneuver drones correctly in specific scenarios, leading to incidents such as collisions. To minimize such risks, drone path planners should be tested thoroughly against diverse scenarios before deployment. Existing research for testing drones to uncover safety-critical vulnerabilities is only focused on the flight control programs and is limited in the capability to generate diverse obstacle scenarios for testing drone path planners. In this work, we propose \textit{DPFuzzer}, an automated framework for testing drone path planners. \textit{DPFuzzer} is an evolutionary algorithm (EA) based testing framework. It aims to uncover vulnerabilities in drone path planners by generating diverse critical scenarios that can trigger vulnerabilities. To better guide the critical scenario generation, we introduce \textit{Environmental Risk Factor (ERF)}, a metric we propose, to abstract potential safety threats of scenarios. We evaluate \textit{DPFuzzer} on state-of-the-art drone path planners and the experimental result shows that \textit{DPFuzzer} can effectively find diverse vulnerabilities. Additionally, we demonstrate that these vulnerabilities are exploitable in the real world on commercial drones.
Tags: "Testing and Quality", "Analysis"Shiyu Zhang, Haoyang Song, Qixin Wang, Henghua Shen, Yu Pei, "A Test Oracle for Reinforcement Learning Software based on Lyapunov Stability Control Theory"
Abstract: Reinforcement Learning (RL) has gained significant attention in recent years. As RL software becomes more complex and infiltrates critical application domains, ensuring its quality and correctness becomes increasingly important. An indispensable aspect of software quality/correctness insurance is testing. However, testing RL software faces unique challenges compared to testing traditional software, due to the difficulty on defining the outputs’ correctness. This leads to the RL test oracle problem. Current approaches to testing RL software often rely on human oracles, i.e. convening human experts to judge the correctness of RL software outputs. This heavily depends on the availability and quality (including the experiences, subjective states, etc.) of the human experts, and cannot be fully automated. In this paper, we propose a novel approach to design test oracles for RL software by leveraging the Lyapunov stability control theory. By incorporating Lyapunov stability concepts to guide RL training, we hypothesize that a correctly implemented RL software shall output an agent that respects Lyapunov stability control theories. Based on this heuristics, we propose a Lyapunov stability control theory based oracle, LPEA(ϑ, θ), for testing RL software. We conduct extensive experiments over representative RL algorithms and RL software bugs to evaluate our proposed oracle. The results show that our proposed oracle can outperform the human oracle in most metrics. Particularly, LPEA(ϑ = 100%, θ = 75%) outperforms the human oracle by 53.6%, 50%, 18.4%, 34.8%, 18.4%, 127.8%, 60.5%, 38.9%, and 31.7% respectively on accuracy, precision, recall, F1 score, true positive rate, true negative rate, false positive rate, false negative rate, and ROC curve’s AUC; and LPEA(ϑ = 100%, θ = 50%) outperforms the human oracle by 48.2%, 47.4%, 10.5%, 29.1%, 10.5%, 127.8%, 60.5%, 22.2%, and 26.0% respectively on these metrics.
Tags: "SE for AI", "Analysis"Yisong Xiao, Aishan Liu, Xinwei Zhang, Tianyuan Zhang, Tianlin Li, Siyuan Liang, Xianglong Liu, Yang Liu, Dacheng Tao, "BDefects4NN: A Backdoor Defect Database for Controlled Localization Studies in Neural Networks"
Abstract: Pre-trained large deep learning models are now serving as the dominant component for downstream middleware users and have revolutionized the learning paradigm, replacing the traditional approach of training from scratch locally. To reduce development costs, developers often integrate third-party pre-trained deep neural networks (DNNs) into their intelligent software systems. However, utilizing untrusted DNNs presents significant security risks, as these models may contain intentional backdoor defects resulting from the black-box training process. These backdoor defects can be activated by hidden triggers, allowing attackers to maliciously control the model and compromise the overall reliability of the intelligent software. To ensure the safe adoption of DNNs in critical software systems, it is crucial to establish a backdoor defect database for localization studies. This paper addresses this research gap by introducing \emph{BDefects4NN}, the first backdoor defect database, which provides labeled backdoor-defected DNNs at the neuron granularity and enables controlled localization studies of defect root causes. In \emph{BDefects4NN}, we define three defect injection rules and employ four representative backdoor attacks across four popular network architectures and three widely adopted datasets, yielding a comprehensive database of 1,654 backdoor-defected DNNs with four defect quantities and varying infected neurons. Based on \emph{BDefects4NN}, we conduct extensive experiments on evaluating six fault localization criteria and two defect repair techniques, which show limited effectiveness for backdoor defects. Additionally, we investigate backdoor-defected models in practical scenarios, specifically in lane detection for autonomous driving and large language models (LLMs), revealing potential threats and highlighting current limitations in precise defect localization. This paper aims to raise awareness of the threats brought by backdoor defects in our community and inspire future advancements in fault localization methods.
Tags: "SE for AI", "Analysis"Ian McCormack, Joshua Sunshine, Jonathan Aldrich, "A Study of Undefined Behavior Across Foreign Function Boundaries in Rust Libraries"
Abstract: Developers rely on the Rust programming language's static safety guarantees to write secure and performant applications. However, Rust is frequently used to interoperate with other languages which allow design patterns that conflict with Rust's aliasing models. Miri is the only dynamic analysis tool capable of validating applications against these models, but it does not support foreign functions, indicating that there may be a critical correctness gap at the heart of the Rust ecosystem. We conducted a large-scale evaluation of multi-language Rust libraries to determine whether Miri's dynamic analyses remain useful in this context. We used Miri and an LLVM interpreter to jointly execute multi-language applications, where we found 48 instances of undefined or undesired behavior. These include three bugs from libraries that had over 10,000 daily downloads on average during our observation period, and one from a library maintained by the Rust Project. Many of the errors we found involved incompatible aliasing patterns, but Rust's latest Tree Borrows aliasing model was significantly more permissive than the earlier Stacked Borrows model. The Rust community must invest in new, production-ready tooling for multi-language applications to ensure that developers can detect these errors.
Tags: "Analysis", "Security"Qikang Liu, Yang He, Yanwen Cai, Byeongguk Kwak, Yuepeng Wang, "Synthesizing Document Database Queries using Collection Abstractions"
Abstract: Document databases are increasingly popular in various applications, but their queries are challenging to write due to the flexible and complex data model underlying document databases. This paper presents a synthesis technique that aims to generate document database queries from input-output examples automatically. A new domain-specific language is designed to express a representative set of document database queries in an algebraic style. Furthermore, the synthesis technique leverages a novel abstraction of collections for deduction to efficiently prune the search space and quickly generate the target query. An evaluation of 110 benchmarks from various sources shows that the proposed technique can synthesize 108 benchmarks successfully. On average, the synthesizer can generate document database queries from a small number of input-output examples within tens of seconds.
Tags: "AI for SE", "Testing and Quality"Yesugen Baatartogtokh, Kaitlyn Cook, Alicia M. Grubb, "Exploring the Robustness of the Effect of EVO on Intention Valuation through Replication"
Abstract: The development of high-quality software depends on precise and comprehensive requirements that meet the objectives of stakeholders. Goal modeling techniques have been developed to fill this gap by capturing and analyzing stakeholders' needs and allowing them to make trade-off decisions; yet, goal modeling analysis is often difficult for stakeholders to interpret. Recent work found that when subjects are given minimal training on goal modeling and access to a color visualization, called EVO, they are able to use EVO to make goal modeling decisions faster without compromising quality. In this paper, we evaluate the robustness of the empirical evidence for EVO and question the underlying color choices made by the initial designers of EVO. We conduct a pseudo-exact replication ($n = 60$) of the original EVO study, varying the experimental site and the study population. Even in our heterogeneous sample with less a priori familiarity with requirements and goal modeling, we find that individuals using EVO answered the goal-modeling questions significantly faster than those using the control, expanding the external validity of the original results. However, we find some evidence that the chosen color scheme is not intuitive and make recommendations for the goal modeling community.
Tags: "Requirements", "Human/Social"Yanick Fratantonio, Luca Invernizzi, Loua Farah, Kurt Thomas, Marina Zhang, Ange Albertini, Francois Galilee, Giancarlo Metitieri, Julien Cretin, Alex Petit-Bianco, David Tao, Elie Bursztein, "Magika: AI-Powered Content-Type Detection"
Abstract: The task of content-type detection---which entails identifying the data encoded in an arbitrary byte sequence---is critical for operating systems, development, reverse engineering environments, and a variety of security applications. In this paper, we introduce Magika, a novel AI-powered content-type detection tool. Under the hood, Magika employs a deep learning model that can execute on a single CPU with just 1MB of memory to store the model's weights. We show that Magika achieves an average F1 score of 99% across over a hundred content types and a test set of more than 1M files, outperforming all existing content-type detection tools today. In order to foster adoption and improvements, we open source Magika under an Apache 2 license on GitHub and make our model and training pipeline publicly available. Our tool has already seen adoption by a major email provider for attachment scanning, and it has been integrated with VirusTotal to aid malware analysis.
Tags: "AI for SE", "Security"Xiafa Wu, Brian Demsky, "GenC2Rust: Towards Generating Generic Rust Code from C"
Abstract: Rust provides an exciting combination of strong safety guarantees and high performance. Many new systems are being implemented in Rust. Nevertheless, there is a large body of existing C code that could greatly benefit from Rust's safety guarantees. Unfortunately, the manual effort required to rewrite C code into Rust is often prohibitively expensive. Researchers have explored tools to assist developers in translating legacy C code into Rust code. However, the mismatch between C abstractions and idiomatic Rust abstractions makes it challenging to automatically utilize Rust's language features, resulting in non-idiomatic Rust code that requires extensive manual effort to further refactor. For example, existing tools often fail to map polymorphic uses of void pointers in C to Rust's more idiomatic generic pointers. In this paper, we present a translation tool, GenC2Rust, that translates non-generic C code into generic Rust code. GenC2Rust statically analyzes the use of void pointers in the C program to compute the typing constraints and then retypes the void pointers into generic pointers. We conducted an evaluation of GenC2Rust across 42 C programs that vary in size and span multiple domains to demonstrate its scalability as well as correctness. We also present a detailed analysis of the limiting factors encountered in the translation process.
Tags: "Analysis", "Prog Comprehension/Reeng/Maint"Hongyan Gao, Yibiao Yang, Maolin Sun, Jiangchang Wu, Yuming Zhou, Baowen Xu, "ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs"
Abstract: Ensuring the reliability of the Rust compiler is of paramount importance, as the Rust language is increasingly being adopted for developing critical systems due to its emphasis on memory and thread safety. However, due to Rust’s complex syntax and strict requirements, generating valid test programs for the Rust compiler poses significant challenges. Currently, with the growing popularity of large language models (LLMs), much research in software testing has explored the use of LLMs to generate test cases. Despite this, directly using LLMs to generate Rust programs often results in a large number of invalid test cases. Existing studies have indicated that test cases triggering historical compiler bugs can assist in software testing. Our investigation into Rust compiler bug issues further supports this observation. Inspired by existing work and our empirical research, we introduce a bracket-based masking and filling strategy called clozeMask. The clozeMask strategy involves extracting test code from historical issue reports, identifying and masking code snippets with specific structures, and utilizing an LLM to fill in the masked portions for synthesizing new test programs. This approach harnesses the generative capabilities of LLMs while retaining the ability to trigger Rust compiler bugs. Ultimately, it enables comprehensive testing of the compiler’s behavior, particularly in exploring corner cases. We implemented our approach as a prototype CLOZEMASTER. CLOZEMASTER has identified 27 confirmed bugs for rustc and mrustc, of which 10 have been fixed by developers. Furthermore, our experimental results indicate that CLOZEMASTER outperforms existing generative fuzzers in terms of code coverage and effectiveness.
Tags: "Testing and Quality", "AI for SE"Xingyu Wang, MingSen Wang, Wenbo Shen, Rui Chang, "Understanding and Detecting Peer Dependency Resolving Loop in npm Ecosystem"
Abstract: As the default package manager for Node.js, npm has become one of the largest package management systems in the world. To facilitate dependency management for developers, npm supports a special type of dependency, Peer Dependency, whose installation and usage differ from regular dependencies. However, conflicts between peer dependencies can trap the npm client into infinite loops, leading to resource exhaustion and system crashes. We name this problem PeerSpin. Although PeerSpin poses a severe risk to ecosystems, it was overlooked by previous studies, and its impacts have not been explored. To bridge this gap, this paper conducts the first in-depth study to understand and detect PeerSpin in the npm ecosystem. First, by systematically analyzing the npm dependency resolution, we identify the root cause of PeerSpin and characterize two peer dependency patterns to guide detection. Second, we propose a novel technique called Node-Replacement-Conflict based PeerSpin Detection, which leverages the state of the directory tree during dependency resolution to achieve accurate and efficient PeerSpin detection. Based on this technique, we developed a tool called PeerChecker to detect PeerSpin. Finally, we apply PeerChecker to the entire NPM ecosystem and find that 5,662 packages, totaling 72,968 versions, suffer from PeerSpin. Up until now, we confirmed 28 real PeerSpin problems by reporting them to the package maintainer. We also open source all PeerSpin analysis implementations, tools, and data sets to the public to help the community detect PeerSpin issues and enhance the reliability of the npm ecosystem.
Tags: "MSR", "Prog Comprehension/Reeng/Maint"Jiashuo Zhang, Jiachi Chen, John Grundy, Jianbo Gao, Yanlin Wang, Ting Chen, Zhi Guan, Zhong Chen, "Automated Test Generation For Smart Contracts via On-Chain Test Case Augmentation and Migration"
Abstract: Pre-deployment testing has become essential to ensure the functional correctness of smart contracts. However, since smart contracts are stateful programs integrating many different functionalities, manually writing test cases to cover all potential usages requires significant effort from developers, leading to insufficient testing and increasing risks in practice. Although several testing techniques for smart contracts have been proposed, they primarily focus on detecting common low-level vulnerabilities such as re-entrancy, rather than generating expressive and function-relevant test cases that can reduce manual testing efforts. To bridge the gap, we propose SolMigrator, an automated technique designed to generate expressive and representative test cases for smart contracts. To our knowledge, SolMigrator is the first migration-based test generation technique for smart contracts, which extracts test cases from real-world usages of on-chain contracts and migrates them to test newly developed smart contracts with similar functionalities. Given a target smart contract to be tested and an on-chain similar source smart contract, SolMigrator first transforms the on-chain usage of the source contract into off-chain executable test cases based on on-chain transaction replay and dependency analysis. It then employs fine-grained static analysis to migrate the augmented test cases from the source to the target smart contract. We built a prototype of SolMigrator and have evaluated it on real-world smart contracts within the two most popular categories, ERC20 and ERC721. Our evaluation results demonstrate that SolMigrator effectively extracts test cases from existing on-chain smart contracts and accurately migrates them across different smart contracts, achieving an average precision of 96.3% and accuracy of 93.6%. Furthermore, the results indicate that these migrated test cases effectively cover common key functionalities of the target smart contracts. This provides promising evidence that real-world usages of existing smart contracts can be transformed into effective test cases for other newly developed smart contracts.
Tags: "Testing and Quality", "Prog Comprehension/Reeng/Maint"Mengxiao Zhang, Zhenyang Xu, Yongqiang Tian, Xinru Cheng, Chengnian Sun, "Toward a Better Understanding of Probabilistic Delta Debugging"
Abstract: Given a list L of elements and a property ψ that L exhibits, ddmin is a classic test input minimization algorithm that aims to automatically remove ψ-irrelevant elements from L. This algorithm has been widely adopted in domains such as test input minimization and software debloating. Recently, ProbDD, a variant of ddmin, has been proposed and achieved stateof- the-art performance. By employing Bayesian optimization, ProbDD estimates the probability of each element in L being relevant to ψ, and statistically decides which and how many elements should be deleted together each time. However, the theoretical probabilistic model of ProbDD is rather intricate, and the underlying details for the superior performance of ProbDD have not been adequately explored. In this paper, we conduct the first in-depth theoretical analysis of ProbDD, clarifying the trends in probability and subset size changes and simplifying the probability model. We complement this analysis with empirical experiments, including success rate analysis, ablation studies, and examinations of trade-offs and limitations, to further comprehend and demystify this state-of- the-art algorithm. Our success rate analysis reveals how ProbDD effectively addresses bottlenecks that slow down ddmin by skipping inefficient queries that attempt to delete complements of subsets and previously tried subsets. The ablation study illustrates that randomness in ProbDD has no significant impact on efficiency. These findings provide valuable insights for future research and applications of test input minimization algorithms. Based on the findings above, we propose CDD, a simplified version of ProbDD, reducing the complexity in both theory and implementation. CDD assists in 1 validating the correctness of our key findings, e.g., that probabilities in ProbDD essentially serve as monotonically increasing counters for each element, and 2 identifying the main factors that truly contribute to ProbDD’s superior performance. Our comprehensive evaluations across 76 benchmarks in test input minimization and software debloating demonstrate that CDD can achieve the same performance as ProbDD, despite being much simplified.
Tags: "Testing and Quality", "Analysis"Benjamin Steenhoek, Siva Sivaraman, Renata Saldivar Gonzalez, Yevhen Mohylevskyy, Roshanak Zilouchian Moghaddam, Wei Le, "Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE"
Abstract: Security vulnerabilities impose significant costs on users and organizations. Detecting and addressing these vulnera bilities early is crucial to avoid exploits and reduce development costs. Recent studies have shown that deep learning models can effectively detect security vulnerabilities. Yet, little research explores how to adapt these models from benchmark tests to practical applications, and whether they can be useful in practice. This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DEEPVULGUARD, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on bench marks of historic vulnerability data. DEEPVULGUARD scans code for vulnerabilities (including identifying the vulnerability type and vulnerable region of code), suggests fixes, provides natural- language explanations for alerts and fixes, leveraging chat inter faces. We recruited 17 professional software developers, observed their usage of the tool on their code, and conducted interviews to assess the tool’s usefulness, speed, trust, relevance, and workflow integration. We also gathered detailed qualitative feedback on users’ perceptions and their desired features. Study participants scanned a total of 24 projects, 6.9k files, and over 1.7 million lines of source code, and generated 170 alerts and 50 fix suggestions. We find that although state-of-the-art AI-powered detection and fix tools show promise, they are not yet practical for real-world use due to a high rate of false positives and non-applicable fixes. User feedback reveals several actionable pain points, ranging from incomplete context to lack of customization for the user’s codebase. Additionally, we explore how AI features, including confidence scores, explanations, and chat interaction, can apply to vulnerability detection and fixing. Based on these insights, we offer practical recommendations for evaluating and deploying AI detection and fix models. Our code and data are available at this link: https://figshare.com/s/77992badb1e37c09e4eb. We plan to release our tool as open-source to support further user studies for other AI based tools.
Tags: "AI for SE", "Security"Soneya Binta Hossain, Matthew Dwyer, "TOGLL: Correct and Strong Test Oracle Generation with LLMs"
Abstract: Test oracles play a crucial role in software testing, enabling effective bug detection. Despite initial promise, neural methods for automated test oracle generation often result in a large number of false positives and weaker test oracles. While LLMs have shown impressive effectiveness in various software engineering tasks, including code generation, test case creation, and bug fixing, there remains a notable absence of large-scale studies exploring their effectiveness in test oracle generation. The question of whether LLMs can address the challenges in effective oracle generation is both compelling and requires thorough investigation. In this research, we present the first comprehensive study to investigate the capabilities of LLMs in generating correct, diverse, and strong test oracles capable of effectively identifying a large number of unique bugs. To this end, we fine-tuned seven code LLMs using six distinct prompts on a large dataset consisting of 110 Java projects. Utilizing the most effective fine- tuned LLM and prompt pair, we introduce TOGLL, a novel LLM-based method for test oracle generation. To investigate the generalizability of TOGLL, we conduct studies on 25 unseen large-scale Java projects. Besides assessing the correctness, we also assess the diversity and strength of the generated oracles. We compare the results against EvoSuite and the state-of-the-art neural method, TOGA. Our findings reveal that TOGLL can produce 3.8 times more correct assertion oracles and 4.9 times more exception oracles. Regarding bug detection effectiveness, TOGLL can detect 1,023 unique mutants that EvoSuite cannot, which is ten times more than what the previous SOTA neural-based method, TOGA, can detect. Additionally, TOGLL significantly outperforms TOGA in detecting real bugs from the Defects4J dataset.
Tags: "Testing and Quality", "AI for SE"Parsa Alian, Noor Nashid, Mobina Shahbandeh, Taha Shabani, Ali Mesbah, "Feature-Driven End-To-End Test Generation"
Abstract: End-to-end (E2E) testing is essential for ensuring web application quality. However, manual test creation is time-consuming and current test generation techniques produce random tests. In this paper, we present AUTOE2E, a novel approach that leverages Large Language Models (LLMs) to automate the generation of semantically meaningful feature-driven E2E test cases for web applications. AUTOE2E intelligently infers potential features within a web application and translates them into executable test scenarios. Furthermore, we address a critical gap in the research community by introducing E2EBENCH, a new benchmark for automatically assessing the feature coverage of E2E test suites. Our evaluation on E2EBENCH demonstrates that AUTOE2E achieves an average feature coverage of 79%, outperforming the best baseline by 558%, highlighting its effectiveness in generating high-quality, comprehensive test cases.
Tags: "AI for SE", "Testing and Quality"Aaron Imani, Iftekhar Ahmed, Mohammad Moshirpour, "Context Conquers Parameters: Outperforming Proprietary LLM in Commit Message Generation"
Abstract: Commit messages provide descriptions of the modifications made in a commit using natural language, making them crucial for software maintenance and evolution. Recent developments in Large Language Models (LLMs) have led to their use in generating high-quality commit messages, such as the Omniscient Message Generator (OMG). This method employs GPT-4 to produce state-of-the-art commit messages. However, the use of proprietary LLMs like GPT-4 in coding tasks raises privacy and sustainability concerns, which may hinder their industrial adoption. Considering that open-source LLMs have achieved competitive performance in developer tasks such as compiler validation, this study investigates whether they can be used to generate commit messages that are comparable with OMG. Our experiments show that an open-source LLM can generate commit messages that are comparable to those produced by OMG. In addition, through a series of contextual refinements, we propose lOcal MessagE GenerAtor (OMEGA) , a CMG approach that uses a 4-bit quantized 8B open-source LLM. OMEGA produces state-of-the-art commit messages, surpassing the performance of GPT-4 in practitioners' preference.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Yuanliang Zhang, Yifan Xie, Shanshan Li, Ke Liu, Chong Wang, Zhouyang Jia, Xiangbing Huang, Jie Song, Chaopeng Luo, Zhizheng Zheng, Rulin Xu, Yitong Liu, Si Zheng, Xiangke Liao, "Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar"
Abstract: Recently, large language models (LLMs) have shown strong potential in code generation tasks. However, there are still gaps before they can be fully applied in actual software development processes. Accurately assessing the code generation capabilities of large language models has become an important basis for evaluating and improving the models. Some existing works have constructed datasets to evaluate the capabilities of these models. However, there are three main gaps to objectively evaluate the real capability of LLMs: the exposure of target code, case timeliness, and dependency availability. The fundamental reason for these gaps is that the code in current datasets may have been exposed during the training phase of LLM, and due to the continuous training and development of LLM, their timeliness has been severely compromised. The key to solve the problem is to, as much as possible, evaluate the LLMs using code that they have not encountered before. Thus, the fundamental idea using in this paper is to draw on the concept of code obfuscation, changing code at different levels while ensuring the functionality and output. To this end, we build a code-obfuscation based benchmark OBFUSEVAL. We first collect 1,354 raw cases from five real-world projects, including function description and code. Then we use three-level strategy (symbol, structure and semantic) to obfuscate descriptions, code and context dependencies. We evaluate four LLMs on OBFUSEVAL and compared the effectiveness of different obfuscation strategy. We use official test suites of these projects to evaluate the generated code. The results show that after obfuscation, the average decrease ratio of test pass rate can up to 62.5\%.
Tags: "AI for SE"Baoquan Cui, rong qu, Zhen Tang, Jian Zhang, "Static Analysis of Remote Procedure Call in Java Programs"
Abstract: The Remote Procedure Call (RPC) is commonly used for inter-process communications over network, allowing a program to invoke a procedure in another address space even another machine as if it were a local call within the same address space. Its convenience comes from encapsulating network communication. However, for the same reason, it cannot be penetrated by current static analyzers. Since the RPC programs/frameworks play a more important role in various domains, the static analysis of RPC is significant and cannot be ignored. We have observed that many of the existing RPC frameworks/programs written in Java are based on explicit protocols, which makes them possible to be modelled for static analysis. The challenges are how to identify RPC operations in different frameworks/programs and how to automatically establish relationships between clients and servers. In this paper, we propose a novel approach, RPCBridge, which uses an adapter to unify the most basic operations during the RPC process. It models the RPC with logic rules in a straightforward and precise way based on its semantics, performs points-to analysis and constructs RPC edges in the call graph, making it more complete. The evaluation on real-world large-scale Java programs based on 5 common RPC frameworks shows that our approach can effectively capture the operations of the RPC (263 matched protocols and 1,098 RPCs), and construct critical links (2,578 edges in the call graph) between clients and servers, in which 60.1% are the true caller-callee pairs after execution. Our approach is expected to bring significant benefits (+24.3% leakage paths for the taint analyzer) for previously incompletely modelled code with a very little memory and time overhead, and connect the modules in a system, so that it can be statically analyzed more holistically.
Tags: "Analysis"Shihao Zhu, Yuqi Guo, Yan Cai, Bin Liang, Long Zhang, Rui Chen, Tingting Yu, "Reduce Dependence for Sound Concurrency Bug Prediction"
Abstract: Recently, dynamic concurrency bug predictions have kept making notable progress in improving concurrency coverage while ensuring soundness. Most of them rely solely on dynamic information in traces and overlook the static semantics of the program when predicting bugs. To ensure soundness, they assume that \textbf{any (memory) read can fully affect subsequent program execution via control-flow and data-flow}. However, the assumption over-approximates constraints among (memory) writes and reads and hence limits reordering space over thread interleaving, ultimately leading to false negatives. From program semantics, only a subset of reads actually affect their subsequent executions. Therefore, by refining dependencies between reads and subsequent executions based on static program semantics, one can refine the assumption and eliminate unnecessary constraints while still guaranteeing soundness. This can bring a chance to explore more thread interleaving space and uncover more concurrency bugs. However, refining dependencies can compromise soundness and bring heavy overhead. To tackle these challenges, this paper introduces the concept of Necessary Consistent Read Event (NRE) and a hybrid analysis algorithm. NRE refines dependencies between reads and their subsequent events and is used to identify necessary constraints where a read probably affects the execution of its subsequent events. Next, we design an efficient and accurate hybrid analysis algorithm to calculate NREs for each event in the trace. The hybrid analysis algorithm maps events to program SSA instructions and simulates executions based on the original trace. We focused on data race and developed NRE and the algorithm as a prototype tool ReconP on top of a recent work M2. We conducted a set of comparative experiments on MySQL with M2 and SeqCheck. The results show that ReconP can detect 46.9\% and 22.4\% more data races than M2 and SeqCheck, respectively. And the hybrid algorithm only accounts for 34\% of the total time cost.
Tags: "Testing and Quality", "Security"Huaijin Wang, Zhibo Liu, Yanbo Dai, Shuai Wang, Qiyi Tang, Sen Nie, Shi Wu, "Preserving Privacy in Software Composition Analysis: A Study of Technical Solutions and Enhancements"
Abstract: Software composition analysis (SCA) denotes the process of identifying open-source software components in an input software application. SCA has been extensively developed and adopted by academia and industry. However, we notice that the modern SCA techniques in industry scenarios still need to be improved due to privacy concerns. Overall, SCA requires the users to upload their applications’ source code to a remote SCA server, which then deeply inspects the applications and reports the component usage to users. This process is privacy-sensitive since the applications may contain sensitive information, such as proprietary algorithms, trade secrets, and user data. Moreover, applications' source code is generally deemed proprietary, and users do not want to share it with the SCA vendor. To protect customers' privacy, contemporary SCA vendors often propose to deploy a "lite" version of SCA service on the customer side. To avoid the leakage of SCA vendors' valuable assets (e.g., code, model, and data), the "lite" SCA usually only performs a shallow analysis with limited accuracy. Privacy concerns have prevented the SCA technology from being used in real-world scenarios. Therefore, academia and the industry demand privacy-preserving SCA solutions. For the first time, we analyze the privacy requirements of SCA and provide a landscape depicting possible technical solutions with varying privacy gains and overheads. In particular, given that de facto SCA frameworks are primarily driven by code similarity-based techniques, we explore combining several privacy-preserving protocols to encapsulate the similarity-based SCA framework. Among all viable solutions, we find that multi-party computation (MPC) offers the strongest privacy guarantee and plausible accuracy; it, however, incurs high overhead ($184\times$). We optimize the MPC-based SCA framework by reducing the amount of crypto protocol transactions using program analysis techniques. The evaluation results show that our proposed optimizations can reduce the MPC-based SCA overhead to only 8.5% without sacrificing SCA’s privacy guarantee or accuracy.
Tags: "Prog Comprehension/Reeng/Maint", "Analysis", "Open Source"Jintao Huang, Kai Yang, Gaosheng Wang, Zhiqiang Shi, Zhiwen Pan, Shichao Lv, Limin Sun, "Moye: A Wallbreaker for Monolithic Firmware"
Abstract: As embedded devices become increasingly popular, monolithic firmware, known for its execution efficiency and simplicity, is widely used in resource-constrained devices. Different from ordinary firmware, the monolithic firmware image is packed without the file that indicates its format, which challenges the reverse engineering of monolithic firmware. Function identification is the prerequisite of monolithic firmware's analysis. Prior works on function identification are less effectiveness when applied to monolithic firmware due to their heavy reliance on file formats. In this paper, we propose Moye, a novel method to identify functions in monolithic firmware. We leverage the important insight that the use of registers must conform to some constraints. In particular, our approach segments the firmware, locate code sections and output the instructions. We uses a masked language model to learn hiding relationships among the instructions to identify the function boundaries. We evaluate Moye using 1,318 monolithic firmware images, including 48 samples collected from widely used devices. The evaluation demonstrates that our approach significantly outperforms current works, achieving a precision greater than 98% and a recall rate greater than 97% across most datasets, showing robustness to complicated compilation options.
Tags: "Analysis", "Prog Comprehension/Reeng/Maint"Haifeng Ruan, Yuntong Zhang, Abhik Roychoudhury, "SpecRover: Code Intent Extraction via LLMs"
Abstract: Autonomous program improvement typically involves automatically producing bug fixes and feature additions. Such program improvement can be accomplished by a combination of large language model (LLM) and program analysis capabilities, in the form of an LLM agent. Since program repair or program improvement typically requires a specification of intended behavior - specification inference can be useful for producing high quality program patches. In this work, we examine efficient and low-cost workflows for iterative specification inference within an LLM agent. Given a GitHub issue to be resolved in a software project, our goal is to conduct iterative code search accompanied by specification inference - thereby inferring intent from both the project structure and behavior. The intent thus captured is examined by a reviewer agent with the goal of vetting the patches as well as providing a measure of confidence in the vetted patches. Our approach SpecRover is built on the open-source LLM agent AutoCodeRover. In an evaluation on the full SWE-Bench consisting of 2294 GitHub issues, it shows more than 50% improvement in efficacy over AutoCodeRover. Compared to the open-source agents available, our work shows modest cost ($0.65 per issue) in resolving an average GitHub issue in SWE-Bench lite. The production of explanation by SpecRover allows for a better "signal" to be given to the developer, on when the suggested patches can be accepted with confidence. SpecRover also seeks to demonstrate the continued importance of specification inference in automated program repair, even as program repair technologies enter the LLM era.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Zhiyuan Li, Zhiyuan Li, Jingzheng Wu, Jingzheng Wu, Jingzheng Wu, Xiang Ling, Xiang Ling, Xiang Ling, Tianyue Luo, Zhiqing Rui, Zhiqing Rui, Yanjun Wu, Yanjun Wu, Yanjun Wu, "The Seeds of the FUTURE Sprout from History: Fuzzing for Unveiling Vulnerabilities in Prospective Deep-Learning Libraries"
Abstract: The widespread application of Large Language Models (LLMs) underscores the importance of Deep Learning (DL) technologies that rely on foundational DL libraries such as PyTorch and TensorFlow. Despite their robust features, these libraries face challenges with scalability and adaptation to rapid advancements in the LLM community. In response, tech giants like Apple and Huawei are developing their own DL libraries to enhance performance, increase scalability, and safeguard intellectual property. Ensuring the security of these libraries is crucial, with fuzzing being a vital solution. However, existing fuzzing frameworks struggle with target flexibility, effectively testing bug-prone API sequences, and leveraging the limited available information in new libraries. To address these limitations, we propose FUTURE, the first universal DL library fuzzing framework tailored for newly introduced and prospective DL libraries. FUTURE leverages historical bug information from existing libraries and fine-tunes LLMs for specialized code generation. This strategy helps identify vulnerabilities in new libraries and uses insights from these libraries to enhance security in existing ones, creating a cycle from history to future and back. To evaluate FUTURE's effectiveness, we conduct comprehensive evaluations on three newly introduced DL libraries. Results demonstrate that FUTURE significantly outperforms existing fuzzers in bug detection, success rate of bug reproduction, validity rate of code generation, and API coverage. Notably, FUTURE has detected 148 bugs across 452 targeted APIs, including 142 previously unknown bugs. Among these, 10 have been assigned CVE IDs. Additionally, FUTURE detects 7 bugs in PyTorch, demonstrating its ability to enhance security in existing libraries in reverse.
Tags: "Testing and Quality", "AI for SE", "Security"Rudrajit Choudhuri, Bianca Trinkenreich, Rahul Pandita, Eirini Kalliamvakou, Igor Steinmacher, Marco Gerosa, Christopher Sanchez, Anita Sarma, "What Guides Our Choices? Modeling Developers' Trust and Behavioral Intentions Towards GenAI"
Abstract: Generative AI (genAI) tools, such as ChatGPT or Copilot, are advertised to improve developer productivity and are being integrated into software development. However, misaligned trust, skepticism, and usability concerns can impede the adoption of such tools. Research also indicates that AI can be exclusionary, failing to support diverse users adequately. One such aspect of diversity is cognitive diversity—variations in users’ cognitive styles—that leads to divergence in perspectives and interaction styles. When an individual’s cognitive style is unsupported, it creates barriers to technology adoption. Therefore, to understand how to effectively integrate genAI tools into software development, it is first important to model what factors affect developers’ trust and intentions to adopt genAI tools in practice? We developed a theoretical model to (1) identify factors that influence developers’ trust in genAI tools and (2) examine the relationship between developers’ trust, cognitive styles, and their intentions to use these tools. We surveyed software developers (N=238) at two major global tech organizations and employed Partial Least Squares-Structural Equation Modeling (PLS-SEM) to evaluate our model. Our findings reveal that genAI’s system/output quality, functional value, and goal maintenance significantly influence developers’ trust in these tools. Furthermore, developers’ trust and cognitive styles influence their intentions to use these tools. We offer practical suggestions for designing genAI tools for effective use and inclusive user experience.
Tags: "Human/Social", "AI for SE"Rudrajit Choudhuri, Bianca Trinkenreich, Rahul Pandita, Eirini Kalliamvakou, Igor Steinmacher, Marco Gerosa, Christopher Sanchez, Anita Sarma, "What Guides Our Choices? Modeling Developers' Trust and Behavioral Intentions Towards GenAI"
Abstract: Generative AI (genAI) tools, such as ChatGPT or Copilot, are advertised to improve developer productivity and are being integrated into software development. However, misaligned trust, skepticism, and usability concerns can impede the adoption of such tools. Research also indicates that AI can be exclusionary, failing to support diverse users adequately. One such aspect of diversity is cognitive diversity—variations in users’ cognitive styles—that leads to divergence in perspectives and interaction styles. When an individual’s cognitive style is unsupported, it creates barriers to technology adoption. Therefore, to understand how to effectively integrate genAI tools into software development, it is first important to model what factors affect developers’ trust and intentions to adopt genAI tools in practice? We developed a theoretical model to (1) identify factors that influence developers’ trust in genAI tools and (2) examine the relationship between developers’ trust, cognitive styles, and their intentions to use these tools. We surveyed software developers (N=238) at two major global tech organizations and employed Partial Least Squares-Structural Equation Modeling (PLS-SEM) to evaluate our model. Our findings reveal that genAI’s system/output quality, functional value, and goal maintenance significantly influence developers’ trust in these tools. Furthermore, developers’ trust and cognitive styles influence their intentions to use these tools. We offer practical suggestions for designing genAI tools for effective use and inclusive user experience.
Tags: "Human/Social", "AI for SE"Hao Zhong, "Understanding Compiler Bugs in Real Development"
Abstract: Compilers are critical in development, but compiler bugs can cause hidden and serious bugs in their compiled code. To deepen the understanding of compiler bugs, in the prior empirical studies, researchers read the bug reports and patches of compilers, and analyze their causes, locations, and patterns. Although they derive many interesting findings, their studies are limited. First, as bug reports seldom explain which projects encounter compiler bugs, it is infeasible to understand the outreaching impact. Second, before compiler bugs are fixed, programmers can bypass such bugs. The bug reports of compilers do not introduce such workarounds. Finally, the distribution of compiler bugs can be distorted, since researchers and compiler developers also file bug reports. In this paper, we propose a novel angle to analyze compiler bugs. Instead of compiler bug reports, we collect compiler bugs that are mentioned in real development. When programmers encounter compiler bugs in real development, they can leave traces in their commit messages. By searching such messages, we collected 644 unique commits whose messages explicitly mention the urls of compiler bugs. From this angle, in this paper, we conduct the first empirical study to analyze compiler bugs in the wild. We summarize our results into seven useful findings for users, compiler developers, and researchers. For example, for researchers, we find that some large workarounds of compiler bugs involve repetitive and systematic changes, which indicates a new research opportunity for code migration tools. Furthermore, we attempt to apply our findings in real development, and we obtain positive feedback.
Tags: "Prog Comprehension/Reeng/Maint", "MSR"Rosalia Tufano, Alberto Martin-Lopez, Ahmad Tayeb, Ozren Dabic, Sonia Haiduc, Gabriele Bavota, "Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?"
Abstract: Several techniques have been proposed to (partially) automate code review. Early support consisted in recommending the most suited reviewer for a given change or in prioritizing the review tasks. With the advent of deep learning in software engineering, the level of automation has been pushed to new heights, with approaches able to provide feedback on source code in natural language as a human reviewer would do. Also, recent work documented open source projects adopting Large Language Models (LLMs) as co-reviewers. Although the research in this field is very active, little is known about the actual impact of including automatically generated code reviews in the code review process. While there are many aspects worth investigating (e.g., is knowledge transfer between developers affected?), in this work we focus on three of them: (i) review quality, i.e., the reviewer's ability to identify issues in the code; (ii) review cost, i.e., the time spent reviewing the code; and (iii) reviewer’s confidence, i.e., how confident is the reviewer about the provided feedback. We run a controlled experiment with 29 professional developers who reviewed different programs with/without the support of an automatically generated code review. During the experiment we monitored the reviewers’ activities, for over 50 hours of recorded code reviews. We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior: Reviewers tend to focus on the code locations indicated by the LLM rather than searching for additional issues in other parts of the code. The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high- severity issues as compared to a completely manual process. Finally, the automated support did not result in saved time and did not increase the reviewers’ confidence.
Tags: "AI for SE", "Human/Social"Weiwei Xu, Kai Gao, Hao He, Minghui Zhou, "LiCoEval: Evaluating LLMs on License Compliance in Code Generation"
Abstract: Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose an evaluation benchmark LiCoEval, to evaluate the license compliance capabilities of LLMs. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.
Tags: "Human/Social", "AI for SE", "Open Source"Dehai Zhao, Zhenchang Xing, Qinghua Lu, Sherry Xiwei Xu, Liming Zhu, "SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation"
Abstract: UI automation is a useful technique for UI testing, bug reproduction and robotic process automation. Recording the user actions with an application assists rapid development of UI automation scripts, but existing recording techniques are intrusive, rely on OS or GUI framework accessibility support or assume specific app implementations. Reversing-engineering user actions from screencasts is non-intrusive, but a key reverse-engineering step is currently missing - recognize human-understandable structured user actions ([command] [widget][location]) from action screencasts. To fill the gap, we propose a deep learning-based computer vision model which can recognize 11 commands and 11 widgets, and generate location phrases from action screencasts, through joint learning and multi-task learning. We label a large dataset with 7260 video-action pairs, which record the user interactions with Word, Zoom, Firefox, Photoshop, and Window 10 Settings. Through extensive experiments, we confirm the effectiveness and generality of our model, and demonstrate the usefulness of a screencast-to-action-script tool built upon our model for bug reproduction.
Tags: "AI for SE", "Testing and Quality"Zeyang Ma, Dong Jae Kim, Tse-Hsun (Peter) Chen, "LibreLog: Accurate and Efficient Unsupervised Log Parsing Using Open-Source Large Language Models"
Abstract: Log parsing is a critical step that transforms unstructured log data into structured formats, facilitating subsequent log-based analysis. Traditional syntax-based log parsers are efficient and effective, but they often experience decreased accuracy when processing logs that deviate from the predefined rules. Recently, large language models (LLM) based log parsers have shown superior parsing accuracy. However, existing LLM-based parsers face three main challenges: 1) time-consuming and labor-intensive manual labeling for fine-tuning or in-context learning, 2) increased parsing costs due to the vast volume of log data and limited context size of LLMs, and 3) privacy risks from using commercial models like ChatGPT with sensitive log information. To overcome these limitations, this paper introduces LibreLog, an unsupervised log parsing approach that leverages open-source LLMs (i.e., Llama3-8B) to enhance privacy and reduce operational costs while achieving state-of-the-art parsing accuracy. LibreLog first groups logs with similar static text but varying dynamic variables using a fixed-depth grouping tree. It then parses logs within these groups using three components: i) similarity scoring-based retrieval augmented generation: selects diverse logs within each group based on Jaccard similarity, helping the LLM distinguish between static text and dynamic variables; ii) self-reflection: iteratively query LLMs to refine log templates to improve parsing accuracy; and iii) log template memory: stores parsed templates to reduce LLM queries for improved parsing efficiency. Our evaluation on LogHub-2.0 shows that LibreLog achieves 25% higher parsing accuracy and processes logs 2.7 times faster compared to state-of-the-art LLM-based parsers. In short, LibreLog addresses privacy and cost concerns of using commercial LLMs while achieving state-of- the-arts parsing efficiency and accuracy.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Yi Sun, Zhuo Zhang, Xiangyu Zhang, "FairChecker: Detecting Fund-stealing Bugs in DeFi Protocols via Fairness Validation"
Abstract: Decentralized Finance (DeFi) is an emerging paradigm within the blockchain space that aims to revolutionize conventional financial systems through the application of blockchain technology. The substantial value of digital assets managed by DeFi protocols makes it a lucrative target for attacks. Despite the human resources and the application of automated tools, frequent attacks still cause significant fund losses to DeFi participants. Existing tools primarily rely on oracles similar to those used in traditional software analysis, making it challenging for them to detect functional bugs specific to the DeFi domain. Since blockchain functions as a distributed ledger system, the foundation of any DeFi protocol is the accurate maintenance of key state variables representing user funds. If these variables are not properly updated or designed to reflect the intended flow of funds, attackers can exploit these flaws to steal assets. From the study of popular DeFi protocols, we observe that, in DeFi systems, to ensure a transaction does not misappropriate someone's fund, the direction of changes (increase or decrease) of values associated with the amount of asset or debt of a user has to adhere to some fairness properties. We propose a concept called fairness bug which allows attackers to gain profit without cost. We propose an inter-procedural and inter-contract static analysis technique that utilizes symbolic execution and an SMT solver to automatically detect fairness bugs in DeFi smart contracts. We have implemented our fairness-checking approach in our tool, named FairChecker. We evaluate our tool on a benchmark of 113 real-world DeFi protocols with 34 fairness bugs. The results show that our tool can detect 32 bugs with a recall of 94.1\% and a precision of 46.4\%, demonstrating its effectiveness.
Tags: "Analysis", "Security", "Blockchain"Li Huang, Weifeng Sun, Meng Yan, "Iterative Generation of Adversarial Example for Deep Code Models"
Abstract: Deep code models are vulnerable to adversarial attacks, making it possible for semantically identical inputs to trigger different responses. Current black-box attack methods typically prioritize the impact of identifiers on the model based on custom importance scores or program context and incrementally replace identifiers to generate adversarial examples. However, these methods often fail to fully leverage feedback from failed attacks to guide subsequent attacks, resulting in problems such as local optima bias and efficiency dilemmas. In this paper, we introduce ITGen, a novel black-box adversarial example generation method that iteratively utilizes feedback from failed attacks to refine the generation process. It employs a bitvector-based representation of code variants to mitigate local optima bias. By integrating these bit vectors with feedback from failed attacks, ITGen uses an enhanced Bayesian optimization framework to efficiently predict the most promising code variants, significantly reducing the search space and thus addressing the efficiency dilemma. We conducted experiments on a total of nine deep code models for both understanding and generation tasks, demonstrating ITGen's effectiveness and efficiency, as well as its ability to enhance model robustness through adversarial fine-tuning. For example, on average, ITGen improves the attack success rate by 47.98% and 69.70% over the state-of-the-art techniques (i.e., ALERT and BeamAttack), respectively.
Tags: "SE for AI", "Testing and Quality"Chenxi Zhang, Yufei Liang, Tian Tan, Chang Xu, Shuangxiang Kan, Yulei Sui, Yue Li, "Interactive Cross-Language Pointer Analysis for Resolving Native Code in Java Programs"
Abstract: Java offers the Java Native Interface (JNI), which allows programs running in the Java Virtual Machine to invoke and be manipulated by native applications and libraries written in other languages, typically C. While JNI mechanism significantly enhances the Java platform's capabilities, it also presents challenges for static analysis of Java programs due to the complex behaviors introduced by native code. Therefore, effectively resolving the interactions between Java and native code is crucial for static analysis. In this paper, we introduce JNIFER, the first interactive cross-language pointer analysis for resolving native code in Java programs. JNIFER integrates both Java and C pointer analyses, equipped with advanced native call and JNI function analyses, enabling the simultaneous analysis of both Java and native code. During the analysis of cross-language interactions, the two analyzers interact with each other, constructing cross-language points-to relations and call graphs, thereby approximating the runtime behavior at the interaction sites. Our evaluation shows that JNIFER outperforms state-of-the-art approaches in terms of soundness while maintaining high precision and comparable efficiency, as evidenced by extensive experiments on OpenJDK and real-world Java applications.
Tags: "Analysis"Chiming Duan, Yong Yang, Tong Jia, Guiyang Liu, Jinbu Liu, Huxing Zhang, Qi Zhou, Ying Li, Gang Huang, "FAMOS: Fault diagnosis for Microservice Systems through Effective Multi-modal Data Fusion"
Abstract: Accurately diagnosing the fault that causes the failure is crucial for maintaining the reliability of a microservice system after a failure occurs. Mainstream fault diagnosis approaches are data-driven and mainly rely on three modalities of runtime data: traces, logs, and metrics. Diagnosing faults with multiple modalities of data in microservice systems has been an clear trend in recent years because different types of faults and corresponding failures tend to manifest in data of various modalities. Accurately diagnosing faults by fully leveraging multiple modalities of data is confronted with two challenges: 1)how to minimize information loss when extracting features for data of each modality; 2)how to correctly capture andutilize the relationships among data of different modalities. To address these challenges, we propose FAMOS, a Fault diagnosis Approach for MicrOservice Systems through effective multi-modal data fusion. On the one hand, FAMOS employs independent feature extractors to preserve the intrinc features for each modality. On the other hand, FAMOS introduces a new Gaussian-attention mechanism to accurately correlate data of different modalities and then captures the inter-modality relationship with a cross-attention mechanism. We evaluated FAMOS on two datasets constructed by injecting comprehensive and abundant faults into an open-source microservice system and a real-world industrial microservice system. Experimental results demonstrate the FAMOS’s effectiveness in fault diagnosis, achieving significant improvements in F1 scores compared to state-of-the-art (SOTA) methods, with an increase of 20.33%.
Tags: "AI for SE", "Security"Van-Hoang Le, yi xiao, Hongyu Zhang, "Unleashing the True Potential of Semantic-based Log Parsing with Pre-trained Language Models"
Abstract: Software-intensive systems often produce console logs for troubleshooting purpose. Log parsing, which aims at parsing a log message into a specific log template, typically serves as the first step toward automated log analytics. To better comprehend semantic information of log messages, many semantic-based log parsers have been proposed. These log parsers fine-tune a small pretrained language model (PLM) such as RoBERTa on a few labelled log samples. With the increasing popularity of large language models (LLMs), some recent studies also propose to leverage LLMs such as ChatGPT through in-context learning for automated log parsing, and obtain better results than previous semantic-based log parsers with small PLMs. In this paper, we show that semantic-based log parsers with small PLMs can actually achieve better or comparable performance to state-of-the-art LLM-based log parsing models while being more efficient and cost-effective. We propose UNLEASH, a novel semantic-based log parsing approach, which incorporates three enhancement methods to boost the performance of PLMs for log parsing, including (1) an entropy-based ranking method to select the most informative log samples; (2) a contrastive learning method to enhance the fine-tuning process; and (3) an inference optimization method to improve the log parsing performance. We evaluate UNLEASH on a set of large log datasets and the experimental results show that UNLEASH is effective and efficient, when compared to state-of-the-art log parsers.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Yiteng Peng, Daoyuan Wu, Zhibo Liu, Dongwei Xiao, Zhenlan Ji, Juergen Rahmel, Shuai Wang, "Testing and Understanding Deviation Behaviors in FHE-hardened Machine Learning Models"
Abstract: Fully homomorphic encryption (FHE) is a promising cryptographic primitive that enables secure computation over encrypted data. A primary use of FHE is to support privacy-preserving machine learning (ML) on public cloud infrastructures. Despite the rapid development of FHE-based ML (or HE-ML) in recent years, the community still lacks a systematic understanding of their robustness. In this paper, we aim to systematically test and understand the deviation behaviors of HE-ML models, where the same input causes deviant outputs between FHE-hardened models and their plaintext versions, leading to completely incorrect model predictions. To effectively uncover deviation-triggering inputs under the constraints of expensive FHE computation, we design a novel differential testing tool called HEDiff, which leverages the margin metric on the plaintext model as guidance to drive targeted testing on FHE models. For the identified deviation inputs, we further analyze them to determine whether they exhibit general noise patterns that are transferable. We evaluate HEDiff using three popular HE-ML frameworks, covering 12 different combinations of models and datasets. HEDiff successfully detected hundreds of deviation inputs across almost every tested FHE framework and model. We also quantitatively show that the identified deviation inputs are (visually) meaningful in comparison to regular inputs. Further schematic analysis reveals the root cause of these deviant inputs and allows us to generalize their noise patterns for more directed testing.
Tags: "SE for AI", "Testing and Quality"Chuan Luo, Shuangyu Lyu, Wei Wu, Hongyu Zhang, Dianhui Chu, Chunming Hu, "Towards High-strength Combinatorial Interaction Testing for Highly Configurable Software Systems"
Abstract: Highly configurable software systems are crucial in practice to satisfy the rising demand for software customization, and combinatorial interaction testing (CIT) is an important methodology for testing such systems. Constrained covering array generation (CCAG), as the core problem in CIT, is to construct a t-wise covering array (CA) of minimum size, where t represents the testing strength. Extensive studies have demonstrated that high-strength CIT (e.g., 4-wise and 5-wise CIT) has stronger fault detection capability than low-strength CIT (i.e., 2-wise and 3-wise CIT), and there exist certain critical faults that can be disclosed through high-strength CIT. Although existing CCAG algorithm has exhibited effectiveness in solving the low-strength CCAG problem, they suffer the severe high-strength challenge when solving 4-wise and 5-wise CCAG, which urgently calls for effective solutions to solving 4-wise and 5-wise CCAG problems. To alleviate the high-strength challenge, we propose a novel and effective local search algorithm dubbed HSCA. Particularly, HSCA incorporates three new and powerful techniques, i.e., multi-round CA generation mechanism, dynamic priority assigning technique, and variable grouping strategy, to improve its performance. Extensive experiments on 35 real-world and synthetic instances demonstrate that HSCA can generate significantly smaller 4-wise and 5-wise CAs than existing state-of-the-art CCAG algorithms. More encouragingly, out of all 35 instances, HSCA successfully constructs 4-wise and 5-wise CAs for 11 and 15 instances, respectively, where existing CCAG algorithms fail. Our results indicate that HSCA can effectively mitigate the high-strength challenge.
Tags: "Testing and Quality"Mengying Wu, Geng Hong, Wuyuao Mai, Xinyi Wu, Lei Zhang, Yingyuan Pu, Huajun Chai, Lingyun Ying, Haixin Duan, Min Yang, "Exposing the Hidden Layer: Software Repositories in the Service of SEO Manipulation"
Abstract: Distinct from traditional malicious packages, this paper uncovers a novel attack vector named “blackhat Search Engine Optimization through REPositories (RepSEO)”. In this approach, attackers carefully craft packages to manipulate search engine results, exploiting the credibility of software repositories to promote illicit websites. Our research presents a systematic analysis of the underground ecosystem of RepSEO, identifying key players such as account providers, advertisers, and publishers. We developed an effective detection tool, applied to a ten-year large-scale dataset of npm, Docker Hub, and NuGet software repositories. This investigation led to the startling discovery of 3,801,682 abusive packages, highlighting the widespread nature of this attack. Our study also delves into the supply chain tactics of these attacks, revealing strategies like the use of self-hosted email services for account registration, redirection methods to obscure landing pages, and rapid deployment techniques by aggressive attackers. Additionally, we explore the profit motives behind these attacks, identifying two primary types of advertisers: survey-based advertisers and malware distribution advertisers. We reported npm, NuGet, and Docker Hub about the RepSEO packages and the related supply chain vulnerabilities of Google, and received their acknowledgments. Software repositories have started removing the abusive packages as of this paper’s submission. We also open-source our code and data to facilitate future research.
Tags: "MSR", "Security"Hanmo You, Zan Wang, Bin Lin, Junjie Chen, "Navigating the Testing of Evolving Deep Learning Systems: An Exploratory Interview Study"
Abstract: Deep Learning (DL) systems have been widely adopted across various industrial domains such as autonomous driving and intelligent healthcare. As with traditional software, DL systems also need to constantly evolve to meet ever-changing user requirements. However, ensuring the quality of these continuously evolving systems presents significant challenges, especially in the context of testing. Understanding how industry developers address these challenges and what extra obstacles they are facing could provide valuable insights for further safeguarding the quality of DL systems. To reach this goal, we conducted semi-structured interviews with 22 DL developers from diverse domains and backgrounds. More specifically, our study focuses on exploring the challenges developers encounter in testing evolving DL systems, the practical solutions they employ, and their expectations for extra support. Our results highlight the difficulties in testing evolving DL systems and identify the best practices for DL developers to address them. Additionally, we pinpoint potential future research directions to enhance testing effectiveness in evolving DL systems.
Tags: "SE for AI", "Human/Social"Chenxing Zhong, Daniel Feitosa, Paris Avgeriou, Huang Huang, Yue Li, He Zhang, "PairSmell: A Novel Perspective Inspecting Software Modular Structure"
Abstract: Enhancing the modular structure of existing systems has attracted substantial research interest, focusing on two main methods: (1) software modularization and (2) identifying design issues (e.g., smells) as refactoring opportunities. However, re-modularization solutions often require extensive modifications to the original modules, and the design issues identified are generally too coarse to guide refactoring strategies. Combining the above two methods, this paper introduces a novel concept, \emph{PairSmell}, which exploits modularization to pinpoint design issues necessitating refactoring. We concentrate on a granular but fundamental aspect of modularity principles---\emph{modular relation (MR)}, i.e., \emph{whether a pair of entities are separated or collocated}. The main assumption is that, if the actual MR of a pair violates its `apt MR', i.e., an MR agreed on by multiple modularization tools (as raters), it can be deemed likely a flawed architectural decision that necessitates further examination. To quantify and evaluate \emph{PairSmell}, we conduct an empirical study on 20 C/C++ and Java projects, using 4 established modularization tools to identify two forms of \emph{PairSmell}: inapt separated pairs $\mathit{InSep}$ and inapt collocated pairs $\mathit{InCol}$. Our study on 260,003 instances reveals that their architectural impacts are substantial: (1) on average, 14.60\% and 20.44\% of software entities are involved in $\mathit{InSep}$ and $\mathit{InCol}$ MRs respectively; (2) $\mathit{InSep}$ pairs are associated with 190\% more co-changes than properly separated pairs, while $\mathit{InCol}$ pairs are associated with 35\% fewer co-changes than properly collocated pairs, both indicating a successful identification of modular structures detrimental to software quality; and (3) both forms of \emph{PairSmell} persist across software evolution. This evidence strongly suggests that \emph{PairSmell} can provide meaningful insights for inspecting modular structure, with the identified issues being both granular and fundamental, making the enhancement of modular design more efficient.
Tags: "Design/Architecture", "Prog Comprehension/Reeng/Maint"Fraol Batole, David OBrien, Tien Nguyen, Robert Dyer, Hridesh Rajan, "An LLM-Based Agent-Oriented Approach for Automated Code Design Issue Localization"
Abstract: Maintaining software design quality is crucial for the long-term maintainability and evolution of systems. However, design issues such as poor modularity and excessive complexity often emerge as codebases grow. Developers rely on external tools, such as program analysis techniques, to identify such issues. This work investigates an automated approach for analyzing and localizing design issues using Large Language Models (LLMs). Large language models have demonstrated significant performance on coding tasks, but directly leveraging them for design issue localization is challenging. Large codebases exceed typical LLM context windows, and program analysis tool outputs in non-textual modalities (e.g., graphs or interactive visualizations) are incompatible with LLMs’ natural language inputs. To address these challenges, we propose LOCALIZEAGENT, a novel multi-agent framework for effective design issue localization. LOCALIZEAGENT integrates specialized agents that (1) analyze code to identify potential code design issues, (2) transform program analysis outputs into abstraction-aware LLM-friendly natural language summaries, (3) generate context-aware prompts tailored to specific refactoring types, and (4) leverage LLMs to locate and rank the localized issues based on their relevance. Our evaluation using diverse real-world codebases demonstrates significant improvements over baseline approaches, with LOCALIZEAGENT achieving 138%, 166%, and 206% relative improvements in exact match accuracy for localizing information hiding, complexity, and modularity issues, respectively.
Tags: "AI for SE", "Design/Architecture"Qingkai Shi, Xiaoheng Xie, Xianjin Fu, Peng Di, Huawei Li, Ang Zhou, Gang Fan, "Datalog-Based Language-Agnostic Change Impact Analysis for Microservices"
Abstract: The shift-left principle in the industry requires us to test a software application as early as possible. Particularly, when code changes in a microservice application are committed to the code repository, we have to efficiently identify all public microservice interfaces impacted by the changes, such that the impacted interfaces can be tested as soon as possible. However, developing an efficient change impact analysis is extremely challenging in microservices because of the multilingual problem: microservice applications are often implemented using varying programming languages and involve diverse frameworks and configuration files. To address this issue, this paper presents Microscope, a language-agnostic change impact analysis that uniformly represents the code, configuration files, frameworks, and code changes by relational Datalog rules. Microscope then benefits from an efficient Datalog solver to identify impacted interfaces. Experiments based on the use of Microscope in a leading software company demonstrate that Microscope is both effective and fast as it successfully identifies interfaces impacted by 112 code commits, with moderate time overhead, and could reduce 97% of interfaces to test and save 73% of testing time after code changes.
Tags: "Analysis", "Prog Comprehension/Reeng/Maint"Robert Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F. Ferreira, Sorin Lerner, Emily First, "Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification"
Abstract: Formal verification using proof assistants, such as Coq, allows for high-quality software. However, the verification process is expensive, requiring significant expertise and manual effort to write proofs. Recent work has explored automating proof synthesis using machine learning, and even more recently, large language models (LLMs), showing that retrieving relevant premises (such as lemmas and definitions) is helpful for these models. We present Rango, a fully automated proof synthesis tool for Coq that uses, not only relevant premises but also similar proofs from the current project. Rango uses retrieval augmentation at every step of the proof to automatically determine which proofs and premises to include in the context of its fine-tuned LLM. In this way, Rango adapts to the project _and_ to the evolving state of the proof. We create a new dataset, CoqStoq, of 2,205 open-source Coq projects from GitHub, which includes both training data and a curated evaluation benchmark of well-maintained projects. On this benchmark, Rango synthesizes 27.7% of the proofs, which is 10% more proofs than prior state-of-the-art tool Tactician. Our evaluation also shows that adding relevant proofs to the context in Rango leads to a 45% increase in the number of theorems proven.
Tags: "AI for SE", "Analysis"Miao Miao, Austin Mordahl, Dakota Soles, Alice Beideck, Shiyi Wei, "An Extensive Empirical Study of Nondeterministic Behavior in Static Analysis Tools"
Abstract: Recent research has studied the importance and identified causes of nondeterminism in software. Static analysis tools exhibit many risk factors for nondeterministic behavior, but no work has analyzed the occurrence of such behavior in these tools. To bridge this gap, we perform an extensive empirical study aiming to understand past and ongoing nondeterminism in 12 popular, open-source static analysis tools that target 5 types of projects. We first conduct a qualitative study to understand the extent to which nondeterministic behavior has been found and addressed within the tools under study, and find results in 7 tool repositories. After classifying the issues and commits by root cause, we find that the majority of nondeterminisms are caused by concurrency issues, incorrect analysis logic, or assumed orderings of unordered data structures, which have shared patterns. We also perform a quantitative analysis, where we use two strategies and diverse input programs and configurations to detect yet-unknown nondeterministic behaviors. We discover such behavior in 8 out of the 12 tools, including 3 which had no results from the qualitative analysis. We find that nondeterminism often appears in multiple configurations on a variety of input programs. We communicated all identified nondeterminism to the developers, and received confirmation of five tools. Finally, we detail a case study of fixing FlowDroid's nondeterministic behavior.
Tags: "Analysis"Xintong Zhou, Zhenyang Xu, Mengxiao Zhang, Yongqiang Tian, Chengnian Sun, "WDD: Weighted Delta Debugging"
Abstract: Delta Debugging is a widely used family of algorithms (e.g., ddmin and ProbDD) to automatically minimize bug-triggering test inputs, thus to facilitate debugging. It takes a list of elements with each element representing a fragment of the test input, systematically partitions the list at different granularities, identifies and deletes bug-irrelevant partitions. Prior delta debugging algorithms assume there are no differences among the elements in the list, and thus treat them uniformly during partitioning. However, in practice, this assumption usually does not hold, because the size (referred to as weight) of the fragment represented by each element can vary significantly. For example, a single element representing 50% of the test input is much more likely to be bug-relevant than elements representing only 1%. This assumption inevitably impairs the efficiency or even effectiveness of these delta debugging algorithms. This paper proposes Weighted Delta Debugging (WDD), a novel concept to help prior delta debugging algorithms overcome the limitation mentioned above. The key insight of WDD is to assign each element in the list a weight according to its size, and distinguish different elements based on their weights during partitioning. We designed two new minimization algorithms, Wddmin and WProbDD, by applying WDD to ddmin and ProbDD respectively. We extensively evaluated Wddmin and WProbDD in two representative applications, HDD and Perses, on 62 benchmarks across two languages. On average, with Wddmin, HDD and Perses took 51.31% and 7.47% less time to generate 9.12% and 0.96% smaller results than with ddmin, respectively. With WProbDD, HDD and Perses used 11.98% and 9.72% less time to generate 13.40% and 2.20% smaller results than with ProbDD, respectively. The results strongly demonstrate the value of WDD. We firmly believe that WDD opens up a new dimension to improve test input minimization techniques.
Tags: "Testing and Quality"Faridah Akinotcho, Lili Wei, Julia Rubin, "Mobile Application Coverage: The 30% Curse and Ways Forward"
Abstract: Testing, security analysis, and other dynamic quality assurance approaches rely on mechanisms that invoke software under test, aiming to achieve high code coverage. A large number of invocation mechanisms proposed in the literature, in particular for Android mobile applications, employ GUI-driven application exploration. However, studies show that even the most advanced GUI exploration techniques can cover only around 30% of a real- world application. This paper aims to investigate “the remaining 70%”. By conducting a large-scale experiment involving two human experts, who thoroughly explored 61 benchmark and 42 popular apps from Google Play, we show that achieving a substantially larger coverage for real-world applications is impractical even if we factor out known GUI-based exploration issues, such as the inability to provide semantic inputs and the right order of events. The main reasons preventing humans from covering the entire application include application dependencies on device configurations and external resources. Thus, future investment into GUI-based exploration strategies is unlikely to lead to substantial improvements in coverage. To chart possible ways forward and explore approaches to satisfy/bypass these dependencies, we thoroughly analyze code-level properties guarding them. Our analysis shows that a large fraction of the dependencies could actually be successfully bypassed with relatively simple beyond- GUI exploration techniques. We hope our study can inspire future work in this area and also provide a realistic benchmark for evaluating this work.
Tags: "Testing and Quality", "Mobile SW"Hao Song, Teng Li, Jiachi Chen, Ting Chen, Beibei Li, Zhangyan Lin, Yi Lu, Pan Li, Xihan Zhou, "Enhancing The Open Network: Definition and Automated Detection of Smart Contract Defects"
Abstract: The Open Network (TON), designed to support Telegram's extensive user base of hundreds of millions, has garnered considerable attention since its launch in 2022. \textit{FunC} is the most popular programming language for writing smart contracts on TON. It is distinguished by a unique syntax compared to other smart contract languages. Despite growing interest, research on the practical defects of TON smart contracts is still in its early stages. In this paper, we summarize eight smart contract defects identified from TON's official blogs and audit reports, each with detailed definitions and code examples. Furthermore, we propose a static analysis framework called TONScanner to facilitate the detection of these defects. Specifically, TONScanner reuses \textit{FunC} compiler's frontend code to transform the \textit{FunC} contract code into \textit{FunC} intermediate representation (IR) in the form of a directed acyclic graph (DAG). Based on this IR, TONScanner constructs a control flow graph (CFG), then transforms it into a static single assignment (SSA) form to simplify further analysis. TONScanner also integrates Data Dependency, Call Graph, Taint Analysis, and Cell Construct, which are specifically tailored for TON blockchain's unique data structures. These components finally facilitate the identification of the eight defects. We evaluate the effectiveness of TONScanner by applying it to 1,640 smart contracts and find a total of 14,995 defects. Through random sampling and manual labeling, we find that TONScanner achieves an overall precision of 97.49%. The results reveal that current TON contracts contain numerous defects, indicating that developers are prone to making errors. TONScanner has proven its ability to accurately identify these defects, thereby aiding in their correction.
Tags: "Analysis", "Security", "Blockchain"Kevin Guan, Owolabi Legunsen, "Instrumentation-Driven Evolution-Aware Runtime Verification"
Abstract: Runtime verification (RV) found hundreds of bugs by monitoring passing tests against formal specifications (specs). RV first instruments a program to obtain relevant events, e.g., method calls, to monitor. A hindrance to RV adoption, especially in continuous integration, is its high overhead. So, prior work proposed spec-driven evolution-aware techniques to speed up RV. They use complex analysis to re-monitor a subset of specs related to code impacted by changes. But, these techniques assume that RV overhead is dominated by monitoring time, and their designs often sacrifice safety (ability to find all new violations) for speed. We present iMOP, the first instrumentation-driven evolution-aware RV framework. iMOP leverages a recent observation that RV overhead during testing is often dominated by instrumentation, not monitoring. iMOP embodies a family of 14 techniques that aim to safely speed up RV by simply re-instrumenting only changed code. Instrumentation from the old revision is re-used for unchanged code, and all specs are re-monitored in the new revision. We implement iMOP as a Maven plugin and evaluate it on 1,627 revisions of 48 projects, using 160 specs of correct JDK API usage. iMOP is safe by design. It is up to 29.6x faster than re-running RV from scratch after each change, and 17.8x and 6.7x faster than safe and unsafe spec-driven techniques, respectively. iMOP is faster than just applying regression test selection to RV.
Tags: "Analysis", "Prog Comprehension/Reeng/Maint"Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, Weimin Zhang, "Model Editing for LLMs4Code: How Far are We?"
Abstract: Large Language Models for Code (LLMs4Code) have been found to exhibit outstanding performance in the software engineering domain, especially the remarkable performance in coding tasks. However, even the most advanced LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to the high cost of training LLMs4Code, it is impractical to re-train the models for fixing these problematic code knowledge. Model editing is a new technical field for effectively and efficiently correcting erroneous knowledge in LLMs, where various model editing techniques and benchmarks have been proposed recently. Despite that, a comprehensive study that thoroughly compares and analyzes the effectiveness of all state-of-the-art model editing techniques for adapting the knowledge within LLMs4Code models across various code-related tasks is notably absent. To bridge this gap, we perform the first systematic study on applying state-of-the-art model editing approaches to repair the inaccuracy of LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help of CLMEEval, we evaluate six advanced model editing techniques on three LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings include that the external memorization-based GRACE approach achieves the best knowledge editing effectiveness and specificity (the editing does not influence untargeted knowledge), while generalization (whether the editing can generalize to other semantically-identical inputs) is a universal challenge for existing techniques. Furthermore, building on in-depth case analysis, we introduce an enhanced version of GRACE called A-GRACE, which incorporates contrastive learning to better capture the semantics of the inputs. Results demonstrate that A-GRACE notably enhances generalization while maintaining similar levels of effectiveness and specificity compared to the vanilla GRACE.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"TianChen Yu, Li Yuan, Liannan Lin, Hongkui He, "A Multiple Representation Transformer with Optimized Abstract Syntax Tree for Efficient Code Clone Detection"
Abstract: Over the past decade, the application of deep learning in code clone detection has produced remarkable results. However, the current approaches have two limitations: (a) code representation approaches with low information utilization, such as vanilla Abstract Syntax Tree (AST), leading to information redundancy which results in performance degradation; (b) low efficiency of clone detection on evaluation, resulting in excessive time costs during practical use. In this paper, we propose a Multiple Representation Transformer with Optimized Abstract Syntax Tree (MRT-OAST) to introduce an efficient code representation method while achieving competitive performance. Specifically, MRT-OAST strategically prunes and enhances the AST, utilizing both pre-order and post-order traversals to represent two different representations. To speed up the evaluation process, MRT-OAST utilizes a pure Siamese network and employs cosine similarity to compare the similarity between codes. Our approach effectively reduces AST sequences to 40% and 39% of their original length in Java and C/C++ while preserving structural information. In code clone detection tasks, our model surpasses state-of-the-art approaches on OJClone and Google Code Jam. During the evaluation of BigCloneBench, our model has a 5x speed improvement compared to the state-of-the-art lightweight model and a 563x speed improvement compared to the BERT-based model, with only a 0.3% and 0.9% decrease in $F_1$-score.
Tags: "AI for SE", "Analysis"Tanghaoran Zhang, Yue Yu, Xinjun Mao, Shangwen Wang, Kang Yang, Yao Lu, Zhang Zhang, Yuxin Zhao, "Instruct or Interact? Exploring and Eliciting LLMs’ Capability in Code Snippet Adaptation Through Prompt Engineering"
Abstract: Code snippet adaptation is a fundamental activity in the software development process. Unlike code generation, code snippet adaptation is not a ``free creation'', which requires developers to tailor a given code snippet in order to fit specific requirements and the code context. Recently, large language models (LLMs) have confirmed their effectiveness in the code generation task with promising results. However, their performance on code snippet adaptation, a reuse-oriented and context-dependent code change prediction task, is still unclear. To bridge this gap, we conduct an empirical study to investigate the performance and issues of LLMs on the adaptation task. We first evaluate the adaptation performances of three popular LLMs and compare them to the code generation task. Our result indicates that their adaptation ability is weaker than generation, with a nearly 15\% decrease on pass@1 and more context-related errors. By manually inspecting 200 cases, we further investigate the causes of LLMs’ sub-optimal performance, which can be classified into three categories, \emph{i.e.,} \textit{Unclear Requirement}, \textit{Requirement Misalignment} and \textit{Context Misapplication}. Based on the above empirical research, we propose an interactive prompting approach to eliciting LLMs' ability on the adaptation task. Specifically, we enhance the prompt by enriching the context and decomposing the task, which alleviates context misapplication and improves requirement understanding. Besides, we enable LLMs' reflection by requiring them to interact with a human or a LLM counselor, compensating for unclear requirement. Our experimental result reveals that our approach greatly improve LLMs' adaptation performance. The best-performing Human-LLM interaction successfully solves 159 out of the 202 identified defects and improves the pass@1 and pass@5 by over 40\% compared to the initial instruction-based prompt. Considering human efforts, we suggest multi-agent interaction as a trade-off, which can achieve comparable performance with excellent generalization ability. We deem that our approach could provide methodological assistance for autonomous code snippet reuse and adaptation with LLMs.
Tags: "AI for SE"Enrique Barba Roque, Luís Cruz, Thomas Durieux, "Unveiling the Energy Vampires: A Methodology for Debugging Software Energy Consumption"
Abstract: Energy consumption in software systems is becoming increasingly important, especially in large-scale deployments. However, debugging energy-related issues remains challenging due to the lack of specialized tools. This paper presents an energy debugging methodology for identifying and isolating energy consumption hotspots in software systems. We demonstrate the methodology's effectiveness through a case study of Redis, a popular in-memory database. Our analysis reveals significant energy consumption differences between Alpine and Ubuntu distributions, with Alpine consuming up to 20.2% more power in certain operations. We trace this difference to the implementation of the `memcpy` function in different C standard libraries (musl vs. glibc). By isolating and benchmarking `memcpy`, we confirm it as the primary cause of the energy discrepancy. Our findings highlight the importance of considering energy efficiency in software dependencies and demonstrate the capability to assist developers in identifying and addressing energy-related issues. This work contributes to the growing field of sustainable software engineering by providing a systematic approach to energy debugging.
Tags: "User experience", "Human/Social"Cedric Richter, Marek Chalupa, Marie-Christine Jakobs, Heike Wehrheim, "Cooperative Software Verification via Dynamic Program Splitting"
Abstract: Cooperative software verification divides the task of software verification among several verification tools in order to increase efficiency and effectiveness. The basic approach is to let verifiers work on different parts of a program and at the end join verification results. While this idea is intuitively appealing, cooperative verification is usually hindered by the facts that program decomposition (1) is often static, disregarding strengths and weaknesses of employed verifiers, and (2) often represents the decomposed program parts in a specific proprietary format, thereby making the use of off-the-shelf verifiers in cooperative verification difficult. In this paper, we propose a novel cooperative verification scheme that we call dynamic program splitting (DPS). Splitting decomposes programs into (smaller) programs, and thus directly enables the use of off-the-shelf tools. In DPS, splitting is dynamically applied on demand: Verification starts by giving a verification task (a program plus a correctness specification) to a verifier V1. Whenever V1 finds the current task to be hard to verify, it splits the task (i.e., the program) and restarts verification on subtasks. DPS continues until (1) a violation is found, (2) all subtasks are completed or (3) some user-defined stopping criterion is met. In the latter case, the remaining uncompleted subtasks are merged into a single one and given to a next verifier V2, repeating the same procedure on the still unverified program parts. This way, the decomposition is steered by what is hard to verify for particular verifiers, leveraging their complementary strengths. We have implemented dynamic program splitting and evaluated it on benchmarks of the annual software verification competition SV-COMP. The evaluation shows that cooperative verification with DPS is able to solve verification tasks that none of the constituent verifiers can solve, without any significant overhead.
Tags: "Analysis", "Security"Yiwei Li, Liangze Yin, Wei Dong, Jiaxin Liu, Yanfeng Hu, Shanshan Li, "Hetrify: Efficient Verification of Heterogeneous Programs on RISC-V"
Abstract: The heterogeneous nature of contemporary software, comprising components like closed-source libraries, embedded assembly snippets, and modules written in multiple programming languages, leads to significant verification challenges. Currently, There are no mature and available methods to effectively address such problems. To bridge this gap, we propose a verification approach capable of effectively verifying heterogeneous programs. This approach is universally applicable. It theoretically supports the verification of any heterogeneous program that can be compiled into binary code, without being constrained by any specific programming language. The approach begins by compiling the entire program or its unverifiable segments into binary format. Under guarantees of semantic equivalence, these binaries are converted into verifiable C code, which can then be verified using existing C verification tools. Based on the RISC-V architecture, we developed the Hetrify tool to implement this verification approach. The tool is supported by rigorous mathematical proofs to ensure operational semantic equivalence between the converted C programs and their original counterparts. To validate our approach, we conducted verification experiments on 130 programs, including 100 assembly programs and 30 large heterogeneous programs with missing critical function source code, demonstrating the effectiveness of our approach.
Tags: "Analysis", "Security"Xue Jiang, Yihong Dong, Yongding Tao, Huanyu Liu, Zhi Jin, Ge Li, "ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation"
Abstract: Large language models (LLMs) have achieved impressive performance in code generation recently, offering programmers revolutionary assistance in software development. However, due to the auto-regressive nature of LLMs, they are susceptible to error accumulation during code generation. Once an error is produced, LLMs can merely continue to generate the subsequent code conditioned on it, given their inability to adjust previous outputs. This generation process differs from the common practice in human coding, which involves review and adjustment during the coding process according to quality and requirements. Existing LLM-based approaches that typically consider post-revising after code generation fail to resolve errors in time, leading to the challenging resolution of accumulated errors and the significant wastage of resources. Ideally, LLMs should rollback and resolve the occurred error immediately during code generation, rather than proceed on the basis of the error and wait for post-revising after generation. In this paper, we propose \ourapproachbf, which integrates the backtracking mechanism and program analysis into LLMs for code generation. Specifically, we employ program analysis to perform incremental error detection during the generation process. When an error is detected, the backtracking mechanism is triggered to priming rollback strategies and constraint regeneration, thereby avoiding the recurrence of the same error. Experiments on multiple code generation benchmarks show that \ourapproachbf can significantly reduce the errors generated by LLMs, with a compilation pass rate of over 98.9\%. The test pass rate is improved by up to 23.8\% compared to the best baseline approach. Compared to the post-revising baseline, the cost is reduced by 19.3\%. Moreover, our approach is model-agnostic and achieves consistent improvements across six LLMs.
Tags: "AI for SE", "Analysis"Christof Tinnes, Alisa Welter, Sven Apel, "Software Model Evolution with Large Language Models: Experiments on Simulated, Public, and Industrial Datasets"
Abstract: Modeling structure and behavior of software systems plays a crucial role in the industrial practice of software engineering. As with other software engineering artifacts, software models are subject to evolution. Supporting modelers in evolving software models with recommendations for model completions is still an open problem, though. In this paper, we explore the potential of large language models for this task. In particular, we propose an approach, \textsc{RaMc}, leveraging large language models, model histories of software systems, and retrieval-augmented generation for model completion. Through experiments on three datasets, including an industrial application, one public open-source community dataset, and one controlled collection of simulated model repositories, we evaluate the potential of large language models for model completion. We found that large language models are indeed a promising technology for supporting software model evolution (62.30% semantically correct completions on real-world industrial data and up to 86.19% type-correct completions). Furthermroe, we found that the general inference capabilities of large language models are useful, for example, when dealing with concepts for which there are few, noisy, or no examples at all.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Ran Mo, Haopeng Song, Wei Ding, Chaochao Wu, "Code Cloning in Solidity Smart Contracts: Prevalence, Evolution, and Impact on Development"
Abstract: In recent years, the development of Solidity smart contracts has been increasing rapidly in popularity. Code cloning is a common coding practice, and many prior studies have revealed that code clones could negatively impact software maintenance and quality. However, there is little work systematically analyzing the nature and impacts of code clones in solidity smart contracts. To bridge this gap, we investigate the prevalence, evolution, and bug-proneness of code clones in solidity smart contracts, and further identify the possible reasons for these clones’ occurrences. With our evaluation of 26,294 smart contracts with 97,877 functions, we have found that code clones are highly prevalent in smart contracts. Additionally, on average, 32.01% of clones co-evolve, indicating the need for careful management to avoid consistency issues. Surprisingly, unlike in traditional software development, code clones in smart contracts are rarely involved in bug fixes. Finally, we identify three main factors that affect the occurrences of clones. We believe our study can provide valuable insights for developers to understand and manage code clones in solidity smart contracts.
Tags: "Prog Comprehension/Reeng/Maint", "Blockchain"Trey Woodlief, Carl Hildebrandt, Sebastian Elbaum, "A Differential Testing Framework to Identify Critical AV Failures Leveraging Arbitrary Inputs"
Abstract: The proliferation of autonomous vehicles (AVs) has made their failures increasingly evident. Testing efforts aimed at identifying the inputs leading to those failures are challenged by the input's long-tail distribution, whose area under the curve is dominated by rare scenarios. We hypothesize that leveraging emerging open-access datasets can accelerate the exploration of long-tail inputs. Having access to diverse inputs, however, is not sufficient to expose failures; an effective test also requires an oracle to distinguish between correct and incorrect behaviors. Current datasets lack such oracles and developing them is notoriously difficult. In response, we propose DiffTest4AV, a differential testing framework designed to address the unique challenges of testing AV systems: 1) for any given input, many outputs may be considered acceptable, 2) the long-tail contains an insurmountable number of inputs to explore, and 3) the AV's continuous execution loop requires for failures to persist in order to affect the system. DiffTest4AV integrates statistical analysis to identify meaningful behavioral variations, judges their importance in terms of the severity of these differences, and incorporates sequential analysis to detect persistent errors indicative of potential system-level failures. Our study on 5 versions of the commercially-available, road-deployed comma.ai OpenPilot system, using 3 available image datasets, demonstrates the capabilities of the framework to detect high-severity, high-confidence, long-running test failures.
Tags: "Testing and Quality", "Autonomy"Giacomo Benedetti, Oreofe Solarin, Courtney Miller, Greg Tystahl, William Enck, Christian Kästner, Alexandros Kapravelos, Alessio Merlo, Luca Verderame, "An Empirical Study on Reproducible Packaging in Open-Source Ecosystems"
Abstract: The integrity of software builds is fundamental to the security of the software supply chain. While Thompson first raised the potential for attacks on build infrastructure in 1984, limited attention has been given to build integrity in the past 40 years, enabling recent attacks on SolarWinds, event-stream, and xz. The best-known defense against build system attacks is creating \emph{reproducible builds}; however, achieving them can be complex for both technical and social reasons and thus is often viewed as impractical to obtain. In this paper, we analyze reproducibility of builds in a novel context: reusable \emph{components} distributed as \emph{packages} in six popular software ecosystems (npm, Maven, PyPI, Go, RubyGems, and Cargo). Our quantitative study on a representative sample of 4000 packages in each ecosystem raises concerns: Rates of reproducible builds vary widely between ecosystems, with some ecosystems having all packages reproducible whereas others have \issues in nearly every package. However, upon deeper investigation, we identified that with relatively straightforward infrastructure configuration and patching of build tools, we can achieve very high rates of reproducible builds in all studied ecosystems. We conclude that if the ecosystems adopt our suggestions, the build process of published packages can be independently confirmed for nearly all packages without individual developer actions, and doing so will prevent significant future software supply chain attacks.
Tags: "MSR"Nusrat Zahan, Philipp Burckhardt, Mikola Lysenko, Feross Aboukhadijeh, Laurie Williams, "Leveraging Large Language Models to Detect npm Malicious Packages"
Abstract: Existing malicious code detection techniques can aid the manual review process by predicting which packages are likely to be malicious. However, these techniques often suffer from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to assist security analysts in detecting malicious packages through the empirical study of using Large Language Models (LLMs) to detect malicious code in the npm ecosystem. We present SecurityAI, a malicious code review workflow to detect malicious code using ChatGPT. We leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We compare the effectiveness of static analysis as a pre-screener with SecurityAI workflow, measuring the number of files that need to be analyzed and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious packages detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. We attained precision and F1 scores of 91% and 94% for GPT-3, and 99% & 97% for GPT-4, respectively, with GPT-3 offering a cost-effective balance. Pre-screening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, hidden backdoors, and suspicious domain connection categories as the top detected malicious packages. The lack of diversity in model-generated responses led to hallucinations, resulting in misclassification cases, with GPT-3 hallucinating more frequently.
Tags: "AI for SE", "Security"Xin-Cheng Wen, Zirui Lin, Cuiyun Gao, Hongyu Zhang, Yong Wang, Qing Liao, "Repository-Level Graph Representation Learning for Enhanced Security Patch Detection"
Abstract: Software vendors often silently release security patches without providing sufficient advisories (e.g., Common Vulnerabilities and Exposures) or delayed updates via resources (e.g., National Vulnerability Database). Therefore, it has become crucial to detect these security patches to ensure secure software maintenance. However, existing methods face the following challenges: (1) They primarily focus on the information within the patches themselves, overlooking the complex dependencies in the repository. (2) Security patches typically involve multiple functions and files, increasing the difficulty in well learning the representations. To alleviate the above challenges, this paper proposes a \textit{Repo}sitory-level Security Patch Detection framework named \textit{RepoSPD}, which comprises three key components: 1) a repository-level graph construction, RepoCPG, which represents software patches by merging pre-patch and post-patch source code at the repository level; 2) a structure-aware patch representation, which fuses the graph and sequence branch and aims at comprehending the relationship among multiple code changes; 3) progressive learning, which facilitates the model in balancing semantic and structural information. To evaluate RepoSPD, we employ two widely-used datasets in security patch detection: SPI-DB and PatchDB. We further extend these datasets to the repository level, incorporating a total of 20,238 and 28,781 versions of repository in C/C++ programming languages, respectively, denoted as SPI-DB* and PatchDB*. We compare RepoSPD with six existing security patch detection methods and five static tools. Our experimental results demonstrate that RepoSPD outperforms the state-of-the-art baseline, with improvements of 11.90\%, and 3.10\% in terms of accuracy on the two datasets, respectively. These results underscore the effectiveness of RepoSPD in detecting security patches. Furthermore, RepoSPD can detect 151 security patches, which outperforms the best-performing baseline by 21.36\% with respect to accuracy.
Tags: "AI for SE", "Security"Tajkia Rahman Toma, Balreet Grewal, Cor-Paul Bezemer, "Answering User Questions about Machine Learning Models through Standardized Model Cards"
Abstract: Reusing pre-trained machine learning models is becoming very popular due to model hubs such as Hugging Face (HF). However, similar to when reusing software, many issues may arise when reusing an ML model. In many cases, users resort to asking questions on discussion forums such as the HF community forum. In this paper, we study how we can reduce the community's workload in answering these questions and increase the likelihood that questions receive a quick answer. We analyze 11,278 discussions from the HF model community that contain user questions about ML models. We focus on the effort spent handling questions, the high-level topics of discussions, and the potential for standardizing responses in model cards based on a model card template. Our findings indicate that there is not much effort involved in responding to user questions, however, 40.1% of the questions remain open without any response. A topic analysis shows that discussions are more centered around technical details on model development and troubleshooting, indicating that more input from model providers is required. We show that 42.5% of the questions could have been answered if the model provider followed a standard model card template for the model card. Based on our analysis, we recommend that model providers add more development-related details on the model's architecture, algorithm, data preprocessing and training code in existing documentation (sub)sections and add new (sub)sections to the template to address common questions about model usage and hardware requirements.
Tags: "MSR", "SE for AI"Cho-Ting Lee, Andrew Neeser, Shengzhe Xu, Jay Katyan, Patrick Cross, Sharanya Pathakota, Marigold Norman, John C. Simeone, Jaganmohan Chandrasekaran, Naren Ramakrishnan, "Can an LLM find its way around a Spreadsheet?"
Abstract: Spreadsheets are routinely used in business and scientific contexts, and one of the most vexing challenges is performing data cleaning prior to analysis and evaluation. The ad-hoc and arbitrary nature of data cleaning problems, such as typos, inconsistent formatting, missing values, and a lack of standardization, often creates the need for highly specialized pipelines. We ask whether an LLM can find its way around a spreadsheet and how to support end-users in taking their free-form data processing requests to fruition. Just like RAG retrieves context to answer users’ queries, we demonstrate how we can retrieve elements from a code library to compose data preprocessing pipelines. Through comprehensive experiments, we demonstrate the quality of our system and how it is able to continuously augment its vocabulary by saving new codes and pipelines back to the code library for future retrieval.
Tags: "AI for SE", "Analysis"Madeline Janecek, Naser Ezzati-Jivan, Abdelwahab Hamou-Lhadj, "Execution Trace Reconstruction Using Diffusion-Based Generative Models"
Abstract: Execution tracing is essential for understanding system and software behaviour, yet lost trace events can significantly compromise data integrity and analysis. Existing solutions for trace reconstruction often fail to fully leverage available data, particularly in complex and high-dimensional contexts. Recent advancements in generative artificial intelligence, particularly diffusion models, have set new benchmarks in image, audio, and natural language generation. This study conducts the first comprehensive evaluation of diffusion models for reconstructing incomplete trace event sequences. Using nine distinct datasets generated from the Phoronix Test Suite, we rigorously test these models on sequences of varying lengths and missing data ratios. Our results indicate that the SSSD$^{S4}$ model, in particular, achieves superior performance, in terms of accuracy, perfect rate, and ROUGE-L score across diverse imputation scenarios. These findings underscore the potential of diffusion-based models to accurately reconstruct missing events, thereby maintaining data integrity and enhancing system monitoring and analysis.
Tags: "Testing and Quality", "Analysis", "AI for SE"Joseph Romeo, Marco Raglianti, Csaba Nagy, Michele Lanza, "UML is Back. Or is it? Investigating the Past, Present, and Future of UML in Open Source Software"
Abstract: Since its inception, UML, the Unified Modeling Language, has been touted as the way to go when it comes to designing and documenting software systems. While being an integral part of many university software engineering programs, UML has found little consideration among developers, especially in open source software. Reasons for this include that UML shares some shortcomings with other forms of documentation (e.g., limited availability, outdatedness, inadequate level of detail). We present a study to investigate the evolution and the current situation regarding the use of UML in open source projects. We mined and analyzed ~13k GitHub projects, developing strategies and heuristics to identify UML files through their extensions and contents, for a quantitative analysis of two decades of evolution of the usage of UML. We explored the popularity of UML, derived characteristics of projects leveraging UML, and analyzed the authors, creators and maintainers, of UML artifacts. Our study confirms that UML is indeed still under-utilized. At the same time we found evidence of a resurgence coinciding with the popularity of human-readable text-based formats, defined and used by tools like PlantUML and Mermaid. We discuss how identifying and addressing the new challenges implied by this resurgence could impact the future of UML.
Tags: "Prog Comprehension/Reeng/Maint", "Human/Social"Abdul Haddi Amjad, Muhammad Danish, Bless Jah, Muhammad Ali Gulzar, "Accessibility Issues in Ad-Driven Web Applications"
Abstract: Website accessibility is essential for inclusiveness and regulatory compliance. Although third-party advertisements (ads) are a vital revenue source for free web services, they introduce significant accessibility challenges. Leasing a website’s space to ad-serving technologies like DoubleClick results in developers losing control over ad content accessibility. Even on highly accessible websites, third-party ads can undermine adherence to Web Content Accessibility Guidelines (WCAG). We conduct the first-of-its-kind large-scale investigation of 430K website elements, including nearly 100K ad elements, to understand the accessibility of ads on websites. We seek to understand the prevalence of inaccessible ads and their overall impact on the accessibility of websites. Our findings show that 67% of websites experience increased accessibility violations due to ads, with common violations including Focus Visible (WCAG 2.4.7) and On Input (WCAG 3.2.2). Popular ad-serving technologies like Taboola, DoubleClick, and RevContent often serve ads that fail to comply with WCAG standards. Even when ads are WCAG compliant, 27% of them have alternative text in ad images that misrepresents information, potentially deceiving users. Manual inspection of a sample of these misleading ads revealed that user-identifiable data is collected on 94% of websites through interactions, such as hovering or pressing enter. Since users with disabilities often rely on tools like screen readers that require hover events to access website content, they have no choice but to compromise their privacy in order to navigate website ads. Based on our findings, we further dissect the root cause of these violations and provide design guidelines to both website developers and ad-serving technologies to achieve WCAG-compliant ad integration.
Tags: "Human/Social", "Testing and Quality"Myeongsoo Kim, Tyler Stennett, Saurabh Sinha, Alessandro Orso, "A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs"
Abstract: As modern web services increasingly rely on REST APIs, their thorough testing has become crucial. Furthermore, the advent of REST API specifications such as OpenAPI ones has led to the emergence of many black-box REST API testing tools. However, these tools often focus on individual test elements in isolation (e.g., APIs, parameters, values), resulting in lower coverage and less effectiveness in detecting faults (i.e., 500 response codes). To address these limitations, we present AutoRestTest, the first black-box framework to adopt a dependency-embedded multi-agent approach for REST API testing, integrating Multi-Agent Reinforcement Learning (MARL) with a Semantic Property Dependency Graph (SPDG) and Large Language Models (LLMs). Our approach treats REST API testing as a separable problem, where four agents---API, dependency, parameter, and value---collaborate to optimize API exploration. LLMs handle domain-specific value restrictions, the SPDG model simplifies the search space for dependencies using a similarity score between API operations, and MARL dynamically optimizes the agents' behavior. Evaluated on 12 real-world REST services, AutoRestTest outperforms the four leading black-box REST API testing tools, including those assisted by RESTGPT (which augments realistic test inputs using LLMs), in terms of code coverage, operation coverage, and fault detection. Notably, AutoRestTest is the only tool able to identify an internal server error in Spotify. Our ablation study underscores the significant contributions of the agent learning, SPDG, and LLM components.
Tags: "Testing and Quality", "AI for SE"Geraldine Galindo-Gutierrez, Juan Pablo Sandoval Alcocer, Nicolas Jimenez-Fuentes, Alexandre Bergel, Gordon Fraser, "Increasing the Effectiveness of Automatically Generated Tests by Improving Class Observability"
Abstract: Automated unit test generation consists of two complementary challenges: Finding sequences of API calls that exercise the code of a class under test, and finding assertion statements that validate the behaviour of the class during execution. The former challenge is often addressed using meta-heuristic search algorithms optimising tests for code coverage, which are then annotated with regression assertions to address the latter challenge, i.e., assertions that capture the states observed during test generation. While the resulting tests tend to achieve high coverage, their fault finding potential is often inhibited by poor or difficult observability of the codebase. That is, relevant attributes and properties may either not be exposed adequately at all, or only in ways that the test generator is unable to handle. In this paper, we investigate the influence of observability in the context of the EvoSuite search-based Java test generator, which we extend in two complementary ways to study and improve observability: First, we apply a transformation to code under test to expose encapsulated attributes to the test generator; second, we address EvoSuite's limited capability of asserting the state of complex objects. Our evaluation demonstrates that together these observability improvements lead to significantly increased mutation scores, underscoring the importance of considering the class observability in the test generation process.
Tags: "Testing and Quality"Nabson Silva, Eriky Rodrigues, Tayana Conte, "A Catalog of Micro Frontends Anti-patterns"
Abstract: Micro frontend (MFE) architectures have gained significant popularity for promoting independence and modularity in development. Despite their widespread adoption, the field remains relatively unexplored, especially concerning identifying problems and documenting best practices. Drawing on both established microservice (MS) anti-patterns and the analysis of real problems faced by software development teams that adopt MFE, this paper presents a catalog of 12 MFE anti-patterns. We composed an initial version of the catalog by recognizing parallels between MS anti-patterns and recurring issues in MFE projects to map and adapt MS anti-patterns to the context of MFE. To validate the identified problems and proposed solutions, we conducted a survey with industry practitioners, collecting valuable feedback to refine the anti-patterns. Additionally, we asked participants if they had encountered these problems in practice and to rate their harmfulness on a 10-point Likert scale. The survey results revealed that participants had encountered all the proposed anti-patterns in real-world MFE architectures, with only one reported by less than 50% of participants. They stated that the catalog can serve as a valuable guide for both new and experienced developers, with the potential to enhance MFE development quality. The collected feedback led to the development of a improved version of the anti-patterns catalog. Furthermore, we developed a web application designed to not only showcase the anti-patterns but also to actively foster collaboration and engagement within the MFE community. The proposed catalog is a valuable resource for identifying and mitigating potential pitfalls in MFE development. It empowers developers of all experience levels to create more robust, maintainable, and well-designed MFE applications.
Tags: "Design/Architecture"Georgios Sakkas, Pratyush Sahu, Kyeling Ong, Ranjit Jhala, "Neurosymbolic Modular Refinement Type Inference"
Abstract: Refinement types -- a type-based generalization of Floyd-Hoare logics -- are an expressive and modular means of statically ensuring a wide variety of correctness, safety, and security properties of software. However, their expressiveness and modularity means that to use them, a developer must laboriously \emph{annotate} all the functions in their code with potentially complex type specifications that specify the contract for that function. We present XO, a neurosymbolic agent that uses LLMs to automatically generate refinement type annotations for all the functions in an entire package or module, using the refinement type checker LiquidHaskell as an oracle to verify the correctness of the generated specifications. We curate a dataset of three Haskell packages where refinement types are used to enforce a variety of correctness properties from data structure invariants to low-level memory safety and use this dataset to evaluate XO. Previously these packages required expert users several days to weeks to annotate with refinement types. Our evaluation shows that when even using models with relatively smaller models like the 3 billion parameter StarCoder LLM, by using fine-tuning, carefully chosen contexts, our neurosymbolic agent generates refinement types for up to 94\% of the functions across entire libraries automatically in just a few hours, thereby showing that LLMs can drastically shrink the human effort needed to use formal verification.
Tags: "AI for SE", "Analysis"Luís F. Gomes, Vincent Hellendoorn, Jonathan Aldrich, Rui Abreu, "An Exploratory Study of ML Sketches and Visual Code Assistants"
Abstract: This paper explores the integration of Visual Code Assistants in Integrated Development Environments (IDEs). In Software Engineering, whiteboard sketching is often the initial step before coding, serving as a crucial collaboration tool for developers. Previous studies have investigated patterns in SE sketches and how they are used in practice, yet methods for directly using these sketches for code generation remain limited. The emergence of visually-equipped large language models presents an opportunity to bridge this gap, which is the focus of our research. In this paper, we built a first prototype of a Visual Code Assistant to get user feedback regarding in-IDE sketch-to-code tools. We conduct an experiment with 19 data scientists, most of whom regularly sketch as part of their job. We investigate developers' mental models by analyzing patterns commonly observed in their sketches when developing an ML workflow. Analysis indicates that diagrams were the preferred organizational component (52.6\%), often accompanied by lists (42.1\%) and numbered points (36.8\%). Our tool converts their sketches into a Python notebook by querying an LLM. We use an LLM-as-judge setup to score the quality of the generated code, finding that even brief sketching can effectively generate useful code outlines. We also find a significant, positive correlation between sketch time and the quality of the generated code. We conclude the study by conducting extensive interviews to assess the tool's usefulness, explore potential use cases, and understand developers' needs. As noted by participants, promising applications for these assistants include education, prototyping, and collaborative settings. Our findings signal promise for the next generation of Code Assistants to integrate visual information, both to improve code generation and to better leverage developers' existing sketching practices.
Tags: "AI for SE", "Human/Social"Waris Gill, Ali Anwar, Muhammad Ali Gulzar, "TraceFL: Interpretability-Driven Debugging in Federated Learning via Neuron Provenance"
Abstract: In Federated Learning, clients train models on local data and send updates to a central server, which aggregates them into a global model using a fusion algorithm. This collaborative yet privacy-preserving training comes at a cost—FL developers face significant challenges in attributing global model predictions to specific clients. Localizing responsible clients is a crucial step towards (a) excluding clients primarily responsible for incorrect predictions and (b) encouraging clients who contributed high quality models to continue participating in the future. Existing ML explainability approaches are inherently inapplicable as they are designed for single-model, centralized training. We introduce TraceFL, a fine-grained neuron provenance capturing mechanism that identifies clients responsible for the global model’s prediction by tracking the flow of information from individual clients to the global model. Since inference on different inputs activates a different set of neurons of the global model, TraceFL dynamically quantifies the significance of the global model’s neurons in a given prediction. It then selectively picks a slice of the most crucial neurons in the global model and maps them to the corresponding neurons in every participating client to determine each client’s contribution, ultimately localizing the responsible client. We evaluate TraceFL on six datasets, including two real-world medical imaging datasets and four neural networks, including advanced models such as GPT. TraceFL achieves 99% accuracy in localizing the responsible client in FL tasks spanning both image and text classification tasks. At a time when state-of-the-art ML debugging approaches are mostly domain-specific (e.g., image classification only), TraceFL is the first technique to enable highly accurate automated reasoning across a wide range of FL applications.
Tags: "SE for AI", "Testing and Quality"Yuanfang Cai, Lanting He, Yony Kochinski, Jun Qian, Ciera Jaspan, Nan Zhang, Antonio Bianco, "Understanding Architectural Complexity, Maintenance Burden, and Developer Sentiment---a Large-Scale Study"
Abstract: Intuitively, the more complex a software system is, the harder it is to maintain. Statistically, it is not clear which complexity metrics correlate with maintenance effort; in fact, it is not even clear how to objectively measure maintenance burden, so that developers' sentiment and intuition can be supported by numbers. Without effective complexity and maintenance metrics, it remains difficult to objectively monitor maintenance, control complexity, or justify refactoring. In this paper, we report a large-scale study of 1252 projects written in C++ and Java from Company_X. We collected three categories of metrics: (1) architectural complexity, measured using propagation cost (PC), decoupling level (DL), and structural anti-patterns; (2) maintenance activity, measured using the number of changes, lines of code (LOC) written, and active coding time (ACT) spent on feature-addition vs. bug-fixing, and (3) developer sentiment on complexity and productivity, collected from 7200 survey responses. We statistically analyzed the correlations among these metrics and obtained significant evidence of the following findings: 1) the more complex the architecture is (higher propagation cost, more instances of anti-patterns), the more LOC is spent on bug-fixing, rather than adding new features; 2) developers who commit more changes for features, spend more lines of code on features, or spend more time on features also feel that they are less hindered by technical debt and complexity. To the best of our knowledge, this is the first large-scale empirical study establishing the statistical correlation among architectural complexity, maintenance activity, and developer sentiment. The implication is that, instead of solely relying upon developer sentiment and intuition to detect degraded structure or increased burden to evolve, it is possible to objectively and continuously measure and monitor architectural complexity and maintenance difficulty, increasing feature delivery efficiency by reducing architectural complexity and anti-patterns.
Tags: "Design/Architecture", "MSR", "Human/Social", "Prog Comprehension/Reeng/Maint"Aleks Chakarov, Jaco Geldenhuys, Matthew Heck, Michael Hicks, Samuel Huang, Georges-Axel Jaloyan, Anjali Joshi, K. Rustan M. Leino, Mikael Mayer, Sean McLaughlin, Akhilesh Mritunjai, Clement Pit-Claudel, Sorawee Porncharoenwase, Florian Rabe, Marianna Rapoport, Giles Reger, Cody Roux, Neha Rungta, Robin Salkeld, Matthias Schlaipfer, Daniel Schoepe, Johanna Schwartzentruber, Serdar Tasiran, Aaron Tomb, Emina Torlak, Jean-Baptiste Tristan, Lucas Wagner, Michael W. Whalen, Remy Willems, Tongtong Xiang, Tae Joon Byun, Joshua Cohen, Ruijie Fang, Junyoung Jang, Jakob Rath, Hira Taqdees Syeda, Dominik Wagner, Yongwei Yuan, "Formally Verified Cloud-Scale Authorization"
Abstract: All critical systems must evolve to meet the needs of a growing and diversifying user base. But supporting that evolution is challenging at increasing scale: Maintainers must find a way to ensure that each change does only what is intended, and will not inadvertently change behavior for existing users. This paper presents how we addressed this challenge for a cloud-scale authorization engine, invoked 1 billion times per second, by using formal verification. Over a period of four years, we built a new authorization engine, one that behaves functionally the same as its predecessor, using the verification-aware programming language Dafny. We can now confidently deploy enhancements and optimizations while maintaining the highest assurance of both correctness and backward compatibility. We deployed the new engine in 2024 without incident and customers immediately enjoyed a threefold performance improvement. The methodology we followed to build this new engine was not an off-the-shelf application of an existing verification tool, and this paper presents several key insights: 1) Rather than prove correct the existing engine, written in Java, we found it more effective to \emph{write a new engine} in Dafny, a language built for \emph{verification from the ground up}, and then compile the result to Java. 2) To ensure performance, debuggability, and to gain trust from stakeholders, we needed to generate readable, \emph{idiomatic} Java code, essentially a transliteration of the source Dafny. 3) To ensure that the specification matches the system's actual behavior, we performed \emph{extensive differential and shadow testing} throughout the development process, ultimately comparing against $10^{15}$ production samples prior to deployment. Our approach demonstrates how formal verification can be effectively applied to evolve critical legacy software at scale.
Tags: "Design/Architecture", "Security", "Prog Comprehension/Reeng/Maint", "Human/Social"Shazibul Islam Shamim, Hanyang Hu, Akond Rahman, "On Prescription or Off Prescription? An Empirical Study of Community-prescribed Security Configurations for Kubernetes"
Abstract: Despite being beneficial for rapid delivery of software, Kubernetes deployments can be susceptible to security attacks, which can cause serious consequences. A systematic characterization of how community-prescribed security configurations, i.e., security configurations that are recommended by security experts, can aid practitioners to secure their Kubernetes deployments. To that end, we conduct an empirical study with 53 security configurations recommended by the Center for Internet Security (CIS), 20 survey respondents, and 356 configuration files obtained from open source software (OSS) repositories and 188 configuration files used by Company-A. From our empirical study, we observe: (i) practitioners can be unaware of prescribed security configurations as 5%~40% of the survey respondents are unfamiliar with 16 prescribed configurations; and (ii) for Company-A and OSS respectively, 18.0% and 17.9% of the configuration files include at least one violation of prescribed configurations. From our evaluation with 5 static application security testing (SAST) tools we find (i) only Kubescape to support all of the prescribed security configurations; (ii) the highest observed precision to be 0.48 and 0.43 respectively, for the Company-A and OSS datasets; and (iii) the highest observed recall to be respectively, 0.53 and 0.65 for the Company-A and OSS datasets. We conclude the paper by providing recommendations for practitioners on how they can use existing SAST tools to secure their Kubernetes deployments.
Tags: "Prog Comprehension/Reeng/Maint", "Security"Jaehyeok Lee, Sooyoung Cha, "TopSeed: Learning Seed Selection Strategies for Symbolic Execution from Scratch"
Abstract: We present TopSeed, a new approach that automatically selects optimal seeds to enhance symbolic execution. Recently, the performance of symbolic execution has significantly improved through various state-of-the-art techniques, including search strategies and state-pruning heuristics. However, these techniques have typically demonstrated their effectiveness without considering “seeding”, which efficiently initializes program states for exploration. This paper aims to select valuable seeds from candidate inputs generated during interactions with any symbolic execution technique, without the need for a predefined seed corpus, thereby maximizing the technique’s effectiveness. One major challenge is the vast number of candidates, making it difficult to identify promising seeds. To address this, we introduce a customized online learning algorithm that iteratively groups candidate inputs, ranks each group, and selects a seed from the top-ranked group based on data accumulated during symbolic execution. Experimental results on 17 open-source C programs show that TopSeed significantly enhances four distinct cutting-edge techniques, implemented on top of two symbolic executors, in terms of branch coverage and bug-finding abilities.
Tags: "Testing and Quality"Ali Ebrahimi Pourasad, Walid Maalej, "Does GenAI Make Usability Testing Obsolete?"
Abstract: Ensuring usability is crucial for the success of mobile apps. Usability issues can compromise user experience and negatively impact the perceived app quality. This paper presents UX-LLM, a novel tool powered by a Large Vision-Language Model that predicts usability issues in iOS apps. To evaluate the performance of UX-LLM we predicted usability issues in two open-source apps of a medium complexity and asked usability experts to assess the predictions. We also performed traditional usability testing and expert review for both apps and compared the results to those of UX-LLM. UX-LLM demonstrated precision ranging from 0.61 and 0.66 and recall between 0.35 and 0.38, indicating its ability to identify valid usability issues, yet failing to capture the majority of issues. Finally, we conducted a focus group with an app development team of a capstone project developing a transit app for visually impaired persons. The focus group expressed positive perceptions of UX-LLM as it identified unknown usability issues in their app. However, they also raised concerns about its integration into the development workflow, suggesting potential improvements. Our results show that UX-LLM cannot fully replace traditional usability evaluation methods but serves as a valuable supplement particularly for small teams with limited resources, to identify issues in less common user paths, due to its ability to inspect the source code.
Tags: "AI for SE", "Testing and Quality"Zhao Tian, Junjie Chen, Xiangyu Zhang, "Fixing Large Language Models' Specification Misunderstanding for Better Code Generation"
Abstract: Code generation is to automatically generate source code conforming to a given programming specification, which has received extensive attention especially with the development of large language models (LLMs). Due to the inherent difficulty of code generation, the code generated by LLMs may not be aligned with the specification. Although thought-eliciting prompting techniques have been proposed to enhance the code generation performance of LLMs, producing correct understanding for complicated programming problems remains challenging, resulting in unsatisfactory performance. Also, some feedback-based prompting techniques have been proposed to fix incorrect code using error messages produced by test execution. However, when the generated code deviates significantly from the ground truth, they encounter difficulties in improving performance based on such coarse-grained information. In this work, we propose a novel prompting technique, called μFiX, to improve the code generation performance of LLMs by devising both sophisticated thought-eliciting prompting and feedback-based prompting and making the first exploration on their synergy. It first exploits test case analysis to obtain specification understanding and enables a self-improvement process to identify and refine the misunderstanding in the thought-eliciting prompting phase. μFiX further fixes the specification understanding towards the direction reducing the gap between the provided understanding (from the first phase) and the actual understanding implicitly utilized by LLMs for code generation in the feedback-based prompting phase. By improving the understanding with μFiX, the code generation performance of LLMs can be largely improved. Our evaluation on two advanced LLMs (ChatGPT and DeepSeek-Coder) with six widely-used benchmarks by comparing with 15 baselines, demonstrates the effectiveness of μFiX. For example, μFiX outperforms the most effective baseline with an average improvement of 35.62% in terms of Pass@1 across all subjects.
Tags: "AI for SE", "SE for AI", "Analysis"Jingwen Zhang, Zibin Zheng, Yuhong Nan, Mingxi Ye, Kaiwen Ning, Yu Zhang, Weizhe Zhang, "SmartReco: Detecting Read-Only Reentrancy via Fine-Grained Cross-DApp Analysis"
Abstract: Despite the increasing popularity of Decentralized Applications (DApps), they are suffering from various vulnerabilities that can be exploited by adversaries for profits. Among such vulnerabilities, Read-Only Reentrancy (called ROR in this paper), is an emerging type of vulnerability that arises from the complex interactions between DApps. In recent three years, attack incidents of ROR have already caused around 30M USD losses to the DApp ecosystem. Existing techniques for vulnerability detection in smart contracts can hardly detect Read-Only Reentrancy attacks, due to the lack of tracking and analyzing the complex interactions between multiple DApps. In this paper, we propose SmartReco, a new framework for detecting Read-Only Reentrancy vulnerability in DApps through a novel combination of static and dynamic analysis (i.e., fuzzing) over smart contracts. The key design behind SmartReco is threefold: (1) SmartReco identifies the boundary between different DApps from the heavy-coupled cross-contract interactions. (2) SmartReco performs fine-grained static analysis to locate points of interest (i.e., entry functions) that may lead to ROR. (3) SmartReco utilizes the on-chain transaction data and performs multi-function fuzzing (i.e., the entry function and victim function) across different DApps to verify the existence of ROR. Our evaluation of a manual-labeled dataset with 45 RORs shows that SmartReco achieves an accuracy of 88.63% and a recall of 86.36%. In addition, SmartReco successfully detects 43 new RORs from 123 popular DApps. The total assets affected by such RORs reach around 520,000 USD.
Tags: "Security", "Analysis"Mengya Zhang, Preksha Shukla, Wuqi Zhang, Zhuo Zhang, Pranav Agrawal, Zhiqiang Lin, Xiangyu Zhang, Xiaokuan Zhang, "An Empirical Study of Proxy Smart Contracts at the Ethereum Ecosystem Scale"
Abstract: Proxy has been introduced as a design pattern to separate data and code in an application to two different types of smart contracts, namely proxy and logic contracts, respectively. Data is stored in the proxy contracts, while the code to be executed is fetched from the logic contracts. Proxy patterns facilitate the flexibility of smart contract development by enabling upgradeability, extensibility, code reuse, etc. Despite its popularity and importance, there is currently no systematic study to understand the prevalence, use scenarios, and development pitfalls of proxies. In this work, we conduct the first comprehensive study on Ethereum proxies. To collect a comprehensive dataset of proxies, we propose ProxyEx, the first framework designed to detect proxy directly from bytecode. Our evaluation shows that ProxyEx achieves over 99% accuracy. With ProxyEx, we collect a large-scale dataset of 2,031,422 proxies from all contracts in Ethereum and conduct the first systematic empirical study. We first measure the total number of proxies and their transaction traffic, to obtain an overall understanding of the status quo of proxies on Ethereum. Then, we categorize the design pattern and use scenarios of proxies into four types: upgradeability, extensibility, code-sharing, and code-hiding. We further identify three types of common pitfalls in proxies: proxy-logic storage collision, logic-logic storage collision, and uninitialized contracts. We also design three checkers for these common pitfalls in proxies by replaying historical transactions. Our study leads to many interesting findings. For instance, we find that upgradeability is not the only reason that developers adopt the proxy pattern in developing Decentralized Applications (DApps). We also find that many proxies suffer from bugs such as storage collision and uninitialized contracts. Our study sheds light on the proxies landscape, and provides valuable insights to future smart contract research on the development, usage, quality assurance, and bug detection of proxies.
Tags: "Blockchain", "MSR"Yifan Wu, Yunpeng Wang, Ying Li, Wei Tao, Siyu Yu, Haowen Yang, Wei Jiang, Jianguo Li, "An Empirical Study on Commit Message Generation using LLMs via In-Context Learning"
Abstract: Commit messages concisely describe code changes in natural language and are important for software maintenance. Several approaches have been proposed to automatically generate commit messages, but they still suffer from critical limitations, such as time-consuming training and poor generalization ability. To tackle these limitations, we propose to borrow the weapon of large language models (LLMs) and in-context learning (ICL). Our intuition is based on the fact that the training corpora of LLMs contain extensive code changes and their pairwise commit messages, which makes LLMs capture the knowledge about commits, while ICL can exploit the knowledge hidden in the LLMs and enable them to perform downstream tasks without model tuning. However, it remains unclear how well LLMs perform on commit message generation via ICL. Therefore, in this paper, we conduct a comprehensive empirical study to investigate the capability of LLMs to generate commit messages via ICL. Specifically, we first explore the impact of different settings on the performance of ICL-based commit message generation. We then compare ICL-based commit message generation with state-of-the-art approaches on a popular multilingual dataset and a new dataset we created to mitigate potential data leakage. The results show that ICL-based commit message generation significantly outperforms state-of-the-art approaches on subjective evaluation and achieves better generalization ability. We further analyze the root causes for LLM’s underperformance and propose several implications, which shed light on future research directions for using LLMs to generate commit messages.
Tags: "AI for SE"Octavio Galland, Marcel Böhme, "Invivo Fuzzing by Amplifying Actual Executions"
Abstract: A major bottleneck that remains when fuzzing software libraries is the need for _fuzz drivers_, i.e., the glue code between the fuzzer and the library. Despite years of fuzzing, critical security flaws are still found, e.g., by manual auditing, because the fuzz drivers do not cover the complex interactions between the library and the host programs using it. In this work we propose an alternative approach to library fuzzing, which leverages a valid execution context that is set up by a given program using the library (the _host_), and _amplify_ its execution. More specifically, we execute the host until a designated function from a list of _target_ functions has been reached, and then perform coverage-guided function-level fuzzing on it. Once the fuzzing quota is exhausted, we move on to fuzzing the next target from the list. In this way we not only reduce the amount of manual work needed by a developer to incorporate fuzzing into their workflow, but we also allow the fuzzer to explore parts of the library as they are used in real-world programs that may otherwise not have been tested due to the simplicity of most fuzz drivers.
Tags: "Testing and Quality"Wenwei Gu, Jiazhen Gu, Jinyang Liu, Zhuangbin Chen, Jianping Zhang, Jinxi Kuang, Cong Feng, Yongqiang Yang, Michael Lyu, "ADAMAS: Adaptive Domain-Aware Performance Anomaly Detection in Cloud Service Systems"
Abstract: A common practice in the reliability engineering of cloud services involves the collection of monitoring metrics, followed by comprehensive analysis to identify performance issues. However, existing methods often fall short of detecting diverse and evolving anomalies across different services. Moreover, there exists a significant gap between the technical and business interpretation of anomalies, i.e., a detected anomaly may not have an actual impact on system performance or user experience. To address these challenges, we propose ADAMAS, an adaptive AutoML-based anomaly detection framework aiming to achieve practical anomaly detection in production cloud systems. To improve the ability of detecting cross-service anomalies, we design a novel unsupervised evaluation function to facilitate the automatic searching of the optimal model structure and parameters. ADAMAS also contains a lightweight human-in-the-loop design, which can efficiently incorporate expert knowledge to adapt to the evolving anomaly patterns and bridge the gap between predicted anomalies and actual business exceptions. Furthermore, through monitoring the rate of mispredicted anomalies, ADAMAS proactively re-configures the optimal model, forming a continuous loop of system improvement. Extensive evaluation on one public and two industrial datasets shows that ADAMAS outperforms all baseline models with a 0.891 F1-score. The ablation study also proves the effectiveness of the evaluation function design and the incorporation of expert knowledge.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Zewei Lin, Jiachi Chen, Jiajing Wu, Weizhe Zhang, Zibin Zheng, "Definition and Detection of Centralization Defects in Smart Contracts"
Abstract: In recent years, security incidents stemming from centralization defects in smart contracts have led to substantial financial losses. A centralization defect refers to any error, flaw, or fault in a smart contract’s design or development stage that introduces a single point of failure. Such defects allow a specific account or user to disrupt the normal operations of smart contracts, potentially causing malfunctions or even complete project shutdowns. Despite the significance of this issue, most current smart contract analyses overlook centralization defects, focusing primarily on other types of defects. To address this gap, our paper introduces six types of centralization defects in smart contracts by manually analyzing 597 Stack Exchange posts and 117 audit reports. For each defect, we provide a detailed description and code examples to illustrate its characteristics and potential impacts. Additionally, we introduce a tool named CDRipper (Centralization Defects Ripper) designed to identify the defined centralization defects. Specifically, CDRipper constructs a permission dependency graph (PDG) and extracts the permission dependencies of functions from the source code of smart contracts. It then detects the sensitive operations in functions and identifies centralization defects based on predefined patterns. We conduct a large-scale experiment using CDRipper on 244,424 real-world smart contracts and evaluate the results based on a manually labeled dataset. Our findings reveal that 82,446 contracts contain at least one of the six centralization defects, with our tool achieving an overall precision of 93.7%.
Tags: "Blockchain", "Security", "Testing and Quality"Changjian Zhang, Parv Kapoor, Ian Dardik, Leyi Cui, Romulo Meira-Goes, David Garlan, Eunsuk Kang, "Constrained LTL Specification Learning from Examples"
Abstract: Temporal logic specifications play an important role in a wide range of software analysis tasks, such as model checking, automated synthesis, program comprehension, and runtime monitoring. Given a set of positive and negative examples, specified as traces, \emph{LTL learning} is the problem of synthesizing a specification, in \emph{linear temporal logic (LTL)}, that evaluates to true over the positive traces and false over the negative ones. In this paper, we propose a new type of LTL learning problem called \emph{constrained LTL learning}, where the user, in addition to positive and negative examples, is given an option to specify one or more \emph{constraints} over the properties of the LTL formula to be learned. We demonstrate that the ability to specify these additional constraints significantly increases the range of applications for LTL learning, and also allows efficient generation of LTL formulas that satisfy certain desirable properties (such as minimality). We propose an approach for solving the constrained LTL learning problem through an encoding in a first-order relational logic and reduction to an instance of the \emph{maximal satisfiability (MaxSAT)} problem. An experimental evaluation demonstrates that ATLAS, an implementation of our proposed approach, is able to solve new types of learning problems while performing better than or competitively with the state-of-the-art tools in LTL learning.
Tags: "Formal methods", "Requirements"Nikhil Parasaram, Huijie Yan, Boyu Yang, Zineb Flahy, Abriele Qudsi, Damian Ziaber, Earl T. Barr, Sergey Mechtaev, "The Fact Selection Problem in LLM-Based Program Repair"
Abstract: Recent research has shown that incorporating bug- related facts, such as stack traces and GitHub issues, into prompts enhances the bug-fixing capabilities of large language models (LLMs). Considering the ever-increasing context window of these models, a critical question arises: what and how many facts should be included in prompts to maximise the chance of correctly fixing bugs? To answer this question, we conducted a large-scale study, employing over 19K prompts featuring various combinations of seven diverse facts to rectify 314 bugs from open-source Python projects within the BugsInPy benchmark. Our findings revealed that each fact, ranging from simple syntactic details like code context to semantic information previously unexplored in the context of LLMs such as angelic values, is beneficial. Specifically, each fact aids in fixing some bugs that would remain unresolved or only be fixed with a low success rate without it. Importantly, we discovered that the effectiveness of program repair prompts is non-monotonic over the number of used facts; using too many facts leads to subpar outcomes. These insights led us to define the fact selection problem: determining the optimal set of facts for inclusion in a prompt to maximise LLM’s performance on a given task instance. We found that there is no one-size- fits-all set of facts for bug repair. Therefore, we developed a basic statistical model, named MANIPLE, which selects facts specific to a given bug to include in the prompt. This model significantly surpasses the performance of the best generic fact set. To underscore the significance of the fact selection problem, we benchmarked MANIPLE against the state-of-the-art zero-shot, non- conversational LLM-based bug repair methods. On our testing dataset of 157 bugs, MANIPLE repairs 88 bugs, 17% above the best configuration.
Tags: "AI for SE", "Analysis/Repair"Anna Mazhar, Saad Sher Alam, William Zheng, Yinfang Chen, Suman Nath, Tianyin Xu, "Fidelity of Cloud Emulators: The Imitation Game of Testing Cloud-based Software"
Abstract: Modern software projects have been increasingly using cloud services as important components. The cloud-based programming practice greatly simplifies software development by harvesting cloud benefits (e.g., high availability and elasticity). However, it imposes new challenges for software testing and analysis, due to opaqueness of cloud backends and monetary cost of invoking cloud services for continuous integration and deployment. As a result, cloud emulators are developed for offline development and testing, before online testing and deployment. This paper presents a systematic analysis of cloud emulators from the perspective of cloud-based software testing. Our goal is to (1) understand the discrepancies introduced by cloud emula- tion with regard to software quality assurance and deployment safety and (2) address inevitable gaps between emulated and real cloud services. The analysis results are concerning. Among 255 APIs of five cloud services from Azure and Amazon Web Services (AWS), we detected discrepant behavior between the emulated and real services in 94 (37%) of the APIs. These discrepancies lead to inconsistent testing results, threatening deployment safety, introducing false alarms, and creating debuggability issues. The root causes are diverse, including accidental implementation defects and essential emulation challenges. We discuss potential solutions and develop a practical mitigation technique to address discrepancies of cloud emulators for software testing.
Tags: "Testing and Quality", "Design/Architecture"Yuxin Zhang, Sen Chen, Xiaofei Xie, Zibo Liu, Lingling Fan, "Scenario-Driven and Context-Aware Automated Accessibility Testing for Android Apps"
Abstract: Mobile accessibility is increasingly important nowadays as it enables people with disabilities to use mobile applications to perform daily tasks. Ensuring mobile accessibility not only benefits those with disabilities but also enhances the user experience for all users, making applications more intuitive and user-friendly. Although numerous tools are available for testing and detecting accessibility issues in Android applications, a large number of false negatives and false positives persist due to limitations in the existing approaches, i.e., low coverage of UI scenarios and lack of consideration of runtime context. To address these problems, in this paper, we propose a scenario-driven exploration method for improving the coverage of UI scenarios, thereby detecting accessibility issues within the application, and ultimately reducing false negatives. Furthermore, to reduce false positives caused by not considering the runtime context, we propose a context-aware detection method that provides a more fine-grained detection capability. Our experimental results reveal that A11yScan can detect 1.7X more issues surpassing current state-of-the-art approaches like Xbot (3,991 vs. 2,321), thereby reducing the false negative rate by 41.84\%. Additionally, it outperforms established UI exploration techniques such as SceneDroid (952 vs. 661 UI scenarios), while achieving comparable activity coverage to recent leading GUI testing tools like GPTDroid on the available dataset (73\% vs. 71\%). Meanwhile, with the context-aware detection method, A11yScan effectively reduces the false positive rate by 21\%, validated with a 90.56\% accuracy rate through a user study.
Tags: "Testing and Quality", "Mobile SW"Hyunjae Suh, Mahan Tafreshipour, Jiawei Li, Adithya Bhattiprolu, Iftekhar Ahmed, "An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We?"
Abstract: Artificial Intelligence (AI) techniques, especially Large Language Models (LLMs), have started gaining popularity among researchers and software developers for generating source code. However, LLMs have been shown to generate code with quality issues and also incurred copyright/licensing infringements. Therefore, detecting whether a piece of source code is written by humans or AI has become necessary. This study first presents an empirical analysis to investigate the effectiveness of the existing AI detection tools in detecting AI-generated code. The results show that they all perform poorly and lack sufficient generalizability to be practically deployed. Then, to improve the performance of AI-generated code detection, we propose a range of approaches, including fine-tuning the LLMs and machine learning-based classification with static code metrics or code embedding generated from Abstract Syntax Tree (AST). Our best model outperforms state-of-the-art AI-generated code detector (GPTSniffer) and achieves an F1 score of 82.55. We also conduct an ablation study on our best-performing model to investigate the impact of different source code features on its performance.
Tags: "AI for SE", "Analysis"Yuyang Rong, Zhanghan Yu, Zhenkai Weng, Stephen Neuendorffer, Hao Chen, "IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation"
Abstract: Modern compilers, such as LLVM, are complex. Due to their complexity, manual testing is unlikely to suffice, yet formal verification is difficult to scale. End-to-end fuzzing can be used, but it has difficulties in discovering LLVM backend problems for two reasons. First, frontend preprocessing and middle optimization shield the backend from seeing diverse inputs. Besides, edge coverages cannot provide an effective feedback as LLVM backend contains much reusable code. In this paper, we implement IRFuzzer to investigate the need of specialized fuzzing of the LLVM compiler backend. We focus on two approaches to improve the fuzzer: guaranteed input validity using constrained mutations to improve input diversity and new metrics to improve feedback quality. The mutator in IRFuzzer is capable of generating a wide range of LLVM IR inputs, including structured control flow, vector types, and function definitions. The system instruments coding patterns in the compiler to monitor the execution status of instruction selection. The instrumentation not only provides a new coverage feedback called matcher table coverage, but also provides an architecture specific guidance to the mutator. We show that IRFuzzer is more effective than existing fuzzers by fuzzing on 29 mature LLVM backend targets. In the process, we reported 78 confirmed new bugs in LLVM upstream, out of which 57 have been fixed, five have been back ported to LLVM 15, showing that specialized fuzzing provides useful and actionable insights to LLVM developers.
Tags: "Testing and Quality", "Analysis"Wuxia Jin, Jiaowei Shang, Jianguo Zheng, Mengjie Sun, Zhenyu Huang, Ming Fan, Ting Liu, "The Design Smells Breaking the Boundary between Android Variants and AOSP"
Abstract: Phone vendors customize their Android variants to enhance system functionalities based on the Android Open Source Project (AOSP). While independent development, Android variants have to periodically evolve with the upstream AOSP and merge code changes from AOSP. Vendors have invested great effort to maintain their variants and resolve merging conflicts. In this paper, we characterize the design smells with recurring patterns that break the design boundary between Android variants and AOSP. These smells are manifested as problematic dependencies across the boundary, hindering Android variants' maintainability and co-evolution with AOSP. We propose the DroidDS for automatically detecting design smells. We collect 22 Android variant versions and 22 corresponding AOSP versions, involving 4 open-source projects and 1 industrial project. Our results demonstrate that: files involved in design smells consume higher maintenance costs than other files; these infected files are not merely the files with large code size, increased complexity, and object-oriented smells; the infected files have been involved in more than half of code conflicts induced by re-applying AOSP's changes to Android variants; a substantial portion of design issues could be mitigable. Practitioners can utilize our DroidDS to pinpoint and prioritize design problems for Android variants. Refactoring these problems will help keep a healthy coupling between diverse variants and AOSP, potentially improving maintainability and reducing conflict risks.
Tags: "Mobile SW", "Design/Architecture", "Prog Comprehension/Reeng/Maint"Mingfei Cheng, Xiaofei Xie, Yuan Zhou, Junjie Wang, Guozhu Meng, Kairui Yang, "Decictor: Towards Evaluating the Robustness of Decision-Making in Autonomous Driving Systems"
Abstract: Autonomous Driving System (ADS) testing is crucial in ADS development, with the current primary focus being on safety. However, the evaluation of non-safety-critical performance, particularly the ADS’s ability to make optimal decisions and produce optimal paths for autonomous vehicles (AVs), is also vital to ensure the intelligence and reduce risks of AVs. Currently, there is little work dedicated to assessing the robustness of ADSs’ path-planning decisions (PPDs), i.e., whether an ADS can maintain the optimal PPD after an insignificant change in the environment. The key challenges include the lack of clear oracles for assessing PPD optimality and the difficulty in searching for scenarios that lead to non-optimal PPDs. To fill this gap, in this paper, we focus on evaluating the robustness of ADSs’ PPDs and propose the first method, Decictor, for generating non-optimal decision scenarios (NoDSs), where the ADS does not plan optimal paths for AVs. Decictor comprises three main components: Non-invasive Mutation, Consistency Check, and Feedback. To overcome the oracle challenge, Non-invasive Mutation is devised to implement conservative modifications, ensuring the preservation of the original optimal path in the mutated scenarios. Subsequently, the Consistency Check is applied to determine the presence of non-optimal PPDs by comparing the driving paths in the original and mutated scenarios. To deal with the challenge of large environment space, we design Feedback metrics that integrate spatial and temporal dimensions of the AV’s movement. These metrics are crucial for effectively steering the generation of NoDSs. Therefore, Decictor can generate NoDSs by generating new scenarios and then identifying NoDSs in the new scenarios. We evaluate Decictor on Baidu Apollo, an open-source and production-grade ADS. The experimental results validate the effectiveness of Decictor in detecting non-optimal PPDs of ADSs. It generates 63.9 NoDSs in total, while the best-performing baseline only detects 35.4 NoDSs.
Tags: "Autonomy", "SE for AI"Mingyue Yuan, Jieshan Chen, Zhenchang Xing, Aaron Quigley, Yuyu Luo, Tianqi Luo, Gelareh Mohammadi, Qinghua Lu, Liming Zhu, "DesignRepair: Dual-Stream Design Guideline-Aware Frontend Repair with Large Language Models"
Abstract: The rise of Large Language Models (LLMs) has streamlined frontend interface creation through tools like Vercel's V0, yet surfaced challenges in design quality (e.g., accessibility, and usability). Current solutions, often limited by their focus, generalisability, or data dependency, fall short in addressing these complexities comprehensively. Moreover, none of them examine the quality of LLM-generated UI design. In this work, we introduce DesignRepair, a novel dual-stream design guideline-aware system to examine and repair the UI design quality issues from both code aspect and rendered page aspect. We utilised the mature and popular Material Design as our knowledge base to guide this process. Specifically, we first constructed a comprehensive knowledge base encoding Google's Material Design principles into low-level component knowledge base and high-level system design knowledge base. After that, DesignRepair employs a LLM for the extraction of key components and utilizes the Playwright tool for precise page analysis, aligning these with the established knowledge bases. Finally, we integrate Retrieval-Augmented Generation with state-of-the-art LLMs like GPT-4 to holistically refine and repair frontend code through a strategic divide and conquer approach. Our extensive evaluations validated the efficacy and utility of our approach, demonstrating significant enhancements in adherence to design guidelines, accessibility, and user experience metrics.
Tags: "Analysis/Repair", "Testing and Quality", "Design/Architecture"Houda Naji, Marco Gutfleisch, Alena Naiakshina, "Relationship Status: “It’s complicated” Developer-Security Expert Dynamics in Scrum"
Abstract: The high number of cyber threats poses significant challenges, with impactful software exploits ranging from data theft to ransomware deployment. Unfortunately, past research highlighted limited security expertise within development teams. Collaboration between developers and security experts, therefore, emerges as one of the few workable means to address this gap. In this paper, we explore the complex interplay between developers and security experts within Scrum, one of the most widely adopted frameworks which actively promotes collaboration, to shed light on their working relationship, challenges, and potential avenues for improvement. To this end, we conducted a qualitative interview study with 14 developers and 13 security experts. Our qualitative results reveal three communication patterns and five shared challenges between the groups affecting the develop-security expert collaboration. Top challenges include consistent interaction difficulties and the lack of workable means to balance business and security needs. As a result, we found that three core Scrum values (openness, respect, courage) are missing from this relationship. Based on our results, we propose recommendations for fostering a healthy collaboration between developers and security experts, both within and beyond Scrum.
Tags: "Human/Social"Bianca Trinkenreich, Zixuan Feng, Rudrajit Choudhuri, Marco Gerosa, Anita Sarma, Igor Steinmacher, "Investigating the Impact of Interpersonal Challenges on Feeling Welcome in OSS"
Abstract: The sustainability of open source software (OSS) projects hinges on contributor retention. Interpersonal challenges can inhibit a feeling of welcomeness among contributors, particularly from underrepresented groups, which impacts their decision to continue with the project. How much this impact is, varies among individuals, underlining the importance of a thorough understanding of their effects. Here, we investigate the effects of interpersonal challenges on the sense of welcomeness among diverse populations within OSS, through the diversity lenses of gender, race, and (dis)ability. We analyzed the large-scale Linux Foundation Diversity and Inclusion survey (n = 706) to model a theoretical framework linking interpersonal challenges with the sense of welcomeness through Structural Equation Models Partial Least Squares (PLS-SEM). We then examine the model to identify the impact of these challenges on different demographics through Multi-Group Analysis (MGA). Finally, we conducted a regression analysis to investigate how differently people from different demographics experience different types of interpersonal challenges. Our findings confirm the negative association between interpersonal challenges and the feeling of welcomeness in OSS, with this relationship being more pronounced among gender minorities and people with disabilities. We found that different challenges have unique impacts on how people feel welcomed, with variations across gender, race, and disability groups. We also provide evidence that people from gender minorities and with disabilities are more likely to experience interpersonal challenges than their counterparts, especially when we analyze stalking, sexual harassment, and gender. Our insights benefit OSS communities, informing potential strategies to improve the landscape of interpersonal relationships, ultimately fostering more inclusive and welcoming communities.
Tags: "Human/Social", "Open Source"Feng Lin, Dong Jae Kim, Tse-Hsun (Peter) Chen, "SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents"
Abstract: Software process models are essential to facilitate collaboration and communication among software teams to solve complex development tasks. Inspired by these software engineering practices, we present FlowGen – a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We emulate three process models, FlowGen$_{Waterfall}$, FlowGen$_{TDD}$, and FlowGen$_{Scrum}$, by assigning LLM agents to embody roles (i.e., requirement engineer, architect, developer, tester, and scrum master) that correspond to everyday development activities and organize their communication patterns. The agents work collaboratively using chain-of-thought and prompt composition with continuous self-refinement to improve the code quality. We use GPT-3.5 as our underlying LLM and several baselines (RawGPT, CodeT, Reflexion) to evaluate code generation on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Our findings show that FlowGen$_{Scrum}$ excels compared to other process models, achieving a Pass@1 of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively (an average of 15% improvement over RawGPT). Compared with other state-of-the-art techniques, FlowGen$_{Scrum}$ achieves a higher Pass@1 in MBPP compared to CodeT, with both outperforming Reflexion. Notably, integrating CodeT into FlowGen$_{Scrum}$ resulted in statistically significant improvements, achieving the highest Pass@1 scores. Our analysis also reveals that the development activities impacted code smell and exception handling differently, with design and code review adding more exception handling and reducing code smells. Finally, FlowGen models maintain stable Pass@1 scores across GPT-3.5 versions and temperature values, highlighting the effectiveness of software process models in enhancing the quality and stability of LLM-generated code.
Tags: "AI for SE", "SE for AI", "MSR"Linfeng Liang, Yao Deng, Kye Morton, Valtteri Kallinen, Alice James, Avishkar Seth, Endrowednes Kuantama, Subhas Mukhopadhyay, Richard Han, Xi Zheng, "GARL: Genetic Algorithm-Augmented Reinforcement Learning to Detect Violations in Marker-Based Autonomous Landing Systems"
Abstract: Automated Uncrewed Aerial Vehicle (UAV) landing is crucial for autonomous UAV services such as monitoring, surveying, and package delivery. It involves detecting landing targets, perceiving obstacles, planning collision-free paths, and controlling UAV movements for safe landing. Failures can lead to significant losses, necessitating rigorous simulation-based testing for safety. Traditional offline testing methods, limited to static environments and predefined trajectories, may miss violation cases caused by dynamic objects like people and animals. Conversely, online testing methods require extensive training time, which is impractical with limited budgets. To address these issues, we introduce GARL, a framework combining a genetic algorithm (GA) and reinforcement learning (RL) for efficient generation of diverse and real landing system failures within a practical budget. GARL employs GA for exploring various environment setups offline, reducing the complexity of RL's online testing in simulating challenging landing scenarios. Our approach outperforms existing methods by up to 18.35% in violation rate and 58% in diversity metric. We validate most discovered violation types with real-world UAV tests, pioneering the integration of offline and online testing strategies for autonomous systems. This method opens new research directions for online testing, with our code available at https://anonymous.4open.science/r/drone_testing-5CF0/.
Tags: "Testing and Quality", "Autonomy", "AI for SE"Siyuan Li, Yuekang Li, Zuxin Chen, Chaopeng Dong, Yongpan Wang, Hong Li, Yongle Chen, Hongsong Zhu, "TransferFuzz: Fuzzing with Historical Trace for Verifying Propagated Vulnerability Code"
Abstract: Code reuse in software development frequently facilitates the spread of vulnerabilities, making the scope of affected software in CVE reports imprecise. Traditional methods primarily focus on identifying reused vulnerability code within target software, yet they cannot verify if these vulnerabilities can be triggered in new software contexts. This limitation often results in false positives. In this paper, we introduce TransferFuzz, a novel vulnerability verification framework, to verify whether vulnerabilities propagated through code reuse can be triggered in new software. Innovatively, we collected runtime information during the execution or fuzzing of the basic binary (the vulnerable binary detailed in CVE reports). This process allowed us to extract historical traces, which proved instrumental in guiding the fuzzing process for the target binary (the new binary that reused the vulnerable function). TransferFuzz introduces a unique Key Bytes Guided Mutation strategy and a Nested Simulated Annealing algorithm, which transfers these historical traces to implement trace-guided fuzzing on the target binary, facilitating the accurate and efficient verification of the propagated vulnerability. Our evaluation, conducted on widely recognized datasets, shows that TransferFuzz can quickly validate vulnerabilities previously unverifiable with existing techniques. Its verification speed is 2.5 to 26.2 times faster than existing methods. Moreover, TransferFuzz has proven its effectiveness by expanding the impacted software scope for 15 vulnerabilities listed in CVE reports, increasing the number of affected binaries from 15 to 53. The datasets and source code used in this article are available at https://anonymous.4open.science/r/TransferFuzz-E9B3.
Tags: "Testing and Quality", "Security"Chong Wang, Jianan Liu, Xin Peng, Yang Liu, Yiling Lou, "Boosting Static Resource Leak Detection via LLM-based Resource-Oriented Intention Inference"
Abstract: Resource leaks, caused by resources not being released after acquisition, often lead to performance issues and system crashes. Existing static detection techniques rely on mechanical matching of predefined resource acquisition/release APIs and null-checking conditions to find unreleased resources, suffering from both (1) false negatives caused by the incompleteness of predefined resource acquisition/release APIs and (2) false positives caused by the incompleteness of resource reachability validation identification. To overcome these challenges, we propose InferROI, a novel approach that leverages the exceptional code comprehension capability of large language models (LLMs) to directly infer resource-oriented intentions (acquisition, release, and reachability validation) in code. InferROI first prompts the LLM to infer involved intentions for a given code snippet, and then incorporates a two-stage static analysis approach to check control-flow paths for resource leak detection based on the inferred intentions. We evaluate the effectiveness of InferROI in both resource-oriented intention inference and resource leak detection. Experimental results on the DroidLeaks and JLeaks datasets demonstrate InferROI achieves promising bug detection rate (59.3% and 62.5%) and false alarm rate (18.6% and 19.5%). Compared to three industrial static detectors, InferROI detects 14~45 and 149~485 more bugs in DroidLeaks and JLeaks, respectively. When applied to real-world open-source projects, InferROI identifies 29 unknown resource leak bugs (verified by authors), with 7 of them being confirmed by developers. In addition, the results of an ablation study underscores the importance of combining LLM-based inference with static analysis. Finally, manual annotation indicated that InferROI achieved a precision of 74.6% and a recall of 81.8% in intention inference, covering more than 60% resource types involved in the datasets.
Tags: "Security", "AI for SE"Wonhoi Kim, Hocheol Nam, Muoi Tran, Amin Jalilov, Zhenkai Liang, Sang Kil Cha, Min Suk Kang, "Fork State-Aware Differential Fuzzing for Blockchain Consensus Implementations"
Abstract: Blockchain networks allow multiple client implementations of the same consensus algorithm by different developers to coexist in the same system. Ensuring correct implementations among these heterogeneous clients is crucial, as even slight semantic discrepancies in their implementations can lead to safety failures. While existing fuzzing frameworks have discovered implementation flaws in blockchain, they suffer from several challenges in testing them with sequences of conflicting blocks, called forks. Existing tools fail to adequately assess the fork-handling processes in blockchain implementations when relying on traditional code coverage feedback, which lacks the granularity needed to navigate the diverse and complex fork-handling scenarios. This paper introduces Forky, a fork state-aware differential fuzzing framework designed to detect implementation discrepancies within the critical fork-handling process with its novel fork-aware mutation and fork-diversifying feedback mechanisms. We test Forky on the two most influential blockchain projects: Bitcoin and Ethereum, which are the representatives of the two major blockchain consensus algorithm families, Proofof-Work (PoW) and Proof-of-Stake (PoS) consensus algorithms.
Tags: "Testing and Quality", "Blockchain"Yueke Zhang, Anda Liang, Xiaohan Wang, Pamela J. Wisniewski, Fengwei Zhang, Kevin Leach, Yu Huang, "Who’s Pushing the Code: An Exploration of GitHub Impersonation"
Abstract: GitHub is one of the largest open-source software (OSS) communities for software development and collaboration. Impersonation in the OSS communities refers to the malicious act of assuming another user's identity, often aiming to gain unauthorized access to code, manipulate project outcomes, or spread misinformation. With several recent real-world attacks resulting from impersonation, this issue is becoming and increasingly problematic concern within the OSS community. We present the first exploration of the impact of impersonation in GitHub. Specifically, we conduct structured interviews with 17 real-world OSS contributors about their perception of impersonation and corresponding mitigations. Our study reveals that, in general, GitHub users lack awareness of impersonation and underestimate the severity of its implications. After witnessing the impersonation, they show significant concern for the OSS community. Meanwhile, we also demonstrate that the current best practices (i.e., commit signing) that might mitigate impersonation must be improved to increase widespread acceptance and adoption. We also present and discuss participant perceptions of potential ways to mitigate GitHub impersonation. We collect a dataset comprising 12.5 million commits to investigate the current status of impersonation. Interestingly, we also find out that impersonation is not currently detected. We observe that existing commit histories treat impersonation behavior identically to pull request events, resulting in a lack of detection methods for impersonation.
Tags: "Human/Social", "Open Source"Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Federica Sarro, Yang Liu, "Diversity Drives Fairness: Ensemble of Higher Order Mutants for Intersectional Fairness of Machine Learning Software"
Abstract: Intersectional fairness is a critical requirement for Machine Learning (ML) software, demanding fairness across subgroups defined by multiple protected attributes. This paper introduces FairHOME, a novel ensemble approach using higher order mutation of inputs to enhance intersectional fairness of ML software during the inference phase. Inspired by social science theories highlighting the benefits of diversity, FairHOME generates mutants representing diverse subgroups for each input instance, thus broadening the array of perspectives to foster a fairer decision-making process. Unlike conventional ensemble methods that combine predictions made by different models, FairHOME combines predictions for the original input and its mutants, all generated by the same ML model, to reach a final decision. Notably, FairHOME is even applicable to deployed ML software as it bypasses the need for training new models. We extensively evaluate FairHOME against six state-of-the-art fairness improvement methods across 24 decision-making tasks using widely adopted metrics. FairHOME consistently outperforms existing methods across all metrics considered. On average, it enhances intersectional fairness by 47.3%, surpassing the currently best-performing method by 10.1 percentage points.
Tags: "SE for AI", "Security"Junjielong Xu, Ying Fu, Shin Hwei Tan, Pinjia He, "Aligning the Objective of LLM-based Program Repair"
Abstract: Large language models (LLMs) have achieved decent results on automated program repair (APR). However, the next token prediction training objective of decoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction objective of current infilling-style methods, which impedes LLMs from fully leveraging pre-trained knowledge for program repair. In addition, while some LLMs can locate and repair bugs in certain functions using the related artifacts (e.g., test cases), existing methods still depend on statement-level fault localization methods to provide a list of buggy hunks for repair. This restriction hinders LLMs from exploring potential patches beyond the given locations. In this paper, we investigate a new approach to adapt LLMs to program repair. Our core insight is that LLM’s APR capability can be greatly improved by simply aligning the output to their training objective and allowing them to refine the whole program without first identifying faulty statements. Based on this insight, we designed D4C, a straightforward prompting framework for APR. D4C can repair 180 bugs correctly in Defects4J, with each patch being sampled only 10 times. This surpasses the SOTA APR methods with perfect fault localization by 10% and reduces the patch sampling number by 90%. Our findings reveal that (1) objective alignment is crucial for fully exploiting LLM’s pre-trained capability, and (2) replacing the traditional localize-buggy-hunks-then-repair workflow with direct debugging is more effective for LLM-based APR methods. Thus, we believe this paper introduces a new mindset for harnessing LLMs in APR.
Tags: "AI for SE", "Analysis/Repair"Aidan Z.H. Yang, Sophia Kolak, Vincent Hellendoorn, Ruben Martins, Claire Le Goues, "Revisiting Unnaturalness for Automated Program Repair in the Era of Large Language Models"
Abstract: Language models have improved by orders of magnitude with the recent emergence of Transformer-based Large Language Models (LLMs). LLMs have demonstrated their ability to generate "natural" code that is highly similar to code written by professional developers. One intermediate value an LLM can emit is entropy, which measures the naturalness of a token of code. We hypothesize that entropy can be used to improve the performance of Automated Program Repair (APR) tasks. While much progress has been made in Automated Program Repair (APR), fault localization techniques suffer from a lack of diversity in ranking scores, patch generation tools tend to be inefficient as all tests need to run before determining if a patch is likely to be correct, and patch ranking often suffers from the test-suite over-fitting problem. However, using an LLM directly for APR introduces concerns for training data leakage. In this work, we introduce a novel way of using the entropy of LLMs in combination with prior APR tools to improve all stages of APR. By using only the prefix and suffix context of a line or block of code to describe naturalness, we can use LLMs to localize faults and rank patches all while eliminating the dependency for test-suites. We show that entropy is highly complementary with prior fault localization tools. Our proposed method achieves a 108% top-1 score improvement over SBFL. When using entropy for patch ranking and classification, our proposed method can rank correct patches more effectively than state-of-the-art machine learning tools with an 49% improvement in top-1. Our work suggests that LLMs can be an effective addition to compliment prior APR tasks while minimizing both the test-suite over-fitting problem and the LLM data leakage problem.
Tags: "Analysis/Repair", "AI for SE"Boqi Chen, José Antonio Hernández López, Gunter Mussbacher, Dániel Varró, "The Power of Types: Exploring the Impact of Type Checking on Neural Bug Detection in Dynamically Typed Languages"
Abstract: [Motivation] Automated bug detection in dynamically typed languages such as Python is essential for maintaining code quality. The lack of mandatory type annotations in such languages can lead to errors that are challenging to identify early with traditional static analysis tools. Recent progress in deep neural networks has led to increased use of neural bug detectors. In statically typed languages, a type checker is integrated into the compiler and thus taken into consideration when the neural bug detector is designed for these languages. [Problem] However, prior studies overlook this aspect during the training and testing of neural bug detectors for dynamically typed languages. When an optional type checker is used, assessing existing neural bug detectors on bugs easily detectable by type checkers may impact their performance estimation. Moreover, including these bugs in the training set of neural bug detectors can shift their detection focus toward the wrong type of bugs. [Contribution] We explore the impact of type checking on various neural bug detectors for variable misuse bugs, a common type targeted by neural bug detectors. Existing synthetic and real-world datasets are type-checked to evaluate the prevalence of type-related bugs. Then, we investigate how type-related bugs influence the training and testing of the neural bug detectors. [Findings] Our findings indicate that existing bug detection datasets contain a significant proportion of type-related bugs. Building on this insight, we discover integrating the neural bug detector with a type checker can be beneficial, especially when the code is annotated with types. Further investigation reveals neural bug detectors perform better on type-related bugs than other bugs. Moreover, removing type-related bugs from the training data helps improve neural bug detectors’ ability to identify bugs beyond the scope of type checkers.
Tags: "AI for SE", "Testing and Quality"Verya Monjezi, Ashutosh Trivedi, Vladik Kreinovich, Saeid Tizpaz-Niari, "Fairness Testing through Extreme Value Theory"
Abstract: Data-driven software is increasingly being used as a critical component of automated decision-support systems. Since this class of software learns its logic from historical data, it can encode or amplify discriminatory practices. Previous research on algorithmic fairness has focused on improving “average-case” fairness. On the other hand, fairness at the extreme ends of the spectrum, which often signifies lasting and impactful shifts in societal attitudes, has received significantly less emphasis. Leveraging the statistics of extreme value theory (EVT), we propose a novel fairness criterion called extreme counterfactual discrimination (ECD). This criterion estimates the worst-case amounts of disadvantage in outcomes for individuals solely based on their memberships in a protected group. Utilizing tools from search-based software engineering and generative AI, we present a randomized algorithm that samples a statistically significant set of points from the tail of ML outcome distributions even if the input dataset lacks a sufficient number of relevant samples. We conducted several experiments on four ML models (deep neural networks, logistic regression, and random forests) over 10 socially relevant tasks from the literature on algorithmic fairness. First, we evaluate the generative AI methods and find that they generate sufficient samples to infer valid EVT distribution in 95% of cases. Remarkably, we found that the prevalent bias mitigators reduce the average-case discrimination but increase the worst-case discrimination significantly in 35% of cases. We also observed that even the tail-aware mitigation algorithm MiniMax-Fairness—increased the worst-case discrimination in 30% of cases. We propose a novel ECD-based mitigator that improves fairness in the tail in 90% of cases with no degradation of the average-case discrimination. We hope that the EVT framework serves as a robust tool for evaluating fairness in both average-case and worst-case discrimination.
Tags: "Security", "SE for AI"Jiageng Li, Zhen Dong, Chong Wang, Haozhen You, Cen Zhang, Yang Liu, Xin Peng, "LLM Based Input Space Partitioning Testing for Library APIs"
Abstract: Automated library APIs testing is difficult as it requires exploring a vast space of parameter inputs that may involve objects with complex data types. Existing search based approaches, with limited knowledge of relations between object states and program branches, often suffer from the low efficiency issue, i.e., tending to generate invalid inputs. Symbolic execution based approaches can effectively identify such relations, but fail to scale to large programs. In this work, we present an LLM-based input space partitioning testing approach, LISP, for library APIs. The approach leverages LLMs to understand the code of a library API under test and perform input space partitioning based on its understanding and rich common knowledge. Specifically, we provide the signature and code of the API under test to LLMs, with the expectation of obtaining a text description of each input space partition of the API under test. Then, the generated text description is employed to guide the input generation process for each partition, ultimately resulting in test suites that systematically explore the program behavior of the API. We evaluate LISP on 10 popular open-source Java libraries (e.g., apache/commons-lang with 2.6k stars, guava with 48.8k stars on GitHub). Our experiment results show that LISP is effective in library API testing. It significantly outperforms state-of-the-art tool EvoSuite in terms of branch coverage. On average, LISP achieves 67.82% branch coverage, surpassing EvoSuite by 1.21 times. In total, LISP triggers 404 exceptions or errors in the experiments, and discovers 13 previously unknown vulnerabilities during evaluation, which have been assigned CVE IDs.
Tags: "Testing and Quality", "AI for SE"Qiaolin Qin, Heng Li, Ettore Merlo, Maxime Lamothe, "Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection"
Abstract: With the advent of data-centric and machine learning (ML) systems, data quality is playing an increasingly critical role for ensuring the overall quality of software systems. Alas, data preparation, an essential step towards high data quality, is known to be a highly effort-intensive process. Although prior studies have dealt with one of the most impacting issues, data pattern violations, we observe that these studies usually require data-specific configurations (i.e., parameterized) or a certain set of fully curated data as learning examples (i.e., supervised). Both approaches require domain knowledge and depend on users' deep understanding of their data, and are often effort-intensive. In this paper, we introduce RIOLU: Regex Inferencer autO-parameterized Learning with Uncleaned data. RIOLU is fully automated, is automatically parameterized, and does not need labeled samples. We observe that RIOLU can generate precise patterns from datasets in various domains, with a high F1 score of 97.2%, exceeding the state-of-the-art baseline. In addition, according to our experiment on five datasets with anomalies, RIOLU can automatically estimate a data column's error rate, draw normal patterns, and predict anomalies from unlabeled data with higher performance (up to 800.4% improvement in terms of F1) than the state-of-the-art baseline. Furthermore, RIOLU can even outperform ChatGPT in terms of both accuracy (12.3% higher F1) and efficiency (10% less inference time). With user involvement, a variation (a guided version) of RIOLU can further boost its precision (up to 37.4% improvement in terms of F1). Our evaluation in an industrial setting further demonstrates the practical benefits of RIOLU.
Tags: "MSR", "Security", "SE for AI"Bimpe Ayoola, Miikka Kuutilla, Rina Wehbe, Paul Ralph, "User Personas Improve Social Sustainability by Encouraging Software Developers to Deprioritize Antisocial Features"
Abstract: \textit{Background}: Sustainable software development involves creating software in a manner that meets present goals without undermining our ability to meet future goals. In a software engineering context, sustainability has at least four dimensions: ecological, economic, social, and technical. No interventions for improving social sustainability in software engineering have been tested in rigorous lab-based experiments, and little evidence-based guidance is available. \textit{Objective}: The purpose of this study is to evaluate the effectiveness of two interventions---stakeholder maps and persona models---for improving social sustainability by improving software feature prioritization. \textit{Method}: We conducted a randomized controlled factorial experiment with 79 undergraduate computer science students. Participants were randomly assigned to one of four groups and asked to prioritize a backlog of prosocial, neutral, and antisocial user stories for a shopping mall's digital screen display and facial recognition software. Participants received either persona models, a stakeholder map, both, or neither. We compared the differences in prioritization levels assigned to prosocial and antisocial user stories using Cumulative Link Mixed Model regression. \textit{Results}: Participants who received persona models gave significantly lower priorities to anti-social user stories but no significant difference was evident for pro-social user stories. The effects of the stakeholder map were not significant. The interaction effects were not significant. \textit{Conclusion}: Providing aspiring software professionals with well-crafted persona models causes them to de-prioritize anti-social software features. The impact of persona modelling on sustainable software development therefore warrants further study with more experience professionals. Moreover, the novel methodological strategy of assessing social sustainability behavior through backlog prioritization appears feasible in lab-based settings.
Tags: "Human/Social", "Requirements"Daniel Erhabor, Sreeharsha Udayashankar, Meiyappan Nagappan, Samer Al-Kiswany, "Measuring the Runtime Performance of C++ Code Written by Humans using GitHub Copilot"
Abstract: GitHub Copilot is an artificially intelligent program- ming assistant used by many developers. While a few studies have evaluated the security risks of using Copilot, there has not been any study to show if it aids developers in producing code with better runtime performance. We evaluate the runtime performance of C++ code produced when developers use GitHub Copilot versus when they do not. To this end, we conducted a user study with 32 participants where each participant solved two C++ programming problems, one with Copilot and the other without it and measured the runtime performance of the participants’ solutions on our test data. Our results suggest that using Copilot may produce C++ code with a statistically significantly slower runtime performance.
Tags: "AI for SE", "Human/Social"Yulong Ye, Tao Chen, Miqing Li, "Distilled Lifelong Self-Adaptation for Configurable Systems"
Abstract: Modern configurable systems provide tremendous opportunities for engineering future intelligent software systems. A key difficulty thereof is how to effectively self-adapt the configuration of a running system such that its performance (e.g., runtime and throughput) can be optimized under time-varying workloads. This unfortunately remains unaddressed in existing approaches as they either overlook the available past knowledge or rely on static exploitation of past knowledge without reasoning the usefulness of information when planning for self-adaptation. In this paper, we tackle this challenging problem by proposing DLiSA, a framework that self-adapts configurable systems. DLiSA comes with two properties: firstly, it supports lifelong planning, and thereby the planning process runs continuously throughout the lifetime of the system, allowing dynamic exploitation of the accumulated knowledge for rapid adaptation. Secondly, the planning for a newly emerged workload is boosted via distilled knowledge seeding, in which the knowledge is dynamically purified such that only useful past configurations are seeded when necessary, mitigating misleading information. Extensive experiments suggest that the proposed DLiSA significantly outperforms state-of-the-art approaches, demonstrating a performance improvement of up to 255\% and a resource acceleration of up to 2.22$\times$ on generating promising adaptation configurations. All data and sources can be found at our anonymous site: https://github.com/Anonymous-DLiSA/DLiSA.
Tags: "Design/Architecture", "AI for SE"Zachary Karas, Benjamin Gold, Violet Zhou, Noah Reardon, Thad Polk, Catie Chang, Yu Huang, "Studying Programmers Without Programming: Investigating Expertise Using Resting State fMRI"
Abstract: Expert programmers are more effective at coding activities, but the reasons for this remain elusive. Accordingly, recent research has used neuroimaging such as fMRI to analyze how expert programmers might think as they perform coding activities. Those experiments have all involved specific programming tasks (i.e., comprehension), but have been unable to detect systematic differences based on coding experience. By using tasks, however, those studies may limit the number and type of brain networks involved. In Cognitive Neuroscience, researchers commonly analyze resting-state data, in which participants’ brain activity is recorded as they lay idle in the scanner. The brain’s functional organization is plastic, and can change with experience. These changes can be measured at rest, making this a suitable data type for studying how programming activities affect neural organization over time. In this paper, we analyzed the resting state scans from 150 participants, 96 of whom were programmers. We found increased connectivity in programmers between brain regions involved in language, math, and the temporal attention. Non-programmers demonstrated more connectivity with regions involved in social and emotional cognition. We found that as years of programming experience increases, connectivity decreases between two regions associated with visual processing during reading and articulation, respectively.
Tags: "Prog Comprehension/Reeng/Maint", "Human/Social"Setu Kumar Basak, K. Virgil English, Ken Ogura, Vitesh Kambara, Bradley Reaves, Laurie Williams, "AssetHarvester: A Static Analysis Tool for Detecting Secret-Asset Pairs in Software Artifacts"
Abstract: GitGuardian monitored secrets exposure in public GitHub repositories and reported that developers leaked over 12 million secrets (database and other credentials) in 2023, indicating a 113\% surge from 2021. Despite the availability of secret detection tools, developers ignore the tools' reported warnings because of false positives (25\%-99\%). However, each secret protects assets of different values accessible through asset identifiers (a DNS name and a public or private IP address). The asset information for a secret can aid developers in filtering false positives and prioritizing secret removal from the source code. However, existing secret detection tools do not provide the asset information, thus presenting difficulty to developers in filtering secrets only by looking at the secret value or finding the assets manually for each reported secret. \textit{The goal of our study is to aid software practitioners in prioritizing secrets removal by providing the assets information protected by the secrets through our novel static analysis tool.} We present AssetHarvester, a static analysis tool to detect secret-asset pairs in a repository. Since the location of the asset can be distant from where the secret is defined, we investigated secret-asset co-location patterns and found four patterns. To identify the secret-asset pairs of the four patterns, we utilized three approaches (pattern matching, data flow analysis, and fast-approximation heuristics). We curated a benchmark of 1,791 secret-asset pairs of four database types extracted from 188 public GitHub repositories to evaluate the performance of AssetHarvester. AssetHarvester demonstrates precision of (97\%), recall (90\%), and F1-score (94\%) in detecting secret-asset pairs. Our findings indicate that data flow analysis employed in AssetHarvester detects secret-asset pairs with 0\% false positives and aids in improving the recall of secret detection tools. Additionally, AssetHarvester shows 43\% increase in precision for database secret detection compared to existing detection tools through the detection of assets, thus reducing developer's alert fatigue.
Tags: "Analysis", "Security"Yining She, Sumon Biswas, Christian Kästner, Eunsuk Kang, "FairSense: Long-Term Fairness Analysis of ML-Enabled Systems"
Abstract: Algorithmic fairness of machine learning (ML) models has raised significant concern in the recent years. Many testing, verification, and bias mitigation techniques have been proposed to identify and reduce fairness issues in ML models. The existing methods are *model-centric* and designed to detect fairness issues under *static settings*. However, many ML-enabled systems operate in a dynamic environment where the predictive decisions made by the system *impact* the environment, which in turn affects future decision-making. Such a self-reinforcing *feedback loop* can cause fairness violations in the long term, even if the immediate outcomes are fair. In this paper, we propose a simulation-based framework called FairSense to detect and analyze long-term unfairness in ML-enabled systems. In particular, the framework targets systems with an ML model that is trained over tabular data using supervised learning. Given a fairness requirement, FairSense performs *Monte-Carlo simulation* to enumerate evolution traces for each system configuration. Then, FairSense performs *sensitivity analysis* on the space of system parameters to understand the impact of configuration decisions on long-term fairness of the system. We demonstrate FairSense's potential utility through three real-world case studies: Loan lending, opioids risk scoring, and predictive policing.
Tags: "SE for AI", "Security", "Requirements"Islem BOUZENIA, Premkumar Devanbu, Michael Pradel, "RepairAgent: An Autonomous, LLM-Based Agent for Program Repair"
Abstract: Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces RepairAgent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable RepairAgent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates RepairAgent’s effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270,000 tokens per bug, which, under the current pricing of OpenAI’s GPT-3.5 model, translates to 14 cents per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.
Tags: "Analysis/Repair", "AI for SE"Sarah Fakhoury, Markus Kuppe, Shuvendu Lahiri, Tahina Ramananandro, Nikhil Swamy, "3DGen: AI-Assisted Generation of Provably Correct Binary Format Parsers"
Abstract: Improper parsing of attacker-controlled input is a leading source of software security vulnerabilities, especially when programmers transcribe informal format descriptions into efficient parsing logic in low-level, memory unsafe languages. Several researchers have proposed formal specification languages for data formats from which efficient code can be extracted. However, distilling informal requirements into formal specifications is challenging and, despite their benefits, new, formal languages are hard for people to learn and use. In this work, we present 3DGen, a framework that makes use of AI agents to transform mixed informal input, including natural language documents and example inputs into format specifications in a language called 3D. To support humans in understanding and trusting the generated specifications, 3DGen uses symbolic methods to also synthesize test inputs that can be validated against an external oracle. Symbolic test generation also helps in distinguishing multiple plausible solutions. Through a process of repeated refinement, 3DGen produces a 3D specification that conforms to a test suite, and which yields safe, efficient, provably correct, parsing code in C. We have evaluated 3DGen on 20 Internet standard formats, demonstrating the potential for AI-agents to produce formally verified C code at a non-trivial scale. A key enabler is the use of a domain-specific language to limit AI outputs to a class for which automated, symbolic analysis is tractable.
Tags: "AI for SE", "Analysis/Repair"Paschal Amusuo, Kyle A. Robinson, Tanmay Singla, Huiyun Peng, Aravind Machiry, Santiago Torres-Arias, Laurent Simon, James C Davis, "$ZTD_{JAVA}$: Mitigating Software Supply Chain Vulnerabilities via Zero-Trust Dependencies"
Abstract: Third-party libraries like Log4j accelerate software application development but introduce substantial risk. Vulnerabilities in these libraries have led to Software Supply Chain (SSC) attacks that compromised resources within the host system. These attacks benefit from current application permissions approaches: third-party libraries are implicitly trusted in the application runtime. An application runtime designed with Zero-Trust Architecture (ZTA) principles — secure access to resources, continuous monitoring, and least-privilege enforcement — could mitigate SSC attacks, as it would give zero implicit trust to these libraries. However, no individual security defense incorporates these principles at a low runtime cost. This paper proposes Zero-Trust Dependencies (ZTD) to mitigate SSC vulnerabilities: we apply the NIST ZTA to software applications. First, we assess the expected effectiveness and configuration cost of Zero-Trust Dependencies using a study of third-party software libraries and their vulnerabilities. Then, we present a system design, $ZTD_{sys}$, that enables the application of Zero-Trust Dependencies to software applications and a prototype, $ZTD_{JAVA}$, for Java applications. Finally, with evaluations on recreated vulnerabilities and realistic applications, we show that $ZTD_{JAVA}$ can defend against prevalent vulnerability classes, introduces negligible cost, and is easy to configure and use.
Tags: "Security", "Analysis"Salma Begum Tamanna, Gias Uddin, Song Wang, Lan Xia, Longyu Zhang, "ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet?"
Abstract: Hallucinations, the tendency to produce irrelevant/incorrect responses, are prevalent concerns in generative AI-based tools like ChatGPT. Although hallucinations in ChatGPT are studied for textual responses, it is unknown how ChatGPT hallucinates for technical texts that contain both textual and technical terms. We surveyed 47 software engineers and produced a benchmark of 412 Q&A pairs from the bug reports of two OSS projects. We find that a RAG-based ChatGPT (i.e., ChatGPT tuned with the benchmark issue reports) is 36.4% correct when producing answers to the questions, due to two reasons 1) limitations to understand complex technical contents in code snippets like stack traces, and 2) limitations to integrate contexts denoted in the technical terms and texts. We present CHIME (ChatGPT Inaccuracy Mitigation Engine) whose underlying principle is that if we can preprocess the technical reports better and guide the query validation process in ChatGPT, we can address the observed limitations. CHIME uses context-free grammar (CFG) to parse stack traces in technical reports. CHIME then verifies and fixes ChatGPT responses by applying metamorphic testing and query transformation. In our benchmark, CHIME shows 30.3% more correction over ChatGPT responses. In a user study, we find that the improved responses with CHIME are considered more useful than those generated from ChatGPT without CHIME
Tags: "SE for AI", "Human/Social"Xin Yin, Chao Ni, Xiaodan Xu, Xiaohu Yang, "What You See Is What You Get: Attention-based Self-guided Automatic Unit Test Generation"
Abstract: Software defects heavily affect software's functionalities and may cause huge losses. Recently, many AI-based approaches have been proposed to detect defects, which can be divided into two categories: software defect prediction and automatic unit test generation. While these approaches have made great progress in software defect detection, they still have several limitations in practical application, including the low confidence of prediction models and the inefficiency of unit testing models. To address these limitations, we propose a WYSIWYG (i.e., What You See Is What You Get) approach: \textbf{A}ttention-based Self-guided Automatic \textbf{U}nit Test \textbf{G}en\textbf{ER}ation (AUGER), which contains two stages: defect detection and error triggering. In the former stage, \toolname first detects the proneness of defects. Then, in the latter stage, it guides to generate unit tests for triggering such an error with the help of critical information obtained by the former stage. To evaluate the effectiveness of \toolname, we conduct a large-scale experiment by comparing with the state-of-the-art (SOTA) approaches on the widely used datasets (i.e., Bears, Bugs.jar, and Defects4J). AUGER makes great improvements by 4.7\% to 35.3\% and 17.7\% to 40.4\% in terms of F1-score and Precision in defect detection, and can trigger 23 to 84 more errors than SOTAs in unit test generation. Besides, we also conduct a further study to verify the generalization in practical usage by collecting a new dataset from real-world projects.
Tags: "AI for SE", "Testing and Quality"Yizhou Chen, Zeyu Sun, Guoqing Wang, Dan Hao, "Gpass: a Goal-adaptive Neural Theorem Prover based on Coq for Automated Formal Verification"
Abstract: Formal verification is a crucial means to assure software quality. Regrettably, the manual composition of verification scripts proves to be both laborious and time-consuming. In response, researchers have put forth automated theorem prover approaches; however, these approaches still grapple with several limitations. These limitations encompass insufficient handling of lengthy proof steps, difficulty in aligning the various components of a Coq program with the requirements and constraints of the proof goal, and inefficiencies. To surmount these limitations, we present Gpass, a goal-adaptive neural theorem prover based on deep learning technology. Firstly, we design a unique sequence encoder for Gpass that completely scans previous proof tactics through multiple sliding windows and provides information related to the current proof step. Secondly, Gpass incorporates a goal-adaptive feature integration module to align the reasoning process with the requirements of the proof goal. Finally, we devise a parameter selection method based on loss values and loss slopes to procure parameter sets with diverse distributions, thereby facilitating the exploration of various proof tactics. Experimental results demonstrate that Gpass attains better performance on the extensive CoqGym benchmark and proves 11.03\%-96.37\% more theorems than the prior work most closely related to ours. We find that the orthogonality between Gpass and CoqHammer proves their complementary capabilities, and together they prove a total of 3,774 theorems, which is state-of-the-art performance. In addition, we propose an efficiency optimisation approach that allows Gpass to achieve performance beyond Diva at one-sixth of the parameter sets.
Tags: "Formal methods", "AI for SE"Yanchen Lu, Hongyu Lin, Zehua He, Haitao Xu, Zhao Li, Shuai Hao, Liu Wang, Haoyu Wang, Kui Ren, "TacDroid: Detection of Illicit Apps through Hybrid Analysis of UI-based Transition Graphs"
Abstract: Illicit apps have emerged as a thriving underground industry, driven by their substantial profitability. These apps either offer users restricted services (e.g., porn and gambling) or engage in fraudulent activities like scams. Despite the widespread presence of illicit apps, scant attention has been directed towards this issue, with several existing detection methods predominantly relying on static analysis alone. However, given the burgeoning trend wherein an increasing number of mobile apps achieve their core functionality through dynamic resource loading, depending solely on static analysis proves inadequate. To address this challenge, in this paper, we introduce TacDroid, a novel approach that integrates dynamic analysis for dynamic content retrieval with static analysis to mitigate the limitations inherent in both methods, i.e., the low coverage of dynamic analysis and the low accuracy of static analysis. Specifically, TacDroid conducts both dynamic and static analyses on an Android app to construct dynamic and static User Interface Transition Graphs (UTGs), respectively. These two UTGs are then correlated to create an intermediate UTG. Subsequently, TacDroid embeds graph structure and utilizes an enhanced Graph Autoencoder (GAE) model to predict transitions between nodes. Through link prediction, TacDroid effectively eliminates false positive transition edges stemming from misjudgments in static analysis and supplements false negative transition edges overlooked in the intermediate UTG, thereby generating a comprehensive and accurate UTG. Finally, TacDroid determines the legitimacy of an app and identifies its category based on the app's UTG. Our evaluation results highlight the outstanding accuracy of TacDroid in detecting illicit apps. It significantly surpasses the state-of-the-art work, achieving an F1-score of 96.73%. This work represents a notable advancement in the identification and categorization of illicit apps.
Tags: "Mobile SW", "Analysis"Ravishka Rathnasuriya, Zijie Zhao, Wei Yang, "CodeImprove: Program Adaptation for Deep Code Models"
Abstract: Leveraging deep learning (DL)-based code analysis tools to solve software engineering tasks is becoming increasingly popular. Code models often suffer performance degradation due to various reasons (e.g., code data shifts). Retraining is often required to address these issues, but frequent model updates are costly in labeling and deployment. In this paper, we explore an alternative solution: Adapting the program inputs to the code models. This can be achieved by two steps: 1) input validation that focuses on identifying whether an input is an out-of-scope input program that are beyond a model’s handling capability, and 2) input adaptation that adapts out-of-scope inputs to become in-scope inputs. Validating program input is challenging, as current techniques focus on continuous inputs such as image data and fail with discrete inputs like code data, which have unique characteristics and are processed differently by deep learning models. Adapting out-of-scope programs is also challenging due to their vast search spaces. Therefore, in this paper, we propose CodeImprove, which distinguishes out-of-scope from normal inputs and converts such out-of-scope inputs back to in-scope inputs through program transformation. In particular, we propose a validity score metric to identify out-of-scope inputs and leverage genetics algorithms to apply semantic preserving program transformation to convert out-of-scope inputs to in-scope inputs. Our experimental results show CodeImprove can enhance upto 8.78% of accuracy, and 51.28% of relative improvements in three code models on two SE tasks. Additionally, our input validation is promising in detecting outof-scope inputs (AUC score of 0.924).
Tags: "SE for AI"Yanfu Yan, Viet Duong, Huajie Shao, Denys Poshyvanyk, "Towards More Trustworthy Deep Code Models by Enabling Out-of-Distribution Detection"
Abstract: Numerous machine learning (ML) models have been developed, including those for software engineering (SE) tasks, under the assumption that training and testing data come from the same distribution. However, train and test distributions often differ, as training datasets rarely encompass the entire distribution, while test distribution tends to shift over time. Hence, when confronted with out-of-distribution (OOD) instances that differ from the training data, a reliable and trustworthy SE ML model must be capable of detecting them to either abstain from making predictions, or potentially forward these OODs to appropriate models handling other categories or tasks. In this paper, we develop two types of SE-specific OOD detection models, unsupervised and weakly-supervised OOD detection for code. The unsupervised OOD detection approach is trained solely on in-distribution samples while the weakly-supervised approach utilizes a tiny number of OOD samples to further enhance the detection performance in various OOD scenarios. Extensive experimental results demonstrate that our proposed methods significantly outperform the baselines in detecting OOD samples from four different scenarios simultaneously and also positively impact a main code understanding task.
Tags: "SE for AI", "Security"Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric, "exLong: Generating Exceptional Behavior Tests with Large Language Models"
Abstract: Many popular programming languages, including C#, Java, and Python, support exceptions. Exceptions are thrown during program execution if an unwanted event happens, e.g., a method is invoked with an illegal argument value. Software developers write exceptional behavior tests (EBTs) to check that their code detects unwanted events and throws appropriate exceptions. Prior research studies have shown the importance of EBTs, but those studies also highlighted that developers put most of their efforts on “happy paths”, e.g., paths without unwanted events. To help developers fill the gap, we present the first framework, dubbed EXLÓNG, that automatically generates EBTs. EXLÓNG is a large language model instruction-tuned from CodeLlama and embeds reasoning about traces that lead to throw statements, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. We compare EX LÓNG with the state-of-the-art models for test generation (CAT-LM) and one of the strongest foundation models (GPT3.5), as well as with analysis-based tools for test generation (Randoop and EvoSuite). Our results show that EXLÓNG outperforms existing models and tools. Furthermore, we contributed several pull requests to open-source projects and 23 EBTs generated by EXLÓNG were already accepted.
Tags: "Testing and Quality", "AI for SE"Jake Zappin, Trevor Stalnaker, Oscar Chaparro, Denys Poshyvanyk, "When Quantum Meets Classical: Characterizing Hybrid Quantum-Classical Issues Discussed in Developer Forums"
Abstract: Recent advances in quantum computing have sparked excitement that this new computing paradigm could solve previously intractable problems. However, due to the faulty nature of current quantum hardware and quantum-intrinsic noise, the full potential of quantum computing is still years away. Hybrid quantum-classical computing has emerged as a possible compromise that achieves the best of both worlds. In this paper, we look at hybrid quantum-classical computing from a software engineering perspective and present the first empirical study focused on characterizing and evaluating recurrent issues faced by developers of hybrid quantum-classical applications. The study comprised a thorough analysis of 531 real-world issues faced by developers -- including software faults, hardware failures, quantum library errors, and developer mistakes -- documented in discussion threads from forums dedicated to quantum computing. By qualitatively analyzing such forum threads, we derive a comprehensive taxonomy of recurring issues in hybrid quantum-classical applications that can be used by both application and platform developers to improve the reliability of hybrid applications. The study considered how these recurring issues manifest and their causes, determining that hybrid applications are crash-dominant (74% of studied issues) and that errors were predominantly introduced by application developers (70% of issues). We conclude by identifying recurring obstacles for developers of hybrid applications and actionable recommendations to overcome them.
Tags: "Quantum", "MSR", "Human/Social"Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, Taesoo Kim, "RUG: Turbo LLM for Rust Unit Test Generation"
Abstract: Unit testing improves software quality by evaluating isolated sections of the program. This approach alleviates the need for comprehensive program-wide testing and confines the potential error scope within the software. However, unit test development is time-consuming, requiring developers to create appropriate test contexts and determine input values to cover different code regions. This problem is particularly pronounced in Rust due to its intricate type system, making traditional unit test generation tools ineffective in Rust projects. Recently, LLM have demonstrated their proficiency in understanding programming language and completing software engineering tasks. However, merely prompting LLM with a basic prompt like "generate unit test for the following source code" often results in code with compilation errors. In addition, LLM-generated unit tests often have limited test coverage. To bridge this gap and harness the capabilities of LLM, we design and implement RUG, an end-to-end solution to automatically generate the unit test for Rust projects. To help LLM's generated test pass Rust strict compilation checks, RUG designs a semantic-aware bottom-up approach to divide the context construction problem into dependent sub-problems. It solves these sub-problems sequentially using an LLM and merges them to a complete context. To increase test coverage, RUG integrates coverage-guided fuzzing with LLM to prepare fuzzing harnesses. Applying RUG on 17 real-world Rust programs (average 24,937 LoC), we show that RUG can achieve a high code coverage, up to 71.37%, closely comparable to human effort (73.18%). We submitted 113 unit tests generated by RUG covering the new code: 53 of them have been accepted, 17 were rejected, and 43 are pending for review.
Tags: "Testing and Quality", "SE for AI"Smit Soneshbhai Patel, Aashish Yadavally, Hridya Dhulipala, Tien N. Nguyen, "Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets"
Abstract: Large Language Models (LLMs) have been excellent in generating and reasoning about source code and the textual descriptions. They can recognize patterns, syntax, and semantics in code, making them effective in several software engineering tasks. However, they exhibit weaknesses in reasoning about the program execution. They primarily operate on static code representations, failing to capture the dynamic behavior and state changes that occur during program execution. In this paper, we advance the capabilities of LLMs in reasoning about program execution. We propose ORCA, a novel approach that instructs an LLM to autonomously formulate a plan to navigate through a control flow graph (CFG) for predictive execution of (in)complete code snippets. It acts as a predictive interpreter to ``execute'' the code. As a downstream task, we use ORCA to statically identify any runtime errors for online code snippets. Early detection of runtime errors and defects in these snippets is crucial to prevent costly fixes later in the development cycle after they were adapted into a codebase. In our novel technique, we guide the LLM to pause at the branching point, focusing on the state of the symbol tables for variables' values, thus minimizing error propagation in the LLM's computation. We also instruct the LLM not to stop at each step in its execution plan, resulting the use of only one prompt to the LLM, thus much cost-saving. Our empirical evaluation showed that ORCA is effective and improves over the state-of-the-art approaches in predicting the execution traces and in runtime error detection.
Tags: "AI for SE", "Analysis"Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, Lei Bu, "SpecGen: Automated Generation of Formal Program Specifications via Large Language Models"
Abstract: In the software development process, formal program specifications play a crucial role in various stages, including requirement analysis, software testing, and verification. However, manually crafting formal program specifications is rather difficult, making the job time-consuming and labor-intensive. Moreover, it is even more challenging to write specifications that correctly and comprehensively describe the semantics of complex programs. To reduce the burden on software developers, automated specification generation methods have emerged. However, existing methods usually rely on predefined templates or grammar, making them struggle to accurately describe the behavior and functionality of complex real-world programs. To tackle this challenge, we introduce SpecGen, a novel technique for formal program specification generation based on Large Language Models (LLMs). Our key insight is to overcome the limitations of existing methods by leveraging the code comprehension capability of LLMs. The process of SpecGen consists of two phases. The first phase employs a conversational approach that guides the LLM to generate appropriate specifications for a given program, aiming to utilize the ability of LLM to generate high-quality specifications. The second phase, designed for where the LLM fails to generate correct specifications, applies four mutation operators to the model-generated specifications and selects verifiable specifications from the mutated ones through a novel heuristic selection strategy by assigning different weights of variants in an efficient manner. We evaluate SpecGen on two datasets, including the SV-COMP Java category benchmark and a manually constructed dataset containing 120 programs. Experimental results demonstrate that SpecGen succeeds in generating verifiable specifications for 279 out of 385 programs, outperforming the existing LLM-based approaches and conventional specification generation tools like Houdini and Daikon. Further investigations on the quality of generated specifications indicate that SpecGen can comprehensively articulate the behaviors of the input program.
Tags: "Formal methods", "AI for SE", "Requirements"Yifeng Di, Tianyi Zhang, "Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding"
Abstract: Large Language Models (LLMs) have demonstrated unprecedented capability in code generation. However, LLM-generated code is still plagued with a wide range of functional errors, especially for complex programming tasks that LLMs have not seen before. Recent studies have shown that developers often struggle with inspecting and fixing incorrect code generated by LLMs, diminishing their productivity and trust in LLM-based code generation. Inspired by the mutual grounding theory in communication, we propose an interactive approach that leverages code comments as a medium for developers and LLMs to establish a shared understanding. Our approach facilitates iterative grounding by interleaving code generation, inline comment generation, and contextualized user feedback through editable comments to align generated code with developer intent. We evaluated our approach on two popular benchmarks and demonstrated that our approach significantly improved multiple state-of-the-art LLMs, e.g., 16.9\% Pass@1 improvement for code-davinci-002 on HumanEval. Furthermore, we conducted a user study with 12 participants in comparison to two baselines: (1) interacting with GitHub Copilot, and (2) interacting with a multi-step code generation paradigm called Multi-Turn Program Synthesis. Participants completed the given programming tasks 16.7\% faster and with 10.5\% improvement in task success rate when using our approach. Both results show that interactively refining code comments enables the collaborative establishment of mutual grounding, leading to more accurate code generation and higher developer confidence.
Tags: "AI for SE", "Analysis", "Human/Social"Zixuan Tan, Jiayuan Zhou, Xing Hu, Shengyi Pan, Kui Liu, Xin Xia, "Similar but Patched Code Considered Harmful -- The Impact of Similar but Patched Code on Recurring Vulnerability Detection and How to Remove Them"
Abstract: Identifying recurring vulnerabilities is crucial for ensuring software security. Clone-based techniques, while widely used, often generate many false alarms due to the existence of similar but patched (SBP) code, which is similar to vulnerable code but is not vulnerable due to having been patched. Although the SBP code poses a great challenge to the effectiveness of existing approaches, it has not yet been well explored. In this paper, we propose a programming language agnostic framework, Fixed Vulnerability Filter (FVF), to identify and filter such SBP instances in vulnerability detection. Different from existing studies that leverage function signatures, our approach analyzes code change histories to precisely pinpoint SBPs and consequently reduce false alarms. Evaluation under practical scenarios confirms the effectiveness and precision of our approach. Remarkably, FVF identifies and filters 65.1% of false alarms from four vulnerability detection tools (i.e., ReDeBug, VUDDY, MVP, and an elementary hash-based approach) without yielding false positives. We further apply FVF to 1,081 real-world software projects and construct a real-world SBP dataset containing 6,827 SBP functions. Due to the SBP nature, the dataset can act as a strict benchmark to test the sensitivity of the vulnerability detection approach in distinguishing real vulnerabilities and SBPs. Using this dataset, we demonstrate the ineffectiveness of four state-of-the-art deep learning-based vulnerability detection approaches. Our dataset can help developers make a more realistic evaluation of vulnerability detection approaches and also paves the way for further exploration of real-world SBP scenarios.
Tags: "Prog Comprehension/Reeng/Maint", "Security"Wenchao Gu, Ensheng Shi, Yanlin Wang, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, Michael Lyu, "SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep Hashing"
Abstract: Code retrieval, which retrieves code snippets based on users’ natural language descriptions, is widely used by developers and plays a pivotal role in real-world software development. The advent of deep learning has shifted the retrieval paradigm from lexical-based matching towards leveraging deep learning models to encode source code and queries into vector representations, facilitating code retrieval according to vector similarity. Despite the effectiveness of these models, managing large-scale code bases presents significant challenges. Previous research propose deep hashing-based methods, which generate hash codes for queries and code snippets and use Hamming distance for rapid recall of code candidates. However, this approach’s reliance on linear scanning of the entire code base limits its scalability. To further improve the efficiency of large scale code retrieval, we propose a novel approach SECRET (Scalable and Efficient Code Retrieval via SegmEnTed deep hashing). SECRET converts long hash codes calculated by existing deep hashing approaches into several short hash code segments through an iterative training strategy. After training, SECRET recalls code candidates by looking up the hash tables for each segment, the time complexity of recall can thus be greatly reduced. Extensive experimental results demonstrate that SECRET can drastically reduce the retrieval time by at least 95% while achieving comparable or even higher performance of existing deep hashing approaches. Besides, SECRET also exhibits superior performance and efficiency compared to the classical hash table-based approach known as LSH under the same number of hash tables.
Tags: "AI for SE"Shanto Rahman, Bala Naren Chanumolu, Suzzana Rafi, August Shi, Wing Lam, "Ranking Relevant Tests for Order-Dependent Flaky Tests"
Abstract: One major challenge of regression testing is the presence of flaky tests, i.e., tests that may pass in one run but fail in another run for the same version of code. One prominent category of flaky tests are order-dependent (OD) flaky tests, which are tests that can pass or fail depending on the test-order in which the tests are run. To help developers debug and fix OD tests, prior work has attempted to automatically find OD-relevant tests, i.e., tests that will determine whether an OD test passes or fails depending on whether the OD-relevant tests are run before or after the OD test in the test-order. Prior work finds OD-relevant tests by running tests before the OD test, without regards to the tests’ likelihood of being OD-relevant tests. We propose RankF to rank tests in order of likelihood of being OD-relevant tests, so a developer can find the first OD-relevant test more quickly, without running tests as often. We propose two ranking approaches, each requiring different information. Our first approach, RankFL, relies on training a large-language model that analyzes test code. Our second approach, RankFO, relies on the analysis of prior test-order execution information. We evaluate our approaches on 155 OD tests from 34 modules across 24 open-source projects. We compare RankF against prior work baselines in terms of the time for finding the first OD-relevant test for an OD test. RankF on average finds the first OD-relevant test faster than the best of the baselines, providing speedups of 1.9X, 1.7X, and 2.6X for the three different types of OD-relevant tests we evaluate.
Tags: "Testing and Quality", "Analysis"Hang Du, Vijay Krishna Palepu, James A. Jones, "Leveraging Propagated Infection to Crossfire Mutants"
Abstract: Mutation testing was proposed to identify weaknesses in test suites by repeatedly generating artificially faulty versions of the software (i.e., *mutants*) and determining if the test suite is sufficient to detect them (i.e., *kill* them). When the tests are insufficient, each surviving mutant provides an opportunity to improve the test suite. We conducted a study and found that many such surviving mutants (up to 84% for the subjects of our study) are detectable by simply augmenting existing tests with additional assertions, or *assertion amplification*. Moreover, we find that many of these mutants are detectable by multiple existing tests, giving developers options for how to detect them. To help with these challenges, we created a technique that performs memory-state analysis to identify candidate assertions that developers can use to detect the surviving mutants. Additionally, we build upon prior research that identifies "crossfiring" opportunities -- tests that coincidentally kill multiple mutants. To this end, we developed a theoretical model that describes the varying granularities that crossfiring can occur in the existing test suite, which provide opportunities and options for how to kill surviving mutants. We operationalize this model to an accompanying technique that optimizes the assertion amplification of the existing tests to crossfire multiple mutants with fewer added assertions, optionally concentrated within fewer tests. Our experiments show that we can kill *all* surviving mutants that are detectable with existing test data with only 1.1% of the identified assertion candidates, and increasing by a factor of 6x, on average, the number of killed mutants from amplified tests, over tests that do not crossfire.
Tags: "Testing and Quality", "Analysis"Yang Sun, Christopher M. Poskitt, Kun Wang, Jun Sun, "FixDrive: Automatically Repairing Autonomous Vehicle Driving Behaviour for $0.08 per Violation"
Abstract: Autonomous Vehicles (AVs) are advancing rapidly, with Level-4 AVs already operating in real-world conditions. Current AVs, however, still lag behind human drivers in adaptability and performance, often exhibiting overly conservative behaviours and occasionally violating traffic laws. Existing solutions, such as runtime enforcement, mitigate this by automatically repairing the AV's planned trajectory at runtime, but such approaches lack transparency and should be a measure of last resort. It would be preferable for AV repairs to generalise beyond specific incidents and to be interpretable for users. In this work, we propose FixDrive, a framework that analyses driving records from near-misses or law violations to generate AV driving strategy repairs that reduce the chance of such incidents occurring again. These repairs are captured in $\mu$Drive, a high-level domain-specific language for specifying driving behaviours according to event-based triggers. Implemented for the state-of-the-art autonomous driving system Apollo, FixDrive identifies and visualises critical moments from driving records, then uses a Multimodal Large Language Model (MLLM) with zero-shot learning to generate $\mu$Drive programs. We tested FixDrive on various benchmark scenarios, and found that the generated repairs improved the AV's performance with respect to following traffic laws, avoiding collisions, and successfully reaching destinations. Furthermore, the direct costs of repairing an AV---15 minutes of offline analysis and \$0.08 per violation---are reasonable in practice.
Tags: "SE for AI", "Autonomy"ziji wu, yu huang, peishan huang, shanghua wen, minglong li, ji wang, "EffBT: An Efficient Behavior Tree Reactive Synthesis and Execution Framework"
Abstract: Behavior Trees (BTs), originated from the control of Non-Player-Characters (NPCs), have been widely embraced in robotics and software engineering communities due to their modularity, reactivity, and other beneficial characteristics. It is highly desirable to synthesize BTs automatically. The consequent challenges are to ensure the generated BTs semantically correct, well-structured, and efficiently executable. To address these challenges, in this paper, we present a novel reactive synthesis method for BTs, namely EffBT, to generate correct and efficient controllers from formal specifications in GR(1) automatically. The idea is to construct BTs soundly from the intermediate strategies derived during the algorithm of GR(1) realizability check. Additionally, we introduce pruning strategies and use of \textit{Parallel} nodes to improve BT execution, while none of the priors explored before. We prove the soundness of the EffBT method, and experimental results across various scenarios and datasets demonstrate its effectiveness.
Tags: "Formal methods", "Analysis/Synthesis"Haofeng Li, Chenghang Shi, Jie Lu, Lian Li, Zixuan Zhao, "Module-Aware Context Sensitive Pointer Analysis"
Abstract: The Java Platform Module System (JPMS) has found widespread applications since introduced in Java 9. However, existing pointer analyses fail to leverage the semantics of JPMS. This paper presents a novel module-aware approach to improving the soundness and performance of pointer analysis. For soundness, we model the semantics of keywords provides and uses in JPMS to recover missing points-to relations. For performance, we design a module-aware context-sensitive analysis, which can propagate and apply critical contexts (by exploiting modularity) to balance precision and efficiency better. We have implemented our module-aware pointer analysis named MPA in Tai-e and conducted extensive experiments to compare it with standard object-sensitive approaches. The evaluation results demonstrate that MPA improves soundness and enhances existing context-sensitivity, striking a good balance between efficiency and precision. In terms of soundness, MPA can increase the number of reachable methods up to 90.9 × 90.9× ( lombok lombok) under the same analysis. Performance-wise, MPA is nearly as fast as context-insensitivity for most benchmarks, while its precision is superior to that of 1-object-sensitivity on average.
Tags: "Analysis"Xu Chen, Ningning Cui, Zhe Pan, Liwei Chen, Gang Shi, Dan Meng, "Critical Variable State-Aware Directed Greybox Fuzzing"
Abstract: Directed fuzzing is an effective software testing method that guides the fuzzing campaign towards user-defined target sites of interest, enabling the discovery of vulnerabilities relevant to those sites. However, even though the generated test cases cover the code near the target sites, complex vulnerabilities remain untriggered. By focusing only on test cases that cover new edges, the program states related to the targets are overlooked, resulting in insufficient testing of the targets and failure to capture complex vulnerabilities. In this paper, we propose a novel directed fuzzing solution named CSFuzz, which considers program states associated with the targets. First, CSFuzz extracts critical variables related to the target sites from the program using static analysis. Then, CSFuzz monitors the runtime values of these critical variables and infers the program states associated with the targets by adaptively partitioning the range of variable values. This allows CSFuzz to store interesting seeds in the state corpus that trigger new states near the target sites. Lastly, CSFuzz employs dynamic scheduling techniques to guide the fuzzing campaign in selecting different corpora and prioritizing seeds. This ensures more adequate testing of the target sites. We have implemented a prototype of CSFuzz and evaluated it on 2 benchmarks and widely fuzzed real-world software. Evaluation results show that CSFuzz outperforms state-of-the-art fuzzers in terms of vulnerability detection capability, achieving a maximum speedup of 219%. Moreover, CSFuzz has discovered 4 new bugs, including 2 CVE IDs assigned.
Tags: "Testing and Quality"Stefan Schwedt, Thomas Ströder, "From Bugs to Benefits: Improving User Stories by Leveraging Crowd Knowledge with CrUISE-AC"
Abstract: Costs for resolving software defects increase exponentially in late stages. Incomplete or ambiguous requirements are one of the biggest sources for defects, since stakeholders might not be able to communicate their needs or fail to share their domain specific knowledge. Combined with insufficient developer experience, teams are prone to constructing incorrect or incomplete features. To prevent this, requirements engineering has to explore knowledge sources beyond stakeholder interviews. Publicly accessible issue trackers for systems within the same application domain hold essential information on identified weaknesses, edge cases, and potential error sources, all documented by actual users. Our research aims at (1) identifying, and (2) leveraging such issues to improve an agile requirements artifact known as a “user story”. We present CrUISE-AC (Crowd and User Informed Suggestion Engine for Acceptance Criteria) as a fully automated method that investigates issues and generates non-trivial additional acceptance criteria for a given user story by employing NLP techniques and an ensemble of LLMs. CrUISE-AC was evaluated by five independent experts in two distinct business domains. Our findings suggest that issue trackers hold valuable information pertinent to requirements engineering. Our evaluation shows that 80–82% of the generated acceptance criteria add relevant requirements to the user stories. Limitations are the dependence on accessible input issues and the fact that we do not check generated criteria for being conflict-free or non-overlapping with criteria from other user stories.
Tags: "Requirements", "AI for SE"Lingfeng Zhang, Zhaohui Wang, Yueling Zhang, Min Zhang, Jiangtao Wang, "HIFI: Explaining and Mitigating Algorithmic Bias through the Lens of Game-Theoretic Interactions"
Abstract: Machine Learning (ML) algorithms are increasingly used in decision-making process across various social-critical domains, but they often somewhat inherit and amplify bias from their training data, leading to unfair and unethical outcomes. This issue highlights the urgent need for effective methods to detect, explain, and mitigate bias to ensure the fairness of ML systems. Previous studies are prone to analyze the root causes of algorithmic bias from a statistical perspective. However, to the best of our knowledge, none of them has discussed how sensitive information inducing the final discriminatory decision is encoded by ML models. In this work, we attempt to explain and mitigate algorithmic bias from a game-theoretic view. We mathematically decode an essential and common component of sensitive information implicitly defined by various fairness metrics with Harsanyi interactions, and on this basis, we propose an in-processing method HIFI for bias mitigation. We conduct an extensive evaluation of HIFI with 11 state-of-the-art methods, 5 real-world datasets, 4 fairness criteria, and 5 ML performance metrics, while also considering intersectional fairness for multiple protected attributes. The results show that HIFI surpasses state-of-the-art in-processing methods in terms of fairness improvement and fairness-performance trade-off, and also achieves notable effectiveness in reducing violations of individual fairness simultaneously.
Tags: "SE for AI", "Security"Xiaolei wang, Ruilin Li, Bin Zhang, Chao Feng, Chaojing Tang, "Practical Object-Level Sanitizer With Aggregated Memory Access and Custom Allocator"
Abstract: To mitigate potential memory safety vulnerabilities, recently there have been significant advances in sanitizers for pre-production bug detection. However, the limited inability to balance performance and detection accuracy still holds. The main reason is due to excessive reliance on shadow memory and a large number of memory access checks at runtime, incurring a significant performance overhead (if fine-grained memory safety detection is performed, the overhead will be even greater). In this paper, we propose a novel Object-Level Address Sanitizer OLASan to reduce performance overhead further while implementing accurate memory violations (including intra-object overflow) detection. Unlike previous sanitizers ignoring the correlation between memory access and objects, OLASan aggregates multiple memory accesses of same object at function level to perform on-demand targeted sanitization, thus avoiding examining most memory accesses at runtime. Specifically, OLASan characterizes various memory access patterns to identify those which can be aggregated, and implements memory safety checks with customized memory tagging. We implement OLASan atop the LLVM framework and evaluate it on SPEC CPU benchmarks. Evaluations show that OLASan outperforms the state-of-the-art methods with 51.18%, 25.20% and 6.52% less runtime overhead than ASan, ASan−− and GiantSan respectively. Moreover, aided by customized memory tagging, OLASan achieves zero false negatives for the first time when testing Juliet suites. Finally, we confirm that OLASan also offers comparable detection capabilities on real bugs.
Tags: "Security", "Testing and Quality"Dimitri Kokkonis, Michaël Marcozzi, Emilien Decoux, Stefano Zacchiroli, "ROSA: Finding Backdoors with Fuzzing"
Abstract: A code-level backdoor is a hidden access, programmed and concealed within the code of a program. For instance, hard-coded credentials planted in the code of an FTP server would enable maliciously logging into all the deployed instances of this server. Confirmed software supply-chain attacks have led to the injection of backdoors into popular open-source projects, and backdoors have been discovered in various router firmware. Manual code auditing for backdoors is challenging and existing semi-automated approaches can handle only a limited amount of programs and backdoors, while requiring manual reverse-engineering of the audited (binary) program. Graybox fuzzing (automated semi-randomized testing) has grown in popularity due to its success in discovering vulnerabilities and hence stands as a strong candidate for improved backdoor detection. However, current fuzzing knowledge does not offer any means to detect the triggering of a backdoor at runtime. In this work we introduce ROSA, a novel approach (and tool) which combines a state-of-the-art fuzzer (AFL++) with a new metamorphic test oracle, capable of detecting runtime backdoor triggers. To facilitate the evaluation of ROSA, we have created ROSARUM, the first openly available benchmark for assessing the detection of various backdoors in diverse programs. Experimental evaluation shows that ROSA has a level of robustness, speed and automation similar to classical fuzzing. Compared to existing detection tools, it can handle a diversity of backdoors and programs and it does not rely on manually reverse-engineering the fuzzed binary code.
Tags: "Security", "Testing and Quality"Uwe Gropengießer, Elias Dietz, Florian Brandherm, Achref Doula, Osama Abboud, Xun Xiao, Max Mühlhäuser, "MARQ: Engineering Mission-Critical AI-based Software with Automated Result Quality Adaptation"
Abstract: AI-based mission-critical software exposes a blessing and a curse: its inherent statistical nature allows for flexibility in result quality, yet the mission-critical importance demands adherence to stringent constraints such as execution deadlines. This creates a space for trade-offs between the Quality of Result (QoR)—a metric that quantifies the quality of a computational outcome—and other application attributes like execution time and energy, particularly in real-time scenarios. Fluctuating resource constraints, such as data transfer to a remote server over unstable network connections, are prevalent in mobile and edge computing environments—encompassing use cases like Vehicle-to-Everything, drone swarms, or social-VR scenarios. We introduce a novel approach that enables software engineers to easily specify alternative AI service chains—sequences of AI services encapsulated in microservices aiming to achieve a predefined goal—with varying QoR and resource requirements. Our methodology facilitates dynamic optimization at runtime, which is automatically driven by the MARQ framework. Our evaluations show that MARQ can be used effectively for the dynamic selection of AI service chains in real-time while maintaining the required application constraints of mission-critical AI software. Notably, our approach achieves a 100x acceleration in service chain selection and an average 10% improvement in QoR compared to existing methods.
Tags: "SE for AI", "Autonomy"Yuchen Shao, Yuheng Huang, Jiawei Shen, Lei Ma, Ting Su, Chengcheng Wan, "Are LLMs Correctly Integrated into Software Systems?"
Abstract: Large language models (LLMs) provide effective solutions in various application scenarios, with the support of retrieval-augmented generation (RAG). However, developers face challenges in integrating LLM and RAG into software systems, due to lacking interface specifications, various requirements from software context, and complicated system management. In this paper, we have conducted a comprehensive study of 100 open-source applications that incorporate LLMs with RAG support, and identified 18 defect patterns. Our study reveals that 77% of these applications contain more than three types of integration defects that degrade software functionality, efficiency, and security. Guided by our study, we propose systematic guidelines for resolving these defects in software life cycle. We also construct an open-source defect library Hydrangea.
Tags: "SE for AI", "Design/Architecture"Jinan Jiang, Xinghao Peng, Jinzhao Chu, Xiapu Luo, "ConsCS: Effective and Efficient Verification of Circom Circuits"
Abstract: Circom is a popular programming language for writing arithmetic circuits that can be used to generate zero-knowledge proofs (ZKPs) like zk-SNARKS. ZKPs have received tremendous attention in protocols like zkRollups. The Circom circuits are compiled to Rank-1 Constraint Systems (R1CS) circuits, based on which zk-SNARK proofs are generated. However, one major challenge associated with R1CS circuits is the problem of under-constrained circuits, which are susceptible to allowing incorrect computations to pass verification due to insufficient constraints, potentially leading to security vulnerabilities. In this paper, we propose a novel framework ConsCS to automatically verify Circom circuits. Our contributions are threefold: 1) we propose novel circuit inference rules to help reduce the size of circuits and to extract more comprehensive information than existing works; 2) we introduce the novel Binary Property Graph (BPG) as a highly efficient reasoning engine, outperforming all existing tools in effectiveness and efficiency; 3) we leverage fine-grained domain-specific information to guide the SMT solving to address non-linear constraints, increasing the success rate of SMT queries of existing works from 2.68% to 48.84%. We conduct experiments to show that ConsCS enhances the solved rate of existing works from around 50-60% to above 80%.
Tags: "Formal methods", "Security"Youngjae Choi, Seunghoon Woo, "TIVER: Identifying Adaptive Versions of C/C++ Third-Party Open-Source Components Using a Code Clustering Technique"
Abstract: Reusing third-party open-source software (OSS) provides many benefits but can expose the entire system to risks owing to propagated vulnerabilities. While tracking the versions of OSS components can help prevent threats, existing approaches typically map a single version to a reused OSS codebase. This coarse-grained method fails to address multiple versions of code that coexist within the codebase, resulting in ineffective OSS management. Additionally, effectively identifying component versions is challenging owing to noise codes, such as algorithmic codes that coexist across different OSS, as well as duplicate components arising from the redundant reuse of OSS. In this paper, we introduce the concept of the adaptive version, a one-stop solution to represent the version diversity of reused OSS. We present TIVER, an effective approach for identifying adaptive versions of OSS components. TIVER employs two key techniques: (1) fine-grained function-level versioning to uncover detailed versions, and (2) OSS code clustering to identify duplicate components and remove noise. This enables precise identification of OSS reuse locations and adaptive versions, effectively mitigating threats related to OSS reuse. Evaluation of popular C/C++ software on GitHub revealed that OSS components with a single version accounted for only 33%, while the remaining 67% of the components contained more than three versions on average. Nonetheless, TIVER effectively identified adaptive versions of OSS components with 88.46% precision and 91.63% recall in duplicate component distinction, and 86% precision and 86.84% recall in eliminating noise, while existing approaches barely achieved 42% recall in distinguishing duplicates and did not address noise. Further experiments showed that TIVER could enhance vulnerability management and be applied to Software Bills of Materials (SBOM) to improve supply chain security.
Tags: "Prog Comprehension/Reeng/Maint", "Security"Yuanhang Zhou, Zhen Yan, Yuanliang Chen, Fuchen Ma, Ting Chen, Yu Jiang, "Chord: Towards a Unified Detection of Blockchain Transaction Parallelism Bugs"
Abstract: Blockchain systems have implemented various transaction parallelism mechanisms to improve the system throughput and reduce the latency. However, they inevitably introduce bugs. Such bugs can result in severe consequences such as asset loss, double spending, consensus failure, and DDoS. Unfortunately, they have been little analyzed about their symptoms and root causes, leading to a lack of effective detection methods. In this work, we conduct a thorough analysis of historical transaction parallelism bugs in four commercial blockchains. Results show that most of them arise from mishandling conflicting transactions and manifest without obvious phenomena. However, given the heterogeneity of blockchains, it is challenging to trigger conflict handling in a unified way. Effectively identifying these bugs is also hard. Inspired by the findings, we propose Chord, aiming at detecting blockchain transaction parallelism bugs. Chord proposes a unified conflict transaction model to generate various conflict transactions. Chord also dynamically adjust the transaction submission and inserts proactive reverts during transaction execution to conduct thorough testing. Besides, Chord incorporates a local-remote differential oracle and a TPS oracle to capture the bugs. Our evaluation shows that Chord successfully detects 54 transaction parallelism bugs. Besides, Chord outperforms the existing methods by decreasing the TPS by 49.7% and increasing the latency by 388.0%, showing its effectiveness in triggering various conflict scenarios and exposing the bugs.
Tags: "Testing and Quality", "Blockchain"Dominik Fuchß, Tobias Hey, Jan Keim, Haoyu Liu, Niklas Ewald, Tobias Thirolf, Anne Koziolek, "LiSSA: Toward Generic Traceability Link Recovery through Retrieval-Augmented Generation"
Abstract: There are a multitude of software artifacts which need to be handled during the development and maintenance of a software system. These artifacts interrelate in multiple, complex ways. Therefore, many software engineering tasks are enabled — and even empowered — by a clear understanding of artifact interrelationships and also by the continued advancement of techniques for automated artifact linking. However, current approaches in automatic Traceability Link Recovery (TLR) target mostly the links between specific sets of artifacts, such as those between requirements and code. Fortunately, recent advancements in Large Language Models (LLMs) can enable TLR approaches to achieve broad applicability. Still, it is a nontrivial problem how to provide the LLMs with the specific information needed to perform TLR. In this paper, we present LiSSA, a framework that harnesses LLM performance and enhances them through Retrieval-Augmented Generation (RAG). We empirically evaluate LiSSA on three different TLR tasks, requirements to code, documentation to code, and architecture documentation to architecture models, and we compare our approach to state-of-the-art approaches. Our results show that the RAG-based approach can significantly outperform the state-of-the-art on the code-related tasks. However, further research is required to improve the performance of RAG-based approaches to be applicable in practice.
Tags: "Requirements", "AI for SE"Weisong Sun, Yuchen Chen, Mengzhe Yuan, Chunrong Fang, Zhenpeng Chen, Chong Wang, Yang Liu, Baowen Xu, Zhenyu Chen, "Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code Naturalness"
Abstract: Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has garnered increasing attention. In particular, NCMs are often trained on large-scale data from potentially untrustworthy sources, providing attackers with the opportunity to manipulate them by inserting crafted samples into the data. This type of attack is called a code poisoning attack (also known as a backdoor attack). It allows attackers to implant backdoors in NCMs and thus control model behavior, which poses a significant security threat. However, there is still a lack of effective techniques for detecting various complex code poisoning attacks. In this paper, we propose an innovative and lightweight technique for code poisoning detection named KillBadCode. KillBadCode is designed based on our insight that code poisoning disrupts the naturalness of code. Specifically, KillBadCode first builds a code language model (CodeLM) on a lightweight n-gram language model and trains it on a few clean code snippets. Then, given poisoned data, KillBadCode utilizes CodeLM to identify those tokens in (poisoned) code snippets that will make the code snippets more natural after being deleted as trigger tokens. Considering that the removal of some normal tokens in a single sample might also enhance code naturalness, leading to a high false positive rate (FPR), we aggregate the cumulative improvement of each token across all samples. Finally, KillBadCode purifies the poisoned data by removing all poisoned samples containing the identified trigger tokens. We conduct extensive experiments to evaluate the effectiveness and efficiency of KillBadCode, involving two types of advanced code poisoning attacks (a total of five poisoning strategies) and datasets from four representative code intelligence tasks. The experimental results demonstrate that across 20 code poisoning detection scenarios, KillBadCode achieves an average FPR of 8.30% and an average Recall of 100%, significantly outperforming four baselines. More importantly, KillBadCode is very efficient, with a minimum time consumption of only 5 minutes, and is 25 times faster than the best baseline on average. These highlight the great potential of KillBadCode in efficiently killing various code poisoning attacks.
Tags: "AI for SE", "Security"Freek Verbeek, Ali Shokri, Daniel Engel, Binoy Ravindran, "Formally Verified Binary-level Pointer Analysis"
Abstract: Binary-level pointer analysis can be of use in symbolic execution, testing, verification, and decompilation of software binaries. In various such contexts, it is crucial that the result is trustworthy, i.e., it can be formally established that the pointer designations are overapproximative. This paper presents an approach to formally proven correct binary-level pointer analysis. A salient property of our approach is that it first generically considers what proof obligations a generic abstract domain for pointer analysis must satisfy. This allows easy instantiation of different domains, varying in precision, while preserving the correctness of the analysis. In the trade-off between scalability and precision, such customization allows ``meaningful'' precision (sufficiently precise to ensure basic sanity properties, such as that relevant parts of the stack frame are not overwritten during function execution) while also allowing coarse analysis when pointer computations have become too obfuscated during compilation for sound and accurate bounds analysis. We experiment with three different abstract domains with high, medium, and low precision. Evaluation shows that our approach is able to derive designations for memory writes soundly in COTS binaries, in a context-sensitive interprocedural fashion.
Tags: "Formal methods", "Analysis"Yuqing Nie, Chong Wang, Kailong Wang, Guoai Xu, Guosheng Xu, Haoyu Wang, "Decoding Secret Memorization in Code LLMs Through Token-Level Characterization"
Abstract: Code Large Language Models (LLMs) have demonstrated remarkable capabilities in generating, understanding, and manipulating programming code. However, their training process inadvertently leads to the memorization of sensitive information, posing severe privacy risks. Existing studies on memorization in LLMs primarily rely on prompt engineering techniques, which suffer from limitations such as widespread hallucination and inefficient extraction of the target sensitive information. In this paper, we present a novel approach to characterize real and fake secrets generated by Code LLMs based on token probabilities. We identify four key characteristics that differentiate genuine secrets from hallucinated ones, providing insights into distinguishing real and fake secrets. To overcome the limitations of existing works, we propose DESEC, a two-stage method that leverages token-level features derived from the identified characteristics to guide the token decoding process. DESEC consists of constructing an offline token scoring model using a proxy Code LLM and employing the scoring model to guide the decoding process by reassigning token likelihoods. Through extensive experiments on four state-of-the-art Code LLMs using a diverse dataset, we demonstrate the superior performance of DESEC in achieving a higher plausible rate and extracting more real secrets compared to existing baselines. Our findings highlight the effectiveness of our token-level approach in enabling an extensive assessment of the privacy leakage risks associated with Code LLMs.
Tags: "AI for SE", "Security"Rajdeep Singh Hundal, Yan Xiao, Xiaochun Cao, Jin Song Dong, Manuel Rigger, "On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations"
Abstract: \emph{Deep Reinforcement Learning} (DRL) is a paradigm of artificial intelligence where an \emph{agent} uses a neural network to learn which actions to take in a given \emph{environment}. DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games. Numerous \emph{implementations} of the state-of-the-art algorithms responsible for training these agents, like the \emph{Deep Q-Network} (DQN) and \emph{Proximal Policy Optimization} (PPO) algorithms, currently exist. However, studies make the mistake of assuming implementations of the same algorithm to be consistent and thus, \emph{interchangeable}. In this paper, through a \emph{differential testing} lens, we present the results of studying the extent of implementation inconsistencies, their effect on the implementations' performance, as well as their impact on the conclusions of prior studies under the assumption of interchangeable implementations. The outcome of our differential tests showed significant discrepancies between the tested algorithm implementations, indicating that they are \textit{not} interchangeable. In particular, out of the five PPO implementations tested on 56 games, three implementations achieved superhuman performance for 50\% of their total trials while the other two implementations only achieved superhuman performance for less than 15\% of their total trials. Furthermore, the performance among the high-performing PPO implementations was found to differ significantly in nine games. As part of a meticulous manual analysis of the implementations' source code, we analyzed implementation discrepancies and determined that code-level inconsistencies primarily caused these discrepancies. Lastly, we replicated a study and showed that this assumption of implementation interchangeability was sufficient to \emph{flip} experiment outcomes. Therefore, this calls for a shift in how implementations are being used. In addition, we recommend for (1) replicability studies for studies mistakenly assuming implementation interchangeability, (2) DRL researchers and practitioners to adopt the differential testing methodology proposed in this paper to combat implementation inconsistencies, and (3) the use of large environment suites.
Tags: "SE for AI", "Testing and Quality"shide zhou, Tianlin Li, WANG KAILONG, Yihao Huang, Ling Shi, Yang Liu, Haoyu Wang, "Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks"
Abstract: Large language models (LLMs) have revolutionized artificial intelligence, but their increasing deployment across critical domains has raised concerns about their abnormal behaviors when faced with malicious attacks. Such vulnerability alerts the widespread inadequacy of pre-release testing. In this paper, we conduct a comprehensive empirical study to evaluate the effectiveness of traditional coverage criteria in identifying such inadequacies, exemplified by the significant security concern of jailbreak attacks. Our study begins with a clustering analysis of the hidden states of LLMs, revealing that the embedded characteristics effectively distinguish between different query types. We then systematically evaluate the performance of these criteria across three key dimensions: criterion level, layer level, and token level. Our research uncovers significant differences in the sets of neurons covered when LLMs process normal versus jailbreak queries, aligning with our clustering experiments. Leveraging these findings, we propose three practical applications of coverage criteria in the context of LLM security testing. Specifically, we develop a real-time jailbreak detection mechanism that achieves high accuracy (93.61% on average) in classifying queries as normal or jailbreak. Furthermore, we explore the use of coverage levels to prioritize test cases, improving testing efficiency by focusing on high-risk interactions and removing redundant tests. Lastly, we introduce a coverage-guided approach for generating jailbreak attack examples, enabling systematic refinement of prompts to uncover new vulnerabilities. This study improves our understanding of LLM security testing, enhances their safety, and provides a foundation for developing more robust AI applications.
Tags: "SE for AI", "Security"Chong Wang, Kaifeng Huang, Jian Zhang, Yebo Feng, Lyuye Zhang, Yang Liu, Xin Peng, "LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion"
Abstract: Large language models (LLMs), pre-trained or fine-tuned on large code corpora, have shown effectiveness in generating code completions. However, in LLM-based code completion, LLMs may struggle to use correct and up-to-date Application Programming Interfaces (APIs) due to the rapid and continuous evolution of libraries. While existing studies have highlighted issues with predicting incorrect APIs, the specific problem of deprecated API usage in LLM-based code completion has not been thoroughly investigated. To address this gap, we conducted the first evaluation study on deprecated API usage in LLM-based code completion. This study involved seven advanced LLMs, 145 API mappings from eight popular Python libraries, and 28,125 completion prompts. The study results reveal the status quo (i.e., API usage plausibility and deprecated usage rate) of deprecated API and replacing API usage in LLM-based code completion from the perspectives of model, prompt, and library, and indicate the root causes behind. Based on these findings, we propose two lightweight fixing approaches, ReplaceAPI and InsertPrompt, which can serve as baseline approaches for future research on mitigating deprecated API usage in LLM-based completion. Additionally, we provide implications for future research on integrating library evolution with LLM-driven software development.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Matteo Camilli, Raffaela Mirandola, "Parametric Falsification of Many Probabilistic Requirements under Flakiness"
Abstract: Falsification is a popular simulation-based testing method for Cyber-Physical Systems to find inputs that violate a formal requirement. It employs optimization algorithms to minimize a robustness metric that defines the satisfaction of a given property over an execution trace. Despite falsification representing an established approach, detecting violations considering many, possibly independent, requirements simultaneously, under flaky simulations is an open problem. We address this problem by proposing a novel approach that combines parametric model checking and many-objective optimization. We use parametric model checking to shift part of the complexity of the problem offline. We pre-compute numeric constraints for the satisfaction of all requirements on a parametric specification of the testing scenario. Flaky violations are then detected using many-objective optimization to explore the space of changing factors in the scenario and push the parameters out of all precomputed constraints. The results of our empirical evaluation using four open-source evaluation subjects with increasing complexity (number of requirements) show that our approach can falsify many requirements simultaneously, without hiding their individual contribution. The effectiveness, in terms of quantity and severity of violations, is significantly higher than random search as well as two selected state-of-the-art baseline approaches. Furthermore, the extra offline computation yields a negligible cost.
Tags: "Testing and Quality"Kaia Newman, Sarah Snay, Madeline Endres, Manasvi Parikh, Andrew Begel, ""Get Me In The Groove": A Mixed Methods Study on Supporting ADHD Professional Programmers"
Abstract: Understanding the work styles of diverse programmers can help build inclusive workplaces, enabling all software engineers to excel. An estimated 10.6\% of programmers have \textit{Attention Deficit Hyperactivity Disorder} (ADHD), a condition characterized by differences in attention and working memory. Prior work has just begun to explore the impact of ADHD on software development, finding that inadequate support may negatively impact team productivity and employment. This prevents software development organizations from benefiting from ADHD-related strengths. To investigate these impacts, we conducted a two-phase mixed methods study. First, we qualitatively analyzed 99 threads (1,658 posts and comments) from \texttt{r/ADHD\_Programmers}, the largest public forum dedicated to the ADHD programmer community. We constructed a mapping that reveals how ADHD programmers apply personal strategies and organizational accommodations to address software task-specific challenges. Second, we conducted a large-scale survey of 239 ADHD and 254 non-ADHD professional programmers to validate how our qualitative data generalize to the worldwide developer population. Our results show that ADHD programmers are 1.8 to 4.4 times more likely to struggle more frequently than neurotypical developers with all challenges we consider, but especially with time management and design. Our findings have implications for inclusive and effective tool- and policy-building in software workplaces and motivate further research into the experiences of ADHD programmers.
Tags: "Human/Social"Ruanqianqian (Lisa) Huang, Savitha Ravi, Michael He, Boyu Tian, Sorin Lerner, Michael Coblenz, "How Scientists Use Jupyter Notebooks: Goals, Quality Attributes, and Opportunities"
Abstract: Computational notebooks are intended to prioritize the needs of scientists, but little is known about how scientists interact with notebooks, what requirements drive scientists' software development processes, or what tactics scientists use to meet their requirements. We conducted an observational study of 20 scientists using Jupyter notebooks for their day-to-day tasks, finding that scientists prioritize different quality attributes depending on their goals. A qualitative analysis of their usage shows (1) a collection of goals scientists pursue with Jupyter notebooks, (2) a set of quality attributes that scientists value when they write software, and (3) tactics that scientists leverage to promote quality. In addition, we identify ways scientists incorporated AI tools into their notebook work. From our observations, we derive design recommendations for improving computational notebooks and future programming systems for scientists. Key opportunities pertain to helping scientists create and manage state, dependencies, and abstractions in their software, enabling more effective reuse of clearly-defined components.
Tags: "Human/Social"Gabriel Sherman, Stefan Nagy, "No Harness, No Problem: Oracle-guided Harnessing for Auto-generating C API Fuzzing Harnesses"
Abstract: Library APIs are used by virtually every modern application and system, making them among today’s most security-critical software. In recent years, library bug-finding efforts have overwhelmingly adopted the powerful testing strategy of coverage-guided fuzzing. At its core, API fuzzing operates on harnesses: wrapper programs that initialize an API before feeding random inputs to its functions. Successful fuzzing demands correct and thorough harnesses, making manual harnessing challenging without sufficient domain expertise. To overcome this, recent strategies propose “learning” libraries’ intended usage to automatically generate their fuzzing harnesses. Yet, despite their high code coverage, resulting harnesses frequently miss key API semantics—bringing with them invalid, unrealistic, or otherwise-impossible data and call sequences—derailing fuzzing with false-positive crashes. Thus, without a precise, semantically-correct harnessing, many critical APIs will remain beyond fuzzing’s reach—leaving their hidden vulnerabilities ripe for attackers. This paper introduces Oracle-guided Harnessing: a technique for fully-automatic, semantics-aware API fuzzing harness synthesis. At a high level, Oracle-guided Harnessing mimics the trial-and-error process of manual harness creation—yet automates it via fuzzing. Specifically, we leverage information from API headers to mutationally stitch-together candidate harnesses; and evaluate their validity via a set of Correctness Oracles: compilation, execution, and changes in coverage. By keeping— and further mutating—only correct candidates, our approach produces a diverse set of semantically-correct harnesses for complex, real-world libraries in as little as one hour. We integrate Oracle-guided Harnessing as a prototype, OGHARN; and evaluate it alongside today’s leading fully-automatic harnessing approach, Hopper, and a plethora of developer-written harnesses from OSS-Fuzz. Across 20 real-world APIs, OGHARN outperforms developer-written harnesses by a median 14% code coverage, while uncovering 31 and 30 more vulnerabilities than both Hopper and developer-written harnesses, respectively—with zero false-positive crashes. Of the 41 new vulnerabilities found by OGHARN, all 41 are confirmed by developers—40 of which are since fixed—with many found in APIs that, until now, lacked harnesses whatsoever.
Tags: "Testing and Quality"Ruchit Rawal, Victor-Alexandru Pădurean, Sven Apel, Adish Singla, Mariya Toneva, "Hints Help Finding and Fixing Bugs Differently in Python and Text-based Program Representations"
Abstract: With the recent advances in AI programming assistants such as GitHub Copilot, programming is not limited to classical programming languages anymore--programming tasks can also be expressed and solved by end-users in natural text. Despite the availability of this new programming modality, users still face difficulties with algorithmic understanding and program debugging. One promising approach to support end-users is to provide hints to help them find and fix bugs while forming and improving their programming capabilities. While it is plausible that hints can help, it is unclear which type of hint is helpful and how this depends on program representations (classic source code or a textual representation) and the user's capability of understanding the algorithmic task. To understand the role of hints in this space, we conduct a large-scale crowd-sourced study involving 753 participants investigating the effect of three types of hints (test cases, conceptual, and detailed), across two program representations (Python and text-based), and two groups of users (with clear understanding or confusion about the algorithmic task). We find that the program representation (Python vs. text) has a significant influence on the users' accuracy at finding and fixing bugs. Surprisingly, users are more accurate at finding and fixing bugs when they see the program in natural text. Hints are generally helpful in improving accuracy, but different hints help differently depending on the program representation and the user's understanding of the algorithmic task. These findings have implications for designing next-generation programming tools that provide personalized support to users, for example, by adapting the programming modality and providing hints with respect to the user's skill level and understanding.
Tags: "Human/Social"Satyaki Das, Syeda Tasnim Fabiha, Saad Shafiq, Nenad Medvidovic, "Are We Learning the Right Features? A Framework for Evaluating DL-Based Software Vulnerability Detection Solutions"
Abstract: Recent research has revealed that the reported results of an emerging body of deep learning-based techniques for detecting software vulnerabilities are not reproducible, either across different datasets or on unseen samples. This paper aims to provide the foundation for properly evaluating the research in this domain. We do so by analyzing prior work and existing vulnerability datasets for the syntactic and semantic features of code that contribute to vulnerability, as well as features that falsely correlate with vulnerability. We provide a novel, uniform representation to capture both sets of features, and use this representation to detect the presence of both vulnerability and spurious features in code. To this end, we design two types of code perturbations: feature preserving perturbations (FPP) ensure that the vulnerability feature remains in a given code sample, while feature eliminating perturbations (FEP) eliminate the feature from the code sample. These perturbations aim to measure the influence of spurious and vulnerability features on the predictions of a given vulnerability detection solution. To evaluate how the two classes of perturbations influence predictions, we conducted a large-scale empirical study on five state-of-the-art DL-based vulnerability detectors. Our study shows that, for vulnerability features, only ~2% of FPPs yield the undesirable effect of a prediction changing among the five detectors on average. However, on average, ~84% of FEPs yield the undesirable effect of retaining the vulnerability predictions. For spurious features, we observed that FPPs yielded a drop in recall up to 29% for graph-based detectors. We present the reasons underlying these results and suggest strategies for improving DNN-based vulnerability detectors. We provide our perturbation-based evaluation framework as a public resource to enable independent future evaluation of vulnerability detectors.
Tags: "Security", "AI for SE"Aashish Yadavally, Xiaokai Rong, Phat Nguyen, Tien Nguyen, "Large Language Models for Safe Minimization"
Abstract: Several tasks in program analysis, verification, and testing are modeled as constraint solving problems, utilizing SMT solvers as the reasoning engine. In this work, we aim to investigate the reasoning capabilities of large language models (LLMs) toward reducing the size of an infeasible string constraint system by exploiting inter-constraint interactions such that the remaining ones are still unsatisfiable. We term this safe minimization. Motivated by preliminary observations of hallucination and error propagation in LLMs, we design SafeMin, a framework leveraging an LLM and SMT solver in tandem to ensure a safe and correct minimization. We test the applicability of our approach on string benchmarks from LeetCode in the computation of minimal unsatisfiable subsets (MUSes). We observed that SafeMin helps safely minimize 94.3% of these constraints, with an average minimization ratio of 98% relative to the MUSes. In addition, we assess SafeMin's capabilities in partially enumerating non-unique MUSes, which is baked into our approach via a "sample-and-enumerate'" decoding strategy. Overall, we captured 42.1% more non-unique MUSes than without such LLM-based macro-reasoning. Finally, we demonstrate SafeMin's usefulness in detecting infeasible paths in programs.
Tags: "AI for SE"Annibale Panichella, "Metamorphic-Based Many-Objective Distillation of LLMs for Code-related Tasks"
Abstract: Knowledge distillation compresses large language models (LLMs) into more compact and efficient versions that achieve similar accuracy on code-related tasks. However, as we demonstrate in this study, compressed models are four times less robust than the original LLMs when evaluated with metamorphic code. They have a 440% higher probability of misclassifying code clones due to minor changes in the code fragment under analysis, such as replacing parameter names with synonyms. To address this issue, we propose MORPH, a method that combines metamorphic testing with many-objective optimization for a robust distillation of LLMs for code. MORPH efficiently explores the models’ configuration space and generates Paretooptimal models that effectively balance accuracy, efficiency, and robustness to metamorphic code. Metamorphic testing measures robustness as the number of code fragments for which a model incorrectly makes different predictions between the original and their equivalent metamorphic variants (prediction flips). We evaluate MORPH on two tasks—code clone and vulnerability detection—targeting CodeBERT and GraphCodeBERT for distillation. Our comparison includes MORPH, the state-of-theart distillation method AVATAR, and the fine-tuned non-distilled LLMs. Compared to AVATAR, MORPH produces compressed models that are (i) 47% more robust, (ii) 25% more efficient (fewer FLOPs), while maintaining (iii) equal or higher accuracy (up to +6%), and (iv) similar model size.
Tags: "AI for SE", "Testing and Quality"Mengxia Ren, Anhao Xiang, Chuan Yue, "Analyzing the Feasibility of Adopting Google's Nonce-Based CSP Solutions on Websites"
Abstract: Content Security Policy (CSP) is a leading security mechanism for mitigating content injection attacks such as Cross-Site Scripting (XSS). Nevertheless, despite efforts from academia and industry, CSP policies (in short, CSPs) are not widely deployed on websites, and deployed CSPs often have security issues or errors. Such low and insecure CSP deployment problems are mainly due to the complexity of the CSP mechanism. Google recently proposed four nonce-based CSP solutions which are simpler and more secure compared to traditional whitelisting-based CSP solutions. Google successfully deployed their nonce- based CSP solutions on over 160 services, covering 62% of all outgoing Google traffic. These nonce-based CSP solutions use simple CSPs but provide fine-grained control of web resources; therefore, if widely adopted on many other websites, they can be very helpful on addressing the low and insecure CSP deployment problems. In this paper, we evaluate the feasibility of adopting Google's nonce-based CSP solutions on the Tranco top 10K websites. We construct a crawling tool to automatically visit websites, simulate user interactions, and insert four CSPs to collect the CSP violations triggered under them. We investigate the adoptability of the nonce-based CSP solutions, adoption issues, and the stability of adopting them on websites by analyzing the CSP violations triggered under the inserted CSPs. We found that most websites can adopt the nonce-based CSP solutions on all their webpages visited in our study. For websites that cannot, usually the adoption is hard on around 40% of their webpages. Overall, our results are very encouraging and can be helpful in promoting the proper deployment of CSPs on many websites.
Tags: "Security", "Testing and Quality"Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, Tianyi Zhang, "Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models"
Abstract: Large Language Models (LLMs) have demonstrated unprecedented capabilities in code generation. However, there remains a limited understanding of code generation errors that LLMs can produce. To bridge the gap, we conducted an in-depth analysis of code generation errors across six representative LLMs on the HumanEval dataset. Specifically, we first employed open coding and thematic analysis to distill a comprehensive taxonomy of code generation errors. We analyzed two dimensions of error characteristics---semantic characteristics and syntactic characteristics. Our analysis revealed that LLMs often made non-trivial, multi-line code generation errors in various locations and with various root causes. We further analyzed the correlation between these errors and task complexity as well as test pass rate. Our findings highlight several challenges in locating and fixing code generation errors made by LLMs. In the end, we discussed several future directions to address these challenges.
Tags: "AI for SE", "MSR"Zexiang Zhang, Gaoning Pan, Ruipeng Wang, Yiming Tao, Zulie Pan, Unknown, Min Zhang, Yang Li, Yi Shen, Chunming Wu, "InSVDF: Interface-State-Aware Virtual Device Fuzzing"
Abstract: Hypervisor is the core technology of virtualization for emulating independent hardware resources for each virtual machine. Virtual devices serve as the main interface of the hypervisor, making the security of virtual devices crucial, as any vulnerabilities can impact the entire virtualization environment and pose a threat to the host machine's security. Direct Memory Access (DMA) is the interface of virtual devices, enabling communication with the host machine. Recently, many efforts have focused on fuzzing against DMA to discover the hypervisor's vulnerabilities. However, the lack of sensitivity to the DMA state causes these efforts to be hindered in efficiency during fuzzing. Specifically, there are two main issues: the uncertain interaction moment and the unclear interaction depth. In this paper, we introduce InSVDF, a DMA interface state-aware fuzzing engine. InSVDF first models the intra-interface state of the DMA interface and incorporates an asynchrony-aware state snapshot mechanism along with a depth-aware seed preservation mechanism. To validate our approach, we compare InSVDF with a state-of-the-art fuzzer. The results demonstrate that InSVDF significantly enhances vulnerability discovery speed, with improvements of up to 24.2x in the best case. Furthermore, InSVDF has identified 2 new vulnerabilities, one of which has been assigned a CVE ID.
Tags: "Testing and Quality", "Security"Heng Yong, Zhong Li, Minxue Pan, Tian Zhang, Jianhua Zhao, Xuandong Li, "GVI: Guided Vulnerability Imagination for Boosting Deep Vulnerability Detectors"
Abstract: The use of deep learning to achieve automated software vulnerability detection has been a longstanding interest within the software security community. These deep vulnerability detectors are mostly trained in a supervised manner, which heavily relies on large-scale, high-quality vulnerability datasets. However, the vulnerability datasets used to train deep vulnerability detectors frequently exhibit class imbalance due to the inherent nature of vulnerability data, where vulnerable cases are significantly rarer than non-vulnerable cases. This imbalance adversely affects the effectiveness of these detectors. A promising solution to address the class imbalance problem is to artificially generate vulnerable samples to enhance vulnerability datasets, yet existing vulnerability generation techniques are not satisfactory due to their inadequate representation of real-world vulnerabilities or their reliance on large-scale vulnerable samples for training the generation model. This paper proposes GVI, a novel approach aimed at generating vulnerable samples to boost deep vulnerability detectors. GVI takes inspiration from human learning with imagination and proposes exploring LLMs to imagine and create new, informative vulnerable samples from given seed vulnerabilities. Specifically, we design a Chain-of-Thought inspired prompt in GVI that instructs the LLMs to first analyze the seed to retrieve attributes related to vulnerabilities and then generate a set of vulnerabilities based on the seed’s attributes. Our extensive experiments on three vulnerability datasets (i.e., Devign, ReVeal, and BigVul) and across three deep vulnerability detectors (i.e., Devign, ReVeal, and LineVul) demonstrate that the vulnerable samples generated by GVI are not only more accurate but also more effective in enhancing the performance of deep vulnerability detectors.
Tags: "AI for SE", "Security"Menglong Chen, Tian Tan, Minxue Pan, Yue Li, "PacDroid: A Pointer-Analysis-Centric Framework for Security Vulnerabilities in Android Apps"
Abstract: General frameworks such as FlowDroid, IccTA, P/Taint, Amandroid, and DroidSafe have significantly advanced the development of static analysis tools for Android security by providing fundamental facilities for them. However, while these frameworks have been instrumental in fostering progress, they often operate with inherent inefficiencies, such as redundant computations, reliance on separate tools, and unnecessary complexity, which are rarely scrutinized by the analysis tools that depend on them. This paper introduces PacDroid, a new static analysis framework for detecting security vulnerabilities in Android apps. PacDroid employs a simple yet effective pointer-analysis-centric approach that naturally manages alias information, interprocedural value propagation, and all Android features it supports (including ICC, lifecycles, and miscs), in a unified manner. Our extensive evaluation reveals that PacDroid not only outperforms state-of-the-art frameworks in achieving a superior trade-off between soundness and precision (F-measure) but also surpasses them in both analysis speed and robustness; moreover, PacDroid successfully identifies 77 real security vulnerability flows across 23 real-world Android apps that were missed by all other frameworks. With its ease of extension and provision of essential facilities, PacDroid is expected to serve as a foundational framework for various future analysis applications for Android.
Tags: "Security", "Mobile SW"Chunhao Dong, Yanjie Jiang, Yuxia Zhang, Yang Zhang, Hui Liu, "ChatGPT-Based Test Generation for Refactoring Engines Enhanced by Feature Analysis on Examples"
Abstract: Software refactoring is widely employed to improve software quality. However, conducting refactorings manually is tedious, time-consuming, and error-prone. Consequently, automated and semi-automated tool support is highly desirable for software refactoring in the industry, and most of the main-stream IDEs provide powerful tool support for refactoring. However, complex refactoring engines are prone to errors, which in turn may result in imperfect and incorrect refactorings. To this end, in this paper, we propose a ChatGPT-based approach to testing refactoring engines. We first manually analyze bug reports and test cases associated with refactoring engines, and construct a feature library containing fine-grained features that may trigger defects in refactoring engines. The approach automatically generates prompts according to both predefined prompt templates and features randomly selected from the feature library, requesting ChatGPT to generate test programs with the requested features. Test programs generated by ChatGPT are then forwarded to multiple refactoring engines for differential testing. To the best of our knowledge, it is the first approach in testing refactoring engines that guides test program generation with features derived from existing bugs. It is also the first approach in this line that exploits LLMs in the generation of test programs. Our initial evaluation of four main-stream refactoring engines suggests that the proposed approach is effective. It identified a total of 115 previously unknown bugs besides 28 inconsistent refactoring behaviors among different engines. Among the 115 bugs, 78 have been manually confirmed by the original developers of the tested engines, i.e., IntelliJ IDEA, Eclipse, VScode-Java, and NetBeans.
Tags: "Prog Comprehension/Reeng/Maint", "AI for SE"Minghua He, Tong Jia, Chiming Duan, Huaqian Cai, Ying Li, Gang Huang, "Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance Learning"
Abstract: Log-based anomaly detection is essential for maintaining software availability. However, existing log-based anomaly detection approaches heavily rely on fine-grained exact labels of log entries which are very hard to obtain in real-world systems. This brings a key problem that anomaly detection models require supervision signals while labeled log entries are unavailable. Facing this problem, we propose a new labeling strategy called inexact labeling that instead of labeling an log entry, system experts can label a bag of log entries in a time span. Furthermore, we propose MIDLog, a weakly supervised log-based anomaly detection approach with inexact labels. We leverage the multi-instance learning paradigm to achieve explicit separation of anomalous log entries from the inexact labeled anomalous log set so as to deduce exact anomalous log labels from inexact labeled log sets. Extensive evaluation on three public datasets shows that our approach achieves an F1 score of over 85\% with inexact labels.
Tags: "AI for SE", "Security"Shuai Wang, Yinan Yu, Robert Feldt, Dhasarathy Parthasarathy, "Automating a Complete Software Test Process Using LLMs: An Automotive Case Study"
Abstract: Vehicle API testing verifies whether the interactions between a vehicle's internal systems and external applications meet expectations, ensuring that users can access and control various vehicle functions and data. However, this task is inherently complex, requiring the alignment and coordination of API systems, communication protocols, and even vehicle simulation systems to develop valid test cases. In practical industrial scenarios, inconsistencies, ambiguities, and interdependencies across various documents and system specifications pose significant challenges. This paper presents a system designed for the automated testing of in-vehicle APIs. By clearly defining and segmenting the testing process, we enable Large Language Models (LLMs) to focus on specific tasks, ensuring a stable and controlled testing workflow. Experiments conducted on over 100 APIs demonstrate that our system effectively automates vehicle API testing. The results also confirm that LLMs can efficiently handle mundane tasks requiring human judgment, making them suitable for complete automation in similar industrial contexts.
Tags: "Testing and Quality", "Autonomy"Yicheng Wang, Wensheng Dou, Yu Liang, Yi Wang, Wei Wang, Jun Wei, Tao Huang, "Evaluating Garbage Collection Performance Across Managed Language Runtimes"
Abstract: Modern managed language runtimes (e.g., Java, Go and C#) rely on garbage collection (GC) mechanisms to automatically allocate and reclaim in-memory objects. The efficiency of GC implementations can greatly impact the overall performance of runtime-based applications. To improve GC performance, the academic and industrial communities have proposed several approaches to evaluate the GC implementations in an individual runtime. However, these approaches target a specific managed language (e.g., Java), and cannot be used to compare the GC implementations in different runtimes. In this paper, we propose GEAR, an automated approach to construct consistent GC workloads for different managed language runtimes, which can further be used to evaluate GC implementations across different runtimes. Specifically, we design a group of runtime-agnostic Memory Operation Primitives (MOP), which can portray the memory usage information that influences GC. GEAR can further automatically convert a MOP program into runtime-specific programs for the target runtimes, which serve as a consistent GC workload for different runtimes. To build MOP programs with real-world GC workloads, we instrument the commonly-used runtime Java Virtual Machine (JVM) to collect the memory operation trace during a Java application’s execution, and then transform the memory operation trace into a MOP program. The experimental result on three widely-used runtimes (i.e., Java, Go and C#) shows that GEAR can generate consistent GC workloads for different runtimes. We further conduct a comprehensive study on these three runtimes, and reveal some interesting findings about their GC performance, providing useful guidance for improving their GC implementations.
Tags: "Analysis"Lili Quan, Tianlin Li, Xiaofei Xie, Zhenpeng Chen, Sen Chen, Lingxiao Jiang, Xiaohong Li, "Dissecting Global Search: A Simple yet Effective Method to Boost Individual Discrimination Testing and Repair"
Abstract: Deep Learning (DL) has achieved significant success in socially critical decision-making applications but often exhibits unfair behaviors, raising social concerns. Among these unfair behaviors, individual discrimination—examining inequalities between instance pairs with identical profiles differing only in sensitive attributes such as gender, race, and age—is extremely socially impactful. Existing methods have made significant and commendable efforts in testing individual discrimination before deployment. However, their efficiency and effectiveness remain limited, particularly when evaluating relatively fairer models. It remains unclear which phase of the existing testing framework (global or local) is the primary bottleneck limiting performance. Facing the above issues, we first identify that enhancing the global phase consistently improves overall testing effectiveness compared to enhancing the local phase. This motivates us to propose Genetic-Random Fairness Testing (GRFT), an effective and efficient method. In the global phase, we use a genetic algorithm to guide the search for more global discriminatory instances. In the local phase, we apply a light random search to explore the neighbors of these instances, avoiding time-consuming computations. Additionally, based on the fitness score, we also propose a straightforward yet effective repair approach. For a thorough evaluation, we conduct extensive experiments involving 6 testing methods, 5 datasets, 261 models (including 5 naively trained, 64 repaired, and 192 quantized for on-device deployment), and sixteen combinations of sensitive attributes, showing the superior performance of GRFT and our repair method.
Tags: "SE for AI"Youpeng Ma, Tao Chen, Ke Li, "Faster Configuration Performance Bug Testing with Neural Dual-level Prioritization"
Abstract: As software systems become more complex and configurable, more performance problems tend to arise from the configuration designs. This has caused some configuration options to unexpectedly degrade performance which deviates from their original expectations designed by the developers. Such discrepancies, namely configuration performance bugs (CPBugs), are devastating and can be deeply hidden in the source code. Yet, efficiently testing CPBugs is difficult, not only due to the test oracle is hard to set, but also because the configuration measurement is expensive and there are simply too many possible configurations to test. As such, existing testing tools suffer from lengthy runtime or have been ineffective in detecting CPBugs when the budget is limited, compounded by inaccurate test oracle. In this paper, we seek to achieve significantly faster CPBug testing by neurally prioritizing the testing at both the configuration option and value range levels with automated oracle estimation. Our proposed tool, dubbed NDP, is a general framework that works with different heuristic generators. The idea is to leverage two neural language models: one to estimate the CPBug types that serve as the oracle while, more vitally, the other to infer the probabilities of an option being CPBug-related, based on which the options and the value ranges to be searched can be prioritized. Experiments on several widely-used systems of different versions reveal that NDP can, in general, better predict CPBug type in 87% cases and find more CPBugs with up to 88.88$\times$ testing efficiency speedup over the state-of-the-art tools.
Tags: "AI for SE", "Testing and Quality"Kevin Hermann, Sven Peldszus, Jan-Philipp Steghöfer, Thorsten Berger, "An Exploratory Study on the Engineering of Security Features"
Abstract: Software security is of utmost importance for most software systems. Developers must systematically select, plan, design, implement, and especially maintain and evolve security features - functionalities to mitigate attacks or protect personal data such as cryptography or access control, to ensure the security of their software. While security features are usually available in libraries, additional code needs to be written and maintained to integrate security features and not all desired features can be reused this way. While there have been studies on the use of such libraries, surprisingly little is known about how developers engineer security features, how they select what security features to implement, and the implications on maintenance. We therefore currently rely on assumptions that are largely based on common sense or individual examples. However, researchers require hard empirical data to understand what practitioners need and how they view security, which we currently lack to provide them with effective solutions. We contribute an exploratory study with 26 knowledgeable industrial participants. We study how security features of software systems are selected and engineered in practice, what their code-level characteristics are, and the challenges practitioners face. Based on the empirical data gathered, we validate four common assumptions and gain insights into engineering practices.
Tags: "Security", "Design/Architecture"Martin Beisel, Johanna Barzen, Frank Leymann, Lavinia Stiliadou, Daniel Vietz, Benjamin Weder, "Pattern-based Generation and Adaptation of Quantum Workflows"
Abstract: Building quantum applications requires deep knowledge of quantum computing and software engineering. Hence, an abstraction layer reducing the complexity for non-experts is needed. Patterns are an established concept for the abstract description of proven solutions to recurring problems. Therefore, the quantum computing patterns, a pattern language for the quantum computing domain, can be used to define the building blocks and the structure of hybrid quantum applications. Furthermore, concrete software artifacts can be associated with patterns to solve the corresponding problem. However, these software artifacts are usually heterogeneous, e.g., using different data formats. Quantum workflows enable a robust and scalable orchestration of these heterogeneous software artifacts. However, manually modeling and configuring such quantum workflows is a complex, error-prone, and time-consuming task. To overcome this issue, we present an approach that automates the generation and adaptation of quantum workflows using the quantum computing patterns. We provide an architecture realizing our approach, a corresponding prototype, as well as an evaluation comprising different use cases, a runtime comparison, and a user study.
Tags: "Design/Architecture", "Prog Comprehension/Reeng/Maint"Beatriz Souza, Michael Pradel, "Treefix: Enabling Execution with a Tree of Prefixes"
Abstract: The ability to execute code is a prerequisite for various dynamic program analyses. Learning-guided execution has been proposed as an approach to enable the execution of arbitrary code snippets by letting a neural model predict likely values for any missing variables. Although state-of-the-art learning-guided execution approaches, such as LExecutor, can enable the execution of a relative high amount of code, they are limited to predicting a restricted set of possible values and do not use any feedback from previous executions to execute even more code. This paper presents Treefix, a novel learning-guided execution approach that leverages LLMs to iteratively create code prefixes that enable the execution of a given code snippet. The approach addresses the problem in a multi-step fashion, where each step uses feedback about the code snippet and its execution to instruct an LLM to improve a previously generated prefix. This process iteratively creates a tree of prefixes, a subset of which is returned to the user as prefixes that maximize the number of executed lines in the code snippet. In our experiments with two datasets of Python code snippets, Treefix achieves 25% and 7% more coverage relative to the current state of the art in learning- guided execution, covering a total of 84% and 82% of all lines in the code snippets.
Tags: "Testing and Quality", "AI for SE"Venkata Sai Aswath Duvvuru, Bohan Zhang, Michael Vierhauser, Ankit Agrawal, "LLM-Agents Driven Automated Simulation Testing and Analysis of small Uncrewed Aerial Systems"
Abstract: Thorough simulation testing is crucial for validating the correct behavior of small Uncrewed Aerial Systems (sUAS) across multiple scenarios, including adverse weather conditions (such as wind, and fog), diverse settings (hilly terrain, or urban areas), and varying mission profiles (surveillance, tracking). While various sUAS simulation tools exist to support developers, the entire process of creating, executing, and analyzing simulation tests remains a largely manual and cumbersome task. Developers must identify test scenarios, set up the simulation environment, integrate the System under Test (SuT) with simulation tools, formulate mission plans, and collect and analyze results. These labor-intensive tasks limit the ability of developers to conduct exhaustive testing across a wide range of scenarios. To alleviate this problem, in this paper, we propose AUTOSIMTEST, a Large Language Model (LLM)-driven framework, where multiple LLM agents collaborate to support the sUAS simulation testing process. This includes: (1) creating test scenarios that subject the SuT to unique environmental contexts; (2) preparing the simulation environment as per the test scenario; (3) generating diverse sUAS missions for the SuT to execute; and (4) automatically analyzing simulation results and providing an interactive analytics interface. Further, the design of the framework is flexible for creating and testing scenarios for a variety of sUAS use cases, simulation tools, and SuT input requirements. We evaluated our approach by (a) conducting simulation testing of PX4 and ArduPilot flight-controller-based SuTs, (b) analyzing the performance of each agent, and (c) gathering feedback from sUAS developers. Our findings indicate that AUTOSIMTEST significantly improves the efficiency and scope of the sUAS testing process, allowing for more comprehensive and varied scenario evaluations while reducing the manual effort.
Tags: "AI for SE", "Autonomy"Hongyuan Liang, Yue Huang, Tao Chen, "The Same Only Different: On Information Modality for Configuration Performance Analysis"
Abstract: Configuration in software systems helps to ensure efficient operation and meet diverse user needs. Yet, some, if not all, configuration options have profound implications for the system's performance. Configuration performance analysis, wherein the key is to understand (or infer) the configuration options' relations and their impacts on performance, is crucial. Two major modalities exist that serve as the source information in the analysis: either the manual or source code. However, it remains unclear what roles they play in configuration performance analysis. Much work that relies on manuals claims their benefits of information richness and naturalness; while work that trusts the source code more prefers the structural information provided therein and criticizes the timeliness of manuals. To fill such a gap, in this paper, we conduct an extensive empirical study over 10 systems, covering 1,694 options, 106,798 words in the manual, and 22,859,552 lines-of-code for investigating the usefulness of manual and code in two important tasks of configuration performance analysis, namely performance-sensitive options identification and the associated dependencies extraction. We reveal several new findings and insights, such as it is beneficial to fuse the manual and code modalities for both tasks; the current automated tools that rely on a single modality are far from being practically useful and generally remain incomparable to human analysis. All those pave the way for further advancing configuration performance analysis.
Tags: "Design/Architecture", "Testing and Quality"Yichen Li, Yulun Wu, Jinyang Liu, Zhihan Jiang, Zhuangbin Chen, Guangba Yu, Michael Lyu, "COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge"
Abstract: Runtime failures are commonplace in modern distributed systems. When such issues arise, users often turn to platforms such as Github or JIRA to report them and request assistance. Automatically identifying the root cause of these failures is critical for ensuring high reliability and availability. However, prevailing automatic root cause analysis (RCA) approaches rely significantly on comprehensive runtime monitoring data, which is often not fully available in issue platforms. Recent methods leverage large language models (LLMs) to analyze issue reports, but their effectiveness is limited by incomplete or ambiguous user-provided information. To obtain more accurate and comprehensive RCA results, the core idea of this work is to extract additional diagnostic clues from code to supplement data-limited issue reports. Specifically, we propose COCA, a code knowledge enhanced root cause analysis approach for issue reports. Based on the data within issue reports, COCA intelligently extracts relevant code snippets and reconstructs execution paths, providing a comprehensive execution context for further RCA. Subsequently, COCA construct a prompt combining historical issue reports along with profiled code knowledge, enabling the LLMs to generate detailed root cause summaries and localize responsible components. Our evaluation on datasets from five real-world distributed systems demonstrates that COCA significantly outperforms existing methods, achieving a 28.3% improvement in root cause localization and a 22.0% improvement in root cause summarization. Furthermore, COCA's performance consistency across various LLMs underscores its robust generalizability.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Luciano Baresi, Davide Yi Xian Hu, Andrea Stocco, Paolo Tonella, "Efficient Domain Augmentation for Autonomous Driving Testing Using Diffusion Models"
Abstract: Simulation-based testing is widely used to assess the reliability of Autonomous Driving Systems (ADS), but its effectiveness is limited by the operational design domain (ODD) conditions available in such simulators. To address this limitation, in this work, we explore the integration of generative artificial intelligence techniques with physics-based simulators to enhance ADS system-level testing. Our study evaluates the effectiveness and computational overhead of three generative strategies based on diffusion models, namely instruction-editing, inpainting, and inpainting with refinement. Specifically, we assess these techniques' capabilities to produce augmented simulator-generated images of driving scenarios representing new ODDs. We employ a novel automated detector for invalid inputs based on semantic segmentation to ensure semantic preservation and realism of the neural generated images. We then perform system-level testing to evaluate the ADS's generalization ability to newly synthesized ODDs. Our findings show that diffusion models help increase the ODD coverage for system-level testing of ADS. Our automated semantic validator achieved a percentage of false positives as low as 3\%, retaining the correctness and quality of the generated images for testing. Our approach successfully identified new ADS system failures before real-world testing.
Tags: "Testing and Quality", "Autonomy"Shiyao Zhou, Jincheng Wang, He Ye, Hao Zhou, Claire Le Goues, Xiapu Luo, "LWDIFF: An LLM-Assisted Differential Testing Framework for WebAssembly Runtimes"
Abstract: WebAssembly (Wasm) runtimes execute Wasm programs, a popular low-level language for efficiently executing high-level languages in browsers, with broad applications across diverse domains. The correctness of those runtimes is critical for both functionality and security of Wasm execution, motivating testing approaches that target Wasm runtimes specifically. However, existing Wasm testing frameworks fail to generate test cases that effectively test all three phases of runtime, i.e., decoding, validation, and execution. To address this research gap, we propose a new differential testing framework for Wasm runtimes, which leverages knowledge from the Wasm language specification that prior techniques overlooked, enhancing comprehensive testing of runtime functionality. Specifically, we first use a large language model to extract that knowledge from the specification. We use that knowledge in the context of multiple novel mutation operators that generate test cases with diverse features to test all three runtime phases. We evaluate LWDIFF by applying it to eight Wasm runtimes. Compared with the state-of-the-art Wasm testers, LWDIFF achieves the highest branch coverage and identifies the largest number of bugs. In total, LWDIFF discovers 31 bugs across eight runtimes, all of which are confirmed, with 25 of them previously undiscovered.
Tags: "Testing and Quality"Taha Shabani, Noor Nashid, Parsa Alian, Ali Mesbah, "Dockerfile Flakiness: Characterization and Repair"
Abstract: Dockerfile flakiness—unpredictable temporal build failures caused by external dependencies and evolving environments—undermines deployment reliability and increases debugging overhead. Unlike traditional Dockerfile issues, flakiness occurs without modifications to the Dockerfile itself, complicating its resolution. In this work, we present the first comprehensive study of Dockerfile flakiness, featuring a nine-month analysis of 8,132 Dockerized projects, revealing that around 10% exhibit flaky behavior. We propose a taxonomy categorizing common flakiness causes, including dependency errors and server connectivity issues. Existing tools fail to effectively address these challenges due to their reliance on pre-defined rules and limited generalizability. To overcome these limitations, we introduce FLAKIDOCK, a novel repair framework combining static and dynamic analysis, similarity retrieval, and an iterative feedback loop powered by Large Language Models (LLMs). Our evaluation demonstrates that FLAKIDOCK achieves a repair accuracy of 73.55%, significantly surpassing state-of-the-art tools and baselines.
Tags: "Analysis"Ziqiao Kong, Shaohua Li, Heqing Huang, Zhendong Su, "SAND: Decoupling Sanitization from Fuzzing for Low Overhead"
Abstract: Sanitizers provide robust test oracles for various vulnerabilities. Fuzzing on sanitizer-enabled programs has been the best practice to find software bugs. Since sanitizers require heavy program instrumentation to insert run-time checks, sanitizer-enabled programs have much higher overhead compared to normally built programs. In this paper, we present SAND, a new fuzzing framework that decouples sanitization from the fuzzing loop. SAND performs fuzzing on a normally built program and only invokes the sanitizer-enabled program when input is shown to be interesting. Since most of the generated inputs are not interesting, i.e., not bug-triggering, SAND allows most of the fuzzing time to be spent on the normally built program. We further introduce execution pattern to practically and effectively identify interesting inputs. We implement SAND on top of AFL++ and evaluate it on 20 real-world programs. Our extensive evaluation highlights its effectiveness: in 24 hours, compared to all the baseline fuzzers, SAND significantly discovers more bugs while not missing any.
Tags: "Testing and Quality", "Security"Shuyin Ouyang, Jie Zhang, Zeyu Sun, Albert Merono Penuela, "Knowledge-Enhanced Program Repair for Data Science Code"
Abstract: This paper introduces DSrepair, a knowledge-enhanced program repair method designed to repair the buggy code generated by LLMs in the data science domain. DSrepair uses knowledge graph based RAG for API knowledge retrieval as well as bug knowledge enrichment to construct repair prompts for LLMs. Specifically, to enable knowledge graph based API retrieval, we construct DS-KG (Data Science Knowledge Graph) for widely used data science libraries. For bug knowledge enrichment, we employ an abstract syntax tree (AST) to localize errors at the AST node level. DSrepair's effectiveness is evaluated against five state-of-the-art LLM-based repair baselines using four advanced LLMs on the DS-1000 dataset. The results show that DSrepair surpasses all five baselines. Specifically, when compared to the second-best baseline, DSrepair demonstrates significant improvements, fixing 44.4%, 14.2%, 20.6%, and 32.1% more buggy code snippets for each of the four evaluated LLMs, respectively. Additionally, it achieves greater efficiency, reducing the number of tokens required per code task by 17.49%, 34.24%, 24.71%, and 17.59%, respectively.
Tags: "AI for SE", "Analysis"Dewu Zheng, Yanlin Wang, Ensheng Shi, Ruikai Zhang, Yuchi Ma, Hongyu Zhang, Zibin Zheng, "HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation"
Abstract: To evaluate the repository-level code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation methods have been developed. These methods typically leverage contextual code from the latest version of a project to assist LLMs in accurately generating the desired function. However, such evaluation methods fail to consider the dynamic evolution of software projects over time, which we refer to as evolution-ignored settings. This in turn results in inaccurate evaluation of LLMs' performance. In this paper, we conduct an empirical study to deeply understand LLMs' code generation performance within settings that reflect the evolution nature of software development. To achieve this, we first construct an evolution-aware repository-level code generation dataset, namely HumanEvo, equipped with an automated execution-based evaluation tool. Second, we manually categorize HumanEvo according to dependency levels to more comprehensively analyze the model's performance in generating functions with different dependency levels. Third, we conduct extensive experiments on HumanEvo with seven representative and diverse LLMs to verify the effectiveness of the proposed benchmark. We obtain several important findings through our experimental study. For example, we find that previous evolution-ignored evaluation methods result in inflated performance of LLMs, with performance overestimations ranging from 10.0% to 61.1% under different context acquisition methods, compared to the evolution-aware evaluation approach. Based on the findings, we give actionable suggestions for more realistic evaluation of LLMs on code generation. We also build a shared evolution-aware code generation toolbox to facilitate future research. The replication package including source code and datasets is anonymously available at https://anonymous.4open.science/r/HumanEvo/.
Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"Ziyu Mao, Jingyi Wang, Jun Sun, Shengchao Qin, Jiawen Xiong, "LLM-aided Automatic Modeling for Security Protocol Verification"
Abstract: Symbolic protocol analysis serves as a pivotal technique for protocol design, security analysis, and the safeguarding of information assets. Several modern tools such as Tamarin and ProVerif have been proven successful in modeling and verifying real-world protocols, including complex protocols like TLS 1.3 and 5G AKA. However, developing formal models for protocol verification is a non-trivial task, which hinders the wide adoption of these powerful tools in practical protocol analysis. In this work, we aim to bridge the gap by developing an automatic method for generating symbolic protocol models using Large Language Models (LLMs) from protocol descriptions in natural language document. Although LLMs are powerful in various code generation tasks, it is shown to be ineffective in generating symbolic models (according to our empirical study). Therefore, rather than applying LLMs naively, we carefully decompose the symbolic protocol modelling task into several stages so that a series of formal models are incrementally developed towards generating the final correct symbolic model. Specifically, we apply LLMs for semantic parsing, enable lightweight manual interaction for disambiguation, and develop algorithms to transform the intermediate models for final symbolic model generation. To ensure the correctness of the generated symbolic model, each stage is designed based on a formal execution model and the model transformations are proven sound. To the best of our knowledge, this is the first work aiming to generate symbolic models for protocol verification from natural language documents. We also introduce a benchmark for symbolic protocol model generation, with 18 real-world security protocol's text description and their corresponding symbolic models. We then demonstrate the potential of our tool, which successfully generated correct models of moderate scale in 10 out of 18 cases.
Tags: "Security", "Formal methods"Sadra Sabouri, Philipp Eibl, Xinyi Zhou, Morteza Ziyadi, Nenad Medvidovic, Lars Lindemann, Souti Chattopadhyay, "Trust Dynamics in AI-Assisted Development: Definitions, Factors, and Implications"
Abstract: Software developers increasingly rely on AI code generation utilities. To ensure that "good" code is accepted into the code base and "bad" code is rejected, developers must know when to trust an AI suggestion. Understanding how developers build this intuition is crucial to enhancing developer-AI collaborative programming. In this paper, we seek to understand how developers (1) define and (2) evaluate the trustworthiness of a code suggestion and (3) how trust evolves when using AI code assistants. To answer these questions, we conducted a mixed-method study consisting of an in-depth exploratory survey with (n=29) developers followed by an observation study (n=10). We found that comprehensibility and perceived correctness were the most frequently used factors to evaluate code suggestion trustworthiness. However, the gap in developers' definition and evaluation of trust points to a lack of support for evaluating trustworthy code in real-time. We also found that developers often alter their trust decisions, keeping only 52% of original suggestions. Based on these findings, we extracted four guidelines to enhance developer-AI interactions. We validated the guidelines through a survey with (n=7) domain experts and survey members (n=8). We discuss the validated guidelines, how to apply them, and tools to help adopt them.
Tags: "Human/Social", "AI for SE"Sigma Jahan, Mehil B Shah, Parvez Mahbub, Masud Rahman, "Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hierarchical and Explainable Classification"
Abstract: Deep Neural Networks (DNN) have found numerous applications in various domains, including fraud detection, medical diagnosis, facial recognition, and autonomous driving. However, DNN-based systems often suffer from reliability issues due to their inherent complexity and the stochastic nature of their underlying models. Unfortunately, existing techniques to detect faults in DNN programs are either limited by the types of faults (e.g., hyperparameter or layer) they support or the kind of information (e.g., dynamic or static) they use. As a result, they might fall short of comprehensively detecting and diagnosing the faults. In this paper, we present DEFault (Detect and Explain Fault) -- a novel technique to detect and diagnose faults in DNN programs. It first captures dynamic (i.e., runtime) features during model training and leverages a hierarchical classification approach to detect all major fault categories from the literature. Then, it captures static features (e.g., layer types) from DNN programs and leverages explainable AI methods (e.g., SHAP) to narrow down the root cause of the fault. We train and evaluate DEFault on a large, diverse dataset of ~14.5K DNN programs and further validate our technique using a benchmark dataset of 52 real-life faulty DNN programs. Our approach achieves ~94% recall in detecting real-world faulty DNN programs and ~63% recall in diagnosing the root causes of the faults, demonstrating 3.92%--11.54% higher performance than that of state-of-the-art techniques. Thus, DEFault has the potential to significantly improve the reliability of DNN programs by effectively detecting and diagnosing the faults.
Tags: "SE for AI", "Testing and Quality"Yang Feng, Zheyuan Lin, Dongchen Zhao, Mengbo Zhou, Jia Liu, James A. Jones, "REDII: Test Infrastructure to Enable Deterministic Reproduction of Failures for Distributed Systems"
Abstract: Despite the fact that distributed systems have become a crucial aspect of modern technology and support many of the software systems that enable modern life, developers experience challenges in performing regression testing of these systems. Existing solutions for testing distributed systems are often either: (1) specialized testing environments that are created specifically for each system by its development team, which requires substantial effort for each team, with little-to-no sharing of this effort across teams; or (2) randomized injection tools that are often computationally expensive and offer no guarantees of preventing regressions, due to their randomness. The challenge of providing a generalized and practical solution to trigger bugs for reproducing and demonstrating failures, as well as to guard against regressions, is largely unaddressed. In this work, we present REDII, an infrastructure for supporting regression testing of distributed systems. REDII contains a dataset of real bugs on common distributed systems, along with a generalizable testing framework REDIT that enables developers to write tests that can reproduce failures by providing ways to deterministically control distributed execution. In addition to the real failures in REDII from multiple distributed systems, REDIT provides a reusable, programmable, platform-agnostic, deterministic regression-testing framework for developers of distributed systems. It can help automate the running of such tests, for both practitioners and researchers. We demonstrate REDIT with 63 bugs that we selected in JIRA on 7 large and widely used distributed systems. Our case studies show that REDII can be used to allow developers to write tests that effectively reproduce bugs on distributed systems and generate specific scenarios for regression testing, as well as providing deterministic failure injection that can help developers and researchers to better understand deterministic failures that may occur in distributed systems in the future. Additionally, our studies show that REDII is efficient for real-world system regression testing, providing a powerful tool for all participants in this area.
Tags: "Testing and Quality"Marcos Macedo, Yuan Tian, Pengyu Nie, Filipe Cogo, Bram Adams, "InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation"
Abstract: Code translation aims to convert a program from one programming language (PL) to another. This long-standing software engineering task is crucial for modernizing legacy systems, ensuring cross-platform compatibility, enhancing performance, and more. However, automating this process remains challenging due to many syntactic and semantic differences between PLs. Recent studies show that even advanced techniques such as large language models (LLMs), especially open-source LLMs, still struggle with the task. Currently, code LLMs are trained with source code from multiple programming languages, thus presenting multilingual capabilities. In this paper, we investigate whether such capabilities can be harnessed to enhance code translation. To achieve this goal, we introduce InterTrans, an LLM-based automated code translation approach that, in contrast to existing approaches, leverages intermediate translations to bridge the syntactic and semantic gaps between source and target PLs. InterTrans contains two stages. It first utilizes a novel Tree of Code Translation (ToCT) algorithm to plan transitive intermediate translation sequences between a given source and target PL, then validates them in a specific order. We evaluate InterTrans with three open LLMs on three benchmarks (i.e., CodeNet, HumanEval-X, and TransCoder) involving six PLs. Results show an absolute improvement of 18.3% to 43.3% in Computation Accuracy (CA) for InterTrans over Direct Translation with 10 attempts. The best-performing variant of InterTrans(with the Magicoder LLM) achieved an average CA of 87.3%-95.4% on three benchmarks.
Tags: "AI for SE"Shiyu Sun, Yanhui Li, Lin Chen, Yuming Zhou, Jianhua Zhao, "Boosting Code-line-level Defect Prediction with Spectrum Information and Causality Analysis"
Abstract: Code-line-level defect prediction (CLDP) is an effective technique to incorporate comprehensive measures for buggy line identification to optimize efforts in Software Quality Assurance activities. Most CLDP methods either consider the textual information of the code or rely merely on file-level label information, which have not fully leveraged the essential information in the CLDP context, with historical \textit{code-line-level labels} being incredibly overlooked in their application. Due to the vast number of code lines and the sparsity of the tokens they contain, leveraging historical code-line-level label information remains a significant challenge. To address this issue, we propose a novel CLDP method, \textbf{S}pectrum inf\textbf{O}rmation and ca\textbf{U}sality a\textbf{N}alysis based co\textbf{D}e-line-level defect prediction ($\mathsf{SOUND}$). $\mathsf{SOUND}$ incorporates two key ideas: (a) it introduces a spectrum information perspective, utilizing labels from historical defective lines to quantify the contribution of tokens to line-level defects, and (b) it applies causal analysis to obtain a more systematic and comprehensive understanding of the causal relationships between tokens and defects. After conducting a comprehensive study involving 142 releases across 19 software projects, the experimental results show that our method significantly outperforms existing state-of-the-art (SOTA) CLDP baseline methods in terms of its ability to rank defective lines under three indicators, IFA, Recall@Top20\%LOC, and Effort@Top20\%Recall. Notably, in terms of IFA, our method achieves a score of 0 in most cases, indicating that the first line in the ranking list generated by our method is actually defective, significantly enhancing its practicality.
Tags: "Prog Comprehension/Reeng/Maint", "Testing and Quality"Ruchira Manke, Mohammad Wardat, Foutse Khomh, Hridesh Rajan, "Mock Deep Testing: Toward Separate Development of Data and Models for Deep Learning"
Abstract: While deep learning (DL) has permeated, and become an integral component of many critical software systems, today software engineering research hasn’t explored how to separately test data and models that are integral for DL approaches to work effectively. The main challenge in independently testing these components arises from the tight dependency between data and models. This research explores this gap, introducing our methodology of mock deep testing for unit testing of DL applications. To enable unit testing, we introduce a design paradigm that decomposes the workflow into distinct, manageable components, minimizes sequential dependencies, and modularizes key stages of the DL, including data preparation and model design. For unit testing these components, we propose modeling their dependencies using mocks. In the context of DL, mocks refer to mock data and mock model that mimic the behavior of the original data and model, respectively. This modular approach facilitates independent development and testing of the components, ensuring comprehensive quality assurance throughout the development process. We have developed KUnit, a framework for enabling mock deep testing for the Keras library, a popular library for developing DL applications. We empirically evaluated KUnit to determine the effectiveness of mocks in independently testing data and models. Our assessment of 50 DL programs obtained from Stack Overflow and GitHub shows that mocks effectively identified 10 issues in the data preparation stage and 53 issues in the model design stage. We also conducted a user study with 36 participants using KUnit to perceive the effectiveness of our approach. Participants using KUnit successfully resolved 25 issues in the data preparation stage and 38 issues in the model design stage. We also found that mock objects provide a lightweight emulation of the dependencies for unit testing, facilitating early bug detection. Lastly, to evaluate the usability of KUnit, we conducted a post-study survey. The results reveal that KUnit is helpful to DL application developers, enabling them to independently test each component (data and model) and resolve issues effectively in different stages.
Tags: "SE for AI", "Testing and Quality"Yorick Sens, Henriette Knopp, Sven Peldszus, Thorsten Berger, "A Large-Scale Study of Model Integration in ML-Enabled Software Systems"
Abstract: The rise of machine learning (ML) and its embedding in software-intensive systems has drastically changed the engineering of such systems. Traditionally, software engineering focuses on manually created artifacts, such as source code, and the process of creating them, as well as best practices for integrating them, i.e., software architectures. In contrast, the development of ML artifacts, i.e., ML models, comes from data science and focuses on the ML models and their training data. However, to deliver value to end users, these ML models must be integrated with traditional software components, often forming complex topologies. In fact, ML-enabled software can easily incorporate many different ML models. While the challenges and practices of building ML-enabled systems have been studied, little is known about the characteristics of real-world ML-enabled systems, beyond isolated examples. Properly embedding ML models in systems so that they can be easily maintained or reused is far from trivial. To improve development processes and architectures for ML-enabled systems, we need to improve our empirical understanding of these systems. We present the first large-scale study of real-world open-source ML-enabled software systems, covering over 2,928 systems on GitHub. We classified and analyzed them to determine their characteristics, as well as their practices for reusing ML models and related code, and the architecture of these systems. Practitioners and researchers benefit from insights into practices for embedding and integrating ML models, bringing data science and software engineering closer together.
Tags: "SE for AI", "Design/Architecture"Qi Guo, Xiaofei Xie, Shangqing Liu, Ming Hu, Xiaohong Li, Lei Bu, "Intention is All You Need: Refining Your Code from Your Intention"
Abstract: This paper proposes an intention-based code refinement technique, transforming the conventional code refinement process from comment to code to intention to code. The process is decomposed into two phases: Intention Extraction and Intention Guided Code Modification Generation. Intention Extraction categorizes comments using predefined templates, while the latter employs large language models (LLMs) to generate revised code based on these defined intentions. Three categories with eight subcategories are designed for comment transformation, followed by a hybrid approach that combines rule-based and LLM-based classifiers for accurate classification. Extensive experiments with five LLMs (GPT4o, GPT3.5, DeepSeekV2, DeepSeek7B, CodeQwen7B) under different prompting settings demonstrate that our approach achieves 79% accuracy in intention extraction and up to 66% in code refinement generation. Our results underscore the potential of this approach in enhancing data quality and improving code refinement processes.
Tags: "AI for SE"Yongchao Wang, Yuandao Ryan Cai, Charles Zhang, "Boosting Path-Sensitive Value Flow Analysis via Removal of Redundant Summaries"
Abstract: Value flow analysis that tracks the flow of values via data dependence is a widely used technique for detecting a broad spectrum of software bugs. However, the scalability issue often deteriorates when high precision (i.e., path-sensitivity) is required, as the instantiation of function summaries becomes excessively time- and memory-intensive. The primary culprit, as we observe, is the existence of redundant computations resulting from blindly computing summaries for a function, irrespective of whether they are related to bugs being checked. To address this problem, we present the first approach that can effectively identify and eliminate redundant summaries, thereby reducing the size of collected summaries from callee functions without compromising soundness or efficiency. Our evaluation on large programs demonstrates that our identification algorithm can significantly reduce the time and memory overhead of the state-of-the-art value flow analysis by 45\% and 27\%, respectively. Furthermore, the identification algorithm demonstrates remarkable efficiency by identifying nearly 80\% of redundant summaries while incurring a minimal additional overhead. In the largest \textit{mysqld} project, the identification algorithm reduces the time by 8107 seconds (2.25 hours) with a mere 17.31 seconds of additional overhead, leading to a ratio of time savings to paid overhead (i.e., performance gain) of 468.48 $\times$. In total, our method attains an average performance gain of 632.1 $\times$.
Tags: "Analysis"Zeinadsadat Saghi, Thomas Zimmermann, Souti Chattopadhyay, "Code Today, Deadline Tomorrow: Procrastination Among Software Developers"
Abstract: Procrastination, the action of delaying or postponing something, is a well-known phenomenon that is relatable to all. While it has been studied in academic settings, little is known about why software developers procrastinate. How does it affect their work? How can developers manage procrastination? This paper presents the first investigation of procrastination among developers. We conduct an interview study with (n=15) developers across different industries to understand the process of procrastination. Using qualitative coding, we report the positive and negative effects of procrastination and factors that triggered procrastination, as perceived by participants. We validate our findings using member checking. Our results reveal 14 negative effects of procrastination on developer productivity. However, participants also reported eight positive effects, four impacting their satisfaction. We also found that participants reported three categories of factors that trigger procrastination: task-related, personal, and external. Finally, we present 19 techniques reported by our participants and studies in other domains that can help developers mitigate the impacts of procrastination. These techniques focus on raising awareness and task focus, help with task planning, and provide pathways to generate team support as a mitigation means. Based on these findings, we discuss interventions for developers and recommendations for tool building to reduce procrastination. Our paper shows that procrastination has unique effects and factors among developers compared to other populations.
Tags: "Human/Social"Kaiyao Ke, "NIODebugger: A Novel Approach to Repair Non-Idempotent-Outcome Tests with LLM-Based Agent"
Abstract: Flaky tests, characterized by inconsistent results across repeated executions, present significant challenges in software testing, especially during regression testing. Recently, there has been emerging research interest in non-idempotent-outcome (NIO) flaky tests—tests that pass on the initial run but fail on subsequent executions within the same environment. Despite progress in utilizing Large Language Models (LLMs) to address flaky tests, existing methods have not tackled NIO flaky tests. The limited context window of LLMs restricts their ability to incorporate relevant source code beyond the test method itself, often overlooking crucial information needed to address state pollution, which is the root cause of NIO flakiness. This paper introduces NIODebugger, the first framework to utilize an LLM-based agent for fixing flaky tests. NIODebugger features a three-phase design: detection, exploration, and fixing. In the detection phase, dynamic analysis provides critical information (such as stack traces and custom test execution logs) from multiple test runs, which helps in understanding accumulative state pollution. During the exploration phase, the LLM-based agent identifies and provides instructions for extracting relevant source code associated with test flakiness. In the fixing phase, NIODebugger repairs the tests using the information gathered from the previous phases. NIODebugger can be integrated with multiple LLMs, achieving patching success rates ranging from 11.63% to 58.72%. Its best-performing variant, NIODebugger-GPT-4, successfully generated correct patches for 101 out of 172 previously unknown NIO tests across 20 large-scale open-source projects. We submitted pull requests for all generated patches; 58 have been merged, only 1 was rejected, and the remaining 42 are pending. The implementation of NIODebugger is provided as a Maven plugin accessible at https://github.com/NIOTester/NIODebugger.
Tags: "AI for SE", "Testing and Quality"Zifan Nan, Zhaoqiang Guo, Kui Liu, Xin Xia, "Test Intention Guided LLM-based Unit Test Generation"
Abstract: The emergence of Large Language Models (LLMs) has accelerated the progress of intelligent software engineering technologies, which brings promising possibility for unit test generation. However, existing approaches on unit tests directly generated from Large Language Models (LLMs) often prove impractical due to their low coverage and insufficient mocking capabilities. This paper proposes IntUT, a novel approach that utilizes explicit test intentions (e.g. test inputs, mock behaviors, and expected results) to effectively guide the LLM to generate high-quality test cases. Our experimental results on three industry Java projects and live study demonstrate that prompting LLM with test intention can generate high-quality test cases for developers. Specifically, it achieves the improvements on branch coverage by 94% and line coverage by 49%. Eventually, we obtain developers' feedback on using IntUT to generate cases for 3 newly Java projects with over 80% line coverage and 30% efficiency improvement on writing unit test cases.
Tags: "AI for SE", "Testing and Quality"Zhiming Chi, Jianan Ma, Pengfei Yang, Cheng-Chao Huang, Renjue Li, Jingyi Wang, Xiaowei Huang, Lijung Zhang, "Patch Synthesis for Property Repair of Deep Neural Networks"
Abstract: Deep neural networks (DNNs) are prone to various dependability issues, such as adversarial attacks, which hinder their adoption in safety-critical domains. Recently, NN repair techniques have been proposed to address these issues while preserving original performance by locating and modifying guilty neurons and their parameters. However, existing repair approaches are often limited to specific data sets and do not provide theoretical guarantees for the effectiveness of the repairs. To address these limitations, we introduce PatchPro, a novel patch-based approach for property-level repair of DNNs, focusing on local robustness. The key idea behind PatchPro is to construct patch modules that, when integrated with the original network, provide specialized repairs for all samples within the robustness neighborhood while maintaining the network's original performance. Our method incorporates formal verification and a heuristic mechanism for allocating patch modules, enabling it to defend against adversarial attacks and generalize to other inputs. PatchPro demonstrates superior efficiency, scalability, and repair success rates compared to existing DNN repair methods, i.e., realizing provable property-level repair for 100% cases across multiple high-dimensional datasets.
Tags: "SE for AI", "Analysis/Synthesis"Antu Saha, Oscar Chaparro, "Decoding the Issue Resolution Process In Practice via Issue Report Analysis: A Case Study of Firefox"
Abstract: Effectively managing and resolving software issues is critical for maintaining and evolving software systems. Development teams often rely on issue trackers and issue reports to track and manage the work needed during issue resolution, ranging from issue reproduction and analysis to solution design, implementation, verification, and deployment. Despite the issue resolution process being generally known in the software engineering community as a sequential list of activities, it is unknown how developers implement this process in practice and how they discuss it in issue reports. This paper aims to enhance our understanding of the issue resolution process implemented in practice by analyzing the issue reports of Mozilla Firefox. We qualitatively and quantitatively analyzed the discussions found in 356 Firefox issue reports, to identify the sequences of stages that developers go through to address various software problems. We analyzed the sequences to identify the overall resolution process at Firefox and derived a catalog of 47 patterns that represent instances of the process. We analyzed the process and patterns across multiple dimensions, including pattern complexity, issue report types, problem categories, and issue resolution times, resulting in various insights about Mozilla's issue resolution process. We discuss these findings and their implications for different stakeholders on how to better assess and improve the issue resolution process.
Tags: "Prog Comprehension/Reeng/Maint", "Human/Social"Sehoon Kim, Yonghyeon Kim, Dahyeon Park, Yuseok Jeon, Jooyong Yi, Mijung Kim, "Lightweight Concolic Testing via Path-Condition Synthesis for Deep Learning Libraries"
Abstract: Many techniques have been recently developed for testing deep learning (DL) libraries, recently. Although these techniques have effectively improved API and code coverage and detected unknown bugs, they rely on black-box fuzzing for input generation. Concolic testing (also known as dynamic symbolic execution) can be more effective in exploring diverse execution paths, but applying it to DL libraries is extremely challenging due to their inherent complexity. In this paper, we introduce the first concolic testing technique for DL libraries. Our technique offers a lightweight approach that significantly reduces the heavy overhead associated with traditional concolic testing. While symbolic execution maintains symbolic expressions for every variable with non-concrete values to build a path condition, our technique computes approximate path conditions by inferring branch conditions via inductive program synthesis. Despite potential imprecision from approximation, our method's light overhead allows for effective exploration of diverse execution paths within the complex implementations of DL libraries. We have implemented our tool, PathFinder, and evaluated it on PyTorch and TensorFlow. Our results show that PathFinder outperforms existing API-level DL library fuzzers by achieving 57\% more branch coverage on average; up to 58\% higher than TitanFuzz and 125\% higher than FreeFuzz. PathFinder is also effective in bug detection, uncovering 61 crash bugs, 59 of which were confirmed by developers as previously unknown, with 32 already fixed.
Tags: "SE for AI", "Testing and Quality"