MSR 2025 - Data and Tool Showcase Track

Accepted Papers
Call for Papers

Accepted Papers

	Title
	A Dataset of Contributor Activities in the NumFocus Open-Source Community Data and Tool Showcase Track Youness Hourri, Alexandre Decan , Tom Mens Pre-print
	A Dataset of Software Bill of Materials for Evaluating SBOM Consumption Tools Data and Tool Showcase Track Rio Kishimoto, Tetsuya Kanda, Yuki Manabe, Katsuro Inoue, Shi Qiu, Yoshiki Higo
	CARDS: A collection of package, revision, and miscelleneous dependency graphs Data and Tool Showcase Track Euxane TRAN-GIRARD, Laurent BULTEAU, Pierre-Yves DAVID Pre-print
	CodeFix-Bench: A Large-scale Benchmark for Learning to Localize Code Changes from Issue Reports Data and Tool Showcase Track Faizan Khan, Azeem Zaheer
	CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance Data and Tool Showcase Track Kunal Suresh Pai, Prem Devanbu, Toufique Ahmed
	CoMRAT: Commit Message Rationale Analysis Tool Data and Tool Showcase Track Mouna Dhaouadi, Bentley Oakes, Michalis Famelis Media Attached
	CoPhi - Mining C/C++ Packages for Conan Ecosystem Analysis Data and Tool Showcase Track Vivek Sarkar, Anemone Kampkötter, Ben Hermann Pre-print
	CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java Repositories Data and Tool Showcase Track Kaihang Jiang, Bihui Jin, Pengyu Nie
	DataTD: A Dataset of Java Projects Including Test Doubles Data and Tool Showcase Track Mengzhen Li, Mattia Fazzini
	DPy: Code Smells Detection Tool for Python Data and Tool Showcase Track Aryan Boloori, Tushar Sharma Pre-print
	Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code Data and Tool Showcase Track Timur Galimzyanov, Sergey Titov, Yaroslav Golubev, Egor Bogomolov Pre-print
	E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects Data and Tool Showcase Track Sergio Di Meglio, Luigi Libero Lucio Starace, Valeria Pontillo, Ruben Opdebeeck, Coen De Roover, Sergio Di Martino
	EvoChain: A Framework for Tracking and Visualizing Smart Contract Evolution Data and Tool Showcase Track Ilham Qasse, Mohammad Hamdaqa, Björn Þór Jónsson
	FormalSpecCpp: A Dataset of C++ Formal Specifications Created Using LLMs Data and Tool Showcase Track Madhurima Chakraborty, Peter Pirkelbauer, Qing Yi
	GHALogs: Large-scale dataset of GitHub Actions runs Data and Tool Showcase Track Florent Moriconi, Thomas Durieux, Jean-Rémy Falleri, Raphaël Troncy, Aurélien Francillon
	GitProjectHealth: an Extensible Framework for Git Social Platform Mining Data and Tool Showcase Track Nicolas Hlad, Benoit Verhaeghe, Kilian Bauvent
	HaPy-Bug - Human Annotated Python Bug Resolution Dataset Data and Tool Showcase Track Piotr Przymus, Mikołaj Fejzer, Jakub Narębski, Radosław Woźniak, Łukasz Halada, Aleksander Kazecki, Mykhailo Molchanov, Krzysztof Stencel
	HyperAST: Incrementally Mining Large Source Code Repositories Data and Tool Showcase Track Quentin Le Dilavrec, Andy Zaidman Pre-print
	ICVul: A Well-labeled C/C++ Vulnerability Dataset with Comprehensive Metadata and VCCs Data and Tool Showcase Track Chaomeng Lu, Tianyu Li, Toon Dehaene, Bert Lagaisse
	JPerfEvo: A Tool for Tracking Method-Level Performance Changes in Java Projects Data and Tool Showcase Track Kaveh Shahedi, Maxime Lamothe, Foutse Khomh, Heng Li
	Jupyter Notebook Activity Dataset Data and Tool Showcase Track Tomoki Nakamaru, Tomomasa Matsunaga, Tetsuro Yamazaki
	MaLAware: Automating the Comprehension of Malicious Software Behaviours using Large Language Models (LLMs) Data and Tool Showcase Track BIKASH SAHA, Nanda Rani, Sandeep K. Shukla
	MARIN: A Research-Centric Interface for Querying Software Artifacts on Maven Repositories Data and Tool Showcase Track Johannes Düsing, Jared Chiaramonte, Ben Hermann Pre-print
	Mining Bug Repositories for Multi-Fault Programs Data and Tool Showcase Track Dylan Callaghan, Bernd Fischer
	Myriad People. Open Source Software for New Media Arts Data and Tool Showcase Track Benoit Baudry, Erik Natanael Gustafsson, Roni Kaufman, Maria Kling
	OpenMent: A Dataset of Mentor-Mentee Interactions in Google Summer of Code Data and Tool Showcase Track Erfan Raoofian, Fatemeh Hendijani Fard, Ifeoma Adaji, Gema Rodríguez-Pérez
	OSPtrack: A Labeled Dataset Targeting Simulated Execution of Open-Source Software Data and Tool Showcase Track Zhuoran Tan, Christos Anagnostopoulos, Jeremy Singer
	OSS License Identification at Scale: A Comprehensive Dataset Using World of Code Data and Tool Showcase Track Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus
	pyMethods2Test: A Dataset of Python Tests Mapped to Focal Methods Data and Tool Showcase Track Idriss Abdelmadjid, Robert Dyer Pre-print
	RefExpo: Unveiling Software Project Structures through Advanced Dependency Graph Extraction Data and Tool Showcase Track Vahid Haratian, Pouria Derakhshanfar, Vladimir Kovalenko, Eray Tuzun
	RepoChat: An LLM-Powered Chatbot for GitHub Repository Question-Answering Data and Tool Showcase Track Samuel Abedu, Laurine Menneron, SayedHassan Khatoonabadi, Emad Shihab
	SCRUBD: Smart Contracts Reentrancy and Unhandled Exceptions Vulnerability Dataset Data and Tool Showcase Track Chavhan Sujeet Yashavant, Mitrajsinh Chavda, Saurabh Kumar, Amey Karkare, Angshuman Karmakar
	SnipGen: A Mining Repository Framework for Evaluating LLMs for Code Data and Tool Showcase Track Daniel Rodriguez-Cardenas, Alejandro Velasco, Denys Poshyvanyk
	SPRINT: An Assistant for Issue Report Management Data and Tool Showcase Track Ahmed Adnan, Antu Saha, Oscar Chaparro
	TerraDS: A Dataset for Terraform HCL Programs Data and Tool Showcase Track Christoph Buehler, David Spielmann, Roland Meier, Guido Salvaneschi
	TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest Data and Tool Showcase Track Altino Alves Júnior, Andre Hora Pre-print
	Under the Blueprints: Parsing Unreal Engine’s Visual Scripting at Scale Data and Tool Showcase Track Kalvin Eng, Abram Hindle
	Wild SBOMs: a Large-scale Dataset of Software Bills of Materials from Public Code Data and Tool Showcase Track Luis Soeiro, Thomas Robert, Stefano Zacchiroli

Call for Papers

The MSR Data and Tool Showcase Track aims to actively promote and recognize the creation of reusable datasets and tools that are designed and built not only for a specific research project but for the MSR community as a whole. These datasets and tools should enable other practitioners and researchers to jumpstart their research efforts, and also allow the reproducibility of earlier work. The MSR Data and Tool Showcase papers can be descriptions of datasets or tools built by the authors that can be used by other practitioners or researchers, and/or descriptions of the use of tools built by others to obtain specific research results.

MSR’25 Data and Tool Showcase Track will accept two types of submissions:

Data showcase submissions are expected to include:
- a description of the data source,
- a description of the methodology used to gather the data (including provenance and the tool used to create/generate/gather the data, if any),
- a description of the storage mechanism, including a schema if applicable,
- if the data has been used by the authors or others, a description of how this was done including references to previously published papers,
- a description of the originality of the dataset (that is, even if the dataset has been used in a published paper, its complete description must be unpublished) and similar existing datasets (if any),
- ideas for future research questions that could be answered using the dataset,
- ideas for further improvements that could be made to the dataset, and
- any limitations and/or challenges in creating or using the dataset.
Reusable Tool showcase submissions are expected to include:
- a description of the tool, which includes the background, motivation, novelty, overall architecture, detailed design, and preliminary evaluation of the tool, as well as the link to download or access the tool,
- a description of the design of the tool, and how to use the tool in practice,
- clear installation instructions and example datasets that allow the reviewers to run the tool,
- if the tool has been used by the authors or others, a description of how the tool was used, including references to previously published papers,
- ideas for future reusability of the tool, and
- any limitations of using the tool.

The dataset or tool should be made available at the time of submission of the paper for review but will be considered confidential until publication of the paper. The dataset or tool should include detailed instructions about how to set up the environment (e.g., requirements.txt), how to use the dataset or tool (e.g., how to import the data or how to access the data once it has been imported, how to use the tool with a running example). At a minimum, upon publication of the paper, the authors should archive the data or tool on a persistent repository that can provide a digital object identifier (DOI) such as zenodo.org, figshare.com, Archive.org, or institutional repositories. In addition, the DOI-based citation of the dataset or the tool should be included in the camera-ready version of the paper. GitHub provides an easy way to make source code citable (with third tools and with a CITATION file).

Data and Tool showcase submissions are not: * empirical studies, or * datasets that are based on poorly explained or untrustworthy heuristics for data collection, or results of trivial application of generic tools.

If custom tools have been used to create the dataset, we expect the paper to be accompanied by the source code of the tools, along with clear documentation on how to run the tools to recreate the dataset. The tools should be open source, accompanied by an appropriate license; the source code should be citable, i.e., refer to a specific release and have a DOI. If you cannot provide the source code or the source code clause is not applicable (e.g., because the dataset consists of qualitative data), please provide a short explanation of why this is not possible.

Evaluation Criteria

The Review Criteria for the Data/Tool Showcase submissions are as follows:

value, usefulness, and reusability of the datasets or tools.
quality of the presentation.
clarity of relation with related work and its relevance to mining software repositories.
availability of the datasets or tools.

Important Dates

Abstract Deadline: November 29, 2024
Paper Deadline: December 2, 2024
Author Notification: January 12, 2025
Camera Ready Deadline: February 5, 2025

Awards

The best dataset/tool paper(s) will be awarded with a Distinguished Paper Award.

Submission

Submit your paper (maximum 4 pages, plus 1 additional page of references) via the HotCRP submission site: https://msr2025-data-tool.hotcrp.com/.

Submitted papers will undergo single-anonymous peer review. We opt for a single-anonymous peer review (i.e., authors’ names should be listed on the manuscript, as opposed to the double-anonymous peer review of the main track) due to the requirement above to describe the ways how data has been used in the previous studies, including the bibliographic reference to those studies. Such a reference is likely to disclose the authors’ identity.

To make research datasets and research software accessible and citable, we further encourage authors to attend to the FAIR rules, i.e., data should be: Findable, Accessible, Interoperable, and Reusable.

Submissions must conform to the IEEE formatting instructions IEEE Conference Proceedings Formatting Guidelines (title in 24pt font and full text in 10pt type, LaTeX users must use \documentclass[10pt,conference]{IEEEtran} without including the compsoc or compsocconf options).

Papers submitted for consideration should not have been published elsewhere and should not be under review or submitted for review elsewhere for the duration of consideration. ACM plagiarism policies and procedures shall be followed for cases of double submission. The submission must also comply with the IEEE Policy on Authorship. Please read the ACM Policy on Plagiarism, Misrepresentation, and Falsification and the IEEE - Introduction to the Guidelines for Handling Plagiarism Complaints before submitting.

Upon notification of acceptance, all authors of accepted papers will be asked to complete a copyright form and will receive further instructions for preparing their camera-ready versions. At least one author of each paper is expected to register and present the results at the MSR 2025 conference. All accepted contributions will be published in the conference’s electronic proceedings.

Data and Tool Showcase TrackMSR 2025

Accepted Papers

Call for Papers

Evaluation Criteria

Important Dates

Awards

Submission

Camilo Escobar-VelásquezData/Tool Showcase Track Co-Chair

Universidad de los Andes, Colombia

Colombia

Yuan TianData/Tool Showcase Track Co-Chair

Queen's University, Kingston, Ontario

Canada

June Sallou

Wageningen University & Research

Netherlands

Andrea Janes

Free University of Bozen-Bolzano

Italy

Zhensu Sun

Singapore Management University

Singapore

Csaba Nagy

Switzerland

Emanuele Iannone

Hamburg University of Technology

Germany

Valeria Pontillo

Vrije Universiteit Brussel

Belgium

Giammaria Giordano

University of Salerno

Italy

Vincenzo Riccio

University of Udine

Italy

Fernanda Madeiral

Vrije Universiteit Amsterdam

Netherlands

Juan Pablo Sandoval Alcocer

Pontificia Universidad Católica de Chile

Chile

Alison Fernandez-Blanco

Pontificia Universidad Católica de Chile

Filipe Cogo

Centre for Software Excellence, Huawei Canada

Canada

Zhenhao Li

York University

Canada

Shouvick Mondal

IIT Gandhinagar

India

Dong Jae Kim

DePaul University

United States

Muhammad Asaduzzaman

University of Windsor

Canada

Ahmad Abdellatif

University of Calgary

Canada

Diego Costa

Concordia University, Canada

Canada

Ting Zhang

Singapore Management University

Singapore

Maxime Lamothe

Polytechnique Montreal

Canada

Keheliya Gallaba

Centre for Software Excellence, Huawei Canada

Canada

Feifei Niu

University of Ottawa

Canada

Chao Ni

Zhejiang University

China