Déjàvu: A Map of Code Duplicates On Github
Déjàvu: A Map of Code Duplicates On Github
1 INTRODUCTION
The advent of web-hosted open source repository services such as GitHub, BitBucket and Source-
Forge have transformed how source code is shared. Creating a project takes almost no efort and is
free of cost for small teams working in the open. Over the last two decades, millions of projects
have been shared, building up a massive trove of free software. A number of these projects have
been widely adopted and are part of our daily software infrastructure. More recently there have
been attempts to treat the open source ecosystem as a massive dataset and to mine it in the hopes
of inding patterns of interest.
Authors’ addresses: Cristina V. Lopes, University of California, Irvine, USA; Petr Maj, Czech Technical University, Czech
Republic; Pedro Martins, University of California, Irvine, USA; Vaibhav Saini, University of California, Irvine, USA; Di Yang,
University of California, Irvine, USA; Jakub Zitny, Czech Technical University, Czech Republic; Hitesh Sajnani, Microsoft
Research, USA; Jan Vitek, Northeastern University, USA.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and
the full citation on the irst page. Copyrights for third-party components of this work must be honored. For all other uses,
This work
contact theisowner/author(s).
licensed under a Creative Commons Attribution 4.0 International License.
© 2017 Copyright held by the owner/author(s).
2475-1421/2017/10-ART84
https://doi.org/10.1145/3133908
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:2 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
When working with software, one may want to make statements about applicability of, say,
a compiler optimization or a static bug inding technique. Intuitively, one would expect that a
conclusion based on a software corpus made up of thousands of programs randomly extracted from
an Internet archive is more likely to hold than one based on a handful of hand-picked benchmarks
such as [Blackburn et al. 2006] or [SPEC 1998]. For an example, consider [Richards et al. 2011]
which demonstrated that the design of the Mozilla optimizing compiler was skewed by the lack of
representative benchmarks. Looking at small workloads gave a very diferent picture from what
could be gleaned by downloading thousands of websites.
Scaling to large datasets has its challenges. Whereas small datasets can be curated with care,
larger code bases are often obtained by random selection. If GitHub has over 4.5 million projects,
how does one pick a thousand projects? If statistical reasoning is to be applied, the projects must
be independent. Independence of observations is taken for granted in many settings, but with
0 100 86 100 99 96 91
0 17 5 52 0 35 0 100 95 98 96 95 41 84 97 83
0 30 18 12 2 43 2 3 47 0 84 100 0 0 100 6 79 2 50 63 64 78 97 90
22 23 7 18 8 22 20 9 0 31 26 58 0 100 50 30 29 100 55 30 74 59 52 61 57 89 72 45
12 11 11 8 14 15 20 19 25 30 30 48 14 78 97 80 0 54 17 39 31 80 57 51 23 48 47 41 59 72 74 72 34 86 83 100
3 8 12 13 14 12 17 23 27 36 35 73 46 36 95 43 47 0 15 40 43 20 33 42 36 45 43 49 71 65 84 94 91 99 94
7 10 9 10 11 16 16 27 29 47 37 70 51 35 89 7 45 25 42 30 40 32 40 30 38 37 42 42 46 45 52 48 80 61 96
7 9 10 13 12 15 19 21 28 48 44 43 47 39 73 79 48 85 75 27 34 28 33 29 40 30 32 38 40 40 52 68 71 76 93 99 52 100
10 9 10 12 13 15 20 23 28 30 44 44 56 58 58 80 65 70 98 9 22 31 30 25 28 28 31 33 35 41 49 59 68 72 90 85 92 68
6 10 11 11 14 15 18 22 31 36 45 46 53 60 70 59 58 89 81 100 85 17 26 24 23 23 27 30 33 37 43 47 64 72 91 89 90 98 73 100
7 10 10 12 13 14 19 21 30 37 40 46 53 56 65 69 68 68 96 46 100 87 24 26 22 24 26 27 32 34 38 44 59 71 80 80 93 96 97 98 100
7 10 11 11 13 15 17 23 29 36 43 51 53 56 65 59 70 80 71 24 88 89 93 15 16 18 20 24 27 31 35 41 53 67 78 92 90 96 98 85 95 98 100 100
7 10 11 11 13 15 17 22 31 36 40 49 54 59 68 61 69 77 89 70 22 99 88 11 13 17 19 25 29 31 37 46 58 72 84 92 90 98 98 98 98 98 95 100
7 10 10 11 13 15 18 22 30 36 39 46 55 59 66 66 76 78 77 96 62 97 8 11 16 21 29 33 38 43 55 66 79 89 93 96 98 96 91 96 96 100 99 99
10 7 9 11 11 13 15 18 22 30 36 40 48 53 57 64 69 71 77 80 70 84 81 95 10 7 12 16 22 36 37 41 47 58 71 81 91 95 94 98 98 96 99 98 70
7 10 10 12 13 15 18 22 30 36 40 49 53 57 69 66 70 80 82 80 71 97 83 7 13 19 25 41 39 43 50 61 69 83 92 96 96 98 97 95 98 99 100 76
7 10 10 11 13 15 18 22 30 35 39 49 53 62 68 67 70 77 82 85 97 98 93 9 15 21 27 44 40 46 53 59 74 86 92 97 96 98 96 94 91 97 100
7 10 11 11 13 15 18 22 30 37 39 46 54 59 67 65 76 80 78 73 81 99 83 12 16 23 30 48 42 48 50 62 72 82 93 95 96 98 98 98 94 99 99 91 0
Java Python
100 0 41 13 71 51 82 82 79 92 94 40 99 96 0 81 42 99 49 85
100 0 44 39 85 46 43 35 61 60 66 75 75 76 98 98 100 33 33 69 75 59 67 60 39 79 69 73 75 75 79 92 78 83 92 96 95
95 100 60 70 27 26 43 29 34 44 46 53 49 49 67 62 87 95 77 96 84 97 96 40 50 64 60 56 81 62 58 52 58 60 63 66 80 80 91 88 99 99 75 100 99
78 0 39 29 31 26 37 29 42 36 37 45 53 53 65 71 77 83 83 95 99 90 40 50 46 58 80 89 80 62 52 65 57 59 67 89 87 95 82 99 99 94 100 100
21 18 22 17 19 19 17 19 23 30 35 51 54 62 75 88 88 94 92 88 91 96 27 37 45 49 51 45 44 45 52 62 76 85 91 95 96 97 95 99 99 99 99 100 100
11 13 16 18 19 19 21 24 27 33 45 57 63 70 75 87 90 91 94 94 95 90 99 26 37 42 48 44 44 46 49 56 69 80 89 93 96 97 98 98 99 98 100 99 99 96
10 17 13 15 15 16 18 23 30 43 56 63 71 80 79 90 92 89 97 96 96 87 90 18 31 38 48 43 43 49 57 67 80 88 93 95 97 98 98 99 99 99 99 100 98 100
C++ JavaScript
Fig. 1. Map of code duplication. The y-axis is the number of commits per project, the x-axis is the number of
files in a project. The value of each tile is the percentage of duplicated files for all projects in the tile. Darker
means more clones.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:3
software there are many ways one project can inluence another. Inluences can originate from
the developers on the team, for instance the same people will tend to write similar code. Even
more common are the various means of software reuse. Projects can include other projects. Apache
Commons is used in thousands of projects, Oracle’s SDK is universally used by any Java project,
JQuery by most websites. StackOverlow and other discussion forums encourage the sharing of
code snippets. Cut and paste programming where code is lifted from one project and dropped into
another is another way to inject dependencies. Lastly, entire iles can be copied from one project to
the next. Any of these actions, at scale, may bias results of research.
Several published studies either neglected to account for duplicates, or addressed them before
analysis. [Casalnuovo et al. 2015] studied the use of assertions in the top 100 most popular C and
C++ projects in GitHub. [Ray et al. 2014] studied software quality using the top 50 most popular
projects in 17 languages. Neither addressed ile duplication. Conversely, [Hofa 2016] studied the
old łtabs v. spacesž issue in 400K GitHub projects; ile duplication was identiied as an issue and
eliminated before analysis. [Cosentino et al. 2016] present a meta-analysis of studies on GitHub
projects where trends and problems related to dataset selection are identiied.
This paper provides a tool to assist selecting projects from GitHub. DéjàVu is a publicly available
index of ile-level code duplication. The novelty of our work lies partly in its scale; it is an index
of duplication for the entire GitHub repository for four popular languages, Java, C++, Python
and JavaScript. Figure 1 illustrates the proportion of duplicated iles for diferent project sizes
and numbers of commits (section 5 explains how these heatmaps were generated). The heatmaps
show that as project size increases the proportion of duplicated iles also increases. Projects with
more commits tend to have fewer project-level clones. Finally JavaScript projects have the most
project-level clones, while Java projects have the fewest.
Table 1. File-hash duplication in subsets.
The clone map from which the heatmaps were pro-
duced is our main contribution. It can be used to under-
stand the similarity relations in samples of projects or 10K Stars 10K Commits
to curate samples to reduce duplicates. Consider for in- Java 9% 6%
stance a subset that focuses on the most active projects, C/C++ 41% 51%
as done in [Borges et al. 2016], by iltering on the num- Python 28% 44%
JavaScript 44% 66%
ber of stars or commits a project has. For example, the
clones for the 10K most popular projects are summarized
in Figure 1. In Java, this ilter is reasonably eicient at reducing the number of clones. In other
languages clones remain prevalent. DéjàVu can be used to curate datasets, i.e. remove projects with
too many clones. Besides applicability to research, our results can be used by anyone who needs to
host large amounts of source code to avoid storing duplicate iles. Our clone map can also be used
to improve tooling, e.g. being queried when new iles are added to projects to ilter duplicates.
At the outset of this work, we were planning to study diferent granularities of duplication. As
the results came in, the staggering rate of ile-level duplication drove us to select three simple levels
of similarity. A ile hash gives a measure of ile that are copied across projects without changes. A
token hash captures minor changes in spaces, comments and ordering. Lastly, SourcererCC captures
files with 80% token-similarity. This gives an idea of how many files have been edited ater cloning.
Our choice of languages was driven by the popularity of these languages, and by the fact that two
are statically typed and two have no type annotations. This can conceivably lead to diferences in
the way code is reused. We expected to answer the following questions: How much code cloning is
there, how does cloning afect datasets of sotware writen in diferent languages, and through
which processes does duplication come about? This paper describes our methodology, details the
corpus that we have selected and gives our answers to these questions. Along with the quantitative
analysis, we provide a qualitative analysis of duplicates on a small number of examples.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:4 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
Artifacts. The lists of clones, code for gathering data, computing clones, data analysis and visualiza-
tion are at: http://mondego.ics.uci.edu/projects/dejavu. Processing was done on a Dell PowerEdge
R830 with 56 cores (112 threads) and 256G of RAM. The data took 2 months to download and 6
weeks to process.
2 RELATED WORK
Code clone detection techniques have been documented in the literature since the early 90s.
Readers interested in a survey of the early work are referred to [Koschke 2007; Roy and Cordy
2007]. There are also benchmarks for assessing the performance of tools [Roy and Cordy 2009;
Svajlenko and Roy 2015]. The pipeline we used includes SourcererCC, a token-based code clone
detection tool that is freely available and has been compared to other similar tools using those
benchmarks [Sajnani 2016; Sajnani et al. 2016].1 SourcererCC is the most scalable tool so far for
detecting Type 3 clones. Type 3 clones are syntactically similar code fragments that difer at the
statement level. The fragments have statements added/modified/removed with respect to each
other.
One of the earliest studies of inter-project cloning, [Kamiya et al. 2002] analyzed clones across
three diferent operating systems. They found evidence of about 20% cloning between FreeBSD
and NetBSD and less than 1% between Linux and FreeBSD or NetBSD. This is explained by the fact
that Linux originated and grew independently. [Mockus 2007] performed an analysis of popular
open source projects, including several versions of Unix and several popular packages; 38K projects
and 5M files. The concept of duplication there was simply based on file names. Approximately
half of the file names were used in more than one project. Furthermore, the study also tried to
identify components that were duplicated among projects by detecting directories that share a
large fraction of their files. Both [Mockus 2007] and [Mockus 2009] use only a fraction of our
dataset and a single similarity metric, as opposed to the 3 metrics we provide.
A few studies have focused on block-level cloning, i.e. portions of code smaller than entire files.
[Roy and Cordy 2010] analyzed clones in twenty open source C, Java and C# systems. They found
15% of the C files, 46% of the Java files, and 29% of C# files are associated with exact block-level
clones. Java had a higher percentage of clones because of accessors methods in Swing. [Heinemann
et al. 2011] computed block-level clones consisting of at least 15 statements between 22 commonly
reused Java frameworks consisting of more than 6 MLOC and 20 open source Java projects. They
did not find any clones for 11 projects. For 5 projects, they found cloning to be below 1% and for
the remaining 4, they found up to 10% cloning. These two studies give conflicting accounts of
block-level code duplication.
Closer to our study, an analysis of file-level code cloning on Java projects is presented by [Ossher
et al. 2011]. This work, analyzed 13K Java projects with close to 2M files. The authors created a
system that merges various clone detection techniques with various degrees of confidence, starting
on the highest: MD5 hashes; name equivalence through Java’s full-qualified names. They report
5.2% file-hash duplication, considerably lower than what we found. Our corpus is three orders
of magnitude larger than Ossher’s. Furthermore, intra-project duplication meant to deal with
versioning was excluded. They looked at subversion, which may have diferent practices than git,
especially related to versioning. We speculate that the practice of copying source code files in open
source has become more pervasive since that study was made, and that sites like GitHub simplify
copying files among projects, but we haven’t reanalyzed the dataset as it is not relevant to the
DéjàVu map.
1 http://github.com/Mondego/SourcererCC
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:5
Over the past few years, open source repositories have turned out to be useful to validate beliefs
about sotware development and sotware engineering in general. The richness of the data and
the potential insights that it represents has created an entire community of researchers. [Kochhar
et al. 2013] used 50K GitHub repositories to investigate the correlation between the presence of
test cases and various project development characteristics, including the lines of code and the size
of development teams. They removed toy projects and included famous projects such as Juery
and Rails in their dataset. [Vendome et al. 2016] study how licensing usage and adoption changes
over a period of time on 51K repositories. They choose repositories that (i) were not forks; and
(ii) had at least one star. [Borges et al. 2016] analyze 2.5K repositories to investigate the factors
that impact their popularity, including the identification of the major paterns that can be used to
describe popularity trends.
The sotware engineering research community is increasingly examining large number of projects
to test hypotheses or derive new knowledge about the sotware development process. However,
as [Nagappan et al. 2013] point out, more is not necessarily beter, and selection of projects
plays an important role ś more so now than ever, since anyone can create a repository for any
purpose at no cost. Thus, the quality of data gathered from these sotware repositories might be
questionable. For example, as we also found out, repositories oten contain school assignments,
copies of other repositories, images and text files without any source code. [Kalliamvakou et al.
2014] manually analyzed a sample of 434 GitHub repositories and found that approximately 37%
of them were not used for sotware development. As a result, researchers have spent significant
efort into collecting, curating, and analyzing data from open source projects around the world.
Flossmetrics [Gonzalez-Barahona et al. 2010] and Sourcerer [Ossher et al. 2009] collect data and
provide statistics. [Dyer et al. 2013] have curated a large number of Java repositories and provide
a domain specific language to help researchers mine data about sotware repositories. Similarly
[Bissyande et al. 2013] have created Orion, a prototype for enabling unified search to retrieve
projects using complex search queries linking diferent artifacts of sotware development, such as
source code, version control metadata, bug tracker tickets, developer activities and interactions
extracted from hosting platform. Black Duck Open Hub (www.openhub.net) is a public directory
of free and open source sotware, ofering analytics and search services for discovering, evaluating,
tracking, and comparing open source code and projects. It analyzes both the code’s history and
ongoing updates to provide reports about the composition and activity of project code bases. These
platforms are useful for researchers to filter out repositories that are interesting to study a given
phenomenon by providing various filters. While these filters are useful to validate the integrity
of the data to some extent, certain subtle factors when unaccounted for can heavily impact the
validity of the study. Code duplication is one such factor. For example, if the dataset consists of
projects that have hundreds and thousands of duplicate projects that are part of the same dataset,
the overall lack of diversity in the dataset might lead to incorrect observations, as pointed out
by [Nagappan et al. 2013].
3 ANALYSIS PIPELINE
Our analysis pipeline is outlined in Figure 2. The pipeline starts with local copies of the projects that
constitute our corpus. From here, code files are scanned for fact extraction and tokenization. Two
of the facts are the hashes of the files and the hashes of the tokens of the files. File hashes identify
exact duplicates; token hashes allow catch clones up with minor diferences. While permutations
of same tokens may have the same hash, they are unlikely. Clones are dominated by exact copies,
and we did not observe any such collision in randomly sampled pairs. Files with distinct token
hashes are used as input to the near-miss clone detection tool, SourcererCC. While our JavaScript
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:6 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
Tokens Group
Distinct
Software Tokenization and Facts File-hash Reduction of Token-hash Reduction Source
Projects for Distinct
Code
Each File Files
SourcererCC
Distinct
Project
Source
Clones
Code
pipeline was developed independently, data formats, database schema and analysis scripts are
identical.
3.1 Tokenization
Tokenization transforms a file into a łbag of words,ž where occurrences of each word are recorded.
Consider, for instance, the Java program:
package foo ;
public class Foo { // Example Class
private int x;
public Foo ( int x) { this .x = x; }
private void print () { System . out . println (" Number : " + x) }
public static void main () { new FooNumber (4) . print () ; } }
Tokenization removes comments, white space, and terminals. Tokens are grouped by frequency,
generating:
Java Foo :[( package ,1) ,( foo ,1) ,( public ,3) ,( class ,1) ,( Foo ,2) ,( private ,2) ,( int
,2) ,(x ,5) ,
( this ,1) ,( void ,2) ,( print ,2) ,( System ,1) ,( out ,1) ,( println ,1) ,( Number ,1) ,(
static ,1) ,
( main ,1) ,( new ,1) ,( FooNumber ,1) ,(4 ,1) ]
The tokens package and foo appear once, public appears three times, etc. The order is not
relevant. During tokenization we also extract additional information: (1) file hash ś the MD5 hash
of the entire string that composes the input file; (2) token hash ś the MD5 hash of the string that
constitutes the tokenized output; (3) size in bytes; (4) number of lines; (5) number of lines of code
without blanks; (6) number of lines of source without comments; (7) number of tokens; and (8)
number of unique tokens. The tokenized input is used both to build a relational database and as
input to SourcererCC. The use of MD5 (or any hashing algorithm) runs the risk of collisions, given
the size of our data they are unlikely to skew the results.
3.2 Database
The data extracted by the tokenizer is imported into a MySQL database. The table Projects
contains a list of projects, with a unique identifier, a path in our local corpus and the project’s URL.
Files contains a unique id for a file, the id of the project the file came from, the relative paths and
URLs of the file and the file hash. The statistics for each file are stored in the table Stats, which
contains the information extracted by the tokenizer. The tokens themselves are not imported. The
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:7
Projects
Project Id Project Path GitHub URL
Files
File Id Project Id Relative Path Relative URL File Hash
Stats
File Hash Bytes Lines LOC SLOC Tokens ...
Stats table has the file hash as unique key. With this, we get an immediate reduction from files
to hash-distinct files. Two files with distinct file hashes may produce the exact same tokens, and,
therefore the same token hash. This could happen when the code of one file is a permutation of
another. The converse does not hold: files with distinct token hashes must have come from files
with distinct file hashes. For source code analysis, file hashes are not necessarily the best indicators
of code duplication; token hashes are more robust to small perturbations. We use primarily token
hashes in our analysis.
3.4 SourcererCC
The concept of inexact code similarity has been studied in the code cloning literature. Blocks of code
that are similar are called near-miss clones, or near-duplication [Cordy et al. 2004]. SourcererCC
estimates the amount of near-duplication in GitHub with a łbag of wordsž model for source code
rather than more sophisticated structure-aware clone detection methods. It has been shown to
have good precision and recall, comparable to more sophisticated tools [Sajnani 2016]. Its input
consists of non-empty files with distinct token hashes. SourcererCC finds clone pairs between
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:8 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
these files at a given level of similarity. We have selected 80% similarity as this has given good
empirical results. Ideally one could imagine varying the level of similarity and reporting a range
of results. But this would be computationally expensive and, given the relatively low numbers of
near-miss clones, would not afect our results.
4 CORPUS
The GitHub projects were downloaded using the GHTorrent database and network [Gousios
2013] which contains meta-data such as number of stars, commits, commiters, whether projects
are forks, main programming language, date of creation, etc., as well as download links. While
convenient, GHTorrent has errors: 1.6% of the projects were replicated entries with the same URL;
only the youngest of these was kept for the analysis.
Table 2 gives the size of the diferent language corpora. We skipped forked projects as forks
contain a large amount of code from the original projects, retaining those would skew our findings.
Downloading the projects was the most time-consuming step. The order of downloads followed
the GHTorrent projects table, which seems to be roughly chronological. Some of the URLs failed
to produce valid content. This happened in two cases: when the projects had been deleted, or
marked private, and when development for the project happens in branches other than master.
Thus, the number of downloaded projects was smaller than the number of URLs in GHTorrent. For
each language, the files analyzed were files whose extensions represent source code in the target
languages. For Java: .java; for Python: .py; for JavaScript: .js, for C/C++: .cpp .hpp .HPP .c
.h .C .cc .CPP .c++ and .cp. Some projects did not have any source code with the expected
extension, they were excluded.
The medians in Table 2 give additional properties of the corpus, namely the number of files per
(non-empty) project, the number of Source Lines of Code (SLOC) per file, the number of stars and
the number of commits of the projects. In terms of files per project, Python and JavaScript projects
tend to be smaller than Java and C++ projects. C++ files are considerably larger than any others,
and JavaScript files are considerably smaller. None of these numbers is surprising. They all confirm
the general impression that a large number of projects hosted in GitHub are small, not very active,
and not very popular. Figures 3 and 4 illustrate the basic size-related properties of the projects we
analyzed, namely the distribution of files per project and the distribution of Source Lines of Code
(SLOC) per file. For JavaScript we give data with and without NPM (it is a cause of a large number
of clones). Without NPM means that we ignored files downloaded by the Node Package Manager.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:9
C++ Python
15
20
10
15
% of Projects
% of Projects
Statistics
10
5
5
0
0
20
15
15
Statistics
Mean
% of Projects
% of projects
Median
10
10
all
no NPM
5
5
0
5 QUANTITATIVE ANALYSIS
We present analyses of the data at two levels of detail: file and project level. This section focuses
exclusively on quantitative analysis; the next section delves deeper into qualitative observations.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:10 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
C++ Python
9
10
% of projects
% of projects
Statistics
6
5
3
0
0
0 10 100 1000 10000 100000 1000000 0 10 100 1000 10000 100000 1000000
SLOC (log)
Java 12.5 SLOC (log)
JavaScript
15
10
Statistics
10
7.5
Mean
% of projects
% of projects
Median
5
all
no NPM
5
2.5
0
The heatmaps (Figure 1) shown in the beginning of the paper were produced using the number
of commits shown in Table 2, the number of files in each project, and the file hashes. The heat
intensity corresponds to the ratio of file hashes clones over total files for each cell.
Duplication can come in many flavors. Specifically, it could be evenly or unevenly distributed
among all token hashes. We found these distributions to be highly skewed towards small groups
of files. In Java 1.5M groups of files with the same token-hash have either 2 or 3 files in them; the
number of token hash-equal groups with more than 100 files is minuscule. The same observation
holds for the other languages. Another interesting piece of information about clone groups is given
by the largest extreme. In Python, the largest group of file-hash clones has over 2.5M files. In Java,
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:11
Fig. 5. File-level duplication for entire dataset and excluding small files.
the largest group of SourcererCC clones has over 65K files. In the next section we show which files
these are.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:12 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
that small files account for a slightly higher presence of duplication, but not that much higher
than the rest of the corpus.
40
30
>=50%
%
>=80%
20 100%
10
Table 5 shows the number of projects whose files exist in other projects at some overlap threshold
ś 50%, 80% and 100%, respectively. A normalization of these numbers over the total number of
projects for each language is shown in Figure 6. JavaScript comes on top with respect to the amount
of project-level duplication, with 48% of projects having 50% or more files duplicated in some other
project, and an equally impressive 15% of projects being 100% duplicated.3 Not surprisingly, the
3 Again, we remind the reader that our dataset does not contain forks.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:13
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:14 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
Table 6. Number of tokens per file within certain percentiles of the distribution of file size.
45%-55% (medium), 70%-80% (large), and greater than 90% (very large). So, the 45%-55% bin
contains files that are between the 45% percentile and the 55% percentile on the number of
tokens per file of a certain language. The number of tokens for the bins can be seen in Table 6.
For example in Java, the first bin includes files containing 47 to 72 tokens, and so on. The
gaps between these percentiles (for example, no file is observed between the 30% and the
45% percentile) ensure bufer zones that are large enough to isolate the diferently-sized files,
should diferences in their characteristics be observed. For each of these bins, we analyzed
the top 20 most cloned files; this grouping was performed twice, using file hashes and token
hashes, and this was done for all the languages. In total, for each language, 80 files were
analyzed.
• Qualitative Elements. Looking at names of most popular files, a first observation was that
many of these files came from popular libraries and frameworks, like Apache Cordova. This
hinted at the possibility that the origin of file duplication was in well-known, popular libraries
copied in many projects; a qualitative analysis of file duplication was beter understood
from this perspective. Therefore, each file was observed from the perspective of the path
relative to the project where it resides, and was then hand coded for its origin.4 For example,
project_name/src/external/com/http-lib/src/file.java was considered to be part
of the external library http-lib. Each folder assumed to represent an external library
was matched with an existing homepage for the library, if we could find it using Google.
Continuing the running example, http-lib was only flagged as an external dependency if
there was a clear pointer online for a Java library with that name. In some cases, the path
name was harder to interpret, for example: p roject_name/external/include/internal/tobjs.h.
In those cases, we searched Google for the last part of the path in order to find the origin
(in this particular case, we searched i nclude/internal/tobjs.h). For JavaScript the situation
was oten simpler: many of the files came from NPM modules, in which case the module
name was obvious from the file’s location. Some of the files were also minified versions of
libraries, in which case the name of the file gave the library name, oten with its version (e.g.
jquery-3.2.1.min). Using these methods, we were able to trace the origins of all the 320
files.
6.1.2 Observations. Contrary to our original expectation, we did not find any diferences in the
nature of file duplication related to either size of the files, similarity metric, or language in the 320
samples we inspected. We also didn’t find any StackOverflow or tutorial files in these samples.
Moreover, the results for these files show a patern that crosses all of those dimensions: the most
4 For a good tutorial on coding, see [Saldaña 2009]
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:15
duplicated files in all ecosystems come from a few well-known libraries and frameworks. The Java
files were dominated by the ActionBarSherlock and Cordova. C/C++ was dominated by boost
and freetype, and JavaScript was dominated by files from various NPM packages, only 2 cases
were from juery library. For Python, the origins of file cloning for the 80 files sampled were more
diverse, along 6 or 7 common frameworks.5
Because the JavaScript sample was so heavily (78 out of 80) dominated by Node packages, we
have performed the same analysis again, this time excluding the Node files. This uncovered juery
in its various versions and parts accounting for more than half of the sample (43), followed from
a distance by other popular frameworks such as Twiter Bootstrap (12), Angular (7), reveal (4).
Language tools such as modernizr, pretify, HTML5Shiv and others were present. We atribute this
greater diversity to the fact that to keep connections small, many libraries are distributed as a
single file. It is also a testament to the popularity of juery which still managed to occupy half of
the list.
The presence of external libraries within the projects’ source code shows a form of dependency
management that occurs across languages, namely, some dependencies are source-copied to the
projects and commited to the projects’ repositories, independent of being installed through a
package manager or not. Whether this is due to personal preference, operational necessity, or
simple practicality cannot be inferred from our data.
Another interesting observation was the proliferation of libraries for being themselves source-
included in other widely-duplicated libraries. Take Cordova, a common duplicated presence within
the Java ecosystem. Cordova includes the source of okhtp, another common origin of duplication.
Similarly, within C/C++, freetype2 was disseminated in great part with the help of another highly
dispersed supporting framework, cocos2d. This not only exacerbates the problem, but provides a
clear picture of the tangled hierarchical reliance that exists in modern sotware, and that sometimes
is source-included rather than being installed via a package manager.
5 The very small number of libraries and frameworks found in these samples is a consequence of having sampled only 80
iles per language, and the most duplicated ones. Many of the iles had the same origin, because those original libraries
consist of several iles.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:16 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
Java C/C++
100000000
100000000
100000 100000
count count
File Size (bytes)
10000
1e+04
1e+02 100
1e+00 1
100 100
Python JavaScript
100000000
100000000
100000 count
count
File Size (bytes)
File Size (bytes)
100000
10000 10000
100 100
1 1
100
100
diferent files). These findings are summarized in Figure 8. The outlier in the top-right corner of
each graph is the empty file. The number of diferent empty files is explained by the fact that
when using token hash, any file that does not have any language tokens in it is considered empty.
Given the multitude of sizes observed within token hash groups, the next step was to analyze the
actual diference in sizes within the groups. The results shown in Figure 9 summarize our findings.
As expected, for all four languages the empty file again showed very close to the top. For Java, the
biggest empty file was 24.3MB and contains a huge number of comments as a compiler test. For
C/C++ the empty files has the second largest diference and consists of a comment with ASCII
art. Python’s empty file was a JSON dump on a single line, which was commented, and finally for
JavaScript the largest empty file consisted of thousands of repetitions of an identical comment
line, totaling 36MB.
More interesting than largest empty files is the answer to the question: What other, non-trivial
files display the greatest diference between sizes in the same group. Interestingly, the answer
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:17
Java C/C++
1000 1000
count count
# of different files
# of different files
1e+06
1e+06
1e+04 1e+04
1e+02 1e+02
1e+00 1e+00
10 10
Python JavaScript
100000
1000
count 1000 count
# of different files
# of different files
1e+06
1e+06
1e+04 1e+04
1e+02 1e+02
1e+00 1e+00
10
10
is slightly diferent for each language: for Java, the greatest size diferences exist for binary files
disguised as java files. In these files, very few tokens were identified by the tokenizer and therefore
two unrelated binary files were grouped into a single token group with a small number of very
diferent files. For C/C++ oten, we have found source codes with and without hundreds of KB
of comments as members of the same groups. An outlier was a file with excessive white-spaces
at each line (2.42MB diference). In Python, formating was most oten the cause: a single file
multiplied its size 10 times by switching from tabs to 8 spaces. For JavaScript, we observed minified
and non-minified versions. Sometimes the files were false positives because complex Javascript
regular expressions were treated as comments by the simple cross-language parser.
6.2.3 SourcererCC Duplicates. For SourcererCC, we randomly selected 20 clone pairs and we
categorized them into three categories: i) intentional copy-paste clones; ii) unintentional accidental
clones; and iii) auto-generated clones. It is interesting to note that the clones in categories ii) and
iii) are both unavoidable and are created because of the use of the popular frameworks.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:18 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
Java C/C++
10000000
100000
100000
count count
Delta size (bytes)
100 100
1000
1000
1 1
10
10
Python JavaScript
100000000
10000000
100000
100000
count count
Delta size (bytes)
Delta size (bytes)
10000 10000
100 100
1000
1 1
100
10
Java. We have categorized 30% (6 out of 20) of the clone pairs into the intentional copy-paste
clones category. It included instances of both inter-project and intra-project clones. Intra-project
clones were created to test/implement functionalities that are similar while keeping them isolated
and easy to maintain. Inter-project clones seemed to come from projects that look like class projects
for a university course and from situations where one project was almost entirely copy-pasted into
the other project. We found 2 instances of unintentional cloning, both inter-project. The files in
such clone pairs implement a lot of similar boilerplate code necessary to create an Android activity
class. We categorized the majority (12 out of 20) of the clone pairs into the auto-generated clones
category. The files in this category are automatically generated from the frameworks like Apache
Axis (6 pairs), Android (2 pairs), and Java Architecture for XML Binding (4 pairs). The unintentional
and auto-generated clones together constitute 70% of the sample.
C/C++. The sample was dominated by intentional copy-paste clones (70%, 12 pairs). The origin
for these file clone pairs seems to be the same, independent of these being inter of intra-project
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:19
clones, and relates to the reuse of certain pieces of source code ater which they sufer small
modification to cope with diferent setups or support diferent frameworks. Five pairs were classified
as unintentional cloning. They represented educational situations (one file was composed in its
large part by the skeleton of a problem, and the diference between the files clones was the small
piece of code that implements the solution). Two diferent versions of the same file were also
found (libpng 1.0.9 vs. libpng 1.2.30). Files from two projects sharing a common ancestor (bitcoin
vs dotcoin) were also observed. The auto-generated clones were present in three pairs, 2 of them
from the Meta-Object compiler.6 The unintentional and auto-generated clones accounted for 40%
of the sample.
Python. The sample was dominated by uses of the Django framework (17 pairs), all variants
of auto generated code to initialize a Djagno application. We classified them as auto-generated
clones. Two pairs were intentional copy-paste clones intra-project copy-paste of unitests. The last
pair belonged to the same category was a model schema for a Django database.
JavaScript. Only one intentional copy-paste clones example has been found, which consisted of a
test template with manually changed name, but nothing else. Five occurrences of unintentional
cloning comprised of pairs of diferent file versions for juery(2), google maps opacity slider,
modernizr, and angular socket service. The remaining 14 pairs (70%) have been classified as auto-
generated clones. Dominated by Angular project files(7), project files for express(3), angular locales
and diferent gruntfiles (builder files for Node projects) were present. All of the Angular project files
are created with Yeoman, a tool for creating application skeletons with boilerplate code used also
by the Angular Full Stack Generator. The last pair classified as autogenerated was also the only
inter-project clone and consisted of two very similar JSON records in a federal election commission
dump stored on Github. In total, 95% of the pairs were unintentional or auto-generated.
6 http://doc.qt.io/qt-4.8/moc.html
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:20 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
200000000
60000000
Legend
Legend
non−test duplicates
non−test duplicates
test duplicates
test duplicates
unique tests
Files
Files
npm non−test 40000000
unique files
npm test
0 0
9/2006 3/2009 9/2011 3/2014 9/2016 9/2006 3/2009 9/2011 3/2014 9/2016
Date Date
Fig. 10. JavaScript files over time, with and without NPM files.
For Python, the top 3 most reappropriated projects are Cactus, which is a static site generator
using Django templates, Shadowsocks, a fast tunnel proxy that helps bypass firewalls, and Scons,
a sotware construction tool.
Finally, for JavaScript the most reappropriated project is the Adobe PhoneGap’s Hello World
Template7 , which has been found intact in total of 1746 projects. PhoneGap is a framework for
building mobile applications using the web stack and it dominates the most frequently cloned
projects - the top 15 most cloned projects are all diferent versions of its template. PhoneGap is
followed by the OctoPress8 blogging framework and by a template for BlueMix.9
These observations show that project reappropriation exists for a variety of reasons: simple reap-
propriations that could be addressed by Git submodules (e.g. Minecrat API, Arduino), seemingly
abandoned derivative development (Cactus, PhoneGap), true forks with addition of non-source
code content (OctoPress) and even unorthodox uses of GitHub (the C++ homework).
6.4 JavaScript
JavaScript has the highest clone ratio of the languages studied. Over 94% of the files are file-hash
clones. We wanted to find out what is causing this bloat. Ater manually inspecting several files,
we observed that many projects commit libraries available through NPM as if they are part of
the application code.10 As such, we analyzed the data with respect to the efect of NPM libraries,
and concluded that this practice is the single biggest cause for the large duplication in JavaScript.
What follows are some mostly quantitative perspectives on the efect of NPM libraries, along
with some qualitative observations pulled from additional sources. Figure 10 on the let shows
the composition of JavaScript repositories over time with respect to unique files and tests and
token-hash clones and (we considered any file in test folder to be a test) compared with files &
tests coming from unorthodox use of NPM. Figure 10 on the right shows the corpus in the same
categories, but without the NPM files, whose number is indicated by the dashed line which quickly
surpasses all other files in the corpus. The huge impact of NPM files can be seen not only in the
sheer number of files, Figure 11 shows the percentage of token-hash clones for diferent subsets
7 http://github.com/phonegap/phonegap-template-hello-world
8 http://octopress.org/
9 https://www.ibm.com/cloud-computing/bluemix/
10 npm is the package manager used by the very popular Node framework.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:21
% of non−unique files
100
Legend
90 # of all files
# of NPM files
%
all files
non−NPM
80 non−NPM tests
NPM
70
of the files over time. To help assess influence, the background of the graph shows the numbers
of total and NPM files at given times. Few files predate the NPM Manager itself (January 2010).
We have found similar outliers in the rest of the files (small amount of them predating not just
GitHub and Git, but even JavaScript itself). As soon as NPM files started to appear in the corpus,
they took over the global ratio (solid line), while the rest of the files slowly added original content
over time. Interesting is the higher originality of tests ś when people copy and paste the entire
files, they tend to ignore their tests.
6.4.1 NPM Files. When npm is used in a project, the package.json file contains the descrip-
tion of the project including its required packages. When the project is built, these packages,
are loaded and stored in the ’node_modules’ directory. If the packages themselves have depen-
dencies, these are stored under the package name in another nested ’node_modules’ directory.
The ’node_modules’ folder will be updated each time the project is built and a new version of
some of the packages it transitively requires is available. Therefore it should not be part of the
repository itself - a practice GitHub recommends.11 Since NPM allows dependencies to link to
specific versions of the package, there is no need to include the ’node_modules’ directory even if
the application requires specific package version. Even more surprising than the sheer number of
NPM files in the corpus is the number of packages responsible for them. 41.21% (732991) projects
use NPM package manager, but only 6% (106582) projects include their ’node_modules’ directory.
These 6% projects are ultimately responsible for almost 70% of the entire files. It is therefore not
surprising that once a project includes its NPM dependencies, its file number is overwhelmed by
the packages’ files as shown in Figure 12 on the let.
There are even projects that seem to contain only NPM files. Oten a project is created us-
ing an automated generator which installs various dependencies, pushed to Github with the
node_modules directory and never used again. The largest of such projects12 contains only NPM
modules used in other project of the same author and has 46281 files. If the project is writen
11 https://github.com/github/gitignore/blob/master/Node.gitignore
12 https://github.com/kuangyeheng/worklow-modules
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:22 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
80
40
60
30
% of projects
% of projects
Statistics Statistics
Mean Mean
40
Median Median
20
20
10
0
0
Fig. 12. % of NPM files in projects and directly imported NPM packages
a dialect of Javascript that does not use the js extension (such as jsx or TypeScript) it would
appear all its files come from NPM. This is the case of the second largest npm-only project13
consists of 16761 JS files from NPM and a handful of jsx files discovered by manual inspection.
We have also analyzed the depth of nested dependencies in the NPM packages. In the worst
case we have observed this nesting to be 47 modules deep with median of 5. The number of unique
projects included has median of 63 and maxes out at 1261, but this includes the nested dependencies
as well. The direct imports, i.e. modules specified as dependencies in the package.json file is in
general much smaller as shown in Figure 12 on the right. There are however outliers which come
close to the max number of unique projects included. The largests of them has been created by
the Angular Full Stack Generator,14 an automated service for generating Angular applications.15
Other projects with extraordinarily large direct dependencies are created using similar automated
generators, such as Yeoman. In terms of module popularity (Figure 13) (note the log scale on y
axis) most modules are imported by a small percent of projects, however there are some massively
popular ones: Express16 (59277 projects) is a minimalist web UI framework, body parser17 (31807
projects) a HTTP response body parser and debug18 (24413 projects), a debugging utility for Node
applications. Surprisingly, many of the NPM packages contain a great deal of tests in them, as
shown in Figure 10, which seems unnecessary, as these should be release versions of the packages
for users, not package for developers.
7 CONCLUSIONS
The source control system upon which GitHub is built, Git, encourages forking projects and
independent development of those forks. GitHub provides an easy interface for forking a project,
and then for merging code changes back to the original projects. This is a popular feature: the
metadata available from GHTorrent shows an average of 1 fork per project. However, there is a lot
more duplication of code that happens in GitHub that does not go through the fork mechanism,
and, instead, goes in via copy and paste of files and even entire libraries.
13 https://github.com/george-codes/react-skeleton
14 https://github.com/angular-fullstack/generator-angular-fullstack
15 Ironically the project itself was created to let people łquickly set up a project following best practicesž.
16 https://www.npmjs.com/package/express
17 https://www.npmjs.com/package/body-parser
18 https://www.npmjs.com/package/debug
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:23
Module popularity
1000
# of modules
Statistics
Mean
Median
10
We presented an exhaustive investigation of code cloning in GitHub for four of the most popular
object-oriented languages: Java, C++, Python and JavaScript. The amount of file-level duplication
is staggering in the four language ecosystems, with the extreme case of JavaScript, where only 6%
of the files are original, and the rest are copies of those. The Java ecosystem has the least amount
of duplication. These results stand even when ignoring very small files. When delving deeper
into the data we observed the presence of files from popular libraries that were copy-included
in a large number projects. We also detected cases of reappropriation of entire projects, where
developers take over a project without changes. There seemed to be several reasons for this, from
abandoned projects , to slightly abusive uses of GitHub in educational contexts. Finally, we studied
the JavaScript ecosystem, which turns out to be dominated by Node libraries that are commited
to the applications’ repositories.
This study has some important consequences. First, it would seem that GitHub, itself, might be
able to compress its corpus to a fraction of what it is. Second, more and more research is being done
using large collections of open source projects readily available from GitHub. Code duplication can
severely skew the conclusions of those studies. The assumption of diversity of projects in those
datasets may be compromised. DéjàVu can help researchers and developers navigate through code
cloning in GitHub, and avoid it when necessary.
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:24 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:25
Table 8. Summary statistics for the minimum set of files (distinct token hashes).
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:26 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
Table 9. Summary statistics for the minimum set of files (distinct file hashes).
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
DéjàVu 84:27
ACKNOWLEDGEMENTS
This project has received funding from the European Research Council (ERC) under the European Union’s
Horizon 2020 research and innovation program (grant agreement 695412), from the United States Defense
Advanced Research Agency under the MUSE program, and was partially support by NSF award 1544542 and
ONR award 503353.
REFERENCES
T. F. Bissyande, F. Thung, D. Lo, L. Jiang, and L. Reveillere. 2013. Orion: A Sotware Project Search Engine with Integrated
Diverse Sotware Artifacts. In International Conference on Engineering of Complex Computer Systems. https://doi.org/10.
1109/ICECCS.2013.42
Stephen M. Blackburn, Robin Garner, Chris Hofmann, Asjad M. Khan, Kathryn S. McKinley, Rotem Bentzur, Amer
Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony L. Hosking, Maria Jump, Han Bok
Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovic, Thomas VanDrunen, Daniel von Dincklage, and Ben
Wiedermann. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In Conference on Object
Oriented Programming Systems Languages and Applications (OOPSLA). https://doi.org/10.1145/1167473.1167488
Hudson Borges, André C. Hora, and Marco Tulio Valente. 2016. Understanding the Factors that Impact the Popularity of
GitHub Repositories. (2016). http://arxiv.org/abs/1606.04984
Casey Casalnuovo, Prem Devanbu, Abilio Oliveira, Vladimir Filkov, and Baishakhi Ray. 2015. Assert Use in GitHub Projects.
In International Conference on Sotware Engineering (ICSE). http://dl.acm.org/citation.cfm?id=2818754.2818846
James R. Cordy, Thomas R. Dean, and Nikita Synytskyy. 2004. Practical Language-independent Detection of Near-miss
Clones. In Conference of the Centre for Advanced Studies on Collaborative Research (CASCON). http://dl.acm.org/citation.
cfm?id=1034914.1034915
V. Cosentino, J. L. C. Izquierdo, and J. Cabot. 2016. Findings from GitHub: Methods, Datasets and Limitations. In Working
Conference on Mining Sotware Repositories (MSR). https://doi.org/10.1109/MSR.2016.023
John W. Creswell. 2014. Research Design: ualitative, uantitative, and Mixed Methods Approaches. SAGE.
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A Language and Infrastructure for
Analyzing Ultra-large-scale Sotware Repositories. In International Conference on Sotware Engineering (ICSE). http:
//dl.acm.org/citation.cfm?id=2486788.2486844
Jesus M. Gonzalez-Barahona, Gregorio Robles, and Santiago Dueñas. 2010. Collecting Data About FLOSS Development:
The FLOSSMetrics Experience. In International Workshop on Emerging Trends in Free/Libre/Open Source Sotware Research
and Development (FLOSS). https://doi.org/10.1145/1833272.1833278
Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Working Conference on Mining Sotware Repositories
(MSR). https://doi.org/10.1109/MSR.2013.6624034
Lars Heinemann, Florian Deissenboeck, Mario Gleirscher, Benjamin Hummel, and Maximilian Irlbeck. 2011. On the Extent
and Nature of Sotware Reuse in Open Source Java Projects. Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21347-2_
16
Felipe Hofa. 2016. 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs? (2016). https:
//medium.com/@hofa/400-000-github-repositories-1-billion-iles-14-terabytes-of-code-spaces-or-tabs-7cfe0b5dd7fd
Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2014. The
Promises and Perils of Mining GitHub. In Working Conference on Mining Sotware Repositories (MSR). https://doi.org/10.
1145/2597073.2597074
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A Multilinguistic Token-based Code Clone
Detection System for Large Scale Source Code. IEEE Trans. Sotw. Eng. 28, 7 (2002). https://doi.org/10.1109/TSE.2002.
1019480
P. S. Kochhar, T. F. BissyandÃľ, D. Lo, and L. Jiang. 2013. Adoption of Sotware Testing in Open Source ProjectsśA
Preliminary Study on 50,000 Projects. In European Conference on Sotware Maintenance and Reengineering. https:
//doi.org/10.1109/CSMR.2013.48
R. Koschke. 2007. Survey of research on sotware clones. In Duplication, Redundancy, and Similarity in Sotware (Dagstuhl
Seminar Proceedings 06301).
A. Mockus. 2007. Large-Scale Code Reuse in Open Source Sotware. In First International Workshop on Emerging Trends in
FLOSS Research and Development. https://doi.org/10.1109/FLOSS.2007.10
A. Mockus. 2009. Amassing and Indexing a Large Sample of Version Control Systems: Towards the Census of Public Source
Code History. In Working Conference on Mining Sotware Repositories (MSR). https://doi.org/10.1109/MSR.2009.5069476
Meiyappan Nagappan, Thomas Zimmermann, and Christian Bird. 2013. Diversity in Sotware Engineering Research. In
Foundations of Sotware Engineering (FSE). https://doi.org/10.1145/2491411.2491415
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.
84:28 C. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, and J. Vitek
J. Ossher, Sushil Bajracharya, E. Linstead, P. Baldi, and Crista Lopes. 2009. SourcererDB: An aggregated repository of
statically analyzed and cross-linked open source Java projects. In Working Conference on Mining Sotware Repositories
(MSR). https://doi.org/10.1109/MSR.2009.5069501
Joel Ossher, Hitesh Sajnani, and Cristina Lopes. 2011. File Cloning in Open Source Java Projects: The Good, the Bad, and
the Ugly. In International Conference on Sotware Maintenance (ICSM). https://doi.org/10.1109/ICSM.2011.6080795
Baishakhi Ray, Daryl Posnet, Vladimir Filkov, and Premkumar Devanbu. 2014. A Large Scale Study of Programming
Languages and Code uality in Github. In International Symposium on Foundations of Sotware Engineering (FSE).
https://doi.org/10.1145/2635868.2635922
Gregor Richards, Andreas Gal, Brendan Eich, and Jan Vitek. 2011. Automated Construction of JavaScript Benchmarks. In
Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). https://doi.org/10.1145/
2048066.2048119
C. K. Roy and J. R. Cordy. 2007. A survey on sotware clone detection research. Technical Report 541. ueens University.
Chanchal K. Roy and James R. Cordy. 2009. A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone
Detection Tools. In International Conference on Sotware Testing, Verification, and Validation. https://doi.org/10.1109/
ICSTW.2009.18
C. K. Roy and J. R. Cordy. 2010. Near-miss Function Clones in Open Source Sotware: An Empirical Study. J. Sotw. Maint.
Evol. 22, 3 (2010). https://doi.org/10.1002/smr.v22:3
Hitesh Sajnani. 2016. Large-Scale Code Clone Detection. Ph.D. Dissertation. University of California, Irvine.
Hitesh Sajnani, Vaibhav Saini, Jefrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling Code
Clone Detection to Big-code. In International Conference on Sotware Engineering (ICSE). https://doi.org/10.1145/2884781.
2884877
Johnny Saldaña. 2009. The Coding Manual for ualitative Researchers. SAGE.
SPEC. 1998. SPECjvm98 benchmarks. (1998).
J. Svajlenko and C. K. Roy. 2015. Evaluating clone detection tools with BigCloneBench. In International Conference on
Sotware Maintenance and Evolution (ICSME). https://doi.org/10.1109/ICSM.2015.7332459
Christopher Vendome, Gabriele Bavota, Massimiliano Di Penta, Mario Linares-Vásquez, Daniel German, and Denys
Poshyvanyk. 2016. License usage and changes: a large-scale study on GitHub. Empirical Sotware Engineering (2016).
https://doi.org/10.1007/s10664-016-9438-4
Proc. ACM Program. Lang., Vol. 1, No. OOPSLA, Article 84. Publication date: October 2017.