This paper is a post-print paper accepted in “International Conference on Future Internet of
Things and Cloud (FiCloud), 2016“
The final version of this paper is available through IEEE Xplore in the next link:
http://ieeexplore.ieee.org/document/7592719/
J. Morán, B. Rivas, C. De La Riva, J. Tuya, I. Caballero and M. Serrano, "Infrastructure-Aware
Functional Testing of MapReduce Programs," 2016 IEEE 4th International Conference on Future
Internet of Things and Cloud Workshops (FiCloudW), Vienna, 2016, pp. 171-176. doi:
10.1109/W-FiCloud.2016.45
IEEE copyright notice. © 2016 IEEE. Personal use of this material is permitted. Permission from
IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new
collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted
component of this work in other works
Infrastructure-Aware Functional Testing of
MapReduce programs
Bibiano Rivas
Jesús Morán
Department of Computing Institute of Technology
and Information Systems
University of Oviedo
University of Castilla-La
Gijón, Spain
Mancha
moranjesus@lsi.uniovi.es
Ciudad Real, Spain
Bibiano.Rivas@uclm.es
Claudio de la Riva, Javier Tuya Ismael Caballero, Manuel Serrano
Department of Computing
University of Oviedo
Gijón, Spain
{claudio, tuya}@uniovi.es
Abstract—Programs that process a large volume of data
generally run in a distributed and parallel architecture, such as
the programs implemented in the processing model MapReduce.
In these programs, developers can abstract the infrastructure
where the program will run and focus on the functional issues.
However, the infrastructure configuration and its state cause
different parallel executions of the program and some could
derive in functional faults which are hard to reveal. In general,
the infrastructure that executes the program is not considered
during the testing, because the tests usually contain few input
data and then the parallelization is not necessary. In this paper a
testing technique is proposed to generate different infrastructure
configurations for a given test input data, and then the program
is executed in these configurations in order to reveal functional
faults. This testing technique is automatized by using a test
engine and applied in a case study. As a result, several
infrastructure configurations are automatically generated and
executed for a test case revealing a functional fault that is then
fixed by the developer.
Keywords— Software testing, MapReduce programs, Big Data
Engineering, Hadoop
I.
INTRODUCTION
The new trends in massive data processing have brought to
light several technologies and processing models in the field
called Big Data Engineering [1]. Among them, MapReduce [2]
can be highlighted as it permits the analysis of large data based
on the “divide and conquer” principle. These programs run two
phases in a distributed infrastructure: the Mapper phase divides
the problem into several subproblems, and then the Reducer
phase solves each subproblem. Usually, MapReduce programs
run on several computers with heterogeneous resources and
features. This complex infrastructure is managed by a
framework, such as Hadoop [3] which stands out due to its
wide use in the industry [4].
From the developer point of view, a MapReduce program
can be implemented only with Mapper and Reducer, without
any consideration about the infrastructure. Then the framework
that manages the infrastructure is also responsible to
automatically deploy and run the program over several
computers and lead the data processing between the input and
output. Among others, the framework divides the input into
Institute of Technology and
Information Systems
University of Castilla-La Mancha
Ciudad Real, Spain
{Ismael.Caballero,
Manuel.Serrano}@uclm.es
several subsets of data, then processes each one in parallel and
re-runs some parts of the program if necessary.
Despite the fact the program can be implemented
abstracting the infrastructure, the developer needs to consider
how the infrastructure configuration could affect the program
functionality. A previous work [5] detects and classifies several
faults that depend on how the infrastructure configuration
affects the program execution and produces different output.
These faults are often masked during the test execution because
the tests usually run over an infrastructure configuration
without considering the different situations that could occur in
production, as for example different parallelism levels or the
infrastructure failures [6]. On the other hand, if the tests are
executed in an environment similar to the production, some
faults may not be detected because it is common that the test
inputs contain few data, which means that Hadoop does not
parallelize the program execution. There are some tools to
enable the simulation for some of these situations (for example
computer and net failures) [7, 8, 9], but it is difficult to design,
generate and execute the tests in a deterministic way because
there are a lot of elements that need fine grained simulation,
including the infrastructure and framework.
The main contribution of this paper is a technique that can
be used to generate automatically the different infrastructure
configurations for a MapReduce application. The goal is to
execute test cases with these configurations in order to reveal
functional faults. Given a test input data, the configurations are
obtained based on the different executions that can happen in
production. Then each one of the configurations is executed in
the test environment in order to detect functional faults of the
program that may occur in production. The contributions of
this work are:
1. A combinatorial technique to generate the different
infrastructure configurations, taking into account
characteristics related to the MapReduce processing and
the test input data.
2. Automatic support by means of a test engine based on
MRUnit [10] that allows the execution of the
infrastructure configurations, together with the
evaluation to detect failures.
The rest of the paper is organized as follows. In Section II the
principles of the MapReduce paradigm are introduced. The
generation of the different configurations, the execution and the
automatization of the tests are defined in Section III. In Section
IV it is applied to a case study. In Section V the related work
about software testing in MapReduce paradigm is presented.
The paper ends with conclusions and future work in Section
VI.
II.
MAPREDUCE PARADIGM
The MapReduce program processes high quantities of data
in a distributed infrastructure. The developer implements two
functionalities: Mapper task that splits the problem into several
subproblems and Reducer task that solves these subproblems.
The final output is obtained from the deployment and the
execution over a distributed infrastructure of several instances
of Mapper and Reducer, also called tasks. The deployment and
execution are automatically carried out by Hadoop or another
framework. First, several Mapper tasks analyse in parallel a
subset of input data and determine which subproblems these
data need. When the execution of all Mappers are finished,
several Reducers are also executed in parallel in order to solve
the subproblems. Internally MapReduce handles <key, value>
pairs, where the key is the subproblem identifier and the value
contains the information to solve it.
To illustrate MapReduce let us suppose a program that
computes the average temperature per year from historical data
about temperatures. This program solves one subproblem for
each year, so the identifier or key is the year. The Mapper task
receives a subset of temperature data and emits <year,
temperature of this year> pairs. Then Hadoop aggregates all
values per key. Therefore, the Reducer tasks receive
subproblems like <year, [all temperatures of this year]>, that is
all temperatures grouped per year. Finally, the Reducer
calculates the average temperature. For example, in Fig. 1 an
execution of the program considering the input is detailed: year
2000 with 3º, 2002 with 4º, 2000 with 1º, and 2001 with 5º.
The first two inputs are analysed in one Mapper task and the
remainder in another task. Then the temperatures are grouped
per year and sent to the Reducer tasks. The first Reducer
receives all the temperatures for the years 2000 and 2002, and
the other task for the year 2001. Finally, each Reducer emits
the average temperature of the analysed subproblems: 2º in the
year 2000, 4º in 2002 and 5º in 2001. This program with the
same input could be executed in another way by the
framework, for example with three Mappers and three
Reducers. Regardless of how the framework runs the program,
it should generate the expected output.
Additionally, to optimize the program, a Combiner
functionality can be implemented. This task is run after the
<2000, 3º>
<2002, 4º>
<2000, 1º>
<2001, 5º>
<2000, 3º> <2000, [3º, 1º]>
<2002, 4º> <2002, [4º]>
<2000, 2º>
Mapper Task
Reducer Task
<2002, 4º>
<2000, 1º>
Mapper Task
Reducer Task <2001, 5º>
<2001, 5º> <2001, [5º]>
Fig. 1. Program that calculates the average temperature per year
Mapper and the goal is to remove the irrelevant <key, value>
pairs to solve the subproblem. In MapReduce there are also
other implementations such as for example Partitioner that
decides for each <key, value> pair which Reducer analyses it,
Sort that sorts the <key, value> pairs, and Group that
aggregates the values of each key before the Reducer.
The wrong implementation of these functionalities could
cause a failure in one of the different ways in which Hadoop
can run the program. These faults are difficult to detect during
testing because the test cases usually contain few input data. In
this way it is not necessary to split the inputs and therefore the
execution is over one Mapper, one Combiner and one Reducer
[2].
III.
GENERATION AND EXECUTION OF TESTS
The generation of the infrastructure configurations for the
tests are defined in Section A, and a framework to execute the
tests in Section B.
A. Generation of the test scenarios
To illustrate how the infrastructure configuration affects the
program output, suppose that the example of Section II is
extended with a Combiner in order to decrease the data and
improve the performance. The Combiner receives several
temperatures and then they are replaced by their average in the
Combiner output. In this case, the program does not admit a
Combiner because all the temperatures are needed to obtain the
total average temperature. The error of adding the Combiner in
order to optimize the program injects a functional fault in the
program. Fig. 2 represents three possible executions of this
program that could occur in production considering the
different infrastructure configurations and the same input (year
1999 with temperatures 4º, 2º and 3º).
The first configuration consists of one Mapper, one
Combiner and one Reducer that produces the expected output.
The second configuration also generates the expected output
executing one Mapper that processes the temperatures 4º and
2º, another Mapper for 3º, two Combiner, and finally one
Reducer. The third configuration also executes two Mapper,
two Combiner and one Reducer, but produces an unexpected
output because the first Mapper processes 4º and the second
Mapper the temperatures 2º and 3º. Then one of the Combiner
tasks calculates the average of 4º, and the other Combiner of 2º
and 3º. The Reducer receives the previous averages (4º and
<1999, 4º>
<1999, 2º>
<1999, 3º>
<1999, 4º>
<1999, 2º>
<1999, 3º>
<1999, 4º>
<1999, 2º>
<1999, 3º>
Same input
Mappper
<1999, [4º, 2º, 3º]> <1999, [3º]>
Combiner
Reducer <1999, 3º>
Mappper
<1999, [4º, 2º]>
Combiner
Mappper
Mappper
Mappper
Combiner
<1999, [3º]>
<1999, [4º]>
Combiner
<1999, [3º, 3º]>
Reducer <1999, 3º>
<1999, [4º, 2.5º]>
Reducer <1999, 3.25º>
Combiner
<1999, [2º,3º]>
Different scenario
Different output
Fig. 2. Different infrastructure configurations for a program that
calculates the average temperature per year with Combiner task
Automatic Test execution
Test Input data Expected output Are equals? No
(9)
(optional)
case
(10)
(8) Yes
(1)
Ideal
Run scenario
output
(2) Ideal scenario
(3)
No (6)
(7)
Are
Are all
equals?
Yes
scenarios
Yes
tested?
No (4)
Output
(5)
Generation of
new scenario
Run scenario
Fig. 3. a) General famework of test execution
2.5º), and calculates the total average in the year. This
configuration produces 3.25º as output instead of the 3º of the
expected output. The program has a functional fault only
detected in the third configuration. The failure is produced
whenever this infrastructure configuration is executed,
regardless of the computer failures, slow net or others. This
fault is difficult to reveal because the test case needs to be
executed in the infrastructure configuration that detect it, and in
a completely controlled way.
Given a test input data, the goal is to generate the different
infrastructure configurations, also called in this context
scenarios. For this purpose, the technique proposed considers
how the MapReduce program can execute these input data in
production. First, the program runs the Mappers, then over
their outputs the Combiners and finally the Reducers. The
execution can be carried out over a different number of
computers and therefore the Mapper-Combiner-Reducer can
analyse a different subset of data in each execution. In order to
generate each one of the scenarios, a combinatorial technique
[11] is proposed to combine the values of the different
parameters that can modify the execution of the MapReduce
program. In this work the following parameters are considered
based on previous work [5] that classifies different types of
faults of the MapReduce applications:
Mapper parameters: (1) Number of Mapper tasks, (2)
Inputs processed per each Mapper, and (3) Data
processing order of the inputs, that is, which data are
processed before other data in the Mapper and which
data are processed after.
Combiner parameters for each Mapper output: (1)
Number of Combiner tasks, and (2) Inputs processed
per each Combiner.
Reducer parameters: (1) Number of Reducer tasks, and
(2) Inputs processed per each Reducer.
The different scenarios are obtained through the combination
of all values that can take the above parameters and applying
the constraints imposed by the sequential execution of
MapReduce. The constraints considered in this paper are the
following:
Input: Test case with:
input data
expected output (optional)
Output: scenario that reveals a fault
(0) /* Generation of scenarios (section A)*/
(1) Scenarios ← Generate scenarios from input data
(2) /* Execution of scenarios */
(3) ideal scenario output ← Execution of ideal
scenario
(4) ∀ scenario ∈ Scenarios:
(5)
scenario output ← Execution of scenario
(6)
IF scenario output <> ideal scenario output:
(7)
RETURN scenario with fault
(8) IF ideal scenario output <> expected output:
(9)
RETURN ideal scenario
(10) ELSE:
(11)
RETURN Zero faults detected
b) Algorithm for test generation and execution of test scenarios
1. The values/combinations of the Mapper parameters
depend on the input data because it is not possible more
tasks than data. For example, if there are three data
items in the input, the maximum number of Mappers is
three.
2. The values/combinations of the Combiner parameters
depend on the output of the Mapper tasks.
3. The values/combinations of the Reducer parameters
depend on the output of the Mapper-Combiner tasks
and another functionality executed by Hadoop before
Reducer tasks. This other functionality is called Shuffle
and for each <key, value> pair determines the Reducer
task that requires these data, then sorts all the data and
aggregates by key.
To illustrate how the parameters are combined and how the
constraints are applied, suppose the program of Fig. 2. The
input of this program contains three data items, and these data
constrain the values that the Mapper parameters can take
because the maximum number of Mapper tasks is three (one
Mapper per each <key, value> pair). The first scenario is
generated with one Mapper, one Combiner and one Reducer.
For the second scenario the parameter “Number of Mapper
tasks” is modified to 2, where the first Mapper analyses two
<key, value> pairs, and the second processes one pair. The
third scenario maintains the parameter “Number of Mapper
tasks” at 2, but modifies the parameter “Inputs processed per
each Mapper”, so the first Mapper analyses one <key, value>
pair and the other Mapper processes two pairs. The scenarios
are generated by the modification of the values in the
parameters in this way and considering the constraints.
B. Execution of the test scenarios
The previous section proposes a technique to generate
scenarios that represent different infrastructure configurations
according to the characteristics of the MapReduce processing.
Fig. 3 describes a framework to execute systematically the tests
with the scenarios generated by the technique of the previous
section.
The framework takes as input a test case that contains the
input data and optionally the expected output. The test input
data can be obtained with a generic testing technique or one
Given a test case, the scenarios are generated according to
the previous section, then they are iteratively executed and
evaluated following the pseudocode of Fig. 3. For example,
Fig. 2 contains the generation and execution of a program that
calculates the average temperature per year in three scenarios
considering the same test input: year 1999 with temperatures
4º, 2º and 3º. The first execution is the ideal scenario with one
Mapper, one Combiner and one Reducer, that produces 3º as
output. Then the second scenario is executed and also produces
3º. Finally, a third scenario is executed and produces 3.25º as
output, this temperature is not equivalent to the 3º of the ideal
scenario output. Consequently, a functional fault is revealed
without any knowledge of the expected output of the test case.
This approach is automatized by means of a test engine
based on MRUnit library [10]. This library is used to execute
each scenario. In MRUnit the test cases are executed in the
ideal scenario, but this library is extended to generate other
scenarios and enable parallelism supporting the execution of
several Mapper, Combiner and Reducer tasks.
CASE STUDY
In order to evaluate the proposed approach, we use as case
study the MapReduce program described in I8K|DQ-BigData
framework [13]. This program measures the quality of the data
exchanged between organizations according to part 140 of the
ISO/TS 8000 [14]. The program receives (1) the data
exchanged in a row-column fashion, together with (2) a set of
mandatory columns that should contain data and (3) a
percentage threshold that divides the data quality of each row
in two parts: the first part is maximum if all mandatory
columns contain data and zero otherwise, and the second part
of the data quality is calculated as the percentage of the nonmandatory columns that contain data. The output of the
TABLE I.
TEST CASE OF THE I8K|DQ-BIGDATA PROGRAM
Input
Data quality threshold: 50%
Name: Alice
50%
City: (no data)
75% (average)
Name: Bob
Row 2
100%
City: Vienna
The procedure described in Section III is applied on the
previous program using the previous test case as input. As a
result, a fault is detected and reported to the developer. This
failure occurs when the rows are processed in different
Mappers and only the first Mapper receives the information
related to the mandatory columns and the data quality
threshold, because Hadoop splits the input data into several
subsets. Without this information, the Mapper cannot calculate
the data quality and does not emit any output. The bottom of
Fig. 4 represents the scenario that produces the failure. There
are two Mappers that process different rows. The first Mapper
receives the data quality threshold (value of 50%), the
mandatory column (“Name”) and the two columns of row 1
with only data in one column, so the Mapper emits 50% as data
quality of row 1. The second Mapper processes only row 2, but
no other information about the mandatory columns or data
quality threshold, so this Mapper cannot emit any output. Then
the Reducer receives only the data quality of row 1 and emits
an incorrect output of the average data quality.
This fault is difficult to detect because it implies the parallel
and controlled execution of the program. Moreover, this fault is
not revealed by the execution of the test case in the following
environments: (a) Hadoop cluster in production with 4
computers, Hadoop in local mode (simple version of Hadoop
with one computer), and (c) MRUnit unit testing library. These
environments do not detect the fault because they only execute
one scenario that masks the fault. Normally these
environments run the program in the ideal scenario that is
formed by one Mapper, one Combiner and one Reducer, and
then the fault is masked due to a lack of parallelism.
The test engine proposed in this paper executes the test case
in the different scenarios that can occur in production with
large data and infrastructure failures. In contrast with the other
Threshold: 50%
Mandatory: Name
Row 1
Row 2
50%
Row 1: 50%
100%
Mapper
Reducer Row 2: 100%
Avg:
75%
Threshold: 50%
Mandatory: Name
Row 1
Row 2
50%
Row 1: 50%
Mapper
Reducer Avg: 50%
Excepted output
Mandatory columns: “Name”
Row 1
Over the previous program, a test case is obtained using a
specific MapReduce testing technique based on data flow [5].
The test input data and the expected output of the test case
contain two rows represented in Table I. Row 1 contains two
columns (Name and City), and only one column has data, so
the data quality is 50%. Row 2 contains data in all columns, so
the data quality is 100%. The total quality is 75%, which is the
average of both rows.
One scenario of
the test engine
IV.
program is the data quality of each row, and the average of all
rows.
MRUnit or real
environment
specifically designed for MapReduce, such as MRFlow [12].
Then, the ideal scenario is generated (1) and executed (2, 3).
This is the scenario formed by one Mapper, one Combiner and
one Reducer which is the usual configuration executed in
testing. Next, new scenarios are iteratively generated (4) and
executed (5) through the technique of the previous section. The
output of each scenario is checked against the output of the
ideal scenario (6), revealing a fault if the outputs are not
equivalent (7). Finally, if the test case contains the expected
output, the output of ideal scenario is also checked against the
expected output (8), detecting a fault when both are not
equivalent (9, 10).
Same input
Mapper No output due the lack of
threshold and mandatory columns
Different scenario
Fig. 4. Execution of the test case in different scenarios
Different ouput
environments, the test engine proposed does not need the
expected output to detect faults. For example, in this case study
the fault is revealed automatically because the outputs of the
different scenarios are not equivalent to each other. The
execution of some scenarios obtains an average quality of
75%, whereas the execution of other scenarios obtains 50%.
These outputs are not equivalent, and the test engine detects
automatically a fault despite the unknown expected output.
After the detection and report of the fault during the test
phase, the developer fixed the program and then the test case
passed.
V.
RELATED WORK
Despite the testing challenges of the Big Data applications
[15, 16] and the progresses in the testing techniques [17], little
effort is focused on testing the MapReduce programs [18], one
of the principal paradigms of Big Data [19]. A study of
Kavulya et al. [20] analyses several MapReduce programs and
3% of them do not finish, while another study by Ren et al.
[21] places the number between 1.38% and 33.11%.
Many of the works about testing of the MapReduce
programs focus on performance and to a lesser degree
functionality. A testing approach for Big Data is proposed by
Gudipati et al. [22] specifying several processes, one of which
is about MapReduce validation. In this process Camargo et al.
[23] and Morán et al. [5] identify and classify several
functional faults. Some of these faults are specific of the
MapReduce paradigm and they are not easy to detect because
they depend on the program execution over the infrastructure.
One common type of fault is produced when the data should
reach the Reducer in a specific order, but the parallel execution
causes these data to arrive disordered. This fault was analysed
by Csallner et al. [24] and Chen et al. [25] using some testing
techniques based on symbolic execution and model checking.
In contrast to the previous works, the approach of this paper is
not focused on the detection of only one type of fault, it can
also detect other MapReduce specific faults. To do this, the test
input data is executed over different infrastructure
configurations that could lead to failures.
Several research lines suggest injecting infrastructure
failures [26, 27] during the testing, and several tools support
their injection [7, 8, 9]. For example, the work by Marynowski
et al. [28] allows the creation of test cases specifying which
computers fail and when. One possible problem is that some
specific MapReduce faults could not be detected by
infrastructure failures, but require full control of Hadoop and
the infrastructure. In this paper, the different ways in which
Hadoop could run the program are automatically generated
from the functional point of view, regardless of the
infrastructure failures and Hadoop optimizations.
Furthermore, there are other approaches oriented to obtain
the test input data of MapReduce programs, such as [12] that
employs data flow testing and [29] based on a bacteriological
algorithm. In this paper, given a test input data, several
configurations are generated and then executed in order to
reveal functional faults. The test input data could be obtained
with the previous testing techniques.
The functional tests can be executed directly in the
production cluster or in one computer with Hadoop. Herriot
[30] can be used to execute the tests in a cluster while
providing access to their components supporting, among
others, the injection of faults. Another option is to simulate a
cluster in memory with the MiniClusters libraries [31]. In the
unit testing, JUnit [32] could be used together with mock tools,
or directly by MRUnit library [10] adapted to the MapReduce
paradigm. These test engines only execute one infrastructure
configuration and usually without parallelization. In this paper
a test engine is implemented by an MRUnit extension that
automatically generates and executes the different
infrastructure configurations that could occur in production.
VI.
CONCLUSIONS
A testing technique for the MapReduce programs is
introduced and automatized in this paper as a test engine that
reproduces the different infrastructure configurations for a
given test case. Automatically and without an expected output,
the test engine can detect functional faults specific to the
MapReduce paradigm that are in general difficult to detect in
the test/production environments. This approach is applied in a
real program using a test case with few data. As a result, a
functional fault is revealed allowing the developer to fix the
program.
In order to improve the generation of the infrastructure
configurations, as part of the future we plan to extend the
technique to select efficiently the configurations that are more
likely to detect faults. The current approach is off-line because
the tests are not carried out when the program is in production.
As future work we plan to extend the approach to on-line
testing, in order to monitor the functionality with the real data
when the program is executed in production and detect the
faults automatically.
ACKNOWLEDGMENTS
This work was supported in part by project TIN201346928-C3-1-R, funded by the Spanish Ministry of Science and
Technology, and GRUPIN14-007, funded by the Principality
of Asturias (Spain) and ERDF funds and Vice President for
Research and Science Policy with BIN1637 INITIATION
SCHOLARSHIP.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
ISO/IEC JTC 1 – Big Data, preliminary report 2014, ISO/IEC Std.,
2015.
J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on
Large Clusters,” in Proc. of the OSDI - Symp. on Operating Systems
Design and Implementation. USENIX, 2004, pp. 137–149.
“Apache hadoop: open-source software for reliable, scalable,
distributed computing,” https://hadoop.apache.org, accessed: 2016-0416.
“Institutions that are using apache hadoop for educational or production
uses,” http://wiki.apache.org/hadoop/PoweredBy, accessed: 2016-0416.
J. Morán, C. de la Riva, and J. Tuya, “MRTree: Functional Testing
Based on MapReduce’s Execution Behaviour,” in Future Internet of
Things and Cloud (FiCloud), 2014 International Conference on, 2014,
pp. 379–384.
K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing
hardware reliability,” in Proceedings of the 1st ACM symposium on
Cloud computing. ACM, 2010, pp. 193–204.
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
“Anarchyape: Fault injection tool for hadoop cluster from yahoo
anarchyape,” https://github.com/david78k/anarchyape, accessed: 201604-16.
“Chaos monkey,” https://github.com/Netflix/SimianArmy/wiki/ChaosMonkey, accessed: 2016-04-16.
“Hadoop injection framework,” https://hadoop.apache.org, accessed:
2016-04-16.
“Apache mrunit: Java library that helps developers unit test apache
hadoop map reduce jobs,” http://mrunit.apache.org, accessed: 2016-0416.
M. Grindal, J. Offutt, and S. F. Andler, “Combination testing strategies:
a survey,” Software Testing, Verification and Reliability, vol. 15, no. 3,
pp. 167–199, 2005.
J. Morán, C. de la Riva, and J. Tuya, “Testing Data Transformations in
MapReduce Programs,” in Proceedings of the 6th International
Workshop on Automating Test Case Design, Selection and Evaluation,
ser. A-TEST 2015. New York, NY, USA: ACM, 2015, pp. 20–25.
B. Rivas, J. Merino, M. Serrano, I. Caballero, and M. Piattini, “I8k| dqbigdata: I8k architecture extension for data quality in big data,” in
Advances in Conceptual Modeling. Springer, 2015, pp. 164–172.
ISO/TS 8000-140, Data quality - Part 140: Master data: Exchange of
characteristic data: Completeness, ISO/TS Std., 2009.
S. Nachiyappan and S. Justus, “Getting ready for bigdata testing: A
practitioner’s perception,” in Computing, Communications and
Networking Technologies (ICCCNT), 2013 Fourth International
Conference on. IEEE, 2013, pp. 1–5.
A. Mittal, “Trustworthiness of big data,” International Journal of
Computer Applications, vol. 80, no. 9, 2013.
A. Bertolino, “Software testing research: Achievements, challenges,
dreams,” in Future of Software Engineering, 2007. FOSE ’07, 2007,
pp. 85–103.
L. C. Camargo and S. R. Vergilio, “Mapreduce program testing: a
systematic mapping study,” in Chilean Computer Science Society
(SCCC), 32nd International Conference of the Computation, 2013.
M. Sharma, N. Hasteer, A. Tuli, and A. Bansal, “Investigating the
inclinations of research and practices in hadoop: A systematic review,”
confluence The Next Generation Information Technology Summit
(Confluence), 2014 5th International Conference -.
S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, “An analysis of
traces from a production mapreduce cluster,” in Cluster, Cloud and
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
Grid Computing (CCGrid), 2010 10th IEEE/ACM International
Conference on. IEEE, 2010, pp. 94–103.
K. Ren, Y. Kwon, M. Balazinska, and B. Howe, “Hadoop’s
adolescence: an analysis of hadoop usage in scientific workloads,”
Proceedings of the VLDB Endowment, vol. 6, no. 10, pp. 853–864,
2013.
M. Gudipati, S. Rao, N. D. Mohan, and N. K. Gajja, “Big data: Testing
approach to overcome quality challenges,” Big Data: Challenges and
Opportunities, pp. 65–72, 2013.
L. C. Camargo and S. R. Vergilio, “Cassicação de defeitos para
programas mapreduce: resultados de um estudo empírico,” in SAST 7th Brazilian Workshop on Systematic and Automated Software
Testing, 2013.
C. Csallner, L. Fegaras, and C. Li, “New ideas track: testing
mapreduce-style programs,” in Proceedings of the 19th ACM SIGSOFT
symposium and the 13th European conference on Foundations of
software engineering. ACM, 2011, pp. 504–507.
Y.-F. Chen, C.-D. Hong, N. Sinha, and B.-Y. Wang, “Commutativity of
reducers,” in Tools and Algorithms for the Construction and Analysis of
Systems. Springer, 2015, pp. 131–146.
F. Faghri, S. Bazarbayev, M. Overholt, R. Farivar, R. H. Campbell, and
W. H. Sanders, “Failure scenario as a service (fsaas) for hadoop
clusters,” in Proceedings of the Workshop on Secure and Dependable
Middleware for Cloud Monitoring and Management. ACM, 2012, p. 5.
P. Joshi, H. S. Gunawi, and K. Sen, “Prefail: A programmable tool for
multiple-failure injection,” in ACM SIGPLAN Notices, vol. 46, no. 10.
ACM, 2011, pp. 171–188.
J. E. Marynowski, A. O. Santin, and A. R. Pimentel, “Method for
testing the fault tolerance of mapreduce frameworks,” Computer
Networks, vol. 86, pp. 1–13, 2015.
A. J. Mattos, “Test data generation for testing mapreduce systems,” in
Master’s degree dissertation, 2011.
“Herriot: Large-scale automated test framework,” https://wiki.apache.org/hadoop/HowToUseSystemTestFramework, accessed:
2016-04-16.
“Minicluster: Apache hadoop cluster in memory for testing,” https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/CLIMiniCluster.html, accessed: 2016-04-16.
“Junit: a simple framework to write repeatable tests,” http://junit.org/,
accessed: 2016-04-16.