[go: up one dir, main page]

0% found this document useful (0 votes)
92 views26 pages

Lange and Sippel MachineLearning Hydrology

Uploaded by

Avadhoot Date
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views26 pages

Lange and Sippel MachineLearning Hydrology

Uploaded by

Avadhoot Date
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/339751225

Machine Learning Applications in Hydrology

Chapter · February 2020


DOI: 10.1007/978-3-030-26086-6_10

CITATIONS READS

37 4,887

2 authors:

Holger Lange Sebastian Sippel


Norwegian Institute of Bioeconomy Research ETH Zurich
149 PUBLICATIONS   2,833 CITATIONS    64 PUBLICATIONS   1,919 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

BACI: Detecting changes in essential ecosystem and biodiversity properties - towards a Biosphere Atmosphere Change Index View project

Land use and land cover changes in African protected areas and their surroundings View project

All content following this page was uploaded by Holger Lange on 18 June 2021.

The user has requested enhancement of the downloaded file.


Chapter 10
Machine Learning Applications
in Hydrology

H. Lange and S. Sippel

10.1 Introduction

Hydrological processes operate on vastly different spatiotemporal scales (Fatichi


et al. 2016). From water transport in the pore space to continental runoff, from rapid
infiltration and diffusion at the scale of seconds to multiyear phenomena, the
hydrosphere exhibits a bewildering diversity. It is without question that very differ-
ent processes are the most relevant at the different scales. Therefore, process-
oriented modeling approaches which assume the same set of processes across all
scales clearly imply inherent limitations and are often inadequate (Sivapalan 2003,
2006; Clark et al. 2015). However, this assumption is often made in hydrological
modeling. Arguably the best known example is the application of the Richards
equation (Richards 1931), which is basically a nonlinear transport equation derived
from averaging microscopic processes within the pore volume, and thus operates in
pore space at typical spatial scales of some centimeters or decimeters, to a hillslope,
whole catchments or even continental scales in the context of Earth system models.
Problems induced by this issue are basically known (Blöschl 2001; Sivapalan et al.
2004; Viney and Sivapalan 2004); see, however, Ma et al. (2017) for an attempt to
bridge the scale gaps by linking mapping, monitoring, modeling, and management
and by “comprehensive integration.”

Electronic Supplementary Material The online version of this chapter (https://doi.org/10.1007/


978-3-030-26086-6_10) contains supplementary material, which is available to authorized users.

H. Lange (*)
Norwegian Institute of Bioeconomy Research, Ås, Norway
e-mail: Holger.Lange@nibio.no
S. Sippel
Norwegian Institute of Bioeconomy Research, Ås, Norway
Department of Environmental System Science, ETH Zürich, Zürich, Switzerland

© Springer Nature Switzerland AG 2020 233


D. F. Levia et al. (eds.), Forest-Water Interactions, Ecological Studies 240,
https://doi.org/10.1007/978-3-030-26086-6_10

holger.lange@nibio.no
234 H. Lange and S. Sippel

It is a common feature of process-oriented hydrological models that they require


values for parameters which are difficult to obtain through measurements or are
almost unconstrained by observations. They are often fitted or at least constrained
through inverse modeling: the hydrological system is supposed to “have” the
parameter values which lead to the best fit of the model to the data available.
However, since the fitting routines seeking the parameter estimates often necessarily
operate in high-dimensional spaces, no unique minimum of the objective function is
obtained, and the problem of “equifinality” occurs (Beven and Freer 2001).
In general, process-oriented modeling tries to explain the observed behavior and
can be characterized as a bottom-up approach where phenomena at a higher level
(e.g., runoff from a catchment) are explained through processes at a lower level
(water transport through soil, groundwater movement, etc.). Equifinality and over-
parametrization are two severe caveats impeding progress along that route. It is an
open and rather difficult question whether this is due to our lack of system under-
standing, scale errors, too little detailed or simply not enough measurements, or
whether focus is on variables unsuitable to foster proper understanding, to name a
few of the basic obstacles one could think of.
Alternatively, a data-oriented approach seeks to detect and describe patterns in
univariate or multivariate hydrological datasets, tries to generalize them, and then
makes predictions on the system state. Hydrology is particularly well-suited to this
line of research since in many cases, it deals with input–output systems and transport
processes (i.e., water flow) along gravitational gradients. The attribution of a variable
as input or output is in many cases obvious. We are mentioning this since in other
contexts, e.g., the biochemistry of ecosystems, the direction of causality is not
intuitive and might even change over time (Sugihara et al. 2012).
“Process-oriented” and “data-oriented” modeling approaches are not complemen-
tary per se, but in many cases, the combination of both entails a large potential for
hydrological modeling. For example, data-oriented models occasionally provide a
detailed description of the input–output mapping and thereby support process-
oriented model building and process understanding. Furthermore, hybrid modeling,
i.e., the combination of both approaches within one modular hydrological model,
can yield improved forecasts (Corzo Perez 2009). Hybrid modeling tries to combine
the advantages of both process-oriented and data-oriented modeling approaches. For
example, a key process-oriented modeling principle is that the energy and water
balances across the entire model remain preserved, while determining model param-
eters might be achieved in a parsimonious way with data-driven approaches. Earlier
studies have shown that such pathways are indeed feasible, for instance, for the
parametrization of clouds within Earth system models (Rasp et al. 2018), and in
hydrology automated machine learning (ML)-based upscaling of streamflow obser-
vations to the grid cell level for Earth system modeling has been conducted
(Gudmundsson and Seneviratne 2015). ML is also used to improve the understand-
ing and analysis of process-oriented models (Peters et al. 2016).

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 235

Table 10.1 Examples for methods used in data analysis in four different categories
Unsupervised Supervised
Manual/ Time series analysis, frequency ESLLEa
non- decomposition, principal compo- GLMs, GAMs, GAMMsb
iterative nent analysis
Automatic/ Clustering, autoencoders, data min- Neural networks, decision trees, random
iterative ing, dimensionality reduction forests, support vector machines, and many
more; proper machine learning
a
Enhanced Supervised Locally Linear Embedding (Zhang 2009)
b
Generalized Linear Models, Generalized Additive Models, Generalized Additive Mixed Models

Data-oriented detection and description of patterns can be done in a variety of


ways. Two basic distinctions are “manual” (or one-time, non-learning) versus
“automatic” (iterative, learning) algorithms on one hand and supervised versus
nonsupervised ones on the other (Table 10.1).
As can be seen from Table 10.1, machine learning methods (as we use the term)
belong to the class of supervised iterative methods, where the iteration is the process
of learning. Thus, in most applications, the algorithm gets training and validation
datasets presented, learns on them, and is afterward tested on the remaining
(or independent) datasets. For hydrological applications, a well-designed cross-
validation scheme is of utmost importance since, as the underlying data are highly
structured and spatiotemporally correlated, the estimation of the generalization error
can be severely distorted (see, e.g., Roberts et al. (2017); a simple, tutorial-style
introduction into this topic is given in Sect. 10.2 from a bias-variance trade-off
perspective where the R code is provided as online electronic supplementary mate-
rial [see Extras Springer online]).
Moreover, modelers are often afraid that the training part could not contain all
situations which occurred throughout the history of the system; proper extrapolation
beyond the range observed (or “hydrological regime”) in the training sample is a
difficult issue as even the most sophisticated machine learning techniques might fail
outside the training range. In many cases less flexible, more parsimonious methods
might actually extrapolate better, and causal learning techniques that would allow to
incorporate the correct causal structures are still in their infancy (Peters et al. 2017).
The present review of (supervised) ML in the context of hydrological applications
is structured as follows: in the next section, we provide a short overview of the origin
and rapid expansion of machine learning in popular as well as scientific publications
and outline basic statistical principles of machine learning from a practitioner’s
perspective. In Sect. 10.3, some of the key machine learning algorithms that are
already in frequent use in hydrology are described in a nontechnical manner, broadly
along a gradient of more parsimonious to more flexible methods. Sect. 10.4 then
highlights recent notable cases of ML application in hydrology; a brief outlook is
given in Sect. 10.5.

holger.lange@nibio.no
236 H. Lange and S. Sippel

10.2 Machine Learning: Historical Overview and Statistical


Basis

10.2.1 The Origin and Rapid Expansion of ML

The term “Machine Learning” was coined already 60 years ago (Samuel 1959) in the
context of game-learning computers (in that case, checkers). The fundamental and
then-new insight was that software is able to learn how to play strategic games, with
a performance superior to that of the human programmer.
From the start, machine learning was closely connected to Artificial Intelligence
(AI). However, since probabilistic and iterative methods are notoriously data-thirsty
and data availability was always an issue prior to the ability to collect, store and
distribute large datasets automatically, machine learning faded away in favor of
expert systems and knowledge-based AI software.
However, with the advent of the Internet, the corresponding availability of
digitized information and with increasing computing power, machine learning
experienced a revival, now as its own field separated from AI. It has strong roots
in statistics and probability theory, and many applications of machine learning,
including those fashionable in hydrology, are indeed tools based on and designed
for regression and/or classification.
A closely related concept is that of data mining. In this review, we consider data
mining and machine learning as separate approaches under the umbrella of data-
oriented modeling: whereas data mining explores unknown datasets for knowledge
discovery, the typical machine learning algorithms has prediction as its target; to
achieve it, it is first fed with training data to learn patterns and dynamics and then
validated on hitherto unseen datasets. Still, the two concepts are interwoven: some
unsupervised learning methods are taken from data mining as a preprocessing step of
machine learning; some data mining approaches utilize machine learning methods
but with a different goal. Data mining is often concerned with dimension reduction,
providing an efficient (low-dimensional) representation of the main patterns present
in high-dimensional data, unraveling redundancies, and so on. Whereas some
machine learning methods perform dimension reduction as preparatory step, it is
not part of machine learning per se.
It is surely no overstatement that we are living in the era of machine learning since
a few years. Their applications span a wide range, including not only scientific
domains but also industry and commercial businesses. Examples are automatic text
and speech recognition, translation from one natural language to another or from
specialist to laymen language, health care and medical data analysis, stock market
predictions, spotting fraud and plagiarism in science and elsewhere, optimizing
search engines like Google, autonomous driving vehicles, or autonomous self-
exploration and social interactions of robots. This list is far from being exhaustive,
and there are many job announcements and project applications mentioning machine
learning in their title. Overall, the scientific and commercial potential is considered
as huge.

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 237

A word of caution might be necessary at this stage. The rise of machine learning
is deeply connected to data availability, which includes the ability to automatically
collect, store, and distribute data—we also live in the era of “big data.” The titanic
flow of data presents a challenge in its own, not only to computer storage and
processing speed but also to human comprehension. However, large classes of
methods used in machine learning are not genuinely new; others are mere refine-
ments of existing approaches. It is possible that there is a mismatch between data
volume and sophistication of our tools; the latter is always in danger to lag behind.1
And as today, skeptics could consider machine learning as just nonlinear fitting in
big datasets, vulnerable to wild errors when extrapolating from the domain of
training sets provided to the “machine” and (as a consequence) difficult to apply in
areas where huge amounts of reliable data are difficult or impossible to get or where
the test datasets differ in some underlying aspects from the training datasets.
A recent poll among chemists2 indicated that 45% of the participants agreed or
even strongly agreed to the statement that machine learning is overhyped. An
extreme example of hyping is that of the director of Artificial Intelligence at Tesla
calling machine learning (and neural nets in particular) “Software2.0.”3 While
overrating or overselling are common side effects of new technologies and emerging
markets, one could as well relax and just add machine learning approaches to the
toolbox of the modeler, recognizing that machine learning is good at things where
machines are good at learning and where the appropriate training data are available,
no more no less.
The increasing popularity of machine learning in hydrology originates also in the
availability of data sources other than traditional ones like precipitation, runoff,
groundwater height, and so on. Remote sensing from satellites or airplanes, embed-
ded sensor networks, drones and even internet-based social networks (sensu citizen
science) all contribute to ever-increasing data streams, creating a rich playground for
machine learning approaches.

10.2.2 The Bias-Variance Trade-Off and Cross-Validation


of Spatiotemporally Correlated Data

A key principle of machine learning methods is that unknown functional relation-


ships are approximated via the iterative tuning of hyperparameters. This leads
directly to the bias variance trade-off (see, e.g., the longer introduction in Hastie
et al. (2008)) with important implications for hydrological ML applications where

1
https://insidebigdata.com/2018/10/19/report-depth-look-big-data-trends-challenges/ (accessed
January 30th, 2019).
2
Found on https://cen.acs.org/physical-chemistry/computational-chemistry/machine-learning-
overhyped/96/i34 (accessed January 30th, 2019).
3
https://medium.com/@karpathy/software-2-0-a64152b37c35 (accessed January 30th, 2019).

holger.lange@nibio.no
238 H. Lange and S. Sippel

substantial spatial or temporal dependencies are crucial data properties. Here, we


illustrate the bias-variance trade-off with the approximation of an “unknown”
(simple third-order polynomial) function f via non-parametric k-nearest neighbor
(kNN) regression. kNN has one tuning parameter, which is the number of nearest
neighbors k (in feature space, i.e., X) that are averaged to obtain the prediction Yb (see
Sect. 10.3.1 for a more detailed introduction); i.e.,

1 X
YbðxÞ ¼ yi
k
xi 2N k ðxÞ

where Nk(x) is the set of the k points xi in the training data which are closest to the
point x. Mean squared error (MSE) curves for approximating f are shown as a
function of k in Fig. 10.1a for the training and test datasets. For large k, kNN
regression “predictions” are close to the training sample mean; hence, a bias arises
toward the extreme cases of X resulting in underfitting (e.g., k ¼ 50 in Fig. 10.1b
where the prediction tends toward a constant at both edges). In contrast, for small k,
kNN “predictions” follow (arbitrarily) closely the training dataset (Fig. 10.1b).
Hence, by successively decreasing k, the training set MSE reduces monotonically
as flexibility is added to the model. This continuously reduces bias and leads
eventually to a perfect fit of the training dataset (Fig. 10.1b). However, a “perfect”
fit on the training dataset (such as shown for k ¼ 1 in Fig. 10.1b) is clearly
undesirable, as the model will generalize poorly to a different (“test set”) realization
of f (Fig. 10.1c), i.e., leading to high variance of the fit. Hence, generalization ability
to unseen data is a crucial property of any machine learning model, and consequently
training set MSE is not a reliable measure for model performance. MSE on an unseen
test (or validation) dataset is a sum of bias (resulting from too little model flexibility,
i.e., underfitting, such as for k ¼ 50 in Fig. 10.1b/c), variance (resulting from too
much model flexibility, i.e., overfitting, e.g., for k ¼ 1 in Fig. 10.1b/c), and
irreducible error (see for more details, e.g., Hastie, Tibshirani, and Friedman
(2008)); and hence from Fig. 10.1a, the modeler would choose a k-value that
minimizes test MSE.
In most practical cases, hyperparameters are tuned through cross-validation, i.e.,
through a systematic partitioning of the training dataset into internal training/
validation sets.
An important property of hydrological datasets is that they are highly structured
and typically involve substantial spatial and temporal correlations and
nonstationarities (Roberts et al. 2017) that might be induced either through missing
predictors or through underlying correlated noise. This property implies that a
random partitioning of data into folds for cross-validation, which is the default in
many standard software cross-validation routines, is prone to fail: The reason is of
course that data used internally for training and validation in cross-validation are
correlated independent, and hence the resulting MSE curve would resemble the
“training set MSE” in Fig. 10.1a; resulting in an underestimation of generalization
error of the model and leading to a model fit that is tuned too closely to the

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 239

Fig. 10.1 Illustration of the bias-variance trade-off as a key concept in machine learning that is
crucially relevant for hydrological applications. (a) Training and test set mean squared error of
k-nearest neighbor (kNN) regression with various tuning parameter values k used to approximate an
unknown function (third-degree polynomial). (b, c) kNN regression fits for the (b) training and (c)
test datasets. k-nearest neighbor regression fits are based on the “kknn” R-package (Schliep and
Hechenbichler 2016)

available data. Hence, a smart partitioning of the data into training and test datasets
is crucial particularly in real-world hydrological applications (Roberts et al. 2017).
For practical applications, the issue of correlations might be alleviated by choosing
block-wise folds (in space and/or time, i.e., eliminating information leakage
between folds) for cross-validation (Roberts et al. 2017). Although the problem
of correlated errors in prediction problems is known since quite some time (see
Schoups and Vrugt (2010) for hydrology, and Bergmeir et al. (2018) for an
overview), it has been argued that some predictions that failed gracefully such as

holger.lange@nibio.no
240 H. Lange and S. Sippel

the outcome of the 2017 US presidential elections or the onset of the 2008 financial
crisis had been at least partly caused by ignoring correlated errors among the
instances used for prediction (Silver 2012).

10.3 Machine Learning Algorithms Used in Hydrological


Research

In this section, we provide a short overview on some important machine learning


techniques used in hydrology. Given the pace of development in this field, the
overview is necessarily incomplete and might also be imbalanced; however, each
one of the methods to follow has had an application in the hydrological literature.
Moreover, following up on the bias-variance trade-off considerations in the previous
section and for practical considerations, we present the methods broadly in a
sequence from more parsimonious methods to more flexible methods.
We also add references to some notable algorithms and software
implementations. However, because machine learning algorithms and
implementations have become so widespread and diverse, it is beyond the scope
of this chapter to recommend specific software packages. Therefore, we refer the
interested reader to dedicated overview sites targeted at machine learning applica-
tions and algorithm implementations, for instance, mloss.org4 (“machine learning
open source software”) or the CRAN task view “Machine Learning & Statistical
Learning” (Hothorn 2019) that provides a comprehensive overview of packages and
implementations that are available within the R programming environment.

10.3.1 k-Nearest Neighbors (KNN)

This is a classic algorithm for both classification and regression (Altman 1992) – and
has been used for illustration in Sect. 10.2. In feature space, i.e., the space spanned
by pairs of input and output vectors, one must find the k-nearest neighbors of any
sample point, which obviously requires a distance measure. Among the common
choices, the Minkowski distance (with distance parameter q) as a generalization of
the Euclidean distance is the most flexible one. In KNN regression, the output
(prediction) for the sample point is the weighted average of the values for the
k-nearest neighbors, where the weights are inversely proportional to the distances.
It might be necessary to reduce the dimensions of the problem first.

4
https://mloss.org/software/

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 241

10.3.2 Regularized Linear Models: Lasso, Ridge, and Elastic


Net Regression

Regularized regression methods have been developed to account for cases in which
the number of predictors is relatively large compared to the number of samples, i.e.,
where multiple linear regression might overfit (Hastie et al. 2008).5 The idea is to
reduce the regression model’s flexibility by shrinking the regression coefficients,
such as to avoid overfitting and to allow interpretability of the model. Shrinkage is
done by adding a constraint that encapsulates the size of the regression coefficients to
the least squares optimization function. Lasso and ridge regression are two related
regression methods that differ in the nature of shrinkage (Hastie et al. 2008), where
ridge regression shrinks coefficients toward zero based on the L2-norm, whereas
Lasso regression performs some kind of subset selection by preferably shrinking
coefficients exactly to zero via the L1-norm penalty. The degree of shrinkage is
regulated by a hyperparameter (λ in the equation below) that is typically determined
by cross-validation.
Elastic net models (Zou and Hastie 2005) represent a blending of Lasso and ridge
regression, and uses an additional hyperparameter allowing to switch continuously
between the two regression methods. The vector of regression coefficients β is
obtained in elastic nets as
 h i
βb ¼ argmin ky  Xβk2 þ λ αkβk2 þ ð1  αÞkβk1
β

where X is the matrix of predictors, λ is the penalty strength, and α is the new
hyperparameter (α ¼ 0 is Lasso, α ¼ 1 is ridge regression).
Elastic nets are implemented in the glmnet package in R (Friedman et al. 2010).

10.3.3 Artificial Neural Networks (ANNs)

These are the most well-known learning algorithms, probably also among the oldest
and with applications to hydrology since several decades (Maier and Dandy 1995;
Kuligowski and Barros 1998; Zealand et al. 1999; Lischeid 2001; Parasuraman et al.
2006; Wang et al. 2006; Daliakopoulos and Tsanis 2016). There are also many
excellent textbooks on the subject, so we provide only some key elements here. The
interested reader might consult Chapter 11 of (Hastie et al. 2008) or Chapter 5 in
Bishop (2006) for a thorough overview on ANNs.

5
This book is freely available as pdf document from https://web.stanford.edu/~hastie/
ElemStatLearn/

holger.lange@nibio.no
242 H. Lange and S. Sippel

Fig. 10.2 A typical three-


layer feed-forward neural
network. The Xi are input
and the Yi output nodes;
between them, there is a
“hidden” layer with nodes
Zi. The links between the
nodes are weighted (not
shown). (Source: Hastie
et al. 2008; reproduced with
permission of Springer)

ANNs are motivated by an analogy to the physiology of the human brain, where
neurons are represented as nodes in a network, whereas the connections between
neurons (synapses) are links between them. Each neuron is equipped with an
activation function, determining according to its input value whether it “fires” (pro-
duces output) or not (or rather, the “firing” is a continuous process between none and
maximal firing). The prototypical architecture of an ANN is shown in Fig. 10.2. It
should be obvious from the figure that the analogy with the human brain should not
be taken too far. In the latter, there are no immediately identifiable “input” and
“output” neurons, let alone hidden layers between them.
In many applications, there will be only one output node Y; it is also important to
note that foreseeing more than one hidden layer does not necessarily improve model
performance, consistent with the universal approximation theorem (Cybenko 1989)
which states that a feed-forward network with a single hidden layer containing a
finite number of neurons can approximate continuous functions, under mild assump-
tions on the activation function; therefore, one hidden layer is the most common
choice.
ANNs have a long history, where the invention of the perceptron (Rosenblatt
1958) was a particular important milestone. Plagued by the bottleneck of insufficient
computing power and some theoretical problems, the proper breakthrough came
with the invention of the backpropagation algorithm (Werbos 1975), later made
popular in particular through the seminal work of (Rumelhart et al. 1986), which is
until today by far the most common method to determine the weights on the links
through training.
There are some serious issues with ANNs in practical circumstances. The learn-
ing rate can be quite slow, preventing an efficient online updating, where the
networks learns continuously while new data are streaming in, rather than working

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 243

fully through an offline training set. There is also a severe danger of


overparameterization, unless the training sample is huge. Letting the
backpropagation run for a long time on the same sample bears the danger of
overfitting (or overlearning): the network gets to “understand” tiny insignificant
details of that data sample, loses generalization power and thus is performing
worse in the subsequent validation phase compared to learning shorter steps. As
Hastie et al. (2008) phrase it, “there is quite an art in training neural networks.”
There is one additional problem which is, strictly speaking, outside proper quality
assessments for machine learning, but it has been subject of intense debate since a
long time: trained neural networks which perform well are usually opaque and not
open to interpretation. The matching between input and output, presented as weights
between (many) input neurons, the nodes in the hidden layer(s) and from there to the
output neuron(s), is eluding human comprehension; developed ANN architectures
are nontransparent to the data analyst. One might wonder whether the incompre-
hensibility of the network architecture is a relevant issue for systems doing analysis
and prediction largely automatic and without human interference, but it has limited
the spread of ANNs; the target of scientific applications (rather than, say, industrial
or engineering applications) is often to understand why things work.

10.3.3.1 What Is Deep Learning?

The adjective “deep” in front of a learning algorithm is a fashionable attribute. A


machine learning algorithm is called deep when it comprises a multitude of layers of
nonlinear processing units, where each successive layer uses the output of the
previous layer. The canonical example for deep learners are ANNs (Schmidhuber
2015), but not every ANN is necessarily deep; a special variant of them, Recurrent
ANNs, have potentially unlimited depth (layers are added during the learning
process) and are thus (very) deep. The term deep learning was brought to the
machine learning community by Dechter (1986). The most general framework is
Deep Reinforcement Learning (DRL). DRL agents are forced to learn how to
interact with an initially unknown and only partially observable environment,
unassisted, in order to maximize their cumulative reward signals (Schultz 2007).
Reward signals are produced by reward neurons and are used to influence brain
activity that controls actions, decisions and choices. In the human brain, an important
example of reward neurons are dopamine neurons. DRL became popular recently as
Silver et al. (2016) managed to construct DRL systems beating world elite players in
the board game Go. Nowadays, deep learning is a very active research area with a
loose recognition of the term; it refers to any algorithm capable of extracting
knowledge from big data in an efficient manner (Najafabadi et al. 2015).
Implementations of deep learning algorithms are available for all major program-
ming languages.6

6
see e.g., https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software

holger.lange@nibio.no
244 H. Lange and S. Sippel

10.3.4 Convolutional Neural Networks (CNNs)

For the ANNs discussed up to here, a spatial neighborhood of input or output


neurons (or more generally, a geodesic distance between them) is ignored, i.e.,
does not enter the network architecture. For some tasks, with automatic image
analysis as foremost example, this is unfortunate since local information (the
surrounding pixels) is more important than remote one. This flaw is overcome by
CNNs (LeCun et al. 1989), designed to emulate the behavior of the visual cortex.
The neurons in a CNN are organized in three dimensions (height, width and depth)
(Fig. 10.3), and this spatial locality is utilized through a local connectivity pattern
between neurons of adjacent layers. The neurons in a given layer (to the right in
Fig. 10.3) are connected to only a few neurons or a small region within the layer
before it, called the receptive field (the left and middle box in Fig. 10.3 show two
iterations of this principle).
The core building element of a CNN is the convolutional layer, whose parameters
are changeable kernels (filters) with a small viewing field (focusing on close
neighbors). Each filter is “convolved” through the width and height of the layer,
which means that the scalar products between the inputs and the entries of the filter
are taken. It is therefore more correct to speak of a “cross-correlation neural
network,” but this is not the common term for this type of networks.

10.3.5 Support Vector Machines (SVMs)

The starting point for support vector machines (SVMs) (Boser et al. 1992) is the
observation that there is always noise in any set of observations and that a

Fig. 10.3 The typical architecture of a convolutional neural network. The convolutional layer to
the right is a three-dimensional object which is connected to a small region of the layer before it
(box in the middle), the receptive field, which, in turn, has few connections with the layer before it
(left). (Source: Wikipedia (https://en.wikipedia.org/wiki/Convolutional_neural_network);
Reproduced under Creative Commons license type Attribution-Share Alike 4.0 International)

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 245

Fig. 10.4 Support vector


regression. Points inside the
ε boundaries are not used as
support vectors (non-SVs).
(Source: Raghavendra and
Deka (2014); Reprinted with
permission from Elsevier)

(generalized) regression should not run through data points which differ less than an
assumed threshold ε; SVM regression is sometimes also called ε-insensitive regres-
sion. Only the data points outside of the threshold are used for the regression; they
constitute the support vectors (Fig. 10.4). Therefore, SVMs are especially suitable
when analyzing noisy data. Arbitrary nonlinear regression is possible when using
kernel functions (the “kernel trick” (Hofmann et al. 2008)). Typical kernels are
Gaussian (in this context often called “Radial Basis Functions” (RBF)) or polyno-
mial ones, but also sigmoidal kernels have been tried. This flexibility comes also
with a price, however: the selection of the kernel function is largely heuristic and
based on trial and error.
The best-known application of SVMs is the automatic identification of handwrit-
ten digits, but there are also other important areas like bioinformatics, biochemistry,
and not the least environmental sciences, including hydrology.

10.3.6 Decision Tree Learning

One of the key ingredients of many machine learning algorithms are decision trees.
These are (graphical) presentations of conditional statements about the outcome of
decisions; depending on weights (probabilities) of the input features, one branch or
the other in a decision tree is taken. Observations are contained in the branches,
whereas target values (predictions) are contained in the leaves. Thus, decision trees
are examples of prediction models. If the target values are continuous, decision trees
are often called regression trees. The prototype and arguably the oldest regression
tree procedure is the Classification and Regression Tree (CART) introduced by

holger.lange@nibio.no
246 H. Lange and S. Sippel

Breiman et al. (1984). An implementation of CART in R is provided by the rpart


package.7
A decision tree learning algorithm usually starts with a tree with a simple
structure (in terms of branches and leaves), and recursively splits the observations
set into subsets, in many cases using just thresholds for the value of the observations
(take one branch if the observation is smaller than a prescribed threshold value, and
the other one otherwise). The process of re-branching is called recursive partitioning.
The process stops when further splitting does no longer improve the predictions. In
many algorithms, a secondary step is performed where insignificant branches are
removed again to end up with a less complex model providing virtually identical
prediction performance, an operation called pruning.

10.3.7 Random Forests

It is a common property of decision trees to be sensible to overtraining, in particular


if the training dataset is not providing all features of the datasets which can occur. It
is generally believed that increasing the depth of decision trees (or the complexity of
many other learning algorithms for that matter) comes at the price of overfitting (see,
e.g., Fig. 10.1); deep trees tend to learn highly irregular patterns. This disadvantage
can be overcome to a certain extent by working with multiple decision trees at once;
the individual trees are generated using a random selection (without replacement) of
the training set. This ensemble of trees is called a random forest (Breiman 2001).
Each tree is making a prediction of the target variable, and the average of all
predictions is then the resulting prediction of the random forest.
A crucial performance improvement is obtained when, repeatedly, a random
sample of the training set (now with replacement) is presented to the generated
trees, and the trees are fitted to these subsamples, a process called “bagging”
(Breiman 2001) which is short for “bootstrap aggregating.” If this is done B times
(say), one has B different regression trees trained. For the prediction of hitherto
unseen samples, one takes the arithmetic average of the predictions of all trained
trees. This has a noise cancelling effect, and as a consequence, the model perfor-
mance is improved in the sense of decreased variance while at the same time the bias
is not increased.
The number of trees B is a parameter of the approach, and may be adjusted by
considering the out-of-bag error, i.e., the prediction performance for samples not
contained in the training set of the individual trees. This error usually flattens out
once B exceeds a couple of hundreds to a few thousands. The limit of B growing
arbitrary large (infinite random forests) is an active area of research in the theory of
random forests.

7
https://cran.r-project.org/web/packages/rpart/index.html

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 247

The learning algorithm of random forests involves the invention of candidate


splits (branching points) in the trees. At each candidate split, the learner is presented
with a fraction (the recommended value being 1/3) of the features, randomly chosen.
This procedure minimizes correlations among the trees which would otherwise occur
when there are strong predictors in the input dataset.
Random forests show a relatively strong resistance to overfitting (Kleinberg
1996), arguably the most decisive reason for its popularity in recent years. Further
details on the theory of random forests can be found in Hastie et al. (2008), and in the
R programming environment, the randomforest package (Liaw and Wiener 2002)
may be used.
The number of articles in hydrology using random forests has rapidly increased
recently. A first review of applications of random forests in the water sciences
(Tyralis et al. 2019) counts more than 200 papers on that method in water journals;
more than 60% of them have been published since 2016.
A further step, closely connected to random forests, consists in presenting the
whole learning sample (rather than just a subsample) to each tree, and assign splits in
the trees at random many times; the split which yields the highest score is finally
chosen to split the node. This variation is called extremely randomized trees (Geurts
et al. 2006) or ExtraTrees for short.

10.3.8 Gradient Boosting Machine

The generalization of gradient-descent algorithms to (stochastic) gradient boosting,


introduced by (Friedman 2002), consists in providing an iterative algorithm with a
set of “base learners”—simple predictor functions which may be non-adapted to the
problem or are in other senses “weak.” The most extreme example of a weak base
learner would be the mean of the whole dataset. Other, more sophisticated examples
include regression trees. The model error is then used in the next iteration to build a
set of new learners, whose predictions are added to the prediction of the base learners
(this is the boosting step), resulting in a new set of prediction functions. This process
is repeated until a user-defined number of trees is reached. The gradient boosting can
be fully automatized using software like the gbm3 package in R.8

10.3.9 M5 and M5-cubist

These two algorithms also belong to the family of decision/regression tree


approaches; they are sometimes also called model tree approach. Contrary to usual
regression trees, they contain regression functions at their leaves, not single

8
https://github.com/gbm-developers/gbm3

holger.lange@nibio.no
248 H. Lange and S. Sippel

(threshold) values. The M5 model tree treats the standard deviation of the class
values that reach a node as the node error and calculates the expected reduction in
this error by splitting the node and testing each attribute at that node. Among all
splits, M5 choses the one which maximizes the expected error reduction. In most
cases, a subsequent pruning step is required. M5 was introduced by Quinlan (1993).
Cubist or M5-cubist (Loh 2011) is a further extension of M5 avoiding sudden
“jumps” or discontinuities potentially arising in M5 when changing the decision
node, if the corresponding linear models have very different coefficients. M5-cubist
is available through the Cubist R package.9

10.3.10 Stack Generalization

As the diversity of methods discussed in this chapter indicate, it is not always easy to
decide for the “optimal” approach when confronted with a specific modeling prob-
lem or dataset. Meta approaches (or ensemble models) combine the predictions of a
suite of (machine learning) algorithms (level 0 methods). These predictions are input
(predictor variables) to another layer of machine learning (level 1). This procedure
resembles boosting in the GBM or bagging in random forests and is referred to as
stacked regression (Wolpert 1992). It can also be iterated (level 2, i.e., meta-meta
models and so on), although this has been rarely seen until now.

10.4 Examples of Machine Learning Applications from


Hydrology

10.4.1 A Note on Quality Metrics

As machine learning applications are still relatively new in the hydrological litera-
ture, the performance is usually compared to more conventional approaches, or
against each other. This raises the question how to compare observations and pre-
dictions (or simulations). There are many ways to do this; it is not obvious which
metric serves the purpose best, and it also depends on the focus of the data analyst: is
reproduction of the mean and the variance most important? Should the autocorrela-
tion function be reproduced (or the power spectrum)? Should the prediction do well
only on short time scales, or also on longer ones (statistically)? Is the reproduction of
the seasonal cycle (phase, asymmetry) an issue? There are many more aspects of the
time series to be considered. However, the vast majority of papers restrict themselves
to very basic metrics, usually delivering just one number for the whole record
indicating the data-model mismatch: the correlation coefficient, Root Mean Squared

9
https://cran.r-project.org/web/packages/Cubist/index.html

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 249

Error (RMSE), Mean Absolute Error (MAE) or the Nash-Sutcliffe Efficiency (Nash
and Sutcliffe 1970). There is a risk that according to these simple metrics, the
methods do not differ much from each other, and their ranking is much more a
random function of the case study at hand rather than a typical one for the research
area. In some cases, it might also not do justice to some methods which have a very
good description of the general dynamics, but simply fail to reproduce the correct
scale (or the mean value) accurately enough.
There are lots of more sophisticated model-data comparisons available; two
examples are (Lange et al. 2013) and (Tongal and Berndtsson 2017), focusing on
the complexity of streamflow. In general, we advocate to optimize for more than just
one metric (or several, but completely independent from each other), i.e., a multi-
objective optimization (Miettinen 1999), to proper evaluate the different machine
learning techniques and to create structure in the zoo of algorithms and their
performance in hydrological applications.

10.4.2 ANNs in Hydrology

As discussed in the Introduction, the focus of machine learning in hydrology (and


mostly elsewhere) is on prediction. The most obvious variable to predict in hydrol-
ogy is streamflow. Here, applications of ANNs have a long history, extending well
before the period where machine learning was a buzzword. In Karunanithi et al.
(1994), the network architecture was determined by a so-called cascade-correlation
algorithm, and it was observed that the network was capable of adapting to changes
in flow history, contrary to analytical models. Also, (Zealand et al. 1999), working
on eight years of streamflow from a larger catchment in Canada, observe the ability
of ANNs to model complex nonlinear relationships and to predict flow a couple of
time steps ahead. On very short time scales (few hours), it was also attempted to
predict precipitation using ANNs (Kuligowski and Barros 1998), based on wind
direction and antecedent precipitation data from a weather gauge.
The literature on ANNs and streamflow exploded with the start of the new
millennium: an ISI Web of Science search lists 4 papers with publication date
prior to 2000, and 602 after 2000. We can only sketch some representative examples.
The lack of transparency of resulting ANN architecture and the weight matrix was
the concern of Kingston et al. (2005). These authors tried to implement constraints in
the objective function, no longer seeking the global optimum but physically plausi-
ble input conditions. The obtained predictive performance was still comparable to
that of the unconstrained fit. Many examples have shown that ANNs (and many
other machine learning techniques for that matter) have difficulties to reproduce
isolated short-term peaks in runoff records; the summation over different neurons
from the hidden layer(s) to produce the output (often a single neuron) has a notorious
smoothing effect. To mitigate this smoothing, (Parasuraman et al. 2006) added a
“spiking” layer to the hidden layer which is active in particular for high flow. A
similar layout could also successfully be used for Eddy Covariance gas fluxes (in the

holger.lange@nibio.no
250 H. Lange and S. Sippel

same paper). A strictly univariate approach (called “top-down black box”), i.e.,
using only streamflow to predict streamflow, was used in Wang et al. (2006). It
turned out that careful preprocessing, in particular deseasonalization without nor-
malization, was the clue to obtain good performances; consistent with that, periodic
ANNs performed best in their case study.
The excellent performance of ANNs for real-time forecasting purposes was
documented in Toth and Brath (2007); however, these authors point to the impor-
tance of extensive hydro-meteorological datasets which the ANNs need for training.
Daliakopoulos and Tsanis (2016) compare an ANN with a conceptual model for
rainfall-runoff modeling. They conclude that, on average, their ANN variant (“input
delay neural network”) is superior to the conceptual model, but it is outmatched for
low flow conditions. It could be that extrapolation to yet unseen conditions (not
contained in the training set) is a problem for ANNs. A comparison of ANNs to a
newer version of it, the extreme learning machine (ELM) (Huang et al. 2006), was
performed in Yaseen et al. (2018). ELMs are single-hidden layer feedforward neural
networks; however, the output weights are calculated analytically rather than itera-
tively using gradient-based methods: they are non-tuned and tremendously faster
than ordinary ANNs. In the paper, it was also shown that they exhibit slightly better
performance on different time scales. Another paper on ELM expands the method to
incorporate online changes in the network structure (number of hidden nodes),
which the authors call variable complexity online sequential ELM (Lima et al.
2017). For daily streamflow prediction, this extension is clearly outperforming the
standard ELM.
In an applied approach (Bozorg-Haddad et al. 2018), ANNs and SVMs are used
to give advice for the optimal operation of water reservoirs. In their case, SVMs
outperform ANNs in the forecasting scenarios. In a similar vein, Siqueira et al.
(2018) investigate forecasting for Brazilian hydroelectric power plants using “unor-
ganized” machines—ELM and so-called echo state networks—in comparison with
standard ANNs. Concerning quality metrics, they present an exception from the rule
as they consider the partial autocorrelation function, mutual information and max-
imum relevance/minimum common redundancy as evaluation criteria. Still another
exception is the already mentioned paper (Tongal and Berndtsson 2017) which
conclude for daily streamflow data, based on entropy and complexity analysis, that
ANNs are well suited for 1-day forecasting, but should be used with care beyond
2-day forecasting (their favorite algorithm for longer forecasting horizons, the Self-
Exciting Threshold AutoRegressive Model (SETAR), is not in the class of machine
learning techniques).

10.4.3 SVMs in Hydrology

The application of support vector regression (SVR) to hydrological data seems to


have started around 15 years ago. One of the earliest examples is (Lin et al. 2006),
pointing to the important fact that over-fitting is rarely an issue for SVMs, contrary to

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 251

ANNs. For rainfall forecasting, a rather advanced setup of learning techniques is


exploited in Hong (2008), combining SVMs with recurrent ANNs, and using a
particular methodology from the field of genetic algorithms, the chaotic particle
swarm optimization (Kennedy and Eberhart 1995), to choose the parameters for the
SVR. The resulting model framework (called RSVRCPSO in the paper) is shown to
yield well forecasting performance.
A comparison of SVR and Bayesian ANNs for prediction of daily streamflow can
be found in Rasouli et al. (2012). Interestingly, they show that climate indices like
the North Atlantic Oscillation (NAO) or the Pacific-North American teleconnection
(PNA) contribute to forecast scores in the case of longer lead times (5-7 days), for a
small watershed in British Columbia, Canada.
In the context of detecting Land Use and Land Use Changes (LULC), (Nourani
et al. 2018) combine remote sensing data and a conceptual rainfall-runoff model
generating outflow, using a Storage Coefficient (SC) as parameter representing the
LULC. The relation between SC and the streamflow time series is then simulated by
SVM and ANNs combined.
The performance of SVR seems to depend on the preprocessing of the hydrolog-
ical data, an observation made also for other analysis methods. In Yu et al. (2018),
SVR is combined with Fourier transformation to generate independent forecasting
models according to the frequency of the components. This FT-SVR appears to yield
remarkably high performance for inflow prediction of the Three Gorges Dam in
China.
For further examples, consider an existing review of SVM applications in hydro-
logical literature as of 2014 (Raghavendra and Deka 2014).

10.4.4 Papers Combining and Comparing Machine Learning


Methods in Hydrology

It should be obvious at this stage that there is no single best machine learning
technique for hydrological applications. The request for analysis and modeling is
spread over different temporal and spatial scales, the amount of (training) data is
very different, and even if we focus here on prediction tasks, comparisons with
conceptual or process-based models is occasionally desired. This can’t be covered
by a single framework. Thus, an increasing number of papers is comparing several
methods and elucidate their respective strengths and weaknesses for the application
at hand.
Groundwater potential mapping is the focus of Naghibi and Pourghasemi (2015).
Here, boosted regression trees, CART (which is not necessary in the class of
machine learning algorithms), and random forests were trained on spring locations,
using 14 predictors. These three were all performing very well, but almost indistin-
guishable, and were outperforming conventional methods easily. The heterogeneity
of the predictors makes the task a typical application domain for machine learning.

holger.lange@nibio.no
252 H. Lange and S. Sippel

Using wavelets as base functions for ANNs, Shafaei and Kisi (2017) compare
ANNs to SVM for daily streamflow prediction. For short prediction times (up to
3 days), the wavelet-based ANN was outperforming both ordinary ANN as well
as SVM.
In the context of daily streamflow from semiarid mountainous regions, Yin et al.
(2018) compare SVR, a multivariate adaptive spline algorithm (MARS) and the M5
tree discussed above, where M5 turned out to be the winner for short-term prediction
up to 3 days.
A new technique coined selected model fusion is developed in (Modaresi et al.
2018a, b). This is an example where the individual methods are not performed
individually and then compared against each other; rather, the output of all of
them (ANN, SVR, K-NN, and ordinary multiple linear regression) are fused together
with an ordered selection algorithm. The fusion of the outputs is superior to even the
best of the individual methods for the case of monthly streamflow prediction,
demonstrating that it is feasible to combine the respective strengths of single
algorithms.
Finally, Worland et al. (2018) set up most of the methods presented here (i.e.,
elastic nets, gradient boosting machine, KNN, M5-cubist, and SVM) to predict a
peculiar extreme statistics, the 7-day mean streamflow which corresponds to the 10%
quantile (“7Q10”), i.e., an indicator for low flow. They tackle the problem of
generalizing these 7Q10 values obtained from gauged sites also to ungauged ones.
They also exploit stack generalization, using a M5-cubist as metamodel, resembling
a Leave-One-Out Cross-Validation, but now on the level of machine learning
techniques (plus three baseline models not related to machine learning). The meta-
model (“meta cubist”) outcompetes each individual method in terms of the standard
quality metrics RMSE and Nash-Sutcliffe.

10.5 Outlook

Machine learning and deep learning are on their way to becoming the key empirical
analysis techniques in hydrology; and increasingly ML applications in hydrological
studies are made reproducible through code sharing (e.g., Peters et al. 2016).
According to Shen (2018), however, we are still in the early “value discovery”
stage of deep learning. One can expect synergies between deep learning/machine
learning methods and process-based models; it is possible that patterns detected by
the automatic methods initiate new questions on the relevance and nature of pro-
cesses at different scales, leading to new routes in mechanistic modeling. The bet is
open on whether this is going to happen. So far, process-based modeling and ML
approaches are more in competition to each other, and their proponents often belong
to different scientific communities.
However, the interplay between process understanding and ML application is a
more complex one. Broadly speaking, ML results are based solely on data presented
to the algorithm, and do not come with any interpretation of what is going on in the

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 253

system. It is rather easy to produce colorful results devoid of any meaning unless
combined with expert knowledge. In return, this knowledge can be sharpened,
confirmed, or revised with the aid of pattern detection based on ML. It is this
feedback loop which should be pursued further, be it in the field of hydrology proper
or as an interdisciplinary effort.
Arguably the central benefit of deep learning is the learning of huge amounts of
unsupervised (or unlabeled, uncategorized) data. The pattern extraction, information
retrieval, classification, and prediction abilities of deep learning algorithms indicate
their suitability for “Big Data Hydrology” (Irving et al. 2018). Still, the low maturity
of deep learning warrants extensive further research (Najafabadi et al. 2015). A
prerequisite of big data analysis methods to be successful is the presence of big data
in the first place. The monitoring of hydrological systems has to be continued and
extended. This is a challenging and long-lasting task, since some patterns are only
apparent in time series extending over decades.
We expect the field of ML and deep learning to expand rapidly, with a prolifer-
ation of publications also in the field of hydrology. Currently, SVMs, CNNs, and
random forests appear to be the most actively investigated algorithms, but new ones
are forthcoming. Thus, it is likely that any future chapter on ML in hydrology in the
coming years would probably focus on new additional (or different) methods as well
as those that are now being widely utilized.
At the time of writing, a special collection of Water Resources Research on “Big
Data and Machine Learning in Water Sciences: Recent Progress and Their Use in
Advancing Science”10 is being compiled, where seven articles already have been
published. We are waiting with excitement to the rest of the contributions and their
discussion within the community of hydrology researchers.

References

Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am


Stat 46:175–185. https://doi.org/10.2307/2685209
Bergmeir C, Hyndman RJ, Koo B (2018) A note on the validity of cross-validation for evaluating
autoregressive time series prediction. Comput Stat Data Anal 120:70–83. https://doi.org/10.
1016/j.csda.2017.11.003
Beven K, Freer J (2001) Equifinality, data assimilation, and uncertainty estimation in mechanistic
modelling of complex environmental systems. J Hydrol 249:11–29. https://doi.org/10.1016/
S0022-1694(01)00421-8
Bishop C (2006) Pattern recognition and machine learning. Springer, New York. 738 p
Blöschl G (2001) Scaling in hydrology. Hydrol Process 15:709–711. https://doi.org/10.1002/hyp.
432
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In:
Proceedings of the fifth annual workshop on computational learning theory. ACM, Pittsburgh,
pp 144–152. https://doi.org/10.1145/130385.130401

10
https://agupubs.onlinelibrary.wiley.com/doi/toc/10.1002/(ISSN)1944-7973.MACHINELEARN

holger.lange@nibio.no
254 H. Lange and S. Sippel

Bozorg-Haddad O, Aboutalebi M, Ashofteh PS, Loaiciga HA (2018) Real-time reservoir operation


using data mining techniques. Environ Monit Assess 190:594. https://doi.org/10.1007/s10661-
018-6970-2
Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/
A:1010933404324
Breiman L, Friedman JH, Stone CJ, Olshen RA (1984) Classification and regression trees. Chap-
man & Hall, Boca Raton. 368 p
Clark MP, Nijssen B, Lundquist JD, Kavetski D, Rupp DE, Woods RA et al (2015) A unified
approach for process-based hydrological modeling: 1. Modeling concept. Water Resour Res
51:2498–2514. https://doi.org/10.1002/2015WR017198
Corzo Perez GA (2009) Hybrid models for hydrological forecasting: Integration of data-driven and
conceptual modelling techniques. Doctoral thesis, TU Delft. 215 p
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signal
2:303–314. https://doi.org/10.1007/BF02551274
Daliakopoulos IN, Tsanis IK (2016) Comparison of an artificial neural network and a conceptual
rainfall-runoff model in the simulation of ephemeral streamflow. Hydrol Sci J 61:2763–2774.
https://doi.org/10.1080/02626667.2016.1154151
Dechter R (1986) Learning while searching in constraint-satisfaction problems. In: AAAI ‘86
Proceedings of the Fifth AAAI national conference on artificial intelligence. Pennsylvania,
Philadelphia, pp 178–183
Fatichi S, Pappas C, Valeriy IY (2016) Modeling plant–water interactions: an ecohydrological
overview from the cell to the global scale. WIRES Water 3:327–368. https://doi.org/10.1002/
wat2.1125
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data An 38:367–378. https://doi.
org/10.1016/S0167-9473(01)00065-2
Friedman JH, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via
coordinate descent. J Stat Softw 33:1–22. https://doi.org/10.18637/jss.v033.i01
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42. https://
doi.org/10.1007/s10994-006-6226-1
Gudmundsson L, Seneviratne SI (2015) Towards observation-based gridded runoff estimates for
Europe. Hydrol Earth Syst Sci 19:2859–2879. https://doi.org/10.5194/hess-19-2859-2015
Hastie T, Tibshirani R, Friedman JH (2008) The elements of statistical learning. Springer,
New York. 745 p
Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat
36:1171–1220. https://doi.org/10.1214/009053607000000677
Hong W-C (2008) Rainfall forecasting by technological machine learning models. Appl Math
Comput 200:41–57. https://doi.org/10.1016/j.amc.2007.10.046
Hothorn T (2019) CRAN task view: machine learning and statistical learning. R-project.org.
Accessed 27 Feb 2019. https://cran.r-project.org/web/views/MachineLearning.html
Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications.
Neurocomputing 70:489–501. https://doi.org/10.1016/j.neucom.2005.12.126
Irving K, Kuemmerlen M, Kiesel J, Kakouei K, Domisch S, Jähnig SC (2018) A high-resolution
streamflow and hydrological metrics dataset for ecological modeling using a regression model.
Sci Data 5:180224. https://doi.org/10.1038/sdata.2018.224
Karunanithi N, Grenney WJ, Whitley D, Bovee K (1994) Neural networks for river flow prediction.
J Comput Civil Eng 8:201–220. https://doi.org/10.1061/(ASCE)0887-3801(1994)8:2(201)
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of the ICNN’95
international conference on neural networks, vol 4, pp 1942–1948. https://doi.org/10.1109/
ICNN.1995.488968
Kingston GB, Maier HR, Lambert MF (2005) Calibration and validation of neural networks to
ensure physically plausible hydrological modeling. J Hydrol 314:158–176. https://doi.org/10.
1016/j.jhydrol.2005.03.013

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 255

Kleinberg EM (1996) An overtraining-resistant stochastic modeling method for pattern recognition.


Ann Stat 24:2319–2349
Kuligowski RJ, Barros AP (1998) Experiments in short-term precipitation forecasting using
artificial neural networks. Mon Weather Rev 126:470–482. https://doi.org/10.1175/1520-0493
(1998)126<0470:EISTPF>2.0.CO;2
Lange H, Rosso OA, Hauhs M (2013) Ordinal pattern and statistical complexity analysis of daily
stream flow time series. Eur Phys- J Spec Top 222:535–552. https://doi.org/10.1140/epjst/
e2013-01858-3
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W et al (1989)
Backpropagation applied to handwritten zip code recognition. Neural Comput Appl
1:541–551. https://doi.org/10.1162/neco.1989.1.4.541
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2:18–22
Lima AR, Hsieh WW, Cannon AJ (2017) Variable complexity online sequential extreme learning
machine, with applications to streamflow prediction. J Hydrol 555:983–994. https://doi.org/10.
1016/j.jhydrol.2017.10.037
Lin JY, Cheng CT, Chau KW (2006) Using support vector machines for long-term discharge
prediction. Hydrol Sci J 51:599–612. https://doi.org/10.1623/hysj.51.4.599
Lischeid G (2001) Investigating short-term dynamics and long-term trends of SO4 in the runoff of a
forested catchment using artificial neural networks. J Hydrol 243:31–42. https://doi.org/10.
1016/S0022-1694(00)00399-1
Loh W-Y (2011) Classification and regression trees. WIRES Data Min Knowl 1:14–23. https://doi.
org/10.1002/widm.8
Ma Y, Li XY, Guo L, Lin H (2017) Hydropedology: interactions between pedologic and hydrologic
processes across spatiotemporal scales. Earth-Sci Rev 171:181–195. https://doi.org/10.1016/j.
earscirev.2017.05.014
Maier HR, Dandy GC (1995) Comparison of the Box-Jenkins procedure with artificial neural
network methods for univariate time series modelling. Research Report No R 127, June 1995.
Department of Civil and Environmental Engineering, University of Adelaide, Adelaide,
Australia
Miettinen K (1999) Nonlinear multiobjective optimization. Springer, New York., 298 p. https://doi.
org/10.1007/978-1-4615-5563-6
Modaresi F, Araghinejad S, Ebrahimi K (2018a) A comparative assessment of artificial neural
network, generalized regression neural network, least-square support vector regression, and
K-nearest neighbor regression for monthly streamflow forecasting in linear and nonlinear
conditions. Water Resour Manag 32:243–258. https://doi.org/10.1007/s11269-017-1807-2
Modaresi F, Araghinejad S, Ebrahimi K (2018b) Selected model fusion: an approach for improving
the accuracy of monthly streamflow forecasting. J Hydroinform 20:917–933. https://doi.org/10.
2166/hydro.2018.098
Naghibi SA, Pourghasemi HR (2015) A comparative assessment between three machine learning
models and their performance comparison by bivariate and multivariate statistical methods in
groundwater potential mapping. Water Resour Manag 29:5217–5236. https://doi.org/10.1007/
s11269-015-1114-8
Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E (2015) Deep
learning applications and challenges in big data analytics. J Big Data 2(1). https://doi.org/10.
1186/s40537-014-0007-7
Nash JE, Sutcliffe V (1970) River flow forecasting through conceptual models, I. A discussion of
principles. J Hydrol 10:282–290. https://doi.org/10.1016/0022-1694(70)90255-6
Nourani V, Roushangar K, Andalib G (2018) An inverse method for watershed change detection
using hybrid conceptual and artificial intelligence approaches. J Hydrol 562:371–384. https://
doi.org/10.1016/j.jhydrol.2018.05.018
Parasuraman K, Elshorbagy A, Carey SK (2006) Spiking modular neural networks: a neural
network modeling approach for hydrological processes. Water Resour Res 42:W05412.
https://doi.org/10.1029/2005WR004317

holger.lange@nibio.no
256 H. Lange and S. Sippel

Peters J, Janzing D, Schölkopf B (2017) Elements of causal inference, Foundations and learning
algorithms. MIT Press, Cambridge, MA. 288 p
Peters R, Lin Y, Berger U (2016) Machine learning meets individual-based modelling: self-
organising feature maps for the analysis of below-ground competition among plants. Ecol
Model 326:142–151. https://doi.org/10.1016/j.ecolmodel.2015.10.014
Quinlan JR (1993) Combining instance-based and model-based learning. In: Proceedings of the
tenth international conference on machine learning. Morgan Kaufmann, Amherst, MA, pp
236–243
Raghavendra SN, Deka PC (2014) Support vector machine applications in the field of hydrology: a
review. Appl Soft Comput 19:372–386. https://doi.org/10.1016/j.asoc.2014.02.002
Rasouli K, Hsieh WW, Cannon AJ (2012) Daily streamflow forecasting by machine learning
methods with weather and climate inputs. J Hydrol 414–415:284–293. https://doi.org/10.
1016/j.jhydrol.2011.10.039
Rasp S, Pritchard MS, Gentine P (2018) Deep learning to represent subgrid processes in climate
models. Proc Natl Acad Sci USA 115:9684–9689. https://doi.org/10.1073/pnas.1810286115
Richards LA (1931) Capillary conduction of liquids in porous mediums. Physics 1:318–333.
https://doi.org/10.1063/1.1745010
Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera-Arroita G et al (2017) Cross-validation
strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography
40:913–929. https://doi.org/10.1111/ecog.02881
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization
in the brain. Psychol Rev 65:386–408. https://doi.org/10.1037/h0042519
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating
errors. Nature 323:533–536. https://doi.org/10.1038/323533a0
Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev
3:210–229. https://doi.org/10.1147/rd.33.0210
Schliep K, Hechenbichler K (2016) kknn: Weighted k-Nearest Neighbors. R package version 1.3.1.
https://CRAN.R-project.org/package¼kknn
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117.
https://doi.org/10.1016/j.neunet.2014.09.003
Schoups G, Vrugt JA (2010) A formal likelihood function for parameter and predictive inference of
hydrologic models with correlated, heteroscedastic, and non-Gaussian errors. Water Resour Res
46:W10531. https://doi.org/10.1029/2009WR008933
Schultz W (2007) Reward signals. Scholarpedia 2:2184. https://doi.org/10.4249/scholarpedia.2184
Shafaei M, Kisi O (2017) Predicting river daily flow using wavelet-artificial neural networks based
on regression analyses in comparison with artificial neural networks and support vector machine
models. Neural Comput Appl 28:S15–S28. https://doi.org/10.1007/s00521-016-2293-9
Shen C (2018) Deep learning: a next-generation big-data approach for hydrology. EOS Trans 99.
https://doi.org/10.1029/2018EO095649
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G et al (2016) Mastering the
game of Go with deep neural networks and tree search. Nature 529:484–489. https://doi.org/10.
1038/nature16961
Silver N (2012) The signal and the noise: why so many predictions fail--but some don’t. Penguin
Books, New York. 560 p
Siqueira H, Boccato L, Luna I, Attux R, Lyra C (2018) Performance analysis of unorganized
machines in streamflow forecasting of Brazilian plants. Appl Soft Comput 68:494–506. https://
doi.org/10.1016/j.asoc.2018.04.007
Sivapalan M (2003) Process compexity at hillslope scale, process simplicity at the watershed scale:
is there a connection? Hydrol Process 17:1037–1041. https://doi.org/10.1002/hyp.5109
Sivapalan M (2006) Pattern, process and function: elements of a unified theory of hydrology at the
catchment scale. Encycl Hydrol Sci. https://doi.org/10.1002/0470848944.hsa012
Sivapalan M, Grayson R, Woods R (2004) Scale and scaling in hydrology. Hydrol Process
18:1369–1371. https://doi.org/10.1002/hyp.1417

holger.lange@nibio.no
10 Machine Learning Applications in Hydrology 257

Sugihara G, May R, Ye H, Hsieh C-H, Deyle E, Fogarty M et al (2012) Detecting causality in


complex ecosystems. Science 338:496–500. https://doi.org/10.1126/science.1227079
Tongal H, Berndtsson R (2017) Impact of complexity on daily and multi-step forecasting of
streamflow with chaotic, stochastic, and black-box models. Stoch Environ Res Risk Assess
31:661–682. https://doi.org/10.1007/s00477-016-1236-4
Toth E, Brath A (2007) Multistep ahead streamflow forecasting: role of calibration data in
conceptual and neural network modeling. Water Resour Res 43:W11405. https://doi.org/10.
1029/2006WR005383
Tyralis H, Papacharalampous G, Langousis A (2019) A brief review of Random Forests for water
scientists and practitioners and their recent history in water resources. Water 11:910. https://doi.
org/10.3390/w11050910
Viney NR, Sivapalan M (2004) A framework for scaling of hydrologic conceptualizations based on
a disaggregation-aggregation approach. Hydrol Process 18:1395–1408. https://doi.org/10.1002/
hyp.1419
Wang W, Van Gelder P, Vrijling JK, Ma J (2006) Forecasting daily streamflow using hybrid ANN
models. J Hydrol 324:383–399. https://doi.org/10.1016/j.jhydrol.2005.09.032
Werbos PJ (1975) Beyond regression: new tools for prediction and analysis in the behavioral
sciences. Harvard University Press, Cambridge, MA. 906 p
Wolpert DH (1992) Stacked generalization. Neural Netw 5:241–259. https://doi.org/10.1016/
S0893-6080(05)80023-1
Worland SC, Farmer WH, Kiang JE (2018) Improving predictions of hydrological low-flow indices
in ungaged basins using machine learning. Environ Model Softw 101:169–182. https://doi.org/
10.1016/j.envsoft.2017.12.021
Yaseen ZM, Allawi MF, Yousif AA, Jaafar O, Hamzah FM, El-Shafie A (2018) Non-tuned
machine learning approach for hydrological time series forecasting. Neural Comput Appl
30:1479–1491. https://doi.org/10.1007/s00521-016-2763-0
Yin ZL, Feng Q, Wen XH, Deo RC, Yang LS, Si JH et al (2018) Design and evaluation of SVR,
MARS and M5Tree models for 1, 2 and 3-day lead time forecasting of river flow data in a
semiarid mountainous catchment. Stoch Environ Res Risk Assess 32:2457–2476. https://doi.
org/10.1007/s00477-018-1585-2
Yu X, Zhang XQ, Qin H (2018) A data-driven model based on Fourier transform and support vector
regression for monthly reservoir inflow forecasting. J Hydro-Environ Res 18:12–24. https://doi.
org/10.1016/j.jher.2017.10.005
Zealand CM, Burn DH, Simonovic SP (1999) Short term streamflow forecasting using artificial
neural networks. J Hydrol 214:32–48. https://doi.org/10.1016/S0022-1694(98)00242-X
Zhang S-Q (2009) Enhanced supervised locally linear embedding. Pattern Recogn Lett
30:1208–1218. https://doi.org/10.1016/j.patrec.2009.05.011
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B
67:301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x

holger.lange@nibio.no
View publication stats

You might also like