Data Driven+Geography
Data Driven+Geography
net/publication/282538532
Data-driven geography
CITATIONS READS
194 1,657
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Harvey Miller on 26 April 2016.
Data-driven geography
Harvey J. Miller • Michael F. Goodchild
Abstract The context for geographic research has knowledge to clean data and to ignore spurious
shifted from a data-scarce to a data-rich environment, patterns, and how to build data-driven models that
in which the most fundamental changes are not just the are both true and understandable.
volume of data, but the variety and the velocity at
which we can capture georeferenced data; trends often Keywords Big data GIScience Spatial statistics
associated with the concept of Big Data. A data-driven Geographic knowledge discovery Geographic
geography may be emerging in response to the wealth thought Time geography
of georeferenced data flowing from sensors and people
in the environment. Although this may seem revolu-
tionary, in fact it may be better described as evolu-
tionary. Some of the issues raised by data-driven Introduction
geography have in fact been longstanding issues in
geographic research, namely, large data volumes, A great deal of attention is being paid to the potential
dealing with populations and messy data, and tensions impact of data-driven methods on the sciences. The
between idiographic versus nomothetic knowledge. ease of collecting, storing, and processing digital data
The belief that spatial context matters is a major theme may be leading to what some are calling the fourth
in geographic thought and a major motivation behind paradigm of science, following the millennia-old
approaches such as time geography, disaggregate traditional of empirical science describing natural
spatial statistics and GIScience. There is potential to phenomena, the centuries-old tradition of theoretical
use Big Data to inform both geographic knowledge- science using models and generalization, and the
discovery and spatial modeling. However, there are decades-old traditional of computational science sim-
challenges, such as how to formalize geographic ulating complex systems. Instead of looking through
telescopes and microscopes, researchers are increas-
ingly interrogating the world through large-scale,
H. J. Miller (&)
complex instruments and systems that relay observa-
Department of Geography, The Ohio State University,
Columbus, OH, USA tions to large databases to be processed and stored as
e-mail: miller.81@osu.edu information and knowledge in computers (Hey et al.
2009).
M. F. Goodchild
This fundamental change in the nature of the data
Department of Geography, University of California, Santa
Barbara, Santa Barbara, CA, USA available to researchers is leading to what some call
e-mail: good@geog.ucsb.edu Big Data. Big Data refer to data that outstrip our
123
450 GeoJournal (2015) 80:449–461
capabilities to analyze. This has three dimensions, the and the environmental, and the existence within the
so-called ‘‘three Vs’’: (1) volume—the amount of data discipline of traditions with very different approaches
that can be collected and stored; (2) velocity—the to research. Moreover, although data-driven geogra-
speed at which data can be captured; and (3) variety— phy may seem revolutionary, in fact it may be better
encompassing both structured (organized and stored in described as evolutionary since its challenges have
tables and relations) and unstructured (text, imagery) long been themes in the history of geographic thought
data (Dumbill 2012). Some of these data are generated and the development of geographical techniques.
from massive simulations of complex systems such as The next section of this paper discusses the
cities (e.g., TRANSIMs; see Cetin et al. 2002), but a concepts of Big Data and data-driven geography,
large portion of the flood is from sensors and software addressing the question of what is special about the
that digitize and store a broad spectrum of social, new flood of georeferenced data. The ‘‘Data-driven
economic, political, and environmental patterns and geography: challenges’’ section of this paper dis-
processes (Graham and Shelton 2013; Kitchin 2014). cusses major challenges facing data-driven geogra-
Sources of geographically (and often temporally) phy; these include dealing with populations (not
referenced data include location-aware technologies samples), messy (not clean) data, and correlations
such as the Global Positioning System and mobile (not causality). The ‘‘Theory in data-driven geogra-
phones; in situ sensors carried by individuals in phy’’ section discusses the role of theory in data-
phones, attached to vehicles, and embedded in infra- driven geography. ‘‘Approaches to data-driven geog-
structure; remote sensors carried by airborne and raphy’’ identifies ways to incorporate Big Data into
satellite platforms; radiofrequency identification geographic research. The final section concludes this
(RFID) tags attached to objects; and georeferenced paper with a summary and some cautions on the
social media (Miller 2007, 2010; Sui and Goodchild broader impacts of data-driven geography on society.
2011; Townsend 2013).
Yet despite the enthusiasm over Big Data and data-
driven methods, the role it can play in scholarly
research, and specifically research in geography may Big data and data-driven geography
not be immediately apparent. Are theory and expla-
nation archaic when we can measure and describe so Humanity’s current ability to acquire, process, share,
much, so quickly? Does data velocity really matter in and analyze huge quantities of data is without prec-
research, with its traditions of careful reflection? Can edent in human history. It has led to the coining of such
the obvious problems associated with variety—lack of terms as the ‘‘exaflood’’ and the metaphor of ‘‘drinking
quality control, lack of rigorous sampling design—be from a firehose’’ (Sui et al. 2013; Waldrop 1990). It is
overcome? Can we make valid generalizations from also led to the suggestion that we are entering a new,
ongoing, serendipitous (instead of carefully designed fourth phase of science that will be driven not so much
and instrumented) data collection? In short, can Big by careful observation by individuals, or theory
Data and data-driven methods lead to significant development, or computational simulation, as by this
discoveries in geographic research? Or will the new abundance of digital data (Hey et al. 2009).
research community continue to rely on what for the It is worth recognizing immediately, however, that
purposes of this paper we will term Scarce Data: the the firehose metaphor has a comparatively long history
products of public-sector statistical programs that in geography, and that the discipline is by no means
have long provided the major input to research in new to an abundance of voluminous data. The Landsat
quantitative human geography? program of satellite-based remote sensing began in the
Our purpose in this paper is to explore the impli- early 1970s by acquiring data at rates that were well in
cations of these tensions—theory-driven versus data- excess of the analytic capacities of the computational
driven research, prediction versus discovery, law- systems of the time; subsequent improvements in
seeking versus description-seeking—for research in sensor resolution and the proliferation of military and
geography. We anticipate that geography will provide civilian satellites have meant that four decades later
a distinct context for several reasons: the specific issues data volumes continue to challenge even the most
associated with location, the integration of the social powerful computational systems.
123
GeoJournal (2015) 80:449–461 451
Volume is clearly not the only characteristic that Data-driven geography: challenges
distinguishes today’s data supply from that of previous
eras. Today, data are being collected from many In Big Data: A Revolution That Will Transform How
sources, including social media, crowd sourcing, We Live, Work, and Think, Mayer-Schonberger and
ground-based sensor networks, and surveillance cam- Cukier (2013) identify three main challenges of Big
eras, and our ability to integrate such data and draw Data in science: (1) populations, not samples; (2)
inferences has expanded along with the volume of the messy, not clean data, and; (3) correlations, not
supply. The phrase Big Data implies a world in which causality. We discuss these three challenges for
predictions are made by mining data for patterns and geographic research in the following subsections.
correlations among these new sources, and some very
compelling instances of surprisingly accurate predic- Populations, not samples
tions have surfaced in the past few years with respect
to the results of the Eurovision song contest (O’Leary Back when analysis was largely performed by hand
2012), the stock market (Preis et al. 2013), and the flu rather than by machines, dealing with large volumes of
(Butler 2008). The theme of Big Data is often data was impractical. Instead, researchers developed
associated not only with volume but with variety, methods for collecting representative samples and for
reflecting these multiple sources, and velocity, given generalizing to inferences about the population from
the speed with which such data can now be analyzed to which they were drawn. Random sampling was thus a
make predictions in close-to-real time. strategy for dealing with information overload in an
Ubiquitous, ongoing data flows are a big deal earlier era. In statistical programs such as the US Census
because they allow us to capture spatio-temporal of Population it was also a means for controlling costs.
dynamics directly (rather than inferring them from Random sampling works well, but it is fragile: it
snapshots) and at multiple scales. The data are works only as long as the sampling is representative. A
collected on an ongoing basis, meaning that both sampling rate of one in six (the rate previously used by
mundane and unplanned events can be captured. To the US Bureau of the Census for its more elaborate
borrow Nassim Taleb’s metaphor for probable and Long Form) may be adequate for some purposes, but
inconsequential versus improbable but consequential becomes increasingly problematic when analysis
events (Taleb 2007): we do not need to sort the white focuses on comparatively rare subcategories. Random
swans from the black swans before collecting data: we sampling also requires a process for enumerating and
can measure all swans and then figure out later which selecting from the population (a sampling frame),
are white or black. White swans may also combine in which is problematic if enumeration is incomplete.
surprising ways to form black-swan events. Sample data also has a lack of extensibility for
Big Data is leading to new approaches to research secondary uses. Because randomness is so critical, one
methodology. Fotheringham (1998) defines geocom- must carefully plan for sampling, and it may be
putation as quantitative spatial analysis where the difficult to re-analyze the data for purposes other than
computer plays a pivotal role. The use of the computer those for which it was collected (Mayer-Schonberger
drives the form of the analysis rather than just being a and Cukier 2013).
convenient vehicle: analysts design geocomputational In contrast, many of the new data sources consist of
techniques with the computer in mind. Similarly, data populations, not samples: the ease of collecting,
play a pivotal role in data-driven methods. From this storing, and processing digital data means that instead
perspective data are not just a convenient way to of dealing with a small representation of the popula-
calibrate, validate, and test but rather the driving force tion we can work with the entire population and thus
behind the analysis. Consequently, analysts design escape one of the constraints of the past. But one
data-driven techniques with data in mind–and not just problem with populations is that they are often self-
large volumes of data, but a wider spectrum of data selected rather than sampled: for example, all people
flowing at higher speeds from the world. In this sense who signed up for Facebook, all people who carry
we may indeed be entering a fourth scientific paradigm smartphones, or all cars than happened to travel within
where scientific methods are configured to satisfy data the City of London between 8 a.m.–11:00 a.m. on 2
rather than data configured to satisfy methods. September 2013. Geolocated tweets are an attractive
123
452 GeoJournal (2015) 80:449–461
source of information on current trends (e.g., Tsou private actions that people wish to keep private
et al. 2013), but only a small fraction of tweets are (Goffman 1959). While there are certainly cases of
accurately geolocated using GPS. Since we do not over-sharing behavior (especially among celebrities)
know the demographic characteristics of any of these we cannot be assured that the information people
groups, it is impossible to generalize from them to any volunteer is an accurate depiction of their complete
larger populations from which they might have been lives or just of the lives they wish to present to the
drawn. social sphere. Several geographic questions follow
Yet geographers have long had to contend with the from these observations. What is the geography of
issues associated with samples and their parent stage versus backstage realms in a city or region? Does
populations. Consider, for example, an analysis of this distribution vary by age, gender, socioeconomic
the relationship between people over 65 years old and status, or culture? What do these imply for what we
people registered as Republicans, the case studied by can know about human spatial behavior?
Openshaw and Taylor in their seminal article on the In addition to selective volunteering of information
modifiable areal unit problem (Openshaw and Taylor about their lives, there also may be selection biases in
1979). The 99 counties of Iowa (their source of data) the information people volunteer about environments.
are all of the counties that exist in Iowa. They are not Open Street Map (OSM) is often identified as a
therefore a random sample of Iowa counties, or even a successful crowdsourced mapping project: many cities
representative sample of counties of the US, so the of the world have been mapped by people on a voluntary
methods of inferential statistics that assume random basis to a remarkable degree of accuracy. However,
and independent sampling are not applicable. In some regions get mapped quicker than others, such as
remote sensing it is common to analyze all of the tourist locations, recreation areas, and affluent neigh-
pixels in a given scene; again, these are not a random borhoods, while locations of less interest to those who
sample of any larger population. participate in OSM (such as poorer neighborhoods)
However, the cases discussed above are where we receive less attention (Haklay 2010). While biases exist
can be assured that the entire population of interest is in official, administrative maps (e.g., governments in
included: we are interested in all of the land cover in a developing nations often do not map informal settle-
scene, or all of the people over 65 and Republicans in ments such as favelas), the biases in crowdsourced maps
Iowa. This is often not true with many new sources of are likely to be more subtle. Similarly, the rise of civic
data. A challenge is how to identify the niches to hacking where citizens generate data, maps, and tools to
which monitored population data can be applied with solve social problems tends to focus on the problems
reasonable generality. This inverts the classic sam- that citizens with laptops, fast internet connections,
pling problem where we identify a question and collect technical skills, and available time consider to be
data to answer that question. Instead, we collect the important (Townsend 2013).
data and determine what questions we can answer.
Another issue concerns what people are volunteer- Messy, not clean
ing when they volunteer geographic and other infor-
mation (Goodchild 2007). Social media such as The new data sources are often messy, consisting of
Facebook may have high penetration rates with data that are unstructured, collected with no quality
respect to population, but do not necessarily have control, and frequently accompanied by no documen-
high penetration rates into peoples’ lives. Checking in tation or metadata. There are at least two ways of
at an orchestra concert or lecture provides a noble dealing with such messiness. On the one hand, we can
image that a person would like to promote, while restrict our use of the data to tasks that do not attempt
checking in at a bar at 10am is an image that a person to generalize or to make assumptions about quality.
may be less keen to share. In the classic sociology text Messy data can be useful in what one might term the
The Presentation of Self in Everyday Life, Erving softer areas of science: initial exploration of study
Goffman uses theater as a metaphor and distinguishes areas, or the generation of hypotheses. Ethnography,
between stage and backstage behaviors, with stage qualitative research, and investigations of Grounded
behaviors being consistent with the role people wish to Theory (Glaser and Strauss 1967) often focus on using
play in public life and backstage behaviors being interviews, text, and other sources to reveal what was
123
GeoJournal (2015) 80:449–461 453
otherwise not known or recognized, and in such Goodchild and Li (2012) describe the social
contexts the kinds of rigorous sampling and docu- solution as implementing a hierarchical structure of
mentation associated with Scarce Data are largely volunteer moderators and gatekeepers. Individuals are
unnecessary. We discuss this option in greater detail nominated to roles in the hierarchy based on their track
later in the paper. record of activity and the accuracy of their contribu-
On the other hand, we can attempt to clean and verify tions. Volunteered facts that appear questionable or
the data, removing as much as possible of the messi- contestable are referred up the hierarchy, to be
ness, for use in traditional scientific knowledge con- accepted, queried, or rejected as appropriate. Schemes
struction. Goodchild and Li (2012) discuss this such as this have been implemented by many projects,
approach in the context of crowdsourced geographic including OSM and Wikipedia. Their major disad-
information. They note that traditional production of vantage is speed: since humans are involved, the
geographic information has relied on multiple sources, solution is best suited to applications where time is not
and on the expertise of cartographers and domain critical.
scientists to assemble an integrated picture of the The third, the knowledge solution, asks how one
landscape. For example, terrain information may be might know if a purported fact is false, or likely to be
compiled from photogrammetry, point measurements false. Spelling errors and mistakes of syntax are simple
of elevation, and historic sources; as a result of this indicators which all of us use to triage malicious email.
process of synthesis the published result may well be In the geographic case, one can ask whether a
more accurate than any of the original sources. purported fact is consistent with what is already
Goodchild and Li (2012) argue that that traditional known about the geographic world, in terms both of
process of synthesis, which is largely hidden from facts and theories. Moreover such checks of consis-
popular view and not apparent in the final result, will tency can potentially be automated, allowing triage to
become explicit and of critical importance in the new occur in close-to real time; this approach has been
world of Big Data. They identify three strategies for implemented, although on a somewhat unstructured
cleaning and verifying messy data: (1) the crowd basis, by companies that daily receive thousands of
solution; (2) the social solution; and (3) the knowledge volunteered corrections to their geographic databases.
solution. The crowd solution is based on Linus’ Law, A purported fact can deviate from established
named in honor of the developer of Linux, Linus geographic knowledge in either syntax or semantics,
Torvalds: ‘‘Given enough eyeballs, all bugs are or both. Syntax refers to the rules by which the world is
shallow’’ (Raymond 2001). In other words, the more constructed, while semantics refers to the meaning of
people who can access and review your code, the those facts. Syntactical knowledge is often easier to
greater the accuracy of the final product. Geographic check than semantic knowledge. For example, Fig. 1
facts that can be synthesized from multiple original
reports are likely to be more accurate than single
reports. This is of course the strategy used by
Wikipedia and its analogs: open contributions and
open editing are evidently capable of producing
reasonably accurate results when assisted by various
automated editing procedures.
In the geographic case, however, several issues
arise that limit the success of the crowd solution.
Reports of events at some location may be difficult to
compare if the means used to specify location (place
names, street address, GPS) are uncertain, and if the
means used to describe the event is ambiguous.
Geographic facts may be obscure, such as the names
of mountains in remote parts of the world, and the
crowd may therefore have little interest or ability to Fig. 1 Syntactical geographic knowledge: Highway on-ramp
edit errors. feature geometry
123
454 GeoJournal (2015) 80:449–461
Fig. 2 Semantic
geographic knowledge:
Where is Mirror Lake?
(Google Earth; last accessed
24 September 2013 10:00am
EDT)
illustrates an example of syntactical geographic semantic facts that can be dismissed confidently as
knowledge. We know from engineering specifications absurd—one would not expect to see a lake scene on
that an on-ramp can only intersect a freeway at a small the top of Mt. Everest or in the Sahara Desert.
angle (typically 30 degrees or less). If a road-network Nevertheless, there is no firm line between clearly
database appears to have on-ramp intersections of[30 absurd and non-absurd semantic facts—e.g., one
degrees we know that the data are likely to be wrong; would not expect to see Venice or New York City in
in the case of Fig. 1, many of the apparent intersec- the Mojave Desert, but Las Vegas certainly exists.
tions of the light-blue segments are more likely to be A major task for the knowledge solution is formal-
overpasses or underpasses. Such errors have been izing knowledge to support automated triage of
termed errors of logical consistency in the literature of asserted facts and automated data fusion. Knowledge
geographic information science (e.g., Guptill and can be derived empirically or as predictions from
Morrison 1995). theories, models, and simulations. In the latter case,
In contrast, Fig. 2 illustrates semantic geographic we may be looking for data at variance with predic-
knowledge: a photograph of a lake that has been linked tions as part of the knowledge-discovery and con-
to the Google Earth map of The Ohio State University struction processes.
campus. However, this photograph seems to be located There are at least two major challenges to
incorrectly: we recognize the scene as Mirror Lake, a formalizing geographic knowledge. First, geographic
campus icon to the southeast of the purported location concepts such as neighborhood, region, the Midwest,
indicated on the map. The purported location must be and developing nations can be vague, fluid, and
wrong, but can we be sure? Perhaps the university contested. A second challenge is the development of
moved Mirror Lake to make way for a new Geography explicit, formal, and computable representations of
building? Or perhaps Mirror Lake was so popular that geographic knowledge. Much geographic knowledge
the university created a mirror Mirror Lake to handle is buried in formal theories, models, and equations
the overflow? We cannot immediately and with that must be solved or processed, or in informal
complete confidence dismiss this empirical fact with- language that must be interpreted. In contrast,
out additional investigation since it does not violate knowledge-discovery techniques require explicit
any known rules by which the world is constructed: representations such as rules, hierarchies, and con-
there is nothing preventing Mirror Lake from being cept networks that can be accessed directly without
moved or mirrored. Of course, there are some processing (Miller 2010).
123
GeoJournal (2015) 80:449–461 455
123
456 GeoJournal (2015) 80:449–461
Table 1 A brief history of partnerships and tensions between and Edward Ullman asserting that geography should
nomothetic (law-seeking) and idiographic (description-seek- be a law-seeking science that answers the question
ing) knowledge in geographic thought
‘‘why?’’ rather than building a collection of facts
Path to geographic Advocates describing what is happening in particular regions.
knowledge Physical geographers have—perhaps wisely—disen-
Nomothetic $ idiographic Strabo gaged themselves from these debates, but the tension
Ptolemy between nomothetic and idiographic approaches per-
Nomothetic ? idiographic Varenius sists in human geography (see Cresswell 2013;
Nomothetic / idiographic Humboldt
DeLyser and Sui 2013; Schuurman 2000; Sui 2004;
Ritter
Sui and DeLyser 2012).
Idiographic Hartshorne
However, attempts to reconcile nomothetic and
idiographic knowledge did not die with Humboldt and
Nomothetic Schaefer
Ritter. Approaches such as time geography seek to
Nomothetic $ idiographic Hägerstrand (time geography)
capture context and history and recognize the roles of
Fotheringham/Anselin (local
spatial statistics) both agency and structure in human behavior (Cres-
Tomlinson/Goodchild swell 2013). In spatial analysis, the trend towards local
(GIScience) statistics, exemplified by Geographically Weighted
Regression (Fotheringham et al. 2002) and Local
Indicators of Spatial Association (Anselin 1995),
early history of geography in the time of Strabo (64/63 represents a compromise in which the general princi-
BCE–24 CE) and Ptolemy (90-168 CE) involved both ples of nomothetic geography are allowed to express
generalizations about the Earth and intimate descrip- themselves differently across geographic space.
tions of specific places and regions; these were two Goodchild (2004) has characterized GIS as combining
sides of the same coin. Bernhardus Varenius the nomothetic, in its software and algorithms, with
(1622–1650) conceptualized geography as consisting the idiographic in its databases.
of general (scientific) and special (regional) knowl- In a sense, the paths to geographic knowledge
edge, although he considered the latter to be subsidiary engendered by data-intensive approaches such as time
to the former (Warntz 1989; Goodchild et al. 1999). geography, disaggregate spatial statistics and GI-
Alexander von Humboldt (1769–1859) and Carl Ritter Science are a return to the early foundation of
(1779–1859), often regarded as the founders of geography where neither law-seeking nor descrip-
modern geography, tried to derive general laws tion-seeking were privileged. Geographic generaliza-
through careful measurement of geographic phenom- tions and laws are possible but space matters: spatial
ena at particular locations and times. In more recent dependency and spatial heterogeneity create local
times, the historic balance between nomothetic and context that shapes physical and human processes as
idiographic geographic knowledge has become more they evolve on the surface of the Earth. Geographers
unstable. The early twentieth century witnessed the have believed this for a long time, but this belief is also
dominance of nomothetic geography in the guise of supported by recent breakthroughs in complex sys-
the environmental determinism in the early 1900s, tems theory, which suggests that patterns of local
followed by a backlash against its abuses and the interactions lead to emergent behaviors that cannot be
subsequent rise of idiographic geography in the form understood in isolation at either the local or global
of areal differentiation: Richard Hartshorne famously levels. Understanding the interactions among agents
declared in The Nature of Geography that the only law within an environment is the scientific glue that binds
in geography is that all areas are unique (Hartshorne the local with the global (Flake 1998).
1939). The dominance of idiographic geography and In short, data-driven geography is not necessarily a
the concurrent crisis in American academic geography radical break with the geographic tradition: geography
(in particular, the closing of Harvard’s geography has a longstanding belief in the value of idiographic
program in 1948; Smith 1992) led to the Quantitative knowledge by itself as well as its role in constructing
Revolution of the 1950s and 1960s, with geographers nomothetic knowledge. Although this belief has been
such as Fred Schaefer, William Bunge, Peter Haggett, tenuous and contested at times, data-driven geography
123
GeoJournal (2015) 80:449–461 457
may provide the paths between idiographic and starts with data describing something and ends with
nomothetic knowledge that geographers have been a hypothesis that explains the data. It is a weaker
seeking for two millennia. However, while complexity form of inference relative to deductive or inductive
theory supports this belief, it also suggests that this reasoning: deductive reasoning shows that X must
knowledge may have inherent limitations: emergent be true, inductive reasoning shows that X is true,
behavior is by definition surprising. while abductive reasoning shows only that X may be
true. Nevertheless, abductive reasoning is critically
important in science, particularly in the initial
Approaches to data-driven geography discovery stage that precedes the use of deductive
or inductive approaches to knowledge-construction
If we accept the premise—at least until proven (Miller 2010).
otherwise—that Big Data and data-driven science Abductive reasoning requires four capabilities: (1)
harmonize with longstanding themes and beliefs in the ability to posit new fragments of theory; (2) a
geography, the question that follows is: how can data- massive set of knowledge to draw from, ranging from
driven approaches fit into geographic research? Data- common sense to domain expertise; (3) a means of
driven approaches can support both geographic searching through this knowledge collection for
knowledge-discovery and spatial modeling. However, connections between data patterns and possible expla-
there are some challenges and cautions that must be nations, and; (4) complex problem-solving strategies
recognized. such as analogy, approximation, and guesses. Humans
have proven to be more successful than machines in
Data-driven geographic knowledge discovery performing these complex tasks, suggesting that data-
driven knowledge-discovery should try to leverage
Geographic knowledge-discovery refers to the initial these human capabilities through methods such as
stage of the scientific process where the investigator geovisualization rather than try to automate the
forms his or her conceptual view of the system, discovery process. Gahegan (2009) envisions a
develops hypotheses to be tested, and performs human-centered process where geovisualization
groundwork to support the knowledge-construction serves as the central framework for creating chains
process. Geographic data facilitates this crucial phase of inference among abductive, inductive, and deduc-
of the scientific process by supporting activities such tive approaches in science, allowing more interactions
as study-site selection and reconnaissance, ethnogra- and synergy among these approaches to geographic
phy, experimental design, and logistics. knowledge building.
Perhaps the most transformative impact of data- One of the problems with Big Data is the size and
driven science on geographic knowledge-discovery complexity of the information space implied by a
will be through data-exploration and hypothesis massive multivariate database. A good data-explora-
generation. Similar to a telescope or microscope, tion system should generate all of the interesting
systems for capturing, storing, and processing massive patterns in a database, but only the interesting ones to
amounts of data can allow investigators to augment avoid overwhelming the analyst. Two ways to manage
their perceptions of reality and see things that would the large number of potential patterns are background
otherwise be hidden or too faint to perceive. From this knowledge and interestingness measures. Background
perspective, data-driven science is not necessarily a knowledge guides the search for patterns by repre-
radically new approach, but rather a way to enhance senting accepted knowledge about the system to focus
inference for the longstanding processes of explora- the search for novel patterns. In contrast, we can use
tion and hypothesis generation prior to knowledge- interestingness measures a posteriori to filter spurious
construction through analysis, modeling, and verifi- patterns by rating each pattern based on dimensions
cation (Miller 2010). such as simplicity, certainty, utility, and novelty.
Data-driven knowledge-discovery has a philo- Patterns with ratings below a user-specified threshold
sophical foundation: abductive reasoning, a form of are discarded or ignored (Miller 2010). Both of these
inference articulated by astronomer and mathemati- approaches require formalization of geographic
cian C. S. Peirce (1894–1914). Abductive reasoning knowledge, a challenge discussed earlier in this paper.
123
458 GeoJournal (2015) 80:449–461
123
GeoJournal (2015) 80:449–461 459
principle: the best model is the one that explains the from Nate Silver: telling stories about data instead of
most with the least. This is sometimes referred to as reality is dangerous and can lead to mistaking noise for
‘‘Occam’s Razor’’: given two models with equal signal (Silver 2012).
validity, the simpler model is better. Model interpre- A final challenge in data-driven spatial modeling is
tation is an informal but key test: the model builder de-skilling: a loss of modeling and analysis skills.
must be able to explain what the model results say While allocating mundane tasks to computers frees
about reality. Models derived computationally from humans to perform sophisticated activities, there are
data and fine-tuned based on feedback from predic- times when mundane skills become crucial. For
tions can generate reliable predictions from processes example, there are documented cases of airline pilots,
that are too complex for the human brain (Townsend due to a lack of manual flying experience, reacted
2013; Weinberger 2011). For example, Openshaw’s badly in emergencies when the autopilot shuts off
automated system for breeding spatial interaction (Carr 2013). Although rarely life-threatening, one
models has been known to generate very complex, could make a similar argument about automatic model
non-intuitive models (Fotheringham 1998), many of building: if a data-driven modeling process generates
which are also dimensionally inconsistent. Figure 3 anomalous results, will the analyst be able to deter-
illustrates some of the spatial interaction models mine if they are artifacts or genuine? With Open-
generated by Openshaw’s automated system; as can shaw’s automated spatial interaction modeling
be seen, they defy easy comprehension. system, the analyst may become less skilled at spatial
The knowledge from data-driven models can be interaction modeling and more skilled at combinato-
complex and non-compressible: the data are the rial optimization techniques. While these skills are
explanation. But if the explanation is not understand- valuable and may allow the analyst to reach greater
able, do we really have an explanation? Perhaps the scientific heights, they are another level removed from
nature of explanation is evolving. Perhaps computers the empirical system being modeled. However, the
are fundamental in data-driven science not only for more anomalous the results, the deeper the thinking
discovering but also for representing complex patterns required.
that are beyond human comprehension. Perhaps this is A solution to de-skilling is to force the skill: require
a temporary stopgap until we achieve convergence it as part of education and certification, or design
between human and machine intelligence as some software that encourages or requires analysts to
predict (Kurzweil 1999). While we cannot hope to maintain some basic skills. However, this is a difficult
resolve this question (or its philosophical implica- case to make compared to the hypnotic call of
tions) within this paper, we can add a cautionary note sophisticated methods with user-friendly interfaces
123
460 GeoJournal (2015) 80:449–461
(Carr 2013). Re-reading Jerry Dobson’s prescient categorizing and reacting to people and places based
essay on automated geography thirty years later on potentials derived from correlations rather than
(Dobson 1983), one is impressed by the number of actual behavior. Finally, we must avoid a data
the activities in geography that used to be painstaking dictatorship: data-driven research should support, not
but are now push-button. Geographers of a certain age replace, decision-making by intelligent and skeptical
may recall courses in basic and production cartogra- humans. Some of the other papers in this special issue
phy without much nostalgia. What skills that we explore these challenges in depth.
consider essential today will be considered the pen,
ink, and lettering kits of tomorrow? What will we
lose?
References
Conclusion Anderson, C. (2008). The end of theory: The data deluge makes
the scientific method obsolete. Wired, 16, 07.
Anselin, L. (1995). Local indicators of spatial association:
The context for geographic research has shifted from a LISA. Geographical Analysis, 27(2), 93–115.
data-scarce to a data-rich environment, in which the Batty, M. (2012). Smart cities, big data. Environment and
most fundamental changes are not the volume of data, Planning B, 39(2), 191–193.
but the variety and the velocity at which we can Butler, D. (2008). Web data predict flu. Nature, 456, 287–288.
Carr, N. (2013) The great forgetting. The Atlantic, pp. 77–81.
capture georeferenced data. A data-driven geography Cetin, N., Nagel, K., Raney, B., & Voellmy, A. (2002). Large-
may be emerging in response to the wealth of scale multi-agent transportation simulations. Computer
georeferenced data flowing from sensors and people Physics Communications, 147(1–2), 559–564.
in the environment. Some of the issues raised by data- Charlton, M. (2008). Geographical Analysis Machine (GAM).
In K. Kemp (Ed.), Encyclopedia of Geographic Informa-
driven geography have in fact been longstanding tion Science (pp. 179–180). London: Sage.
issues in geographic research, namely, large data Cresswell, T. (2013). Geographic thought: A critical introduc-
volumes, dealing with populations and messy data, tion. New York: Wiley-Blackwell.
and tensions between idiographic versus nomothetic DeLyser, D., & Sui, D. (2013). Crossing the qualitative-quan-
titative divide II: Inventive approaches to big data, mobile
knowledge. However, the belief that spatial context methods, and rhythmanalysis. Progress in Human Geog-
matters is a major theme in geographic thought and a raphy, 37(2), 293–305.
major motivation behind approaches such as time Diplock, G. (1998). Building new spatial interaction models by
geography, disaggregate spatial statistics, and GI- using genetic programming and a supercomputer. Envi-
ronment and Planning A, 30(10), 1893–1904.
Science. There is potential to use Big Data to inform Dobson, J. E. (1983). Automated geography. The Professional
both geographic knowledge-discovery and spatial Geographer, 35, 135–143.
modeling. However, there are challenges, such as Dumbill, E. (2012). What is big data? An introduction to the big
how to formalize geographic knowledge to clean data data landscape, http://strata.oreilly.com/2012/01/what-is-
big-data.html. Last accessed 17 April 2014.
and to ignore spurious patterns, and how to build data- Flake, G. W. (1998). The computational beauty of nature:
driven models that are both true and understandable. computer explorations of fractals, chaos, complex systems,
Cautionary notes need to be sounded about the and adaptation. Cambridge: MIT Press.
impact of data-driven geography on broader society Fotheringham, A. S. (1998). Trends in quantitative methods II:
Stressing the computational. Progress in Human Geogra-
(see Mayer-Schonberger and Cukier 2013). We must phy, 22(2), 283–292.
be cognizant about where this research is occurring— Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002).
in the open light of scholarly research where peer Geographically weighted regression: The analysis of
review and reproducibility is possible, or behind the spatially varying relationships. Chichester: Wiley.
Gahegan, M. (2000). On the application of inductive machine
closed doors of private-sector companies and govern- learning tools to geographical analysis. Geographical
ment agencies, as proprietary products without peer Analysis, 32(1), 113–139.
review and without full reproducibility. Privacy is a Gahegan, M. (2009). Visual exploration and explanation in
vital concern, not only as a human right but also as a geography: Analysis with light. In H. J. Miller & J. Han
(Eds.), Geographic data mining and knowledge discovery
potential source of backlash that will shut down data- (2nd ed., pp. 291–324). London: Taylor and Francis.
driven research. We must be careful to avoid pre- Gibbings, J. C. (2011). Dimensional analysis. New York:
crimes and pre-punishments (Zedner 2010): Springer.
123
GeoJournal (2015) 80:449–461 461
Glaser, B. G., & Strauss, A. L. (1967). The discovery of analysis of point data sets. International Journal of Geo-
grounded theory. Chicago: Aldine. graphical Information Systems, 1(4), 335–358.
Goffman, E. (1959). The presentation of self in everyday life. Openshaw, S., & Taylor, P. J. (1979). A million or so correlation
New York: Anchor Books. coefficients: three experiments on the modifiable areal unit
Goodchild, M. F. (2004). GIScience, geography, form, and problem. In N. Wrigley (Ed.), Statistical methods in the
process. Annals of the Association of American Geogra- social sciences (pp. 127–144). London: Pion.
phers, 94(4), 709–714. Preis, T., Moat, H. S., & Stanley, H. E. (2013). Quantifying
Goodchild, M. F. (2007). Citizens as sensors: The world of trading behavior in financial markets using Google Trends.
volunteered geography. GeoJournal, 69(4), 211–221. Scientific Reports, 3 (1684). doi:10.1038/srep01684.
Goodchild, M. F., Egenhofer, M. J., Kemp, K. K., Mark, D. M., Raymond, E. S. (2001). The cathedral and the bazaar: Musings
& Sheppard, E. (1999). Introduction to the Varenius pro- on linux and open source by an accidental revolutionary.
ject. International Journal of Geographical Information Sebastopol: O’Reilly Media.
Science, 13(8), 731–745. Schuurman, N. (2000). Trouble in the heartland: GIS and its
Goodchild, M. F., & Li, L. (2012). Assuring the quality of critics in the 1990s. Progress in Human Geography, 24(4),
volunteered geographic information. Spatial Statistics, 1, 569–589.
110–120. doi:10.1016/j.spasta.2012.03.002. Silver, N. (2012). The signal and the noise: Why most predic-
Graham, M., & Shelton, T. (2013). Geography and the future of tions fail—but some don’t.
big data, big data and the future of geography. Dialogues in Smith, N. (1992). History and philosophy of geography: Real
Human Geography, 3(3), 255–261. wars, theory wars. Progress in Human Geography, 16(2),
Guptill, S. C., & Morrison, J. L. (Eds.). (1995). Elements of 257–271.
spatial data quality. Oxford: Elsevier. Sui, D. (2004). GIS, cartography, and the ‘‘Third Culture’’:
Haklay, M. (2010). How good is volunteered geographical Geographic imaginations in the computer age. Profes-
information? A comparative study of OpenStreetMap and sional Geographer, 56(1), 62–72.
Ordnance Survey datasets. Environment and Planning B: Sui, D., & DeLyser, D. (2012). Crossing the qualitative-quan-
Planning and Design, 37(4), 682–703. titative chasm I: Hybrid geographies, the spatial turn, and
Hand, D. J. (1999). Discussion contribution on ‘data mining volunteered geographic information (VGI). Progress in
reconsidered: Encompassing and the general-to-specific Human Geography, 36(1), 111–124.
approach to specification search’ by Hoover and Perez. Sui, D., & Goodchild, M. F. (2011). The convergence of GIS and
Econometrics Journal, 2(2), 241–243. social media: Challenges for GIScience. International
Hartshorne, R. (1939). The nature of geography: A critical Journal of Geographical Information Science, 25(11),
survey of current thought in the light of the past. Wash- 1737–1748.
ington, DC: Association of American Geographers. Sui, D., Goodchild, M. F., & Elwood, S. (2013). Volunteered
Hey, T., Tansley S., & Tolle, K. (Eds.). (2009). The fourth geographic information, the exaflood, and the growing
paradigm: Data-intensive scientific discovery. digital divide. In D. Sui, S. Elwood, & M. F. Goodchild
Hoover, K. D., & Perez, S. J. (1999). Data mining reconsidered: (Eds.), Crowdsourcing geographic knowledge (pp. 1–12).
Encompassing and the general-to-specific approach to New York: Springer.
specification search. Econometrics Journal, 2(2), 167–191. Taleb, N. N. (2007). The black swan: The impact of the highly
Kitchin, R. (2014). Big data and human geography: Opportu- improbable. New York: Random House.
nities, challenges and risks. Dialogues in Human Geog- The Economist. (19 October 2013). Trouble at the lab,
raphy, 3(3), 262–267. pp. 26–30.
Kurzweil, R. (1999). The age of spiritual machines: when Townsend, A. (2013). Smart cities: Big data, civic hackers, and
computers exceed human intelligence. New York: Vintage. the quest for a new utopia. New York: Norton.
Mayer-Schonberger, V., Cukier, K. (2013). Big Data: A revo- Tsou, M. H., Yang, J. A., Lusher, D., Han, S., Spitzberg, B.,
lution that will transform how we live, work, and think. Gawron, J. M., et al. (2013). Mapping social activities and
Merton, R. K. (1967). On sociological theories of the middle concepts with social media (Twitter) and web search
range. In R. K. Merton (Ed.), On theoretical sociology (pp. engines (Yahoo and Bing): a case study in 2012 US Pres-
39–72). New York: The Free Press. idential Election. Cartography and Geographic Informa-
Miller, H. J. (2007). Place-based versus people-based geo- tion Science, 40(4), 337–348.
graphic information science. Geography Compass, 1(3), Waldrop, M. M. (1990). Learning to drink from a fire hose.
503–535. Science, 248(4956), 674–675.
Miller, H. J. (2010). The data avalanche is here. Shouldn’t we be Warntz, W. (1989). Newton, the Newtonians, and the Geogra-
digging? Journal of Regional Science, 50(1), 181–201. phia Generalis Varenii. Annals of the Association of
O’Leary, M. (2012). Eurovision statistics: post-semifinal American Geographers, 79(2), 165–191.
update, Cold Hard Facts (May 23). Available: http:// Watts, D. J. (2011). Everything is Obvious – Once You Know the
mewo2.com/nerdery/2012/05/23/eurovision-statistics- Answer. United States of America: Crown Business.
post-semifinal-update/. Accessed October 25, 2013. Weinberger, D. (2011). The machine that would predict the
Openshaw, S. (1988). Building an automated modeling system future, Scientific American, November 15, 2011. http://
to explore a universe of spatial interaction models. Geo- www.scientificamerican.com/article.cfm?id=the-machine-
graphical Analysis, 20(1), 31–46. that-would-predict.
Openshaw, S., Charlton, M., Wymer, C., & Craft, A. (1987). Zedner, L. (2010). Pre-crime and pre-punishment: a health
A Mark I geographical analysis machine for the automated warning. Criminal Justice Matters, 81(1), 24–25.
123