[go: up one dir, main page]

Academia.eduAcademia.edu
Six Degrees of “Who Cares?”1 Rick Grannis University of California, Los Angeles The concept that we live in small world networks connected by short paths has proved fascinating. These networks, however, do not typically emerge as linear responses to individual-level changes; rather, subtle changes in relations produce extraordinarily different macrolevel outcomes. Similarly, nuances in how we conceptualize, define, and measure relations can lead to widely different network characterizations. The author demonstrates this variability using a spectrum of interaction types and argue that the dependence of results on subtleties in definition or measurement makes theoretical interpretation difficult. He offers an index to calculate how much inaccuracy or imprecision relational definitions or data-gathering techniques can tolerate before results yield utterly different interpretations. It’s a small world, or so we are often told. Both the popular imagination and some academic theories have been fascinated by the social network concept that we are all connected to each other by relatively short paths.2 Six degrees of separation has been an intriguing concept since it was first explored by Stanley Milgram four decades ago or popularized into a play by John Guare or a movie starring Will Smith and, more recently, repopularized by Duncan Watts and Steven Strogatz (not to mention the eponymous television show, the Kevin Bacon game, etc.). The Internet has only fueled these ideas, as individuals regularly compare how popular, powerful, and interesting they are by assessing how many hundreds of thousands of people they have in their online social or business network 1 I would like to thank the AJS reviewers for thoughtful comments. Direct correspondence to Rick Grannis, Sociology Department, University of California, Los Angeles, 264 Haines Hall, Box 951551, Los Angeles, California 90095-1551. E-mail: grannis@soc.ucla.edu 2 While the term “path” has some intuitive clarity, the formal graph-theoretic definition of a path is a sequence of nodes and edges, beginning and ending with nodes, with no repeated nodes in the sequence. 䉷 2010 by The University of Chicago. All rights reserved. 0002-9602/2010/11504-0001$10.00 AJS Volume 115 Number 4 (January 2010): 991–1017 991 American Journal of Sociology (Ulanoff 2005). These networks of short paths intrigue us because they appear to make present the impact of otherwise distant relational contacts. Networks, however, are built relation by relation, and how one defines the underlying relation predetermines whether or not the network that we discover connects almost everyone via short paths or whether virtually everyone is essentially isolated. This is not surprising. What is surprising is how very subtle, both conceptually and empirically, the relational distinctions creating these divergent outcomes often are. While we can accurately define and precisely measure many of the social relations we wish to model, even slight inconsistencies in the data we elicit, often apparently trivial at the individual level, can lead to extraordinarily different network characterizations. These subtle inconsistencies that yield dramatically different system-level results and that profoundly affect theoretical interpretations are the focus of this article. THE PROMISE OF LARGE-SCALE NETWORKS One of the fundamental issues in social science concerns how the interactions of individuals translate into the characteristics of the social systems they compose (Schelling 1978). Social networks have a potentially powerful role in this process. In recent years, models proposed by mathematicians and statistical physicists have modeled large-scale networks in many ways other than random networks (Erdos and Rényi 1960).3 These complex networks have included small world models (Watts and Strogatz 1998) with significantly higher clustering than a random network (with the same number of nodes and edges)4 but maintaining relatively short characteristic path length (similar to a random network)5 as well as centralized scale-free models 3 What are often labeled “random networks” generally refer to two related ensembles of graphs, one with a given number of nodes and a given number of edges that are distributed randomly among the nodes and the other with a given number of nodes and a given probability of edges connecting them. 4 Clustering (Newman, Strogatz, and Watts 2001; Maslov, Sneppen, and Zaliznyak 2004) or transitivity (Rapoport 1953, 1957, 1963; Davis 1967, 1970; Holland and Leinhardt 1972) both refer to the increased likelihood for two social actors with a mutual neighbor to also be connected to each other. 5 A component is a maximal subset of nodes, together with the edges connecting members of the subset, in which all nodes are reachable from every other node using paths formed by nodes that are also members of the component. Two nodes are reachable from each other if there exists a path (of any length) between them. Maximal means that it is the largest possible subset (i.e., you could not find another node anywhere in the population under study such that it could be added to the subset and all the nodes in the subset would still be reachable). If a single component contains the majority of all network members, it is a giant component. The length of a path is the 992 Six Degrees of “Who Cares?” (Barabasi and Albert 1999)6 within which most nodes also tend to be connected by unusually short path lengths. While invoking different models, based on how they perceive local-level interactions (e.g., based on transitivity, preferential attachment, etc.),7 these complex network models are all founded on the concept of a giant component, a single component containing the majority of nodes in a network. THE DISPROPORTIONAL RESPONSE OF LARGE-SCALE NETWORKS TO LOCAL-LEVEL CHANGES Watts and Strogatz’s seminal work (Watts and Strogatz 1998; Watts 1999a, 1999b) established an elegant model of how a network could have a relatively short path length despite relatively substantial clustering. They limited their model, however, to networks that were already connected in a single giant component. In his simulations, Watts (1999a, p. 499) introduced an artificial substrate to force the graphs to be connected and, in his real world examples, Watts intentionally focused on networks with a substantial average degree (k 11 1 ) in order to guarantee their connectivity. In this article, I examine the underlying presupposition of such models, that the network is connected within a giant component at all at any number of steps. Giant components, single components that connect the majority of network members, do not emerge as a linear response to individual-level changes; rather, subtle changes in relations potentially produce extraordinarily different macrolevel outcomes. As the average number of relations among individuals increases, the size of the components that they form number of edges in it. A geodesic is the shortest path between any two nodes. The characteristic path length is the mean of the geodesic length among all pairs of nodes in the giant component, if it exists. Smaller, distinct components are, of course, unreachable, and, if a giant component does not exist, then the characteristic path length is essentially meaningless. By “relatively short” I specifically mean that the characteristic path length of a component varies more directly with the logarithm of the number of its constituent nodes (similar to a random network) rather than with the direct number of nodes. 6 As defined by Barabasi and Albert (1999), scale-free networks have degree distributions that asymptotically approximate a power law: P(k) ∼ k⫺g, where P(k) is the fraction of nodes in the network having k connections, and g is a constant, typically, although not exclusively, between 2 and 3. 7 While preferential attachment could refer to any selective process, in network terms it most often implies disproportionately selecting alters who already have many relations. 993 American Journal of Sociology does not grow relatively smoothly from small to large.8 Instead, in virtually all networks, a sharp threshold point exists combinatorially such that only a relatively small increase in the proportion of relations transitions the network from the situation in which virtually no network member is connected via a path to any other network member to the situation in which most network members are connected via a path to most other network members (Erdos and Rényi 1960; Janson, Luczak, and Rucinski 2002). This has been referred to as the “critical point” or “the doublejump threshold” (Molloy and Reed 1995, 1998). I refer to this simply as the phase transition.9 The concept of a phase transition originated in thermodynamics, where it refers to an abrupt change in physical properties as a result of a relatively small change in temperature. A familiar example would be water that entirely transforms from a crystalline solid to a liquid over a relatively small temperature threshold. As the temperature rises, ice maintains its crystalline structure until, during a very short interval of degrees, it completely transforms into liquid water. If one imagines a sparse network with nodes existing only in small, disconnected components, as the density of relations increases, there would be no large-scale effects until a threshold point was achieved, when the addition of a relatively few relations transforms the population from many small, disconnected, insular communities into a network composed primarily of one dominant, comprehensive community whose constituent members are mutually reachable via paths. Figure 1 illustrates this. It represents 10,000 randomly generated networks,10 each with 1,000 nodes and each with an average degree varying between zero and six. Figure 1 plots the proportion of nodes in each network that is part of the largest component as a function of the average degree of the network. The three lines drawn on the graph represent three distinct theoretical slopes clearly evidenced in this plot: before the phase transition, during the phase transition, and after the phase transition. Clearly, as the average degree rises, so does the size of the largest com8 In graph theoretical terminology, “size” specifically refers to the number of edges, or relations, in a graph while “order” is used to denote the number of nodes, or individuals. I use the term “size,” however, because I suspect the language will be more intuitive to most readers. I beg the tolerance and understanding of graph theorists. 9 While the phase transition concept originated in the study of random graphs, it has been shown to occur in small world graphs, scale-free graphs, and the configuration model (Newman et al. 2001) has suggested that it occurs in all types of graphs. 10 While the networks generated for figure 1 were all random, using clustered (i.e., small world) or preferentially attached (i.e., scale-free) networks changes the horizontal positioning and slope somewhat; the overall story, however, remains essentially the same. 994 Six Degrees of “Who Cares?” Fig. 1.—The phase transition simulated by 10,000 randomly generated networks, each with 1,000 nodes. Lines represent theoretical slopes: before, during, and after the phase transition. ponent in the network. While the increase is rather gradual as the average degree rises to one, the slope changes spectacularly, increasing almost 20fold, as the average degree goes to two and then, just as dramatically, returns to a slope similar to its previous one. The size of the largest component, which is gradually increasing along most of the continuum of increasing degree, suddenly “jumps” to a new threshold, one that it would not have achieved until the network was 20 times as dense as it currently is, if the phase transition had not occurred. 995 American Journal of Sociology THE DISPROPORTIONAL RESPONSE OF LARGE-SCALE NETWORK MODELS TO DEFINITIONS AND MEASUREMENTS OF LOCAL-LEVEL CHANGES Because the phase transition is so sensitive to local-level phenomena, our models of the phase transition are sensitive to the data we elicit about those local-level phenomena. Whether social researchers discover numerous small, isolated social networks or a pervasive giant one depends heavily on definitional nuances and measurement subtleties. One way of avoiding these issues would be to follow Watts’s example and limit our focus to relations that are ubiquitous enough to guarantee connectivity (while, as Watts [1999a, 1999b] showed, raw density is certainly not the only determinant of connectivity [as I shall discuss below], it does substantially influence all other parameters). Sometimes, there is minimal risk to theoretical interpretation from merely defining and measuring relations in such a way that we guarantee a sufficient density for a giant component to form. For example, one of the beauties of Granovetter’s (1973) “weak ties” was that he discovered a highly efficacious effect (finding a job) that could be carried out by an extended path consisting of relatively superficial acquaintances because relatively little was demanded of any intermediary to pass along the information. Each intermediary could benefit from providing a major charitable function (finding someone a job) with little risk and very little investment of effort. Some sociologically important effects, however, require stronger ties. Friedkin (1983) showed that two academics who discussed their work with the same other third academic were aware of each other’s current research (a two-step path) but generally no further; Friedkin (1983) noted that since observability of both behavior and the responses to behavior is necessary for influence to occur, if we are not even aware of what is happening three steps away, we certainly do not have any influence on it. For academics to influence each other across even a two-step path would appear to require relations not only more potent than those of mere acquaintances but at least as potent (if not more so) than discussing current research with each other. In still other scenarios, weak ties may seem the appropriate choice, but the hypothesized paths are composed of ties so utterly weak that not only will they not be useful for any sociologically important process but it is also unlikely that they offer any utility whatsoever. In the social network analyzed by Watts (1999a, 1999b),11 which involved collaborations among actors in feature films, the average actor had a relatively high degree, appearing in a film with 61 distinct other actors at some point in their 11 He also analyzed two nonsocial networks, a power grid and a worm’s neural network. 996 Six Degrees of “Who Cares?” career. While what primarily interested Watts was the mathematically intriguing finding that the average path length (3.65 steps) was nearly as short as would be expected if actors were connected in a random graph (despite the evidence of substantial clustering), from a sociological perspective, this path length was quite long precisely because of the relatively high degree required to create these paths. When one considers a path length of 3.65 steps formed by actors with an average degree of 61 what this means is that, for the average actor, one of the 61 or so other actors they costarred in a movie with at some point in their career had as one of the 61 or so other actors they costarred in a movie with at some point in their career someone who had as one of the 61 or so other actors they costarred in a movie with at some point in their career someone who had as one of the 61 or so other actors they costarred in a movie with at some point in their career any other typical actor. What sociological significance would there be for the typical actor to know that once in their life they were in a movie with someone who once in their life was in a movie with someone who once in their life was in a movie with someone who once in their life was in a movie with Tom Cruise (or any other actor) and that this is the closest they are? Furthermore, given that these relations are not contemporaneous but rather span the entirety of these actors’ careers, they could not spread even the most contagious of diseases across this network, much less anything else. As a final example, in Milgram’s (1967) original small world experiment, which first heralded “six degrees of separation,” 71% of the chains did not make it to their targets at all. Milgram (1967) dismissed this in terms of noncooperation, but this is exactly the point. If people in the 1960s were not willing to invest the effort to send along “a passport of thick royal blue cardboard with the name ‘Harvard University’ embossed in gold letters on the cover and a stylish gold logo [and a roster of signatures with] each person’s name written with a fountain pen in different colors of ink” (Kleinfeld 2002, p. 63) then it calls into question what efficacy, if any, could be derived from these hypothesized, noncooperating paths. In sum, while some ties can have a pronounced long-range effect, not all ties, weak or strong, have an effect across even a small world. Not all relations that are ubiquitous enough to guarantee connectivity may be sociologically meaningful; not all sociologically meaningful relations may generate a giant component; and the difference between these two scenarios may be far from obvious. This fact makes it imperative that the relations hypothesized to form the underlying network be defined and measured with great care. While sociologists are quite adept at this, they may not realize the extreme sensitivity of large-scale networks to relational definitions and measurement. Just as subtle changes in local-level relations have the potential to 997 American Journal of Sociology produce extraordinarily different macrolevel outcomes, so subtle distinctions in how we define and measure relations, often apparently trivial at the individual level, have the potential to lead to extraordinarily different network characterizations if the network being studied is near the point of the phase transition. These measurement subtleties, which could lead to dramatically different global interpretations, include not only sampling variation or data collection errors but also very ordinary changes in data collection techniques (e.g., slightly different questions asked in slightly different ways or under slightly different conditions, etc.),12 which could elicit different lists of alters. While many social relations we wish to model, such as economic exchange, can be measured with precision, other relations have more inherent ambiguity. It is often difficult to appropriately balance emotional closeness, frequency of contact, and the duration of a relation, to name but a few concerns. Similarly, while actual economic exchange is clear, any related meaning may be less so. Burt’s (1997) review of nine name generators asking about similar kinds of relations found little overlap in the contacts elicited: “redundant questions elicit non-redundant names” (p. 359).13 ARTICLE OVERVIEW In this article, I review three quite different data sets, one involving organizations and two involving individuals, covering a spectrum of types of relations, to demonstrate that subtleties and nuances in how relations are conceptualized, defined, measured, or categorized can have profound effects on theoretical interpretations emerging from network analyses based on these relations. Finally, I offer an index to calculate how much 12 Even the best of data collections can have error. Tom Smith at the National Opinion Research Center reported in a draft memo, “2004 Social-Network Module,” on September 22, 2008, that he had “thoroughly examined the GSS procedures and discovered 41 cases that had been coded as ‘zero confidants’ but should have been coded as missing data” (Fischer 2008, p. 1). It might seem reasonable to assume that this relatively small amount of error might have a relatively small effect on results generated using it or that whatever effect it did have on our analytic “outputs” would be in the same order of magnitude as its effect on the data “inputs.” In many cases, this would be true; these minimal data errors would have minimal effects. In some cases, however, the effects would be overwhelming. Distinguishing scenarios such as these is a primary focus of this article. 13 Other articles, in contrast, find that major differences in name-generator wording may in some situations have little or no effect (Straits 2000). It is not my intent to enter the debate about the robustness of name generators but rather to highlight that when similar name generators produce even slight amounts of nonredundancy the potential effect on theoretical interpretations can be profound. 998 Six Degrees of “Who Cares?” imprecision or inaccuracy can be tolerated before results yield utterly different interpretations. SOCIOLOGY GRADUATE PROGRAM HIRING NETWORKS I begin by examining a data set that has seen much study recently (Han 2003; Burris 2004), the exchange of Ph.D.’s among sociology graduate programs. Fundamentally, we interest ourselves in this network because it represents the flow of social goods and resources, norms, and values amidst sociology graduate programs.14 Burris (2004, p. 244) argued that much of the sense of the unity of U.S. sociology comes from this exchange of Ph.D. students, the mechanism by which they “reaffirm the boundaries of the group” and express “mutual affirmation.” If these hiring relations united most programs into a single network with short paths, the flow of norms, values, and influence might generate a single, unified social system. To create this data set I consulted the American Sociological Association’s guide to graduate programs of sociology (2003) for information on the institution in which each full-time faculty member received his or her Ph.D. Seventeen hundred twenty-nine (1,729) professors were listed at the 95 U.S. graduate programs ranked by the National Research Council. If one represents graduate programs as nodes with a directed arc from graduate program A to graduate program B if graduate program B hired as a professor an individual who received their Ph.D. from graduate program A, a giant strong component is evident containing 84 graduate programs.15 These graduate programs within the giant strong component are connected by a mean path length of 2.477 steps. Thus, the typical graduate program hired someone from a graduate program who hired someone from any other typical graduate program (with an additional “hired someone from a graduate program” about half the time). Moody, McFarland, and Bender-deMoll (2005), however, effectively argued that much social network research has a “structural bias” artificially 14 Certainly, relations among graduate programs are highly multiplex, and Ph.D. exchange is only one relation among the many ways that they influence each other, others including (but not limited to) coauthorship, comembership in research organizations, co-attendance at conferences, and friendships. 15 Because these relations are directed both in theory (i.e., program A hires from program B) as well as in empirical reality (only 307 of the 1,729 relations [18%] were reciprocated), we use strong components. For a directed graph, a pair of nodes, i and j, is strongly connected if there is a path from i to j and a path from j to i; although the path from i to j may contain different nodes and directed arcs than the path from j to i. A strong component consists of a maximal set of strongly connected nodes in which the nodes that form the paths are also members of the component. 999 American Journal of Sociology creating structure by “aggregating dead past events” (p. 1208). In this case, the connections that form the Ph.D. exchange network are typically 20 years old (a mean age of 19.1 years and a median of 20 years), and some date back over half a century. Thus, while most graduate programs are apparently connected with a path length of about two and a half steps, a two-step path from graduate programs A to B to C means that, on average, 20 years ago graduate program A sent one of its Ph.D. students to graduate program B and, on average, 20 years ago graduate program B sent one of its Ph.D. students to graduate program C. Thus, the path from A to C may take an average of 40 years. It seems unlikely that this implies any current flow of ideas, symbols, values, or influence between graduate programs A and C through graduate program B. Furthermore, the essentially as common, three-step path from graduate programs A to B to C to D may take an average of 60 years to complete. Many relations (including many professional contacts) clearly lose efficacy as they become less current. For example, Uzzi and Spiro’s (2005) study of the networks of Broadway musical artists noted that artists who had not worked with others for 5, 7, or 10 years had to “break back into the business” (p. 466). They were no longer connected. How long does hiring a Ph.D. from another graduate program maintain an efficacious connection—five years, two decades, half a century? When no “natural” cutoff exists in continuous relational data, we often do not impose one. In this context, however, failing to impose a cutoff assumes implicitly that either all relations, even the oldest (over a half a century old), are equally meaningful or (more likely) that we believe that a time-scaled version of our current data would produce a scaled-down, perhaps just less dense, but essentially similar version of our results. I explore the second assumption. If the size of the largest component was a linear function of the age of the Ph.D. hiring relation, we might expect that if we excluded the older half of all links (i.e., those formed by people who received their Ph.D.’s more than 20 years ago), then path lengths would be a little longer or a few more graduate programs might be disconnected. In fact, that is what we find. Instead of containing 84 graduate programs, the giant strong component contains only 70 programs, and path lengths have increased from 2.477 steps to 3.163 steps. This trend continues a little more as we restrict our data to even more recent Ph.D.’s. If we include only those who received their Ph.D.’s within the past 16 years, or 43% of all currently working faculty members, path lengths are longer still (3.344 steps), more graduate programs have become disconnected, and the giant strong component contains only 60 instead of 70 graduate programs. It would seem that the trend would continue back to the present, but something distinctive happens if we restrict our data set to Ph.D.’s ex1000 Six Degrees of “Who Cares?” Fig. 2.—Ph.D. exchange network changed less than 16 years ago. If we eliminate those hired either 15 or 16 years ago, we have now reduced the number of faculty included by 4% of the total, but we have cut the size of the largest strong component in half. This is the interval of the phase transition. Only 33 graduate programs remain in the giant strong component.16 Furthermore, instead of getting longer (as was the previous pattern), average path lengths within the largest strong component have dropped radically (to 2.483 steps) primarily because so many previously distantly connected graduate programs are now simply unreachable. Figure 2 plots the average shortest path length among graduate programs in the giant strong component by the number of years of data included. Note that after its initial rise upon formation, path length falls slowly as years of data are added until the phase transition occurs and it spikes radically only to fall slowly once again. As we excluded the oldest data and progressively moved our cutoff criteria to a more recent point, year by year, path lengths became slightly longer and a few graduate programs became disconnected in a linear fashion until we moved our chronological cutoff point from 16 to 14 years ago, at which point not only does the size of the largest strong component 16 The 33 graduate programs are Arizona, Brown, Berkeley, UCLA, UCSD, Chicago, Columbia, Cornell, Duke, Florida State, Harvard, Indiana Bloomington, Iowa, Johns Hopkins, Maryland College Park, Michigan, Minnesota, New York, North Carolina Chapel Hill, North Carolina State, Northwestern, Notre Dame, Ohio State, Pennsylvania, Pennsylvania State, Princeton, Stanford, SUNY Albany, SUNY Binghamton, SUNY Stony Brook, Texas Austin, Washington, and Wisconsin Madison. 1001 American Journal of Sociology Fig. 3.—Ph.D. exchange network instantly halve but, instead of lengthening, average path lengths become a full step shorter. Finally, the remaining 33 graduate programs overlap 79% with the top 33 ranked graduate programs and include all of the top 18. Thus, the difference between including the last 14 years of data and the last 16 seems subtle both conceptually and empirically, at least when one considers that it only involves 4% of faculty hires, but profound when one considers that it determines whether we “discover” a distantly connected, but somewhat comprehensive, world of sociology or an intensely connected core of highly ranked graduate programs. In this case, the phase transition is somewhat mirrored in network centralization. Clearly, component size bounds how centralized the network could be since a network cannot be centralized unless it is connected. Figure 3 displays the betweenness centralization of the Ph.D. exchange network as a function of the age of relations included. Betweenness centralization clearly begins to rise at the phase transition point, when we include more than 14 years of data. These dramatic, systemwide changes, however, are not readily visible at the individual level. Figure 4 shows that average degree, the average number of other graduate programs a graduate program hires from, grows in a nearly perfect linear fashion for almost four decades with only trivial fluctuations. 1002 Six Degrees of “Who Cares?” Fig. 4.—Ph.D. exchange network E-MAIL NETWORKS AMONG EMERGENCY RESPONDERS As a second example, I explore task-oriented e-mail communications among 150 potential first responders, from the federal, state, and local levels, who were joint participants in a 45-hour training exercise simulating a natural disaster and ensuing public disorder.17 E-mail communication among participants was logged and, fortunately, is likely to be a near-perfect record of their actual interaction since the participants were disallowed from using cellular phones or other remote forms of contact and most were never in direct contact with each other during the exercise (and none were for very long). I model these e-mail messages as a network with participants representing nodes and an arc from node A to node B indicating that an email was sent from participant A to participant B. If short paths of emails connected most participants, then information, insights, and ideas gathered by individuals would be readily available to most, and a consensus understanding of the crisis and how best to respond could be expected to emerge. Emergencies both before and after this exercise have confirmed the importance of efficient communication for a timely response by authorities. The 150 participants sent 868 e-mail messages to each other, which were categorized by those managing the exercise as occurring within one 17 The exercise, conducted in August 2003, went from 3 a.m. on the first day to 11:59 p.m. on the second day. 1003 American Journal of Sociology of 270 distinct 10-minute time periods.18 Given that the entirety of the exercise occurred over two days, it would be easy to evaluate such a network as a timeless entity and, in fact, all of the in-house reports generated for the study did just that, noting that absolutely everyone was connected into a single, giant strong component. In this network, however, as in the hiring network, time is a crucial factor, although on a much smaller scale. Given that, if one includes data from all time periods, everyone is connected into a single, giant strong component, one might expect that the size of the largest strong component would decrease somewhat linearly with time so that, for instance, when the exercise was one-third complete, one-third of all individuals might be included, or when the exercise was two-thirds complete, two-thirds of all individuals might be included, or at least that the size of the largest strong component would relate to some linear function of time. In fact, the largest strong component grows by an average of one member every two hours until 21 hours into the exercise, at which point it includes only 11 of the 150 participants. At this rate, it would take almost 12 days to include everyone. During the next 20 minutes, however, the size of the largest strong component explodes, nearly quintupling in size, and within five hours, it has included more than 10 times the number of members it took the first 21 hours to achieve. Figure 5 displays this. In this case, not only is the temporal nature of the data relevant but how one categorizes the time element is crucial as well. For example, if one chose to examine only the first day of this exercise, one could choose to sample either the first calendar day or the first 24-hour period. Since the exercise began at 3 a.m., these would appear similar. Furthermore, if one looked at the density of ties present, one would notice that the threehour longer 24-hour period included only 20% more e-mails than did the first calendar day. It would not be unusual to read an article reporting that the authors defined a “day” as a calendar day with a footnote stating that they had also examined the 24-hour period and found the density of ties to be not substantially different. Including these additional three hours of data, however, causes the phase transition to occur. This 20% increase in e-mails represents the difference between a situation in which a giant strong component unites most participants by short paths of information flow and one in which the largest strong component includes only 11 of the 150 participants and most individuals are members of even more trivially sized strong components. 18 While an average of only three e-mails a day sent per participant may not seem like a lot, the e-mails were not trivial; most were hundreds of words in length and tended to resemble minireports and action plans. 1004 Six Degrees of “Who Cares?” Fig. 5.—Emergency responders’ e-mail network. Hours of data included begin with the start of the exercise. How quickly the network grows from its initial starting point is only relevant, however, when considering the spread of information, insights, and ideas that individuals had immediately available to them when the exercise began. Since one can only communicate ideas or understanding one has already apprehended, any information or insight gained by participants either directly or through communication with others could only be transmitted forward in time. A path, a set of contiguous interactions, from person i to person j through person k cannot transmit anything unless the interaction between persons i and k happens either before or concurrently with the relationship between persons k and j (Moody 2002). Thus, another relevant network for the transmission of informational resources is the network which is available after the information has been acquired, the reverse of the scenario presented in figure 5. I model this in figure 6. The x-axis plots negative numbers counting hours before the conclusion of the exercise while the y-axis plots the size of the largest strong component formed by e-mails occurring within that time frame. This figure is reminiscent of figure 5. As one travels back in time from the conclusion of the exercise, adding each previous time’s email messages to the network, the size of the largest strong component increases gradually, adding approximately one member for every 50 minutes of time in the past. When one has included all e-mail messages sent within the last 24 hours (actually 20 minutes short of this) before the conclusion of the exercise, the largest strong component contains only 29 members. Including just one more 10-minute time period, however, nearly doubles the size of the largest strong component to 55 members. The size of the largest strong component then continues to rise steeply, adding 1005 American Journal of Sociology Fig. 6.—Emergency responders’ e-mail network. Hours of data included begin with conclusion of the exercise and count backward (e.g., negative numbers on x-axis represent hours before conclusion of exercise). approximately two members for every 10 minutes, until 33.5 hours before the conclusion of the exercise everyone is a member of the giant strong component. Despite the fact that this network, when considered as a timeless entity, is a single strong component including everyone, if crucial information had been discovered 23 hours and 40 minutes before the end of the exercise (less than halfway through the exercise), no more than 29 of the 150 individuals would have had access to it (and this would only have been Fig. 7.—Emergency responders’ e-mail network. Hours of data included begin with the start of the exercise. 1006 Six Degrees of “Who Cares?” Fig. 8.—Emergency responders’ e-mail network. Hours of data included begin with the start of the exercise. the case if it was discovered by one of the 29 individuals comprising the largest strong component that would form). Unlike the Ph.D. hiring network, the abruptness of the change in the size of the largest strong component is not as apparent in network centralization. Figure 7 shows that, unlike the size of the largest strong component, betweenness centralization is much more linearly related to time with more gradual fluctuations in the slope. Considering these data as well as the Ph.D. exchange network, it is clear that the presence of a giant component in no way guarantees centralization; however, as noted with the Ph.D. hiring example, the absence of a giant component guarantees lack of centralization. Furthermore, figure 8 shows that, similar to what we discovered with the Ph.D. hiring networks, average degree grows in a nearly perfect linear fashion as time passes, gaining approximately one degree per 12 hours. Thus, when the size of the largest strong component quintuples after 21 hours (a 400% increase), average degree only increases 5%, rendering the phase transition imperceptible from the perspective of individuals involved in the system. ESTIMATING THE PHASE TRANSITION FROM SAMPLE DATA Having explored the phase transition in two complete data sets, I now consider how we would investigate it if complete network data were unavailable and all we had was sample data. We would not be able to calculate precisely whether or not a giant component existed but instead 1007 American Journal of Sociology would have to estimate the phase transition point. Several methods have been proposed that could be modified for this purpose, but most have neglected the very clustering foundational to small world network models and have assumed the number of second neighbors, neighbors of neighbors, to simply be a function of the number of first neighbors.19 In contrast, I illustrate a method to estimate large-scale network properties even in the presence of clustering. To begin, figure 9 illustrates clustering. In the figure, node 1 connects to four others, and each of these also connects to four others (assume that the network continues on past the nodes labeled 6 and higher but that those edges simply are not illustrated here). While node 1 has four first neighbors, it has nine, not 12 (4 # 3) second neighbors (each of the four nodes that node 1 is connected to has already spent one of their four relations connecting to node 1). This occurs because nodes 2 and 3, both of which are first neighbors of node 1, each spend one of their relations connecting to each other and because nodes 4 and 5, both first neighbors of node 1, share a common second neighbor in node 12. To account for clustering such as this, Grannis (2004) differentiated the number of first and second neighbors as two distinct variables. In Grannis’s formulation, f1 equals the mean number of neighbors of a typical randomly chosen node, while f2 equals the mean number of distinct second neighbors, regardless of how this number arises, whether influenced by transitivity or clustering or any other process that acts upon the distribution of second neighbors. The variable f2 is measured independently of f1 and, by ignoring those edges that do not contribute to unique second neighbors, explicitly accounts for clustering (as well as the necessary connection with the original node). The ratio g p ( f2 /f1 ), f1 ( 0 equals the proportional increase in the number of new neighbors. Thus, in figure 9, f1 p 4, f2 p 9, and g p 2.25. Using this notation, we expect the average node has f1 first neighbors, f2 p f1 g 1 second neighbors, f3 p f1 g 2 third neighbors, and fm p f1 g m⫺1 mth neighbors. The total number of neighbors reached in l (or fewer) steps is given by the geometric series mpl 冘 mp1 19 mpl fm p 冘 mp1 f1 g m⫺1 p f1 gl ⫺ 1 . g⫺1 Newman (2009, p. 1) has accounted for some clustering in determining networklevel properties (such as the phase transition) from local-level information using multivariate generating functions and an intriguing distinction between triangles and stubs. However, while Newman argues that “in principle, one could generalize the model further to include higher-order elements of four or more vertices,” he does not do so. These higher-order elements, which would be likely to appear in any social network with degree greater than two, would need to include not only complete graphs of size four, five, etc. but also triangles (and higher-order complete graphs) with shared edges. 1008 Six Degrees of “Who Cares?” Fig. 9.—Node 1 has four first neighbors, labeled 2–5, and nine second neighbors, neighbors of neighbors, labeled 6–14. The expected size of the connected component to which a typical node belongs equals one (itself ) plus the number of neighbors it could reach after an infinite number of steps. Substituting ⬁ for l into the formula above and adding 1 yields f1 g⬁ ⫺ 1 ⫹ 1. g⫺1 If g ! 1, g⬁ asymptotically approaches zero and the expression reduces to 1⫹ f1 . 1⫺g If g 1 1, the first term (and thus the entire expression) approaches infinity; the average component size is infinite (i.e., a giant component has formed). If, however, g p 1, then the first term becomes indeterminate, the phase transition point. A giant component exists when g 1 1 and does not exist when g ! 1. Intuitively, one can understand this as follows. Assume that node A connects to nodes B, C, and D. We can consider these nodes as the starting points on branches originating from A. Regardless of B’s, C’s, or D’s initial degrees, they must use one tie connecting to A (else they would not 1009 American Journal of Sociology be A’s neighbor), and they may use some (perhaps none, perhaps all) of their other ties (if any) connecting to each other (i.e., clustering). Any remaining ties will ramify out into new branches. Assume that, after connecting to A and perhaps to each other, B, C, and D have zero, two, and one remaining ties, respectively, available to connect to new nodes. By connecting to B, the number of branches originating from A has decreased; by connecting to C, the number of branches originating from A has increased; and, by connecting to D, the number of branches originating from A has stayed the same. In general, given the context’s propensity for clustering, if the neighbors any typical node is likely to connect to will, after accounting for ties spent in clustering and the initial connection, on average, each yield more than one new branch, this process will expand throughout the network and a giant component can be expected to form.20 As an example application of this index, recall from the discussion of the Ph.D. hiring networks that the onset of the phase transition occurred between years 14 and 16 when the largest component doubled in size. The proportional increase in the number of new neighbors, g, rose steadily and, at 14 years, had a value of 0.94. In year 15, however, its value was 1.13, indicating that the phase transition had begun (and, by year 16, it was valued at 1.34). Similarly, recall from the discussion of the e-mail network among the emergency responders that, during the 20 minutes immediately following 21 event hours, the largest strong component quintupled in size from 11 members to 59 members and within the hour increased to 89 members; g, which had been rising steadily, reached a value of precisely 1.00 at 21 hours, the onset of the phase transition. It then rose to 1.04 within the next 20 minutes and to 1.10 by the end of the hour. GSS DISCUSSION NETWORKS I now demonstrate how to estimate the phase transition using sample data, specifically the General Social Survey (GSS) data on the confidants with whom Americans discuss important matters. Examination of trends in GSS discussion networks (which were collected in 1985, 1987, and 2004) at the individual level have reported important changes in the last generation. For example, McPherson, Smith-Lovin, and Brashears (2006, p. 353) noted that individual networks are a third smaller in 2004 than in 1985 (about two people instead of three) and that the number of people 20 At least, among the nonisolate portion of the population. 1010 Six Degrees of “Who Cares?” saying there is no one with whom they discuss important matters nearly tripled. Besides being sample data from which we will have to estimate the phase transition and account for sampling error, this data set differs from the previous networks we explored in one other important way. In the previous two networks, time was the factor we used to distinguish relations; in this case, however, we will use a qualitative criterion. As discussed in the previous section, we can estimate whether the phase transition has occurred and a giant component exists by calculating the proportional increase in the number of new neighbors, f2 g p ( f1 ( 0). f1 If g 1 1, then the phase transition has occurred. To perform this calculation, we need to know the number of a node’s neighbors, f1 , as well as the number of distinct second neighbors that are not also first neighbors, f2. The first variable, f1, the number of others each individual nominates as someone they discuss important matters with, is readily available from the GSS. The second variable, f2, requires some calculation. The simplest, although probably not the most accurate, way to do this is to assume that one’s discussion partners do not discuss important matters with each other (i.e., that there is no clustering) but, rather, that they link to others randomly with the only stipulation being that each discussion partner has used one tie connecting to the respondent; all other ties extend outward.21 Table 1 shows that, using this model, the value of g is 2.975 in 1985 and 2.565 in 2004.22 Alternatively, we could theorize that some of one’s discussion partners also discuss important matters with each other. While the GSS did not ask respondents which of those, whom they discussed important matters with, also discussed important matters with each other, it did ask respondents to characterize the relationship between each pair of the people they mentioned into three categories: as “especially close, as close or closer” than their relationship to the respondent; “total strangers”; or somewhere 21 By randomly, I mean that the probability that they will link to another individual is directly related to the number of discussion partners that individual has. For example, they are five times as likely to link to a person with five discussion partners as to someone with one, and, obviously, they will not have someone as a discussion partner who has no discussion partners. 22 Table 1 presents results using both the random method (described in n. 21 above) as well as the “Preferential Attachment” method, which assumes that respondents link to others who report the same number of alters that they do. This second method assumes that a respondent with three alters would connect to three persons, each of whom would also have three alters (one of which would be the respondent). 1011 American Journal of Sociology TABLE 1 Values for Individual-Level Predictors of the Phase Transition Model 1985 Proportional increase in new neighbors (g): No clustering . . . . . . . . . . . . . . . . . Especially close . . . . . . . . . . . . . . . Neither especially close nor strangers . . . . . . . . . . . . . . . . . . . . Preferential Attachment 2004 Random Preferential Attachment Random 2.975 (.06935) 1.815 (.04975) 2.975 (.05055) 1.894 (.03968) 2.565 (.09166) 1.505 (.06020) 2.565 (.09771) 1.592 (.06161) .6858 (.02996) .8277 (.03326) .5046 (.03302) .6818 (.03902) Number of respondents . . . . . . . . Average number of first neighbors (f1) . . . . . . . . . . . . . . . . . . . . . 1,525 1,466 2.980 (.04409) 1.987 (.04409) Note.—Numbers in parentheses are SEs. in between; but which, if any, of these corresponds to discussing important matters is unknown. If we theorize that all pairs of individuals identified as “especially close” are, in fact, discussion partners,23 then members of pairs so identified each spend one of their ties connecting to the other. Thus, fewer ties will extend outward to others. Table 1 shows that, under this model, the value of g is 1.894 in 1985 and 1.592 in 2004. For the “no clustering” model as well as the model of those who were “especially close” as discussion partners, g 1 1 indicates that a giant component clearly unites most isolates.24 Some might assume that only the more exclusive “especially close” relation represents those who would have in fact nominated each other as someone they discuss important matters with if the GSS had surveyed them. However, while it seems reasonable to assume that not all pairs of individuals whom respondents categorized as neither “especially close” nor “total strangers” would have nominated each other as someone they 23 When dealing with clustered networks, at least some alters are constrained to have a minimum degree. For example, if a respondent with a degree of five reports that all five of those it connects to have as close or closer relationship to each other as they do to the respondent, then each of them must necessarily also have a degree of at least five. 24 In all cases, the size of the GSS data set keeps the standard errors relatively small compared to the estimates (see table 1), and thus their confidence intervals would not include the phase transition point. 1012 Six Degrees of “Who Cares?” discuss important matters with, it is certainly arguable that some of them might have, given that this intermediate category implicitly included those “almost as close.” For example, in the author’s case, two of those whom the author discusses important matters with would not be “as close or closer” to each other as they are to the author, but they do discuss important matters with each other. Thus, if we further theorize that not only those who are “especially close” but those in the intermediate category, neither “especially close” nor “strangers,” are also discussion partners, then an even larger number of pairs of alters will spend ties connecting to each other. Table 1 shows that, using this model, the value of g is 0.8277 in 1985 and 0.6818 in 2004. If this model is correct, g ! 1 tells us that in neither year has the phase transition occurred and all components are minuscule. While the distinction between identifying discussion partners with alters who were “especially close” and those in the intermediate category is undoubtedly the least subtle of the three examples presented in this article, correct classification is far from clear. While it is unlikely that all in the middle category are discussion partners, at least some may be (as is the case with the author). This definitional distinction, however, has dramatic effects when one considers the network model it generates. The difference between these two scenarios is not merely that one component is somewhat larger, it is a difference in orders of magnitude. It signals the difference theoretically between a society that is primarily united into a single discussion network and a society that has utterly disintegrated. In the first case, it is possible that the typical person is involved in an extended discussion network (e.g., I discuss important matters with someone who discusses important matters with someone who discusses important matters with someone, etc.) that ultimately includes multiplied millions of people.25 While it is unlikely that the specifics of one’s discussions transmit over any distance, it is possible that general norms or values could diffuse and a general awareness, if not consensus, could form. The second case is quite distinct. To understand just how tiny the nonphased components are, we can use the formula derived above for calculating average component size when g ! 1, 1⫹ f1 . (1 ⫺ g) We find that, in this case, the size of the average discussion component 25 While if a “national” discussion network exists, a myriad of factors sociologists routinely study would undoubtedly guide and constrain it, my point is that it would be a society-level phenomenon, not an individual-level one. 1013 American Journal of Sociology is 18 in 1985 and 7 in 2004. Thus, in this case, most persons’ discussion networks do not extend much beyond those they have direct discussion with. Instead of society consisting of an extended network diffusing norms and values, it would have been pulverized into tiny groups, perhaps no larger than a single individual’s discussion network. IMPLICATIONS Subtle distinctions in how we define and measure relations can lead to profoundly divergent global network characterizations, characterizations that would be invisible from the perspective of individuals. For the network formed by Ph.D. hires among sociology graduate programs, how recently we determined a hire must have been made to be considered influential led to severe discontinuities in our model of systemic properties. As we excluded the oldest data and progressively moved our cutoff criteria to a more recent point, year by year, path lengths became slightly longer and a few graduate programs became disconnected in a linear fashion until we moved our chronological cutoff point from 16 to 14 years ago, at which point not only does the size of the largest strong component instantly halve but, instead of lengthening, average path lengths become a full step shorter. This subtle definitional distinction (whether we focused on hires made within the last 14 years or the last 16) determined whether we “discovered” a generally connected world of sociology or a small core of highly ranked graduate programs. This network also highlighted an important point. When no “natural” cutoff exists in continuous relational data, we often do not impose one. This seems reasonable, but this is an assumption that needs to be justified and sometimes cannot be. In this context, it assumes implicitly that either all relations, even the oldest (over a half a century old), are equally meaningful or (more likely) that we believe that a time-scaled version of our current data would produce a scaled-down, perhaps just less dense, but essentially similar version of our results. In this case, this would be true if we scale time back to 16 years ago, but then this assumption becomes completely invalid. The reality of the phase transition forces us to decide theoretically if we are interested only (or primarily) in prephase relations, in this case less than 14 years old, or if we believe older ones (although in this case only necessarily two years older) should somehow have equal merit. The network formed by task-oriented e-mails among first responders also involved time, but at a much smaller scale. Our determination of how recently two individuals must have e-mailed each other to be considered to be “in contact” led to severe discontinuities in our model of 1014 Six Degrees of “Who Cares?” systemic properties. Furthermore, not only is the temporal nature of the data relevant in this case but how one categorizes the time element is crucial as well. If one looked at all of the data for the first 24 hours of the exercise as a set, one would find a giant strong component uniting most participants by short paths of information flow; however, if one had truncated this exercise, which began at 3 a.m. at the end of the first calendar day (three hours earlier), the largest strong component would have included only 11 of the 150, participants and most individuals would have been members of even more trivially sized strong components. These superficially interchangeable categorization schemes would have yielded totally distinct theoretical conclusions. Finally, I used a qualitative definitional distinction, which of a respondent’s alters we assumed to also be discussion partners, to explore GSS sample data about the network formed by individuals discussing important matters with each other. While this is arguably the least subtle definitional distinction of the three cases, because the phase transition is so very sensitive to local-level phenomena, the repercussions of potential inaccuracy in this case are overwhelming. If we are even somewhat wrong in our categorization of which alters are also discussion partners (if at least some of those categorized as not “especially close” would also be discussion partners), this signals the difference theoretically between an American society that is primarily united into a single discussion network and a society that has utterly disintegrated. Because the phase transition is so sensitive to local-level phenomena, models of the phase transition are sensitive to the data we elicit about those phenomena. In many cases, nuances in definition and measurement (e.g., the number of years for which a hiring relation between two graduate programs is believed to represent something meaningful, how to best organize the flux of e-mail communication into meaningful temporal categories, or the accuracy and precision ascribed to one’s understanding of their discussion partners’ relations to each other) may have dramatic impacts on our results, so dramatic that it might prove difficult, if not impossible, to theoretically interpret what one has empirically measured. To overcome this, I have proposed an index that offers researchers a way of assessing how concerned they should be about the sensitivity of their results to subtleties in definition and measurement and which will further allow them to more confidently study the less ubiquitous, but sometimes more meaningful, relations that occur near the point of the phase transition. REFERENCES American Sociological Association. 2003. Guide to Graduate Departments of Sociology. Washington, D.C.: American Sociological Association. 1015 American Journal of Sociology Barabasi, Albert-László, and Réka Albert. 1999. “Emergence of Scaling in Random Networks.” Science 286:509–12. Burris, Val. 2004. “The Academic Caste System: Prestige Hierarchies in Ph.D. Exchange Networks.” American Sociological Review 69:239–64. Burt, Ronald S. 1997. “A Note on Social Capital and Network Content.” Social Networks 19:355–73. Davis, James A. 1967. “Clustering and Structural Balance in Graphs.” Human Relations 20:181–87. ———. 1970. “Clustering and Hierarchy in Interpersonal Relations: Testing Two Theoretical Models on 742 Sociograms.” American Sociological Review 35:843–52. Erdos, P., and A. Rényi. 1960. “On the Evolution of Random Graphs.” Publications of the Mathematical Institute of the Hungarian Academy of Sciences 5:17–61. Fischer, Claude. 2008. “The 2004 GSS Finding of Shrunken Social Networks: An Artifact?” http://sociology.berkeley.edu/faculty/fischer/pdf/Fischer_2004%20GSS%20Net works_9-23-08.pdf. Retrieved September 23, 2008. Friedkin, Noah E. 1983. “Horizons of Observability and Limits of Informal Control in Organizations.” Social Forces 62:54–77. Grannis, Rick. 2004. “Sampling the Structure of Large-Scale Social Networks.” Presented at the annual meeting of the Research Committee on Logic and Methodology of the International Sociological Association, Amsterdam, August. Granovetter, Mark. 1973. “The Strength of Weak Ties.” American Journal of Sociology 81:1287–1303. Han, Shin-Kap. 2003. “Tribal Regimes in Academia: A Comparative Analysis of Market Structure across Disciplines.” Social Networks 25:251–80. Holland, Paul W., and Samuel Leinhardt. 1972. “Some Evidence on the Transitivity of Positive, Interpersonal Sentiment.” American Journal of Sociology 77:1205–9. Janson, Svante, Tomasz Luczak, and Andrzej Rucinski. 2002. “The Phase Transition.” Chap. 5 (pp. 103–38) in Random Graphs. New York: Wiley. Kleinfeld, Judith S. 2002. “Could It Be a Big World after All? The ‘Six Degrees of Separation’ Myth.” Society 39:61–66. Maslov, Sergei, Kim Sneppen, and Alexei Zaliznyak. 2004. “Detection of Topological Patterns in Complex Networks: Correlation Profile of the Internet.” Physica A 333: 529–40. McPherson, Miller, Lynn Smith-Lovin, and Matthew E. Brashears. 2006. “Social Isolation in America: Changes in Core Discussion Networks over Two Decades.” American Sociological Review 71:353–75. Milgram, Stanley. 1967. “The Small World Problem.” Psychology Today 1:61–67. Molloy, Michael, and Bruce Reed. 1995. “A Critical Point for Random Graphs with a Given Degree Sequence.” Random Structures and Algorithms 6:161–79. ———. 1998. “The Size of the Giant Component of a Random Graph with a Given Degree Sequence.” Combinatorics, Probability and Computing 7:295–305. Moody, James. 2002. “The Importance of Relationship Timing for Diffusion: Indirect Connectivity and STD Infection Risk.” Social Forces 81:25–56. Moody, James, Daniel McFarland, and Skye Bender-deMoll. 2005. “Dynamic Network Visualization.” American Journal of Sociology 110:1206–41. Newman, Mark E. J. 2009. “Random Graphs with Clustering.” Physical Review Letters, vol. 103, art. 58701, pp. 1–4. Newman, M. E. J., S. H. Strogatz, and D. J. Watts. 2001. “Random Graphs with Arbitrary Degree Distribution and Their Applications.” Physical Review E, vol. 64, art. 026118. Rapoport, Anatol. 1953. “Spread of Information through a Population with Sociostructural Bias: I. Assumption of Transitivity.” Bulletin of Mathematical Biophysics 15:523–33. 1016 Six Degrees of “Who Cares?” ———. 1957. “A Contribution to the Theory of Random and Biased Nets.” Bulletin of Mathematical Biophysics 19:257–71. ———. 1963. “Mathematical Models of Social Interaction.” Pp. 493–579 in Handbook of Mathematical Psychology, vol. 1. Edited by R. Duncan Luce, Robert R. Bush, and Eugene Galanter. New York: Wiley & Sons. Schelling, Thomas C. 1978. Micromotives and Macrobehavior. New York: Norton. Straits, Bruce C. 2000. “Ego’s Important Discussants or Significant People: An Experiment in Varying the Wording of Personal Network Name Generators.” Social Networks 22:123–40. Ulanoff, Lance. 2005. “Six Degrees of ‘Who Cares?’” PC Magazine, August. Uzzi, Brian, and Jarrett Spiro. 2005. “Collaboration and Creativity: The Small World Problem.” American Journal of Sociology 111:447–504. Watts, Duncan. 1999a. “Networks, Dynamics and the Small World Phenomenon.” American Journal of Sociology 105 (2): 493–527. ———. 1999b. Small Worlds: The Dynamics of Networks between Order and Randomness. Princeton, N.J.: Princeton University Press. Watts, Duncan, and Steven Strogatz. 1998. “Collective Dynamics of ‘Small World’ Networks.” Nature 393:440–42. 1017