You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 05-document-term-matrices.Rmd
+10-2Lines changed: 10 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,13 @@ In the previous chapters, we've been analyzing text arranged in the tidy text fo
14
14
15
15
However, most of the existing R tools for natural language processing, besides the tidytext package, aren't compatible with this format. The [CRAN Task View for Natural Language Processing](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html) lists a large selection of packages that take other structures of input and provide non-tidy outputs. These packages are very useful in text mining applications, and many existing text datasets are structured according to these formats.
16
16
17
-
Computer scientist Hal Abelson has observed that "No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system" [@Friedman:2008:EP
10000
L:1378240]. In that spirit, this chapter will discuss the "glue" that connects the tidy text format with other important packages and data structures, allowing you to rely on both existing text mining packages and the suite of tidy tools to perform your analysis. In particular, we'll examine the process of tidying document-term matrices, as well as casting a tidy data frame into a sparse matrix.
17
+
Computer scientist Hal Abelson has observed that "No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system" [@Friedman:2008:EPL:1378240]. In that spirit, this chapter will discuss the "glue" that connects the tidy text format with other important packages and data structures, allowing you to rely on both existing text mining packages and the suite of tidy tools to perform your analysis.
18
+
19
+
```{r flowchart, echo = FALSE, out.width = '100%', fig.cap = "A flowchart of a typical text analysis that combines tidytext with other tools and data formats. This chapter shows how to convert between document-term matrices and tidy data frames, as well as converting from a Corpus object to a text data frame. The topicmodels and mallet packages are explored in Chapter 6."}
20
+
knitr::include_graphics("images/flowchart.png")
21
+
```
22
+
23
+
Figure \@ref(fig:flowchart) illustrates how an analysis can might switch between tidy and non-tidy data structures and tools, a process that will be explored in these next two chapters. This chapter will focus on the process of tidying document-term matrices, as well as casting a tidy data frame into a sparse matrix. We'll also expore how to tidy Corpus objects into text data frames, including a case study of ingesting and analyzing financial articles.
18
24
19
25
## Tidying a document-term matrix {#tidy-dtm}
20
26
@@ -31,6 +37,8 @@ DTM objects cannot be used directly with tidy tools, just as tidy data frames ca
31
37
*`tidy()` turns a document-term matrix into a tidy data frame. This verb comes from the broom package [@R-broom], which provides similar tidying functions for many statistical models and objects.
32
38
*`cast()` turns a tidy one-term-per-row data frame into a matrix. tidytext provides three variations of this verb, each converting to a different type of matrix: `cast_sparse()` (converting to a sparse matrix from the Matrix package), `cast_dtm()` (converting to a `DocumentTermMatrix` object from tm), and `cast_dfm()` (converting to a `dfm` object from quanteda).
33
39
40
+
As shown in Figure \@ref(fig:flowchart), a DTM is typically comparable to a tidy data frame after a `count` or a `group_by`/`summarize` that contains counts or another statistic for each combination of a term and document.
41
+
34
42
### Tidying DocumentTermMatrix objects
35
43
36
44
Perhaps the most widely used implementation of DTMs in R is the `DocumentTermMatrix` class in the tm package. Many available text mining datasets are provided in this format. For example, consider the collection of Associated Press newspaper articles included in the topicmodels package.
@@ -353,7 +361,7 @@ stock_tokens %>%
353
361
labs(y = "Frequency of word * AFINN score")
354
362
```
355
363
356
-
In the context of these financial articles, there are a few big red flags here. The words "share" and "shares" are counted as positive verbs by the AFINN lexicon ("Alice will **share** her cake with Bob"), but they're actually neutral nouns ("The stock price is $12 per **share**") that could just as easily be in a positive sentence as a negative one. The word "fool" is even more deceptive: it refers to Motley Fool, a financial services company. In short, we can see that the AFINN sentiment lexicon is entirely unsuited to the context of financial data (as are the NRC and Bing).
364
+
In the context of these financial articles, there are a few big red flags here. The words "share" and "shares" are counted as positive verbs by the AFINN lexicon ("Alice will **share** her cake with Bob"), but they're actually neutral nouns ("The stock price is $12 per **share**") that could just as easily be in a positive sentence as a negative one. The word "fool" is even more deceptive: it refers to Motley Fool, a financial services company. In short, we can see that the AFINN sentiment lexicon is entirely unsuited to the context of financial data (as are the NRC and Bing lexicons).
357
365
358
366
Instead, we introduce another sentiment lexicon: the Loughran and McDonald dictionary of financial sentiment terms [@loughran2011liability]. This dictionary was developed based on analyses of financial reports, and intentionally avoids words like "share" and "fool", as well as subtler terms like "liability" and "risk" that may not have a negative meaning in a financial context.
Copy file name to clipboardExpand all lines: 06-topic-models.Rmd
+12-5Lines changed: 12 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ In text mining, we often have collections of documents, such as blog posts or ne
14
14
15
15
Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to "overlap" each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.
16
16
17
-
We can use tidy text principles to approach topic modeling with the same set of tidy tools we've used throughout this book. In this chapter, we'll learn to tidy `LDA` objects from the [topicmodels package](https://cran.r-project.org/package=topicmodels) to examine model results using ggplot2 and dplyr. We'll also explore an example of clustering chapters from several books, where we can see that a topic model "learns" to tell the difference between the four books based on the text content.
17
+
We can use tidy text principles to approach topic modeling with the same set of tidy tools we've used throughout this book. In this chapter, we'll learn to work with `LDA` objects from the [topicmodels package](https://cran.r-project.org/package=topicmodels). As shown in Figure \@ref(fig:flowchart) in the last chapter, the LDA package produces a model object that can be tidied with tidytext to make it compatible with ggplot2 and dplyr. We'll also explore an example of clustering chapters from several books, where we can see that a topic model "learns" to tell the difference between the four books based on the text content.
18
18
19
19
## Latent Dirichlet allocation
20
20
@@ -34,7 +34,13 @@ data("AssociatedPress")
34
34
AssociatedPress
35
35
```
36
36
37
-
We can use the `LDA()` function from the topicmodels package, setting `k = 2`, to create a two-topic LDA model. (Almost any topic model in practice will use a larger `k`, but we will see in Chapter \@ref(library-heist) that this analysis approach extends to a larger number of topics). This function returns an object containing the full details of the model fit, such as how words are associated with topics and how topics are associated with documents.
37
+
We can use the `LDA()` function from the topicmodels package, setting `k = 2`, to create a two-topic LDA model.
38
+
39
+
```{block, type = "rmdnote"}
40
+
Almost any topic model in practice will use a larger `k`, but we will soon see that this analysis approach extends to a larger number of topics
41
+
```
42
+
43
+
This function returns an object containing the full details of the model fit, such as how words are associated with topics and how topics are associated with documents.
38
44
39
45
```{r ap_lda}
40
46
# set a seed so that the output of the model is predictable
@@ -253,7 +259,7 @@ chapters_gamma
253
259
254
260
Each of these values is an estimated proportion of words from that document that are generated from that topic. For example, the model estimates that each word in the `r chapters_gamma$document[1]` document has only a `r percent(chapters_gamma$gamma[1])` probability of coming from topic 1 (Pride and Prejudice).
255
261
256
-
Now that we have these topic-probabilities, we can see how well our unsupervised learning did at distinguishing the four books. We'd expect that chapters within a book would be found to be mostly (or entirely), generated from the corresponding topic.
262
+
Now that we have these topicprobabilities, we can see how well our unsupervised learning did at distinguishing the four books. We'd expect that chapters within a book would be found to be mostly (or entirely), generated from the corresponding topic
8000
.
257
263
258
264
First we re-separate the document name into title and chapter, after which we can visualize the per-document-per-topic probability for each (Figure \@ref(fig:chaptersldagamma)).
259
265
@@ -340,7 +346,7 @@ assignments %>%
340
346
fill = "% of assignments")
341
347
```
342
348
343
-
We notice that almost all the words for *Pride and Prejudice*, *Twenty Thousand Leagues Under the Sea*, and *War of the Worlds* were correctly assigned, while *Great Expectations* had a fair amount of misassigned words (which, as we saw above, led to two chapters getting misclassified).
349
+
We notice that almost all the words for *Pride and Prejudice*, *Twenty Thousand Leagues Under the Sea*, and *War of the Worlds* were correctly assigned, while *Great Expectations* had a fair number of misassigned words (which, as we saw above, led to two chapters getting misclassified).
@@ -418,4 +425,4 @@ We could use ggplot2 to explore and visualize the model in the same way we did t
418
425
419
426
## Summary
420
427
421
-
This chapter introduces topic modeling for finding clusters of words within documents, and shows how the `tidy()` verb lets us explore and understand these models using dplyr and ggplot2. This is one of the advantages of the tidy approach to model exploration; the challenges of different output formats are handled by the tidying functions, and we can explore model results using a standard set of tools. In particular, we examined
428
+
This chapter introduces topic modeling for finding clusters of words that characterize a set of documents, and shows how the `tidy()` verb lets us explore and understand these models using dplyr and ggplot2. This is one of the advantages of the tidy approach to model exploration: the challenges of different output formats are handled by the tidying functions, and we can explore model results using a standard set of tools. In particular, we saw that topic modeling is able to separate and distinguish chapters from four separate books, and explored the limitations of the model by finding words and chapters that it assigned incorrectly.
Copy file name to clipboardExpand all lines: 09-usenet.Rmd
+16-4Lines changed: 16 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,11 @@ In our final chapter, we'll use what we've learned in this book to perform a sta
13
13
14
14
## Pre-processing
15
15
16
-
We'll start by reading in all the messages from the `20news-bydate` folder, which are organized in sub-folders with one file for each message. We can read in files like these with a combination of `read_lines()`, `map()` and `unnest()`. Note that this step takes several minutes to read all the documents.
16
+
We'll start by reading in all the messages from the `20news-bydate` folder, which are organized in sub-folders with one file for each message. We can read in files like these with a combination of `read_lines()`, `map()` and `unnest()`.
17
+
18
+
```{block, type = "rmdwarning"}
19
+
Note that this step may take several minutes to read all the documents.
20
+
```
17
21
18
22
```{r libraries}
19
23
library(dplyr)
@@ -80,7 +84,11 @@ cleaned_text <- raw_text %>%
80
84
ungroup()
81
85
```
82
86
83
-
Many lines also have nested text representing quotes from other users, typically starting with a line like "so-and-so writes..." These can be removed with a few regular expressions. (We also choose to manually remove two messages, `9704` and `9985` that contained a large amount of non-text content).
87
+
Many lines also have nested text representing quotes from other users, typically starting with a line like "so-and-so writes..." These can be removed with a few regular expressions.
88
+
89
+
```{block, type = "rmdnote"}
90
+
We also choose to manually remove two messages, `9704` and `9985` that contained a large amount of non-text content.
91
+
```
84
92
85
93
```{r cleaned_text2, dependson = "cleaned_text1"}
86
94
cleaned_text <- cleaned_text %>%
@@ -145,7 +153,7 @@ tf_idf %>%
145
153
coord_flip()
146
154
```
147
155
148
-
We see lots of characteristic words specific to particular newsgroup, such as "wiring" and "circuit" on the sci.electronics topic and "orbit" and "lunar" for the space newsgroup. You could use this same code to explore other newsgroups yourself.
156
+
We see lots of characteristic words specific to a particular newsgroup, such as "wiring" and "circuit" on the sci.electronics topic and "orbit" and "lunar" for the space newsgroup. You could use this same code to explore other newsgroups yourself.
As a simple measure to reduce the role of randomness, we filtered out messages that had fewer than five words that contributed to sentiment. What were the most positive messages?
361
+
```{block, type = "rmdnote"}
362
+
As a simple measure to reduce the role of randomness, we filtered out messages that had fewer than five words that contributed to sentiment.
0 commit comments