8000 Changed blocks to notes, in a way that can be compiled for O'Reilly · codingbooks/tidy-text-mining@cc108e0 · GitHub 8000
[go: up one dir, main page]

Skip to content

Commit cc108e0

Browse files
David RobinsonDavid Robinson
authored andcommitted
Changed blocks to notes, in a way that can be compiled for O'Reilly
1 parent 876f7f5 commit cc108e0

File tree

4 files changed

+11
-11
lines changed

4 files changed

+11
-11
lines changed

04-word-combinations.Rmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ austen_bigrams
3131

3232
This data structure is still a variation of the tidy text format. It is structured as one-token-per-row (with extra metadata, such as `book`, still preserved), but each token now represents a bigram.
3333

34-
```{block, type = "rmdnote"}
34+
```NOTE
3535
Notice that these bigrams overlap: "sense and" is one token, while "and sensibility" is another.
3636
```
3737

@@ -273,15 +273,15 @@ ggraph(bigram_graph, layout = "fr") +
273273

274274
It may take a some experimentation with ggraph to get your networks into a presentable format like this, but the network structure is useful and flexible way to visualize relational tidy data.
275275

276-
```{block, type = "rmdnote"}
276+
```NOTE
277277
Note that this is a visualization of a **Markov chain**, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word. In this case, a random generator following this model might spit out "dear", then "sir", then "william/walter/thomas/thomas's", by following each word to the most common words that follow it. To make the visualization interpretable, we chose to show only the most common word to word connections, but one could imagine an enormous graph representing all connections that occur in the text.
278278
```
279279

280280
### Visualizing bigrams in other texts
281281

282282
We went to a good amount of work in cleaning and visualizing bigrams on a text dataset, so let's collect it into a function so that we easily perform it on other text datasets.
283283

284-
```{block, type = "rmdnote"}
284+
```NOTE
285285
To make it easy to use the `count_bigrams()` and `visualize_bigrams()` yourself, we've also reloaded the packages necessary for them.
286286
```
287287

@@ -410,7 +410,7 @@ For example, that $n_{11}$ represents the number of documents where both word X
410410

411411
$$\phi=\frac{n_{11}n_{00}-n_{10}n_{01}}{\sqrt{n_{1\cdot}n_{0\cdot}n_{\cdot0}n_{\cdot1}}}$$
412412

413-
```{block, type = "rmdnote"}
413+
```NOTE
414414
The phi coefficient is equivalent to the Pearson correlation, which you may have heard of elsewhere, when it is applied to binary data).
415415
```
416416

05-document-term-matrices.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ ap_td
6969

7070
Notice that we now have a tidy three-column `tbl_df`, with variables `document`, `term`, and `count`. This tidying operation is similar to the `melt()` function from the reshape2 package [@R-reshape2] for non-sparse matrices.
7171

72-
```{block, type = "rmdnote"}
72+
```NOTE
7373
Notice that only the non-zero values are included in the tidied output: document 1 includes terms such as "adding" and "adult", but not "aaron" or "abandon". This means the tidied version has no rows where `count` is zero.
7474
```
7575

@@ -153,7 +153,7 @@ inaug_tf_idf %>%
153153

154154
As another example of a visualization possible with tidy data, we could extract the year from each document's name, and compute the total number of words within each year.
155155

156-
```{block, type = "rmdnote"}
156+
```NOTE
157157
Note that we've used tidyr's `complete()` function to include zeroes (cases where a word didn't appear in a document) in the table.
158158
```
159159

@@ -276,7 +276,7 @@ acq_tokens %>%
276276

277277
Here we'll retrieve recent articles relevant to nine major technology stocks: Microsoft, Apple, Google, Amazon, Facebook, Twitter, IBM, Yahoo, and Netflix.
278278

279-
```{block, type = "rmdnote"}
279+
```NOTE
280280
These results were downloaded in January 2017, when this chapter was written, but you'll certainly find different results if you ran it for yourself. Note that this code takes several minutes to run.
281281
```
282282

06-topic-models.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ AssociatedPress
4040

4141
We can use the `LDA()` function from the topicmodels package, setting `k = 2`, to create a two-topic LDA model.
4242

43-
```{block, type = "rmdnote"}
43+
```NOTE
4444
Almost any topic model in practice will use a larger `k`, but we will soon see that this analysis approach extends to a larger number of topics.
4545
```
4646

09-usenet.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ In our final chapter, we'll use what we've learned in this book to perform a sta
1515

1616
We'll start by reading in all the messages from the `20news-bydate` folder, which are organized in sub-folders with one file for each message. We can read in files like these with a combination of `read_lines()`, `map()` and `unnest()`.
1717

18-
```{block, type = "rmdwarning"}
18+
```WARNING
1919
Note that this step may take several minutes to read all the documents.
2020
```
2121

@@ -86,7 +86,7 @@ cleaned_text <- raw_text %>%
8686

8787
Many lines also have nested text representing quotes from other users, typically starting with a line like "so-and-so writes..." These can be removed with a few regular expressions.
8888

89-
```{block, type = "rmdnote"}
89+
```NOTE
9090
We also choose to manually remove two messages, `9704` and `9985` that contained a large amount of non-text content.
9191
```
9292

@@ -361,7 +361,7 @@ sentiment_messages <- usenet_words %>%
361361
filter(words >= 5)
362362
```
363363

364-
```{block, type = "rmdnote"}
364+
```NOTE
365365
As a simple measure to reduce the role of randomness, we filtered out messages that had fewer than five words that contributed to sentiment.
366366
```
367367

0 commit comments

Comments
 (0)
0