[go: up one dir, main page]

0% found this document useful (0 votes)
41 views69 pages

Lecture6 Text As Data Ver3

Uploaded by

tofer0321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views69 pages

Lecture6 Text As Data Ver3

Uploaded by

tofer0321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Lecture 06.

Text As Data
(Chapter 10 of DSB and Chapter 8 of BDS)

Ping Yu

HKU Business School


The University of Hong Kong

Ping Yu (HKU) Text As Data 1 / 69


Introduction

“Text mining, also referred to as text data mining, roughly equivalent to text analyt-
ics, refers to the process of deriving high-quality information from text." - Wikipedia
“Another way to view text data mining is as a process of exploratory data analysis
that leads to heretofore unknown information, or to answers for questions for which
the answer is not currently known." - Marti A. Hearst (UC-Berkeley), 1999

Text Mining = Text Data + Data Mining

For new forms of data, we must either engineer the data representation to match
the tools, or build new tools to match the data. The former is generally simpler.
We will do both: apply the techniques we learned in previous lectures to text data,
and also develop new techniques specific to text data.
- In the former category, we can use the regression techniques from Lectures 2, 4
and 5 or from future Lectures 7-10.
- In the latter category, w e will discuss one unsupervised generative model (LDA)
and one supervised generative model (MNIR).
- Recall that regressionmeans from x to y and generativemeans from y to x in
modelling.
Ping Yu (HKU) Text As Data 2 / 69
Continued

Why Text Is Important? Many applications in finance, macroeconomics, media


economics, marketing, political economy, etc.
Why Text Is Difficult? Text is "unstructured" (i.e., does not have the usual data
structure), and "dirty" (e.g., misspelling, abbreviating, etc.); the meaning of text is
context dependent; p may be much greater than n (so ML is useful).
So text must undergo a good amount of preprocessing before it can be used as
input to a data mining algorithm.
- The general strategy in text mining is to use the simplest (least expensive) tech-
nique that works.
A typical text analysis includes three steps:
1 Represent raw text as a numerical array X.
2 b of unknown outcomes y.
Map X to prediction y
3 b in subsequent descriptive or causal analysis.
Use y

Terminology from Information Retrieval (IR): A document is one piece of text, no


matter how large or small, and is one sample point; a document is composed of
individual tokens or terms; a collection of documents is called a corpus ,1 and is
the dataset.
1
Latin for "body". The plural is corpora.
Ping Yu (HKU) Text As Data 3 / 69
Good References for Practitioners (not theorists)

Gentzkow, Kelly and Taddy, 2019, Text as Data, Journal of Economic Literature,
57(3), 535-574.

Ping Yu (HKU) Text As Data 4 / 69


Text Representation

Text Representation

Ping Yu (HKU) Text As Data 5 / 69


Text Representation

Tokenization

Meaning is built from ordered sequences of words that refer to each other, but often
scattered across sentences or paragraphs.
However, text analyses often ignore this complexity and rely solely on counts for lan-
guage tokens: words, phrases, or other basic elements (separated by whitespace).
- See Lecture 10 for the extensions in deep learning.
Much information of text is contained in these simple counts, and the token-based
learning is fast and effective especially when ntrain is not big.
Bag-of-words representation of a text just counts the frequency of some key terms
in the text.
- Treat documents as if they were constructed by randomly drawing words with-
replacement from a bucket containing the fixed vocabulary, and throw away all more
complex information (e.g., grammar, word order, sentence structure, and (usually)
punctuation, etc.), but simplicity is its advantage.
- This approach treats every word in a document as an independent potential key-
word (feature) of the document.

Ping Yu (HKU) Text As Data 6 / 69


Text Representation

More Details

First removes the following:


- Overly common or rare words [see Zipf’s Law in the next slide]
- Isolated punctuation (, but not :-P) and numbers
- Common suffixes like s or ing (so-called stemming), e.g., ladies ! lady, referring
! refer, forgotten ! forget
Note that the case is usually normalized: every term is in lowercase.
This pruning can improve computational and statistical efficiency.
But this operation is not strict, e.g., stop words like if, but, and, of or on and the
frequency of the can reflect the writing style of an author.
- A classical example is that Mosteller and Wallace (1963) uses counts for common
words to classify authorship for disputed federalist papers.
Exclusion of rare words (which are rich in meaning) is only because they are too
rare to convey much information (e.g., in clustering), but when more data are accu-
mulated, previously rare words can be added.
- Typically remove all words that are not in at least N (e.g., 75%) documents, where
N is some intuitive minimum threshold. Also check the sensitivity of results to N.

Ping Yu (HKU) Text As Data 7 / 69


Text Representation

(*) Zipf’s Law

Zipf’s law [figure here] states that frequency of a word is inversely proportional to
its rank in the frequency table:
h i
1
ks
f (k js, n) = h i =: fk , [figure here] (1)
∑ni=1 i1s

where n is the total number of elements in the frequency table, k is the rank, and
s 1 is the value of the exponent h characterizing
i the distribution.
- Since log (Nfk ) = log N log ∑ni=1 i1s s log k , where N is the total number of
words and Nfk is the frequency of the k th most frequent word, we have a straight
line with slope s if plotting the data on a log-log graph, with x = log k and y =
log (Nfk ).
In the Brown Corpus of American English text, the word “the" is the most frequently
occurring word, accounting for nearly 7% of all word occurrences; the second-place
word “of" accounts for slightly over 3.5% of words, followed by “and" (about 2.7%).
So s is close to 1 in (1).
Popular words take large proportion of occurrences but they are semantically mean-
ingless.
Rare words take major proportion of vocabulary, but they rarely occur in documents.
Ping Yu (HKU) Text As Data 8 / 69
Text Representation

History of Zipf’s Law

George K. Zipf (1902-1950, Harvard, linguist)

Ping Yu (HKU) Text As Data 9 / 69


Text Representation

Ping Yu (HKU) Text As Data 10 / 69


Text Representation

Measuring Sparseness: Inverse Document Frequency

In addition to imposing upper and lower limits on term frequency to avoid too com-
mon or rare words, the distribution of the term over a corpus is also important.
The sparseness of a term j is measured commonly by inverse document frequency
(IDF):
Total number of documents, n
IDF (j ) = 1 + log .
Number of documents containing j, say, nj

This is a decreasing function of nj , when nj ! n, IDF(j ) ! 1.


A very popular representation for a document is the product of Term Frequency
(TF) and IDF:
TFIDF (i, j ) = TF (i, j ) IDF (j ) ,
where TF(i, j ) = xij is the frequency of term j in document i.
- TF(i, j ) is related only to document i, but IDF(j ) is related to all documents.
- TFIDF(i, j ) boosts the frequency of a term j by its rarity in the entire corpus, i.e.,
it will keep words that occur frequently in some documents but do not appear in
others.
A common practice is to keep only the words within each document with TFIDF
scores above some cutoff.
Ping Yu (HKU) Text As Data 11 / 69
Text Representation

Details Continued

Removing punctuation is fine when analyzing news articles or literature, but may
be not in online corpora such as Twitter.
Numbers are usually unimportant for text processing, but whether to discard is still
purpose-dependent, e.g., "911".
Stemming is the process of cutting words to their root, e.g., tax from taxing, taxed,
taxes, taxation, or taxable.
In all pruning procedures, need not be too aggressive (like Porter (1980) stemmer
for English); we can let the ML tools to sort out extra vocabulary elements.
- Always balance efficiency and robustness!
The resulting document-term-matrix (DTM) is an n p sparse numeric matrix X with
xij being the count for word j in document i. [figure here]
p
Often, divide xij by mi = ∑j =1 xij = kxi k1 to transform counts to proportions (to
eliminate the length difference of two documents) or change xij to 1(xij > 0) to
indicate whether a term appears in the document.
- Which of the three strategies of tokenization (i.e., frequency, proportion, and pres-
ence) or TFIDF2 is the best depends on the data.

2
TF(i, j ) in TFIDF can also be normalized to proportions.
Ping Yu (HKU) Text As Data 12 / 69
Text Representation

Use Word Cloud to Represent DTM

Ping Yu (HKU) Text As Data 13 / 69


Text Representation

Other Considerations

‘A is better than B’ and ‘B is better than A’ are not distinguishable.


So besides unigrams, we can use bigrams (i.e., adjacent pairs) or even m-grams
(i.e., combinations of m words) to to capture local dependency and order.
- For example, bigrams (with nonzero frequency) in the example above: A.is,
is.better, better.than, than.B; B.is, is.better, better.than, than.A.
p
- But this will dramatically increase p (to Cm = O (pm )) and many features become
rare, so is seldom worth the computational effort in practice (m is usually 3).
- The m-gram counts are sufficient for estimation of an m-order homogeneous
Markov model across words (i.e., the word choice depends only on the previous
m 1 words).
Named Entity Extraction: a preprocessing procedure to recognize common named
entities like Silicon Valley, R.A. Fisher, World Trade Organization, etc.
- Different from m-grams, this requires expertise.
- The related is the so-called syntactic m-grams where words are group together
whenever their meaning depends upon each other, according to a model of lan-
guage syntax.

Ping Yu (HKU) Text As Data 14 / 69


Text Representation

[Example] Learning to Rank

In the Jazz Musicians example of DSB, we query for the name of a Jazz musician
by inputting one sentence from his biography of Wikipedia.
After the TFIDF representation of that sentence, we calculate its cosine similarity
with each musician’s biography in Wikipedia, and then choose the most similar
musician as the query result, where the cosine similarity between two vectors x
and y is defined as
hx, yi
scos (x, y) =
kxk kyk
p
with h , i being the Euclidean inner product, and kxk = hx, xi being the Euclidean
norm of x, and is commonly used in text classification to measure the similarity
between documents.
- The name "cosine" similarity comes from the fact that it is the cosine of the angle
between x and y.
- Cosine similarity is exactly the correlation-based distance in Lecture 3, and is
commonly used in text classification; another popular distance is the Euclidean
distance.

Ping Yu (HKU) Text As Data 15 / 69


Text Representation

Vector Space Model

Because we represent a document as a vector above, the resulting model is called


the vector space model.
Advantage:
- Intuitive and effective in practice.
- Easy to implement and popularly-studied.
Disadvantage:
- Assume term independence.
- Somewhat ad-hoc term weighting and similarity measure.
- Many tuning parameters.

Ping Yu (HKU) Text As Data 16 / 69


Text Representation

Refinement I: LSA

Latent Semantic Analysis (LSA): using PCA to factorize token counts (or transfor-
mations, e.g., xij /mi or xij log nj ) and achieve dimension reduction.
From SVD,
X UΣVT =: A,
where Un K refers to documents, and Vp K refers to terms, but both are with lower
rank, and Σ is a K K diagonal matrix.
With LSA, we can compare documents i and i 0 by comparing Σui and Σui 0 , where
ui denotes the ith row of U:
uTi Σ2 ui 0
scos (ai , ai 0 ) = q q = scos (Σui , Σui 0 ) ,
uTi Σ2 ui uTi0 Σ2 ui 0

where ai = VΣui 2 Rp is the ith row of A.


We can also compare terms j and j 0 by comparing Σvj and Σvj 0 , where vj denotes
the jth row of V:
vTj Σ2 vj 0
scos aj , aj 0 = q q = scos Σvj , Σvj 0 ,
vTj Σ2 vj vTj0 Σ2 vj 0

where aj = UΣvj 2 Rp is the jth column of A.


Ping Yu (HKU) Text As Data 17 / 69
Text Representation

Continued

Given a query q, first translate it to


1 T
e=Σ
q V q 2 RK
e to the docs in the low-dimensional space (spanned by rows of U).
and compare q
Cluster docs and terms in the low-dimensional space.
Doc classification in the low-dimensional space.

Challenges to LSA:
- Scalability: SVD of a large matrix X.
- Determination of K : Around 300 400 dimensions usually provide the best per-
formance; Retain 70% of the variance is also suggested.

Ping Yu (HKU) Text As Data 18 / 69


Text Representation

Refinement II: Word Embedding

Word embedding, also known as distributed language representation, is a contin-


uous word representation, that stores “most" of the information in a dense vector
with fixed dimension.
Instead of capturing co-occurrence counts directly, it predicts surrounding words
of every word; in other words, words with similar meaning will colocate with each
other and have similar vector representations. [figure here]
Specifically, it attempts to maximize the log probability of any context word given
the current center word, which can be solved via a neural network with only one
hidden layer.
- We will discuss more on word embedding in Lecture 10 which covers deep learn-
ing.
- The next3 slide uses PCA (instead of the one-hidden-layer neural network) to
achieve the embedding to aid intuition.
Interface is available in most softwares, including R.
- The word2vec package includes the word2vec algorithm by a team led by Tomáš
Mikolov at Google [figure here].
- The text2vec package includes the GloVe (Global Vectors for Word Representa-
tion) algorithm developed as an open-source project at Stanford.

Ping Yu (HKU) Text As Data 19 / 69


Text Representation

Figure: A Graphical Example of Word Embedding

In the left panel, {child, man, woman} are grouped together, and {king, queen,
prince} are grouped together.
In the right panel, women+(king man) is close to queen; in other words,
king man queen women.
Besides distance, we can also measure the similarity between two words by cosine
similarity.
Appendix A shows how to visualize word embedding in the two-dimensional space
using the t-SNE algorithm.
Ping Yu (HKU) Text As Data 20 / 69
Text Representation

History of Word2vector Model

Tomáš Mikolov (Czech Institute of Informatics)

Ping Yu (HKU) Text As Data 21 / 69


Text Representation

Some Details

The key preliminary step in both word2vec and GloVe is to settle on a notion of
co-occurrence among terms.
For example, let C be a p p matrix whose (i, j ) entry counts the number of times
in your corpus that the terms i and j appear within, say, b words of each other within
the same sentence, which is known as the skip-gram co-occurrence matrix.
- C is symmetric if the co-occurrence count does not take into account of the order
of terms i and j.
We approximate
C UVT
by PCA, where U and V are p K matrix with K p.
- When C is symmetric, we can set U = V, and ui is the word embedding of term i.
- U and V can be obtained by SVD, but more advanced algorithms in deep learning
are designed to deal with the high amount of sparsity in C.
This approximation implies that

u0i vj = ∑k =1 uik vjk ,


K
cij

i.e., the inner product of terms’ embeddings, which measures the closeness of the
pair in the K -dimensional vector space, describes how likely the pair is to co-occur.
Ping Yu (HKU) Text As Data 22 / 69
Text Regression

Text Regression

Ping Yu (HKU) Text As Data 23 / 69


Text Regression

[Example] Gentzkow and Shapiro (2010)

Given the DTM, any method learned in previous lectures, e.g., OLS, ridge, lasso,
PCR, PLS, etc., can be used to analyze the data.
Gentzkow and Shapiro (2010) study on how newspapers target content to the poli-
tics of their readership.
They show that the demand-side forces dominate: variation in consumer political
attitudes explains roughly 20% of the variation in measured slant; the supply-side
forces, e.g., owner ideology, pressure from incumbent politicians and the tastes of
reporters and editors, are less important
The dataset coungress109Counts summarizes the first year of the 109th Congres-
sional Record (2005-2006), containing all speech in that year for 529 members of
the U.S. House and Senate.
The text is tokenized (through their Pearson χ 2 -test statistic) into 500 bigrams and
500 trigrams after stopword removal and stemming (using Porter stemmer).
So X 2 R529 1000 , which is a sparse large matrix.
In addition, the dataset coungress109Ideology contains the variable repshare which
is the proportion of the two-party vote obtained by George W. Bush in the 2004
presidential election for each member’s constituency (district for representatives,
states for senators).
Ping Yu (HKU) Text As Data 24 / 69
Text Regression

PLS

We can use repshare as a surrogate for ideology and partisanship for each con-
gressperson.
repshare is closely related to the phrases that congressperson would used. For ex-
ample, a republican used death.tax and tax relief more often than a democrat,
and a democrat used estate.tax and tax.break more often than a republican.
Thus, we can build an index of slant that sums across a speaker’s term usage
weighted by the direction of slant for each term.
Specifically, we run PLS of y = repshare on X to build a model that connects lan-
guage to partisanship.
- In the figure in the next slide, the R 2 is 0.17, 0.32, and 0.43 for PLS(1) (called mar-
ginal regression), PLS(2), and PLS(3), respectively, so PLS(2) seems necessary,
while Gentzkow and Shapiro use PLS(1).
- Gentzkow and Shapiro normalize each row of X to have sum one, so it doesn’t
matter how much a politician speaks but just which words s/he chooses.
Then regressing the relative frequency of the 1000 phrases of a newspaper on the
pls
coefficients βb (2) in PLS(2), the slope can be used to quantify political slant of
that newspaper.

Ping Yu (HKU) Text As Data 25 / 69


Text Regression

Continued

Figure: The PLS fits for repshare on relative term usage in the 1000 phrases: points are colored
by political party — grey for Republicans and black for Democrats.

Ping Yu (HKU) Text As Data 26 / 69


Text Regression

Lasso
We can also run a lasso regression for y on X.
Often we impose a weight ω j =sd xj on the lasso penalty β j , called rare feature
up-weighting in a similar spirit as TFIDF, where rare words are imposed smaller
weights, so are most useful in differentiating between documents.
The following figure shows how to choose K in PLS and λ in lasso by CV: the lasso
outperforms PLS here.

Figure: The best K for PLS is 1 or 2, and the best λ for lasso is 0.01483774.

Ping Yu (HKU) Text As Data 27 / 69


Text Regression

Bias-Corrected AIC and BIC

In high-dimensional problems, we often use a biased-corrected AIC objective to


choose λ , called AICc .
Specifically,
n
AICc (λ ) = 2 log ` β̂ 0λ , β̂ 1λ , , β̂ pλ + 2dfλ ,
n dfλ 1
where dfλ can be unbiasedly estimated by the number of nonzero estimated coef-
ficients (including the intercept).
Similarly, the BIC objective is

BIC = log ` β̂ 0λ , β̂ 1λ , , β̂ pλ + 2dfλ log n.

The BIC tends to choose simpler models than CV or AICc.


If variable selection is the primary goal, BIC is suggested for lasso penalty selection,
while if predictive performance is the primary goal, AICc is preferred.

Ping Yu (HKU) Text As Data 28 / 69


Topic Models

Topic Models

Ping Yu (HKU) Text As Data 29 / 69


Topic Models

Topic Models

Many observations are not labeled, so we can first conduct PCA on X to generate
factors, and then apply a supervised learning procedure to the labeled observa-
tions.
- For a really big sparse DTM, we usually first compute the covariance matrix of X
before conducting PCA, i.e., compute the eigenvectors of the covariance matrix of
X rather than compute the SVD of X itself to avoid running out of memory.
For example, in the dataset we8there of restaurant reviews, we observe a text re-
view and also a multidimensional rating on overall experience, atmosphere, food,
service, and value with each rated on a five-point scale.
The review is constructed from a number of underlying topics, e.g., the type of food,
the quality of the restaurant, or the geographic location; in other words, we add a
topic layer between the document and the ultimate model like in PCR.
Our target is to uncover these topics via factorization.
LSA was common before the 2000s; in 2003, David Blei, Andrew Ng and Michael
I. Jordan introduced another unsupervised text analysis tool – topic modeling, aka
Latent Dirichlet Allocation (LDA), to replace arbitrary transformations of LSA with a
plausible generative model; see Blei (2012) for an introduction.

Ping Yu (HKU) Text As Data 30 / 69


Topic Models

History of LDA

David Blei (Columbia)3 Andrew Ng (Stanford) Michael I. Jordan (Berkeley)

3
Both Blei and Ng were supervised by Jordan at Berkeley.
Ping Yu (HKU) Text As Data 31 / 69
Topic Models

LDA

The squared error loss implied by PCA is inappropriate for analysis of sparse word-
count data.
Rather, we take the bag-of-words representation seriously and model token counts
as realizations from a multinomial distribution, i.e., we model topic models as a
multinomial factor model.
Specifically, xi 2 Rp has a multinomial factor distribution:

xi MN (ω i1 θ 1 + + ω iK θ K , mi ) ,
p
where mi = ∑j =1 xij , ω i = (ω i1 , , ω iK )T is a probability vector for document i, and
θ k 2 Rp is a probability vector over the vocabulary of words and represents "topic"
k.
- The probability of word j appearing in document i is pij := ∑K k =1 ω ik θ jk .
- The term "Dirichlet" is from the Bayesian modelling which treats ω i and θ k as
generated from a Dirichlet-distributed prior.
LDA can account for latent clustering and dependency within and across docu-
ments, and can be used for managing, organizing, and annotating large archives of
texts by themes (i.e., topics).

Ping Yu (HKU) Text As Data 32 / 69


Topic Models

Comparison with PCA

In PCA,
E [xi ] = zi1 φ 1 + + ziK φ K ,
and in LDA,
E [xi /mi ] = ω i1 θ 1 + + ω iK θ K =: Θω i ,
so topic score ω ik is like PC score zik and topic probability θ k is like the factor
loading φ k .
The difference is that the multinomial deviance rather than the squared error loss is
minimized; also, ω i and θ k are probability vectors rather than kφ k k = 1 as in PCA.
- LDA is modeling relative usage of words rather than absolute usage as in PCA.
Topic models are adopted since its introduction because they tend to provide easily
interpretable factorization.
- There are also extensions like hierarchical topic models to account for an unknown
number of topics, dynamic topic models to allow topics to change slowly in time,
correlated topic models to allow θ k , k = 1, , K , to be correlated, or supervised
topic models where some observed attributes are used to supervise the selection
of topics.

Ping Yu (HKU) Text As Data 33 / 69


Topic Models

An Illustrative Example

Ping Yu (HKU) Text As Data 34 / 69


Topic Models

Continued

Note that the names of topics do not exist a priori and are added a posteriori to
enhance interpretability.

Ping Yu (HKU) Text As Data 35 / 69


Topic Models

Computation Details

Stochastic Gradient Descent (SGD) algorithms are developed in Hoffman et al.


(2013) to solve the optimization problem in LDA:
n p K
mi !
∏ xi1 ! ∏ ∏ P (θ k )
xij
max ω θ + + ω iK θ jK P (ω i ) (2)
fω i gni=1 ,fθ k gKk=1 i =1 xip ! j =1 i1 j1 k =1

- These algorithms stream data during estimation rather than loading it into memory
as a block; see Lecture 10 for more details.
- For many EM- and MAP-type algorithms, see Taddy (2012).
K can be chosen by CV, or maximizing the marginal likelihood (an idea embodied
in the Bayes factor),
Z
P (DjM ) = P (Ω, ΘjM ) P (DjΩ, Θ) d Ωd Θ,

where D is the data X, M is indexed by K , (Ω, Θ) is the parameter in M with


Ω = (ω 1 , , ω n ), and P (Ω, ΘjM ) P (DjΩ, Θ) is the posterior distribution in (2),
or initialize K with the order of ten and search over the neighborhood to improve
interpretability.
- If lasso is employed to fit yi on ω i , then we can choose a larger K and let lasso
determine the appropriate K .
Ping Yu (HKU) Text As Data 36 / 69
Topic Models

Interpretation for Topics

Often, the goal of text analysis is to provide an intuitive description of text, rather
than inference on some underlying "true" parameters.
We can work from bottom up or top down to build a narrative.
Bottom up: look at "top" words for each topic.
- Checking the words associated with large θ jk ’s may not be appropriate since some
words are on top for many topics (especially when only a small set of stop words
are pruned).
- Alternatively, use lift:
θ jk
liftjk = ,
q̄j
x
where q̄j = n1 ∑ni=1 miji is the average proportion of a document allocated to word j.
So the lift is high for words that are much more common in topic k than in general
speech.
Top down: plot ω ki against yi for the first few k ’s to search for patterns.
Analogue: PCR vs. PLS.

Ping Yu (HKU) Text As Data 37 / 69


Topic Models

Collaborative Filtering

Consider the Netflix problem in Lecture 5 again.


The watcher and the movies chosen are analogous to a document and the words
in the document, respectively, where xij 2 f0, 1g rather than the rating between 1
and 5.
The collaborative filtering task is as follows: We train a model on a fully observed set
of users.Then, for each unobserved user, we are shown all but one of the movies
preferred by that user and are asked to predict what the held-out movie is, i.e., we
are predicting a watcher’s future choices from their and others’ past choices.
- This can be applied to recommend new musics to a listener, or recommend new
books to a buyer (by Amazon).
We can use topic modelling to recommend movies with high liftjk to a watcher
whose ω ki is large.
It is more efficient than the traditional market basket analysis (MBA) such as the
Apriori algorithm (see Appendix Bfor an introduction), especially when the set of
products is large and the target association rules are beyond pairs.

Ping Yu (HKU) Text As Data 38 / 69


Multinomial Inverse Regression

Multinomial Inverse Regression

Ping Yu (HKU) Text As Data 39 / 69


Multinomial Inverse Regression

Multinomial Inverse Regression (MNIR)

Try to understand how text connects to a set of related covariates, e.g., how to con-
nect the restaurant reviews simultaneously to all five aspect ratings, which allows
us to determine which content is predictive of rating on, say, atmosphere separate
from food or service.
Given document attributes yi (author characteristics, date, beliefs, sentiment, etc.),
multinomial inverse regression (MNIR) of Taddy (2013) assumes xi follows a logistic
multinomial model:

exp α j + y0i φ j exp η ij


xi MN (qi , mi ) with qij = =: for j = 1, , p. (3)
p
∑l =1 exp α l + y0i φ l Λi

- The "inverse" in MNIR comes from the inverse operation in text regression where
one attribute yij is regressed on xi while here xi is regressed on a few attributes yi .
- This model is similar to topic models but replaces unknown topics with known
attributes.
- Note that xij , j = 1, , p, are correlated conditional on yi – an increase in some
xij implies decreases in other xij ’s.4
4
When dim (yi ) = 1, this is different from Naive Bayes where we assume xij jyi , j = 1, , p, are independent
and follow Poisson or negative Binomial.
Ping Yu (HKU) Text As Data 40 / 69
Multinomial Inverse Regression

History of MNIR

Matt Taddy (VP of Amazon, Chicago Booth Previously)

Ping Yu (HKU) Text As Data 41 / 69


Multinomial Inverse Regression

An Illustrative Example

Suppose yi is a scalar and can take only discrete values in Y , e.g., Y = f 1, 0, 1g


corresponding to negative, neural and positive sentiment on a Twitter post ("tweet").
We can collapse token counts as xy = ∑i:yi =y xi . If xi are independent, then

exp α j + y φ j
xy MN (qy , my ) with qij = p for j = 1, , p and y 2 Y ,
∑l =1 exp (α l + y φ l )

where my = ∑i:yi =y mi .
Although independence of xi is surely incorrect, within-individual correlation in xi
is quickly overwhelmed in aggregation and the multinomial becomes decent model
for xy .

Ping Yu (HKU) Text As Data 42 / 69


Multinomial Inverse Regression

Estimation
Taddy considers the GL scheme in Appendix B of Lecture 4 and suggests to use
coordinate descent combined with majorization to obtain the MAP estimate.5
(s, r ) in Gamma-Lasso can be chosen by CV or set at some pilot value; it seems
that the minimizaiton is insensitive to the choice of (s, r ).
When p and/or d := dim (yi ) are is large, Taddy (2015) suggests to use distributed
multinomial regression (DMR) to lighten computational burdens.
iid
- Suppose xij Poisson exp η ij ; then
p ( xi ) = ∏
p
j =1
Poisson xij j exp η ij = MN (xi jqi , mi ) Poisson (mi jΛi ) , (4)
i.e., whether conditioning on mi does not affect the inference of qi .
- Extend η ij = α j + µ i + y0i φ j in (3); conditional on mi , the MLE of µ i is
µ i = log (mi /Λi ) .
- If we instead set µ̂ i = log mi , then the factorized Poisson likelihood (4) can be
analyzed independently across j with the negative log likelihood proportional to
log ` α j , φ j = ∑i =1 [mi exp(α j + y0i φ j )
n
xij (α j + y0i φ j )]

so that different (α j , φ j )’s can be estimated across different computers.


5
He sets α j N (0, 1) to avoid specifying a null category, and also adds a random effect uij to η ij in (3) to take
u
account of omitted attributes and specifies e ij Ga(1, 1).
Ping Yu (HKU) Text As Data 43 / 69
Multinomial Inverse Regression

Prediction

To predict future document attributes, we need to invert MNIR to obtain yi from xi .


When d > 1, this inversion by applying Bayes’ formula is hard.
Taddy (2013) uses the inversion regression ideas by deriving sufficient reduction
(SR) projection from the fitted model, where the token count projection is Φxi with
Φ = [φ 1 , , φ p ], and zi := Φxi is sufficient statistic for yi in the sense that
yi ? xi jzi .
- Since dim (zi ) = d, presumably small, dimension reduction is achieved by using
zi instead of xi .
For example, to predict the continuous attribute of a new document with token count
wi , we can run forward regression of yi on b b i to obtain coefficients Γ,
zi := Φx b and
predict the new attribute as ΓbT (Φw
b i ), where we often normalize b zi by the length of
reviews, bzi /mi .
If the attribute is ordered discrete like in the tweet example, then the forward re-
gression applies logistic proportional odds model of the form
1
P (yi < c ) = (1 + exp [ (γ c + β zi )]) .
In spite of possible misapplication of population-averge summary projections, ad-
hoc forward regression is expected to compensate for this misspecification.
Ping Yu (HKU) Text As Data 44 / 69
Multinomial Inverse Regression

[Example] Gentzkow and Shapiro (2010)

Taddy (2013) specify yi =(gop, repshare)^ T to run MNIR and the fitted count against
the observed count is shown in the upper-right panel of the figure in the next slide.,
where gop is an indicator for "Grand Old Party" (i.e., The Republican Party), and
^ is the residual from regressing repshare on gop.
repshare
- Note that φ jk , k = 1, , d, are partial effects (i.e., depending on other attributes);
for each k , we can list top loadings by the size of their jφ jk j as in the lower panel of
the figure in the next slide.
p
- zik = x0i φ k = ∑j =1 xij φ jk contains all of the information in xi that is relevant to yik ,
and is especially useful for sorting documents according to the various attributes.
He then regress repshare on (zgop , zvotediff , zgop zvotediff ) to have

E [repshare] = 51.9 + 6.2zgop + 5.2zvotediff 1.9zgop zvotediff ,

where zgop and zvotediff are the normalized SR scores, and the interaction term is
to reduce misspecification bias.
- The original repshare against the fitted repshare are shown in the upper-left panel
of the figure in the next slide (red for Republican and blue for Democrats).

Ping Yu (HKU) Text As Data 45 / 69


Multinomial Inverse Regression

Results

Ping Yu (HKU) Text As Data 46 / 69


Multinomial Inverse Regression

Discussions on Text Regression, Topic Models, and MNIR

In text regression, dim (yi ) = 1, in topic models, dim (yi ) = 0, and in MNIR, dim (yi ) >
1.
When p is large relative to n, MNIR performs better than text regression, but these
gains diminish with n.
It is usually unwise to attempt to learn flexible (than linear) functional forms unless
n is much larger than p.
When dim (yi ) > 1, MNIR, especially DMR, can be used to resolve or control for
interdependencies between these attributes and their effects on language. The
speed and scalability are the main advantages of MNIR over other methods.
- Caution: Do not naively use only the text data for prediction; instead, use the text
as a complement to the existing methods – use it to predict residuals or automate
analysis that is expensive and time-consuming.
It is not suggested to overinterpret topics. To improve interpretability for topic mod-
els, it is the best to add some supervision (i.e., to incorporate external information
on the topics for some set of cases).
- Generally, when yi is available, we can supervise the topic selection using yi .
Often, we have a large corpora of unlabeled documents and a smaller set of labeled
ones. We can first fit a topic model on the larger corpora, and then use these topics
as well as the token counts for text regression or MNIR on smaller labeled corpora.
Ping Yu (HKU) Text As Data 47 / 69
Lab: Text As Data

Lab: Text As Data

Data: congress109 and we8there in textir .


PLS: try the pls function of textir package rather than the plsr function of pls package
in Lecture 5.
Lasso: try gamlr for Gamma-Lasso rather than glmnet in Lecture 4 and apply AICc .
Topic Models: topicmodels for the variational EM (VEM) algorithm of Blei, Ng and
Jordan (2003), and maptpx for the block-relaxation MAP algorithm of Maddy (2012).
- VEM augments the data by topic membership of individual terms zij , and treats
the hyperparameter α in the prior P (ω i jα ) and Θ as fixed, and zij , xij and ω ik as
random.
- Block-relaxation MAP augments the data by the topic total Ti = (ti1 , , tiK )
MN(ω i , mi ) such that each document is expanded as

xi MN (θ 1 , ti1 ) + + MN (θ K , tiK ) ,

and maximizes (2) over Ω and Θ jointly.


- For for the SGD algorithm of Hoffman et al. (2013), see gensim library of Python.
Multinomial Inverse Regression: textir.
Apriori Algorithm: arules.
Ping Yu (HKU) Text As Data 48 / 69
Appendix A: t-SNE

Appendix A: t-SNE

Ping Yu (HKU) Text As Data 49 / 69


Appendix A: t-SNE

History of t-SNE

Geoffrey Hinton (1947-, Google and UToronto) Sam Roweis (1972-2010, NYU)6

Stochastic Neighbor Embedding (SNE) was originally developed by Sam Roweis


and Geoffrey Hinton in 2002, and Laurens van der Maaten proposed the t-distributed
variant in 2008.

6
He jumped over the 16th-floor balcony in an argument with his wife over caring for preemie twins.
Ping Yu (HKU) Text As Data 50 / 69
Appendix A: t-SNE

Basic Ideas

t-SNE is a nonlinear dimensionality reduction technique well-suited for embedding


high-dimensional data for visualization in a low-dimensional space of two or three
dimensions.
- It can be treated as a nonlinear extension of Fisher’s discriminant plot in Lecture
8.
Specifically, it models each high-dimensional object by a two- or three-dimensional
point in such a way that similar objects are modeled by nearby points and dissimilar
objects are modeled by distant points with high probability.
The t-SNE algorithm comprises two main stages.
- First, t-SNE constructs a probability distribution over pairs of high-dimensional
objects in such a way that similar objects are assigned a higher probability while
dissimilar points are assigned a lower probability.
- Second, t-SNE defines a similar probability distribution over the points in the low-
dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence)
between the two distributions with respect to the locations of the points in the map.

Ping Yu (HKU) Text As Data 51 / 69


Appendix A: t-SNE

Stage 1

Given a set of n high-dimensional objects x1 , , xn , t-SNE first computes probabil-


ities pij that are proportional to the similarity of objects xi and xj as follows: for i 6= j,
define
2
exp xi xj /2σ 2i
pjji =
∑k 6=i exp kxi xk k2 /2σ 2i

and set piji = 0.


- The similarity of datapoint xj to datapoint xi is the conditional probability, pjji ,
that xi would pick xj as its neighbor if neighbors were picked in proportion to their
probability density under a Gaussian centered at xi .
Now define
pjji + pijj
pij = .
2n
- This is motivated because pi is estimated as 1/n, so conditional probability can
be written as pjji = npij and pijj = npji , while pij = pji .

Ping Yu (HKU) Text As Data 52 / 69


Appendix A: t-SNE

Continued

The bandwidth of the Gaussian kernels σ i is set in such a way that the entropy of
the conditional distribution equals a predefined entropy using the bisection method.
Specifically, any particular value of σ i induces a probability distribution, Pi , over all
of the other datapoints. The perplexity of Pi , defined as

Perp (Pi ) = 2H (Pi ) ,

where H (Pi ) = ∑j pjji log2 pjji is the Shannon entropy of Pi measured in bits [see
Lecture 8 for more discussion on entropy], can be interpreted as a smooth measure
of the effective number of neighbors.
As a result, the bandwidth is adapted to the density of the data: smaller values of
σ i are used in denser parts of the data space.
The performance of SNE is fairly robust to changes in the perplexity, and typical
values are between 5 and 50.
(**) Since the Gaussian kernel uses the Euclidean distance xi xj , it is affected
by the curse of dimensionality, and in high dimensional data when distances lose
the ability to discriminate, the pij become too similar (asymptotically, they would
converge to a constant). It has been proposed to adjust the distances with a power
transform, based on the intrinsic dimension of each point, to alleviate this.
Ping Yu (HKU) Text As Data 53 / 69
Appendix A: t-SNE

Stage 2

t-SNE aims to learn a d-dimensional map y1 , , yn (with yi 2 Rd and d typically


chosen as 2 or 3) that reflects the similarities pij as well as possible.
To this end, it measures similarities qij between two points similarly: for i 6= j, define
2 1
(1 + yi yj )
qij = 1
∑k ∑l6=k exp kyk yl k2

and set qii = 0.


- A heavy-tailed Student t-distribution (with one-degree of freedom, which is the
same as a Cauchy distribution) is used to measure similarities between low-dimensiona
points in order to allow dissimilar objects to be modeled far apart in the map.
The locations of the points yi in the map are determined by minimizing the (non-
symmetric) Kullback–Leibler divergence of the distribution P from the distribution
Q, that is,
pij
KL (PkQ ) = ∑ pij log
i6=j
qij

The minimization of the Kullback–Leibler divergence with respect to the points yi is


performed using gradient descent [see Lecture 10 for more discussion on gradient
descent].
Ping Yu (HKU) Text As Data 54 / 69
Appendix B: Association Rules

Appendix B: Association Rules


(Section 14.2 of ESL)

Ping Yu (HKU) Text As Data 55 / 69


Appendix B: Association Rules

Goals

Goal: find the mode of X = (X1 , , Xp ), i.e., arg maxx2X Pr(X = x ), where X =
p
∏j =1 Sj is the support of X , and Sj is the support of Xj .
Often Xj 2 f0, 1g, representing whether the jth item was purchased, so it is referred
to as market basket analysis.
- "Mode" with many Xj ’s equal to 1 means those items are frequently purchased
together, so are useful for stocking shelves, cross-marketing in sales promotions,
catalog design, and consumer segmentation based on buying patterns.
If p is relatively large (e.g., p 104 and n 108 in practice), and Sj includes more
p
than a small number of values7 , then X includes ∏j =1 Sj values, which is huge,
implying Pr(X = x ) is too small to be reliably estimated for any x 2 X – another
manifestation of the curse of dimensionality, where Sj is the cardinality of Sj .
Modified Goal: seek regions of X with high probability relative to their size.
- Mathematically, we try to find sj Sj such that
\p
Pr Xj 2 sj (5)
j =1

is relatively large.
Tp
- The intersection of subsets j =1 Xj 2 sj is called a conjunctive rule. [What does
a non-conjunctive rule look like?]
7
If Xj is continuous, dj = ∞.
Ping Yu (HKU) Text As Data 56 / 69
Appendix B: Association Rules

Market Basket Analysis


Even for the modified goal, solving (5) is not feasible, so we simplify the rules
further.
sj consists of a single value of Xj , sj = v0j , or sj = Sj (then Xj is said not to appear
in the rule).
Now, (5) is reduced to find subsets of the integers J f1, , pg, and v0j , j 2 J ,
such that \
Pr Xj = v0j (6)
j2J
is large. [see Figure 14.1]

Ping Yu (HKU) Text As Data 57 / 69


Appendix B: Association Rules

Continued

Create dummy variable Zk = I Xj = vlj for some j 2 f1, , pg and some l 2 1, , S


then (6) can be rewritten as
\
Pr
k 2K
(Zk = 1) = Pr ∏k 2K Zk = 1 , (7)

p
where K f1, , K g with K = ∑j =1 Sj .
- The set K is called an item set, and jK j is called its size which is no bigger than
p. [How many possible item sets? < 2K ]
The probability in (7) can be estimated by

1 n
∏k 2K Zk = 1 n i∑ ∏k 2K zik ,
b
Pr = (8)
=1

which is called the support or prevalence T (K ) of the item set K . An observation


i for which ∏k 2K zik = 1 is said to contain the item set K .
- Often, we search over only K ’s such that jT (K )j > t for some threshold t.

Ping Yu (HKU) Text As Data 58 / 69


Appendix B: Association Rules

The Apriori Algorithm

Principle: for a given threshold t:


I. jfK jT (K ) > tgj is relatively small.
II. L K =) T (L ) T (K ). [Why? zik 2 f0, 1g for any k 2 f1, , K g]
Pass 1: set jK j = 1, and search (single-item set) K ’s such that T (K ) > t.
Pass 2: set jK j = 2, and add one more item to the surviving K ’s in Pass 1 such
that T (K ) > t for the new K ’s.
Pass m with m > 2: set jK j = m, and consider only candidates such that all of their
m ( = Cm m ) ancestral item sets of size m 1 are frequent [the list of such frequent
1
item sets of size m 1 is obtained from Pass m 1].
- In forming the candidates, we can add one item that is retained from Pass 1 [Why?
Principle II] to the item sets that survived the previous pass (i.e., the size is m 1).
Termination: all candidate rules from the previous pass have support less than t.
Output : the set fK jT (K ) > tg which is the union of rules that can survive any
pass.
The Apriori algorithm requires only one pass over the data for each value of jK j,
which is crucial since we assume the data size is too big to fit into a computer’s
main memory.
If the data is sparse enough or t is high enough, the algorithm will terminate in
reasonable time even if n is large.
Ping Yu (HKU) Text As Data 59 / 69
Appendix B: Association Rules

History of the Apriori Algorithm

Rakesh Agrawal (Microsoft) Ramakrishnan Srikant (Google)8

8
The former is one of the two PhD supervisors of the later; the Apriori algorithm is one chapter of the disser-
tation of the latter; both graduated from the University of Wisconsin-Madison.
Ping Yu (HKU) Text As Data 60 / 69
Appendix B: Association Rules

Association Rules

Each K in the output of the Apriori algorithm can be partitioned into two disjoint
subsets, A [ B = K , and written as an association rule A ) B.
- A is called the antecedent , and B the consequent.
b (∏k 2K Zk = 1) =: Pr
The support of the rule T (A ) B ) is Pr b (A [ B ) - the probability
of simultaneously observing both A and B.
The confidence or predictability C (A ) B ) of the rule is

T (A ) B ) b (BjA) .
C (A ) B ) = =: Pr
T (A)
b (B ).
The expected confidence is T (B ) =: Pr
The lift is
C (A ) B ) b (A [ B )
Pr
L (A ) B ) = =: .
T (B ) b (A) Pr
Pr b (B )
Output of the entire analysis: the association rules that satisfy the constraints

T (A ) B ) > t and C (A ) B ) > c (9)

for some thresholds t and c.


Ping Yu (HKU) Text As Data 61 / 69
Appendix B: Association Rules

An Example

Suppose K = fpeanut butter, jelly, breadg and consider the rule fpeanut butter, jellyg
fbreadg.
If T (fpeanut butter, jellyg ) fbreadg) = 3%, then the three appear together in 3%
of the market baskets.
If C (fpeanut butter, jellyg ) fbreadg) = 82%, then when peanut butter and jelly
were purchased, 82% of the time bread was also purchased.
If bread appeared in 43% of all market baskets, then the rule fpeanut butter, jellyg )
fbreadg would have a lift of 82%/43% = 1.95 > 1.
The output like (9) are stored in a data based that can be queried by the user.
A usual query requires particular items to be in the consequent, e.g.,
Display all transactions in which ice skates are the consequent that have confidence
over 80% and support of more than 2%.
Such queries provide information on the antecedent that predicates sales of ice
skates.
Focusing on a particular consequent casts the problem into the framework of su-
pervised learning.

Ping Yu (HKU) Text As Data 62 / 69


Appendix B: Association Rules

Drawbacks of the Apriori Algorithm

The Apriori algorithm can be applied to huge data bases, so represents one of the
major advances in data mining technology.
Although association rules are among data mining’s biggest successes, they indeed
suffers from some drawbacks.
- Xj can only be discrete or be transformed into discrete.
- The number of solution item sets, their size, and the number of passed required
over the data can grow exponentially with decreasing t, which implies that rules
with high confidence or lift, but low support, will not be discovered, e.g., a high
confidence rule such as vodka)caviar will not be discovered due to the low sales
volume of the consequent caviar .

Ping Yu (HKU) Text As Data 63 / 69


Appendix B: Association Rules

Unsuperised as Supervised Learning

Let g (x ) be the unknown density to be estimated, and g0 (x ) be a reference density,


e.g., the uniform density on X .
Assume the data size is n, and we simulate n0 ( n) data points from g0 (x ).
Pooling these two data set, and assigning mass w = n0 / (n + n0 ) to those from
g (x ) (i.e., the original data set) and w0 = n/ (n + n0 ) to those drawn from g0 (x ),
0 0 0 wng (x )+w n g (x )
results in a random sample drawn from the mixture density wn+w0 n0 =
(g (x ) + g0 (x )) /2.
Assign Y = 1 to the samples from g (x ) and Y = 0 to those from g0 (x ); then
g (x ) g (x ) /g0 (x )
µ (x ) = E [Y jx ] = =
g (x ) + g0 (x ) 1 + g (x ) /g0 (x )
can be estimated by supervised learning using the combined sample
(y1 , x1 ) , , (yn+n0 , xn+n0 ) .
Given µ̂ (x ),
µ̂ (x )
ĝ (x ) = g0 (x ) .
1 µ̂ (x )
For example, if we use logistic regression (with weights w on yi = 1 and w0 on
g (x )
yi = 0) to estimate f (x ) = log g (x ) , then ĝ (x ) = g0 (x ) ef̂ (x ) .
0

Ping Yu (HKU) Text As Data 64 / 69


Appendix B: Association Rules

Ping Yu (HKU) Text As Data 65 / 69


Appendix B: Association Rules

Choice of g0

The accuracy of ĝ (x ) can greatly depend on the choice of g0 ; a good choice of g0


should make the approximation of the resulting µ (x ) or f (x ) by the method being
used easy.
On the other hand, µ (x ) and f (x ) are functions of "contrast" g/g0 , so measure the
departure of g from g0 .
As a result, the choice of g0 is dictated by types of departures that are deemed
most interesting.
- If departures from uniformity are of interest, then g0 might be the uniform density
over the range of the variables.
- If departures from normality are of interest, then g0 could be a Gaussian distribu-
tion with the same mean vector and covariance matrix as the data.
- If departures from independence are of interest, then
p
g0 (x ) = ∏ gj xj , (10)
j =1

where gj xj is the marginal data density of Xj .


This method of converting an unsupervised learning problem to a supervised one
is not as popular as expected; the reason may be the increased computational
resources and storage resources – simulating the extra n0 data points from g0 .
Ping Yu (HKU) Text As Data 66 / 69
Appendix B: Association Rules

Generalized Association Rules

For the general problem (5), we can apply the supervised learning approach al-
though such an approach cannot apply to as huge a data base as the market basket
analysis above.
The problem (5) tries to find J f1, 2, , pg and subsets sj Sj , j 2 J , such
that ! !
T 1 n T
n i∑
b
Pr Xj 2 sj = I xij 2 sj (11)
j2J =1 j2J

is large, where Xj 2 sj j2J is called a "generalized" item set.


- sj corresponding to quantitative variables are taken to be contiguous intervals in
Sj , and for categorical variables can involve more than a single value.
For such a general searching problem, one can only apply heuristic search methods
to find a useful collection of such generalized item set.
Since both (8) and (11) care only about frequency, the implicit reference distribution
is the uniform distribution.
1
This favors the discovery of item sets whose marginal probability n ∑ni=1 I xij 2 sj
is large.
So we can concentrate on conjunctions of such frequent subsets.

Ping Yu (HKU) Text As Data 67 / 69


Appendix B: Association Rules

Better Choice of g0

Reference to the uniform distribution can cause highly frequent item sets with low
associations among their constituents to dominate the collection of highest support
item sets and neglect the high-association-low-frequency item sets such as (vodka,
caviar).
To avoid this problem, we can choose (10) as our reference distribution, where the
simulated sample from (10) can be easily generated from the data itself by applying
a different random permutation to the data values of each Xj .
- g (x ) /g0 (x ) would be uniform if there are no associations among Xj ’s, regardless
of the magnitude of the marginal density gj xj .
- It is not clear how to incorporate reference distributions other than the uniform into
the Apriori algorithm, so a supervised learning method would be applied.
The goal is find regions
T
R= Xj 2 sj (12)
j2J

for which the target function µ (x ) = E [Y jx ] is relatively large, and in addition, the
data support Z
T (R ) = g (x ) dx
x2R
is not too small.
Ping Yu (HKU) Text As Data 68 / 69
Appendix B: Association Rules

Choice of Supervised Learning Method

To learn conjunctive rules as in (12), we can apply a CART decision tree [see
Lecture 8] whose terminal nodes take exactly the form of (12).
Those terminal nodes t with high average y -values
ȳt = ave (yi jxi 2 t )
are candidates for high-support generalized item sets.
- The actual (data) support is given by
Nt
T (R ) = ȳt ,
N + N0
where Nt is the number of (pooled) data points in the terminal node t.
These can then be partitioned into antecedents and consequents in a search for
generalized association rules of high confidence and/or lift.
Other methods such as patient rule induction method (PRIM) can also be applied.
Comparison with the Apriori algorithm: (i) The Apriori algorithm is exhaustive -
it finds all rules with support greater than a threshold, while CART or PRIM is a
greedy algorithm so may not find the "optimal" set of rules. (ii) The Apriori algorithm
can deal only with dummy variables and cannot find a rule where sj is a subset of
Sj containing more than one values.
Ping Yu (HKU) Text As Data 69 / 69

You might also like