Lecture6 Text As Data Ver3
Lecture6 Text As Data Ver3
Text As Data
(Chapter 10 of DSB and Chapter 8 of BDS)
Ping Yu
“Text mining, also referred to as text data mining, roughly equivalent to text analyt-
ics, refers to the process of deriving high-quality information from text." - Wikipedia
“Another way to view text data mining is as a process of exploratory data analysis
that leads to heretofore unknown information, or to answers for questions for which
the answer is not currently known." - Marti A. Hearst (UC-Berkeley), 1999
For new forms of data, we must either engineer the data representation to match
the tools, or build new tools to match the data. The former is generally simpler.
We will do both: apply the techniques we learned in previous lectures to text data,
and also develop new techniques specific to text data.
- In the former category, we can use the regression techniques from Lectures 2, 4
and 5 or from future Lectures 7-10.
- In the latter category, w e will discuss one unsupervised generative model (LDA)
and one supervised generative model (MNIR).
- Recall that regressionmeans from x to y and generativemeans from y to x in
modelling.
Ping Yu (HKU) Text As Data 2 / 69
Continued
Gentzkow, Kelly and Taddy, 2019, Text as Data, Journal of Economic Literature,
57(3), 535-574.
Text Representation
Tokenization
Meaning is built from ordered sequences of words that refer to each other, but often
scattered across sentences or paragraphs.
However, text analyses often ignore this complexity and rely solely on counts for lan-
guage tokens: words, phrases, or other basic elements (separated by whitespace).
- See Lecture 10 for the extensions in deep learning.
Much information of text is contained in these simple counts, and the token-based
learning is fast and effective especially when ntrain is not big.
Bag-of-words representation of a text just counts the frequency of some key terms
in the text.
- Treat documents as if they were constructed by randomly drawing words with-
replacement from a bucket containing the fixed vocabulary, and throw away all more
complex information (e.g., grammar, word order, sentence structure, and (usually)
punctuation, etc.), but simplicity is its advantage.
- This approach treats every word in a document as an independent potential key-
word (feature) of the document.
More Details
Zipf’s law [figure here] states that frequency of a word is inversely proportional to
its rank in the frequency table:
h i
1
ks
f (k js, n) = h i =: fk , [figure here] (1)
∑ni=1 i1s
where n is the total number of elements in the frequency table, k is the rank, and
s 1 is the value of the exponent h characterizing
i the distribution.
- Since log (Nfk ) = log N log ∑ni=1 i1s s log k , where N is the total number of
words and Nfk is the frequency of the k th most frequent word, we have a straight
line with slope s if plotting the data on a log-log graph, with x = log k and y =
log (Nfk ).
In the Brown Corpus of American English text, the word “the" is the most frequently
occurring word, accounting for nearly 7% of all word occurrences; the second-place
word “of" accounts for slightly over 3.5% of words, followed by “and" (about 2.7%).
So s is close to 1 in (1).
Popular words take large proportion of occurrences but they are semantically mean-
ingless.
Rare words take major proportion of vocabulary, but they rarely occur in documents.
Ping Yu (HKU) Text As Data 8 / 69
Text Representation
In addition to imposing upper and lower limits on term frequency to avoid too com-
mon or rare words, the distribution of the term over a corpus is also important.
The sparseness of a term j is measured commonly by inverse document frequency
(IDF):
Total number of documents, n
IDF (j ) = 1 + log .
Number of documents containing j, say, nj
Details Continued
Removing punctuation is fine when analyzing news articles or literature, but may
be not in online corpora such as Twitter.
Numbers are usually unimportant for text processing, but whether to discard is still
purpose-dependent, e.g., "911".
Stemming is the process of cutting words to their root, e.g., tax from taxing, taxed,
taxes, taxation, or taxable.
In all pruning procedures, need not be too aggressive (like Porter (1980) stemmer
for English); we can let the ML tools to sort out extra vocabulary elements.
- Always balance efficiency and robustness!
The resulting document-term-matrix (DTM) is an n p sparse numeric matrix X with
xij being the count for word j in document i. [figure here]
p
Often, divide xij by mi = ∑j =1 xij = kxi k1 to transform counts to proportions (to
eliminate the length difference of two documents) or change xij to 1(xij > 0) to
indicate whether a term appears in the document.
- Which of the three strategies of tokenization (i.e., frequency, proportion, and pres-
ence) or TFIDF2 is the best depends on the data.
2
TF(i, j ) in TFIDF can also be normalized to proportions.
Ping Yu (HKU) Text As Data 12 / 69
Text Representation
Other Considerations
In the Jazz Musicians example of DSB, we query for the name of a Jazz musician
by inputting one sentence from his biography of Wikipedia.
After the TFIDF representation of that sentence, we calculate its cosine similarity
with each musician’s biography in Wikipedia, and then choose the most similar
musician as the query result, where the cosine similarity between two vectors x
and y is defined as
hx, yi
scos (x, y) =
kxk kyk
p
with h , i being the Euclidean inner product, and kxk = hx, xi being the Euclidean
norm of x, and is commonly used in text classification to measure the similarity
between documents.
- The name "cosine" similarity comes from the fact that it is the cosine of the angle
between x and y.
- Cosine similarity is exactly the correlation-based distance in Lecture 3, and is
commonly used in text classification; another popular distance is the Euclidean
distance.
Refinement I: LSA
Latent Semantic Analysis (LSA): using PCA to factorize token counts (or transfor-
mations, e.g., xij /mi or xij log nj ) and achieve dimension reduction.
From SVD,
X UΣVT =: A,
where Un K refers to documents, and Vp K refers to terms, but both are with lower
rank, and Σ is a K K diagonal matrix.
With LSA, we can compare documents i and i 0 by comparing Σui and Σui 0 , where
ui denotes the ith row of U:
uTi Σ2 ui 0
scos (ai , ai 0 ) = q q = scos (Σui , Σui 0 ) ,
uTi Σ2 ui uTi0 Σ2 ui 0
Continued
Challenges to LSA:
- Scalability: SVD of a large matrix X.
- Determination of K : Around 300 400 dimensions usually provide the best per-
formance; Retain 70% of the variance is also suggested.
In the left panel, {child, man, woman} are grouped together, and {king, queen,
prince} are grouped together.
In the right panel, women+(king man) is close to queen; in other words,
king man queen women.
Besides distance, we can also measure the similarity between two words by cosine
similarity.
Appendix A shows how to visualize word embedding in the two-dimensional space
using the t-SNE algorithm.
Ping Yu (HKU) Text As Data 20 / 69
Text Representation
Some Details
The key preliminary step in both word2vec and GloVe is to settle on a notion of
co-occurrence among terms.
For example, let C be a p p matrix whose (i, j ) entry counts the number of times
in your corpus that the terms i and j appear within, say, b words of each other within
the same sentence, which is known as the skip-gram co-occurrence matrix.
- C is symmetric if the co-occurrence count does not take into account of the order
of terms i and j.
We approximate
C UVT
by PCA, where U and V are p K matrix with K p.
- When C is symmetric, we can set U = V, and ui is the word embedding of term i.
- U and V can be obtained by SVD, but more advanced algorithms in deep learning
are designed to deal with the high amount of sparsity in C.
This approximation implies that
i.e., the inner product of terms’ embeddings, which measures the closeness of the
pair in the K -dimensional vector space, describes how likely the pair is to co-occur.
Ping Yu (HKU) Text As Data 22 / 69
Text Regression
Text Regression
Given the DTM, any method learned in previous lectures, e.g., OLS, ridge, lasso,
PCR, PLS, etc., can be used to analyze the data.
Gentzkow and Shapiro (2010) study on how newspapers target content to the poli-
tics of their readership.
They show that the demand-side forces dominate: variation in consumer political
attitudes explains roughly 20% of the variation in measured slant; the supply-side
forces, e.g., owner ideology, pressure from incumbent politicians and the tastes of
reporters and editors, are less important
The dataset coungress109Counts summarizes the first year of the 109th Congres-
sional Record (2005-2006), containing all speech in that year for 529 members of
the U.S. House and Senate.
The text is tokenized (through their Pearson χ 2 -test statistic) into 500 bigrams and
500 trigrams after stopword removal and stemming (using Porter stemmer).
So X 2 R529 1000 , which is a sparse large matrix.
In addition, the dataset coungress109Ideology contains the variable repshare which
is the proportion of the two-party vote obtained by George W. Bush in the 2004
presidential election for each member’s constituency (district for representatives,
states for senators).
Ping Yu (HKU) Text As Data 24 / 69
Text Regression
PLS
We can use repshare as a surrogate for ideology and partisanship for each con-
gressperson.
repshare is closely related to the phrases that congressperson would used. For ex-
ample, a republican used death.tax and tax relief more often than a democrat,
and a democrat used estate.tax and tax.break more often than a republican.
Thus, we can build an index of slant that sums across a speaker’s term usage
weighted by the direction of slant for each term.
Specifically, we run PLS of y = repshare on X to build a model that connects lan-
guage to partisanship.
- In the figure in the next slide, the R 2 is 0.17, 0.32, and 0.43 for PLS(1) (called mar-
ginal regression), PLS(2), and PLS(3), respectively, so PLS(2) seems necessary,
while Gentzkow and Shapiro use PLS(1).
- Gentzkow and Shapiro normalize each row of X to have sum one, so it doesn’t
matter how much a politician speaks but just which words s/he chooses.
Then regressing the relative frequency of the 1000 phrases of a newspaper on the
pls
coefficients βb (2) in PLS(2), the slope can be used to quantify political slant of
that newspaper.
Continued
Figure: The PLS fits for repshare on relative term usage in the 1000 phrases: points are colored
by political party — grey for Republicans and black for Democrats.
Lasso
We can also run a lasso regression for y on X.
Often we impose a weight ω j =sd xj on the lasso penalty β j , called rare feature
up-weighting in a similar spirit as TFIDF, where rare words are imposed smaller
weights, so are most useful in differentiating between documents.
The following figure shows how to choose K in PLS and λ in lasso by CV: the lasso
outperforms PLS here.
Figure: The best K for PLS is 1 or 2, and the best λ for lasso is 0.01483774.
Topic Models
Topic Models
Many observations are not labeled, so we can first conduct PCA on X to generate
factors, and then apply a supervised learning procedure to the labeled observa-
tions.
- For a really big sparse DTM, we usually first compute the covariance matrix of X
before conducting PCA, i.e., compute the eigenvectors of the covariance matrix of
X rather than compute the SVD of X itself to avoid running out of memory.
For example, in the dataset we8there of restaurant reviews, we observe a text re-
view and also a multidimensional rating on overall experience, atmosphere, food,
service, and value with each rated on a five-point scale.
The review is constructed from a number of underlying topics, e.g., the type of food,
the quality of the restaurant, or the geographic location; in other words, we add a
topic layer between the document and the ultimate model like in PCR.
Our target is to uncover these topics via factorization.
LSA was common before the 2000s; in 2003, David Blei, Andrew Ng and Michael
I. Jordan introduced another unsupervised text analysis tool – topic modeling, aka
Latent Dirichlet Allocation (LDA), to replace arbitrary transformations of LSA with a
plausible generative model; see Blei (2012) for an introduction.
History of LDA
3
Both Blei and Ng were supervised by Jordan at Berkeley.
Ping Yu (HKU) Text As Data 31 / 69
Topic Models
LDA
The squared error loss implied by PCA is inappropriate for analysis of sparse word-
count data.
Rather, we take the bag-of-words representation seriously and model token counts
as realizations from a multinomial distribution, i.e., we model topic models as a
multinomial factor model.
Specifically, xi 2 Rp has a multinomial factor distribution:
xi MN (ω i1 θ 1 + + ω iK θ K , mi ) ,
p
where mi = ∑j =1 xij , ω i = (ω i1 , , ω iK )T is a probability vector for document i, and
θ k 2 Rp is a probability vector over the vocabulary of words and represents "topic"
k.
- The probability of word j appearing in document i is pij := ∑K k =1 ω ik θ jk .
- The term "Dirichlet" is from the Bayesian modelling which treats ω i and θ k as
generated from a Dirichlet-distributed prior.
LDA can account for latent clustering and dependency within and across docu-
ments, and can be used for managing, organizing, and annotating large archives of
texts by themes (i.e., topics).
In PCA,
E [xi ] = zi1 φ 1 + + ziK φ K ,
and in LDA,
E [xi /mi ] = ω i1 θ 1 + + ω iK θ K =: Θω i ,
so topic score ω ik is like PC score zik and topic probability θ k is like the factor
loading φ k .
The difference is that the multinomial deviance rather than the squared error loss is
minimized; also, ω i and θ k are probability vectors rather than kφ k k = 1 as in PCA.
- LDA is modeling relative usage of words rather than absolute usage as in PCA.
Topic models are adopted since its introduction because they tend to provide easily
interpretable factorization.
- There are also extensions like hierarchical topic models to account for an unknown
number of topics, dynamic topic models to allow topics to change slowly in time,
correlated topic models to allow θ k , k = 1, , K , to be correlated, or supervised
topic models where some observed attributes are used to supervise the selection
of topics.
An Illustrative Example
Continued
Note that the names of topics do not exist a priori and are added a posteriori to
enhance interpretability.
Computation Details
- These algorithms stream data during estimation rather than loading it into memory
as a block; see Lecture 10 for more details.
- For many EM- and MAP-type algorithms, see Taddy (2012).
K can be chosen by CV, or maximizing the marginal likelihood (an idea embodied
in the Bayes factor),
Z
P (DjM ) = P (Ω, ΘjM ) P (DjΩ, Θ) d Ωd Θ,
Often, the goal of text analysis is to provide an intuitive description of text, rather
than inference on some underlying "true" parameters.
We can work from bottom up or top down to build a narrative.
Bottom up: look at "top" words for each topic.
- Checking the words associated with large θ jk ’s may not be appropriate since some
words are on top for many topics (especially when only a small set of stop words
are pruned).
- Alternatively, use lift:
θ jk
liftjk = ,
q̄j
x
where q̄j = n1 ∑ni=1 miji is the average proportion of a document allocated to word j.
So the lift is high for words that are much more common in topic k than in general
speech.
Top down: plot ω ki against yi for the first few k ’s to search for patterns.
Analogue: PCR vs. PLS.
Collaborative Filtering
Try to understand how text connects to a set of related covariates, e.g., how to con-
nect the restaurant reviews simultaneously to all five aspect ratings, which allows
us to determine which content is predictive of rating on, say, atmosphere separate
from food or service.
Given document attributes yi (author characteristics, date, beliefs, sentiment, etc.),
multinomial inverse regression (MNIR) of Taddy (2013) assumes xi follows a logistic
multinomial model:
- The "inverse" in MNIR comes from the inverse operation in text regression where
one attribute yij is regressed on xi while here xi is regressed on a few attributes yi .
- This model is similar to topic models but replaces unknown topics with known
attributes.
- Note that xij , j = 1, , p, are correlated conditional on yi – an increase in some
xij implies decreases in other xij ’s.4
4
When dim (yi ) = 1, this is different from Naive Bayes where we assume xij jyi , j = 1, , p, are independent
and follow Poisson or negative Binomial.
Ping Yu (HKU) Text As Data 40 / 69
Multinomial Inverse Regression
History of MNIR
An Illustrative Example
exp α j + y φ j
xy MN (qy , my ) with qij = p for j = 1, , p and y 2 Y ,
∑l =1 exp (α l + y φ l )
where my = ∑i:yi =y mi .
Although independence of xi is surely incorrect, within-individual correlation in xi
is quickly overwhelmed in aggregation and the multinomial becomes decent model
for xy .
Estimation
Taddy considers the GL scheme in Appendix B of Lecture 4 and suggests to use
coordinate descent combined with majorization to obtain the MAP estimate.5
(s, r ) in Gamma-Lasso can be chosen by CV or set at some pilot value; it seems
that the minimizaiton is insensitive to the choice of (s, r ).
When p and/or d := dim (yi ) are is large, Taddy (2015) suggests to use distributed
multinomial regression (DMR) to lighten computational burdens.
iid
- Suppose xij Poisson exp η ij ; then
p ( xi ) = ∏
p
j =1
Poisson xij j exp η ij = MN (xi jqi , mi ) Poisson (mi jΛi ) , (4)
i.e., whether conditioning on mi does not affect the inference of qi .
- Extend η ij = α j + µ i + y0i φ j in (3); conditional on mi , the MLE of µ i is
µ i = log (mi /Λi ) .
- If we instead set µ̂ i = log mi , then the factorized Poisson likelihood (4) can be
analyzed independently across j with the negative log likelihood proportional to
log ` α j , φ j = ∑i =1 [mi exp(α j + y0i φ j )
n
xij (α j + y0i φ j )]
Prediction
Taddy (2013) specify yi =(gop, repshare)^ T to run MNIR and the fitted count against
the observed count is shown in the upper-right panel of the figure in the next slide.,
where gop is an indicator for "Grand Old Party" (i.e., The Republican Party), and
^ is the residual from regressing repshare on gop.
repshare
- Note that φ jk , k = 1, , d, are partial effects (i.e., depending on other attributes);
for each k , we can list top loadings by the size of their jφ jk j as in the lower panel of
the figure in the next slide.
p
- zik = x0i φ k = ∑j =1 xij φ jk contains all of the information in xi that is relevant to yik ,
and is especially useful for sorting documents according to the various attributes.
He then regress repshare on (zgop , zvotediff , zgop zvotediff ) to have
where zgop and zvotediff are the normalized SR scores, and the interaction term is
to reduce misspecification bias.
- The original repshare against the fitted repshare are shown in the upper-left panel
of the figure in the next slide (red for Republican and blue for Democrats).
Results
In text regression, dim (yi ) = 1, in topic models, dim (yi ) = 0, and in MNIR, dim (yi ) >
1.
When p is large relative to n, MNIR performs better than text regression, but these
gains diminish with n.
It is usually unwise to attempt to learn flexible (than linear) functional forms unless
n is much larger than p.
When dim (yi ) > 1, MNIR, especially DMR, can be used to resolve or control for
interdependencies between these attributes and their effects on language. The
speed and scalability are the main advantages of MNIR over other methods.
- Caution: Do not naively use only the text data for prediction; instead, use the text
as a complement to the existing methods – use it to predict residuals or automate
analysis that is expensive and time-consuming.
It is not suggested to overinterpret topics. To improve interpretability for topic mod-
els, it is the best to add some supervision (i.e., to incorporate external information
on the topics for some set of cases).
- Generally, when yi is available, we can supervise the topic selection using yi .
Often, we have a large corpora of unlabeled documents and a smaller set of labeled
ones. We can first fit a topic model on the larger corpora, and then use these topics
as well as the token counts for text regression or MNIR on smaller labeled corpora.
Ping Yu (HKU) Text As Data 47 / 69
Lab: Text As Data
xi MN (θ 1 , ti1 ) + + MN (θ K , tiK ) ,
Appendix A: t-SNE
History of t-SNE
Geoffrey Hinton (1947-, Google and UToronto) Sam Roweis (1972-2010, NYU)6
6
He jumped over the 16th-floor balcony in an argument with his wife over caring for preemie twins.
Ping Yu (HKU) Text As Data 50 / 69
Appendix A: t-SNE
Basic Ideas
Stage 1
Continued
The bandwidth of the Gaussian kernels σ i is set in such a way that the entropy of
the conditional distribution equals a predefined entropy using the bisection method.
Specifically, any particular value of σ i induces a probability distribution, Pi , over all
of the other datapoints. The perplexity of Pi , defined as
where H (Pi ) = ∑j pjji log2 pjji is the Shannon entropy of Pi measured in bits [see
Lecture 8 for more discussion on entropy], can be interpreted as a smooth measure
of the effective number of neighbors.
As a result, the bandwidth is adapted to the density of the data: smaller values of
σ i are used in denser parts of the data space.
The performance of SNE is fairly robust to changes in the perplexity, and typical
values are between 5 and 50.
(**) Since the Gaussian kernel uses the Euclidean distance xi xj , it is affected
by the curse of dimensionality, and in high dimensional data when distances lose
the ability to discriminate, the pij become too similar (asymptotically, they would
converge to a constant). It has been proposed to adjust the distances with a power
transform, based on the intrinsic dimension of each point, to alleviate this.
Ping Yu (HKU) Text As Data 53 / 69
Appendix A: t-SNE
Stage 2
Goals
Goal: find the mode of X = (X1 , , Xp ), i.e., arg maxx2X Pr(X = x ), where X =
p
∏j =1 Sj is the support of X , and Sj is the support of Xj .
Often Xj 2 f0, 1g, representing whether the jth item was purchased, so it is referred
to as market basket analysis.
- "Mode" with many Xj ’s equal to 1 means those items are frequently purchased
together, so are useful for stocking shelves, cross-marketing in sales promotions,
catalog design, and consumer segmentation based on buying patterns.
If p is relatively large (e.g., p 104 and n 108 in practice), and Sj includes more
p
than a small number of values7 , then X includes ∏j =1 Sj values, which is huge,
implying Pr(X = x ) is too small to be reliably estimated for any x 2 X – another
manifestation of the curse of dimensionality, where Sj is the cardinality of Sj .
Modified Goal: seek regions of X with high probability relative to their size.
- Mathematically, we try to find sj Sj such that
\p
Pr Xj 2 sj (5)
j =1
is relatively large.
Tp
- The intersection of subsets j =1 Xj 2 sj is called a conjunctive rule. [What does
a non-conjunctive rule look like?]
7
If Xj is continuous, dj = ∞.
Ping Yu (HKU) Text As Data 56 / 69
Appendix B: Association Rules
Continued
p
where K f1, , K g with K = ∑j =1 Sj .
- The set K is called an item set, and jK j is called its size which is no bigger than
p. [How many possible item sets? < 2K ]
The probability in (7) can be estimated by
1 n
∏k 2K Zk = 1 n i∑ ∏k 2K zik ,
b
Pr = (8)
=1
8
The former is one of the two PhD supervisors of the later; the Apriori algorithm is one chapter of the disser-
tation of the latter; both graduated from the University of Wisconsin-Madison.
Ping Yu (HKU) Text As Data 60 / 69
Appendix B: Association Rules
Association Rules
Each K in the output of the Apriori algorithm can be partitioned into two disjoint
subsets, A [ B = K , and written as an association rule A ) B.
- A is called the antecedent , and B the consequent.
b (∏k 2K Zk = 1) =: Pr
The support of the rule T (A ) B ) is Pr b (A [ B ) - the probability
of simultaneously observing both A and B.
The confidence or predictability C (A ) B ) of the rule is
T (A ) B ) b (BjA) .
C (A ) B ) = =: Pr
T (A)
b (B ).
The expected confidence is T (B ) =: Pr
The lift is
C (A ) B ) b (A [ B )
Pr
L (A ) B ) = =: .
T (B ) b (A) Pr
Pr b (B )
Output of the entire analysis: the association rules that satisfy the constraints
An Example
Suppose K = fpeanut butter, jelly, breadg and consider the rule fpeanut butter, jellyg
fbreadg.
If T (fpeanut butter, jellyg ) fbreadg) = 3%, then the three appear together in 3%
of the market baskets.
If C (fpeanut butter, jellyg ) fbreadg) = 82%, then when peanut butter and jelly
were purchased, 82% of the time bread was also purchased.
If bread appeared in 43% of all market baskets, then the rule fpeanut butter, jellyg )
fbreadg would have a lift of 82%/43% = 1.95 > 1.
The output like (9) are stored in a data based that can be queried by the user.
A usual query requires particular items to be in the consequent, e.g.,
Display all transactions in which ice skates are the consequent that have confidence
over 80% and support of more than 2%.
Such queries provide information on the antecedent that predicates sales of ice
skates.
Focusing on a particular consequent casts the problem into the framework of su-
pervised learning.
The Apriori algorithm can be applied to huge data bases, so represents one of the
major advances in data mining technology.
Although association rules are among data mining’s biggest successes, they indeed
suffers from some drawbacks.
- Xj can only be discrete or be transformed into discrete.
- The number of solution item sets, their size, and the number of passed required
over the data can grow exponentially with decreasing t, which implies that rules
with high confidence or lift, but low support, will not be discovered, e.g., a high
confidence rule such as vodka)caviar will not be discovered due to the low sales
volume of the consequent caviar .
Choice of g0
For the general problem (5), we can apply the supervised learning approach al-
though such an approach cannot apply to as huge a data base as the market basket
analysis above.
The problem (5) tries to find J f1, 2, , pg and subsets sj Sj , j 2 J , such
that ! !
T 1 n T
n i∑
b
Pr Xj 2 sj = I xij 2 sj (11)
j2J =1 j2J
Better Choice of g0
Reference to the uniform distribution can cause highly frequent item sets with low
associations among their constituents to dominate the collection of highest support
item sets and neglect the high-association-low-frequency item sets such as (vodka,
caviar).
To avoid this problem, we can choose (10) as our reference distribution, where the
simulated sample from (10) can be easily generated from the data itself by applying
a different random permutation to the data values of each Xj .
- g (x ) /g0 (x ) would be uniform if there are no associations among Xj ’s, regardless
of the magnitude of the marginal density gj xj .
- It is not clear how to incorporate reference distributions other than the uniform into
the Apriori algorithm, so a supervised learning method would be applied.
The goal is find regions
T
R= Xj 2 sj (12)
j2J
for which the target function µ (x ) = E [Y jx ] is relatively large, and in addition, the
data support Z
T (R ) = g (x ) dx
x2R
is not too small.
Ping Yu (HKU) Text As Data 68 / 69
Appendix B: Association Rules
To learn conjunctive rules as in (12), we can apply a CART decision tree [see
Lecture 8] whose terminal nodes take exactly the form of (12).
Those terminal nodes t with high average y -values
ȳt = ave (yi jxi 2 t )
are candidates for high-support generalized item sets.
- The actual (data) support is given by
Nt
T (R ) = ȳt ,
N + N0
where Nt is the number of (pooled) data points in the terminal node t.
These can then be partitioned into antecedents and consequents in a search for
generalized association rules of high confidence and/or lift.
Other methods such as patient rule induction method (PRIM) can also be applied.
Comparison with the Apriori algorithm: (i) The Apriori algorithm is exhaustive -
it finds all rules with support greater than a threshold, while CART or PRIM is a
greedy algorithm so may not find the "optimal" set of rules. (ii) The Apriori algorithm
can deal only with dummy variables and cannot find a rule where sj is a subset of
Sj containing more than one values.
Ping Yu (HKU) Text As Data 69 / 69