3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.
edu 1
Date:
▪ Monday, March 11 2:00 PM –
Wednesday, March 13, 2:00 PM Pacific Time
▪ Logistics:
▪ Administered on Gradescope
▪ 3 hours long (timer starts once you open the exam)
▪ Submitting answers (all questions visible at the same time):
One PDF for the entire exam (uploaded at the top of the exam)
One PDF for each question (uploaded to each question)
▪ You can do this as you go through the questions (do not need to
wait until the end)
Write answers directly in text boxes
▪ Please budget your time for submission (~10 min) and solve
questions you find easy first – the exam tends to be on the
longer side
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 2
If you think a question isn't clear on the
exam...
▪ Ask on Ed or state your (reasonable
and valid) assumptions in your answer
▪ We will actively monitor Ed on...
▪ Monday: 2 PM – 10 PM PT
▪ Tuesday: 8 AM – 3 PM, 5 PM – 10 PM PT
▪ Wednesday: 8 AM – 2 PM PT
▪ We will answer clarifying questions only
Exam Review Session: Friday, 6 PM – 7 PM PT
via Zoom (see Ed, Canvas for details)
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 3
Final exam is open book and open notes
A calculator or computer is REQUIRED
▪ You may only use your computer to do arithmetic
calculations (i.e., the buttons found on a standard
scientific calculator)
▪ You may also use your computer to read course
notes or the textbook
▪ No use of AI chatbots (including, but not limited
to, ChatGPT)
▪ No collaboration with other students
Practice finals are posted on Ed, Gradescope
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 4
Good luck with the exam! ☺
You Have Done a Lot!!!
And (hopefully) learned a lot!!!
▪ Answered questions and
proved many interesting results
▪ Implemented a number of methods
Thank You for the
Hard Work!
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 5
Note to other teachers and users of these slides: We would be delighted if you found our
material useful for giving your own lectures. Feel free to use these slides verbatim, or to
modify them to fit your own needs. If you make use of a significant portion of these slides
in your own lecture, please include this message, or a link to our web site: http://www.mmds.org
CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http://cs246.stanford.edu
Redundancy leads to a bad user experience
▪ Uncertainty around information need => don’t
put all eggs in one basket
How do we optimize for diversity directly?
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 7
France intervenes
Chuck for Defense
Argo wins big
Hagel expects fight
Monday, January 14
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 8
France intervenes
Chuck for Defense
Argo wins big
New gun proposals
Monday, January 14
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 9
Idea: Encode diversity as coverage problem
Example: Word cloud of news for a single day
▪ Want to select articles so that most words are
“covered”
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 10
Q: What is being covered?
A: Concepts (In our case: Named entities)
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
Hagel expects fight
Q: Who is doing the covering?
A: Documents
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 12
Suppose we are given a set of documents D
▪ Each document d covers a set 𝑿𝒅 of
words/topics/named entities W
For a set of documents A D we define
𝑭 𝑨 = ራ 𝑿𝒊
𝒊∈𝑨
Goal: We want to
max 𝑭(𝑨)
𝑨 ≤𝒌
Note: F(A) is a set function: 𝑭 𝑨 : 𝐒𝐞𝐭𝐬 → ℕ
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 13
Given universe of elements 𝑾 = {𝒘𝟏, … , 𝒘𝒏 }
and sets 𝑿𝟏, … , 𝑿𝒎 𝑾
X3
X1 W
X2 X4
Goal: Find k sets Xi that cover the most of W
▪ More precisely: Find k sets Xi whose size of the
union is the largest
▪ Bad news: A known NP-complete problem
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 14
Simple Heuristic: Greedy Algorithm:
Start with 𝑨𝟎 = { }
For 𝒊 = 𝟏 … 𝒌
▪ Find set 𝒅 that 𝐦𝐚𝐱 𝑭(𝑨𝒊−𝟏 ∪ {𝒅})
▪ Let 𝑨𝒊 = 𝑨𝒊−𝟏 {𝒅}
𝑭 𝑨 = ራ 𝑿𝒅
Example:
𝒅∈𝑨
▪ Eval. 𝑭 𝒅𝟏 , … , 𝑭({𝒅𝒎}), pick best (say 𝒅𝟏 )
▪ Eval. 𝑭 𝒅𝟏 } ∪ {𝒅𝟐 , … , 𝑭({𝒅𝟏 } ∪ {𝒅𝒎 }), pick best (say 𝒅𝟐 )
▪ Eval. 𝑭({𝒅𝟏 , 𝒅𝟐 } ∪ {𝒅𝟑 }), … , 𝑭({𝒅𝟏 , 𝒅𝟐 } ∪ {𝒅𝒎}), pick best
▪ And so on…
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 15
Goal: Maximize the covered area
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 16
Goal: Maximize the covered area
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 17
Goal: Maximize the covered area
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 18
Goal: Maximize the covered area
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 19
Goal: Maximize the covered area
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 20
A
B C
Goal: Maximize the size of the covered area
Greedy first picks A and then C
But the optimal way would be to pick B and C
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 21
Greedy produces a solution A
where: F(A) (1-1/e)*OPT (F(A) 0.63*OPT)
[Nemhauser, Fisher, Wolsey ’78]
Claim holds for functions F(·) with 2 properties:
▪ F is monotone: (adding more docs doesn’t decrease coverage)
if A B then F(A) F(B) and F({})=0
▪ F is submodular:
adding an element to a set gives less improvement
than adding it to one of its subsets
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 22
Definition:
Set function F(·) is called submodular if:
For all A,B W:
F(A) + F(B) F(A B) + F(A B)
+ +
A A B B
A B
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 23
Diminishing returns characterization
Equivalent definition:
Set function F(·) is called submodular if:
For all A B:
F(A {d}) – F(A) ≥ F(B {d}) – F(B)
Gain of adding d to a small set Gain of adding d to a large set
B A + d Large improvement
+ d Small improvement
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 24
F(·) is submodular: A B
F(A {d}) – F(A) ≥ F(B {d}) – F(B)
Gain of adding d to a small set Gain of adding d to a large set
Natural example: A
▪ Sets 𝑑1, … , 𝑑𝑚
d
▪ 𝐹 𝐴 = 𝑖𝑑 𝐴∈𝑖ڂ
(size of the covered area)
B
▪ Claim:
𝑭(𝑨) is submodular! d
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 25
Submodularity is discrete analogue of
concavity
F(·)
F(B {d}) A B
F(B)
F(A {d})
F(A) Adding d to B helps less
than adding it to A!
Solution size |A|
F(A {d}) – F(A) ≥ F(B {d}) – F(B)
Gain of adding Xd to a small set Gain of adding Xd to a large set
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 26
Marginal gain:
𝚫𝑭 𝒅 𝑨 = 𝑭 𝑨 ∪ {𝒅} − 𝑭(𝑨)
Submodular: 𝐴⊆𝐵
𝑭 𝑨 ∪ {𝒅} − 𝑭 𝑨 ≥ 𝑭 𝑩 ∪ {𝒅} − 𝑭(𝑩)
Concavity: 𝑎≤𝑏
𝒇 𝒂 + 𝒅 − 𝒇 𝒂 ≥ 𝒇 𝒃 + 𝒅 − 𝒇(𝒃)
F(A)
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu
|A| 27
Let 𝑭𝟏 … 𝑭𝒎 be submodular and 𝝀𝟏 … 𝝀𝒎 > 𝟎
then 𝑭 𝑨 = σ𝒎 𝒊=𝟏 𝝀𝒊 𝑭𝒊 𝑨 is submodular
▪ Submodularity is closed under non-negative
linear combinations!
This is an extremely useful fact:
▪ Average of submodular functions is submodular:
𝑭 𝑨 = σ𝒊 𝑷 𝒊 ⋅ 𝑭𝒊 𝑨
▪ Multicriterion optimization: 𝑭 𝑨 = σ𝒊 𝝀𝒊 𝑭𝒊 𝑨
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 28
Q: What is being covered?
A: Concepts (In our case: Named entities)
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
Hagel expects fight
Q: Who is doing the covering?
A: Documents
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 29
Objective: pick k docs that cover most concepts
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
Enthusiasm for Inauguration wanes Inauguration weekend
F(A): the number of concepts covered by A
▪ Elements…concepts, Sets … concepts in docs
▪ F(A) is submodular and monotone!
▪ We can use greedy algorithm to optimize F
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 30
Objective: pick k docs that cover most concepts
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
Enthusiasm for Inauguration wanes Inauguration weekend
The good: The bad:
Penalizes redundancy Concept importance?
Submodular All-or-nothing too harsh
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 31
Objective: pick k docs that cover most concepts
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
Enthusiasm for Inauguration wanes Inauguration weekend
Each concept 𝒄 has importance weight 𝒘𝒄
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 33
Document coverage function
probability document d covers
concept c
[e.g., how strongly d covers c]
Obama Romney
Enthusiasm for Inauguration wanes
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 34
Document coverage function:
probability document d covers
concept c
▪ Coverd(c) can also model how relevant is concept c for user u
Set coverage function:
▪ Prob. that at least one document in A covers c
Objective: concept weights
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 35
The objective function is also submodular
▪ Intuitively, it has a diminishing returns property
▪ Greedy algorithm leads to a (1 – 1/e) ~ 63%
approximation, i.e., a near-optimal solution
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 36
Objective: pick k docs that cover most concepts
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
Enthusiasm for Inauguration wanes Inauguration weekend
Each concept 𝑐 has importance weight 𝑤𝑐
Documents partially cover concepts: 𝐜𝐨𝐯𝐞𝐫𝒅 (𝒄)
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 37
Greedy Greedy algorithm is slow!
Marginal gain: ▪ At each iteration we need to
F(Ax)-F(A)
re-evaluate marginal gains of
a
all remaining documents
b
▪ Runtime 𝑶(|𝑫| · 𝑲) for
c selecting 𝑲 documents out of the
d
set of 𝑫 of them
Add document with
highest marginal gain
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 39
[Leskovec et al., KDD ’07]
In round 𝒊: So far we have 𝑨𝒊−𝟏 = {𝒅𝟏 , … , 𝒅𝒊−𝟏 }
▪ Now we pick 𝐝𝒊 = 𝐚𝐫𝐠 𝐦𝐚𝐱 𝑭(𝑨𝒊−𝟏 ∪ {𝒅}) − 𝑭(𝑨𝒊−𝟏)
𝒅∈𝑽
▪ Greedy algorithm maximizes the “marginal benefit”
𝚫 𝒊 𝒅 = 𝑭(𝑨𝒊−𝟏 ∪ {𝒅}) − 𝑭(𝑨𝒊−𝟏 )
By submodularity property:
𝐹 𝐴𝑖 ∪ 𝑑 − 𝐹 𝐴𝑖 ≥ 𝐹 𝐴𝑗 ∪ 𝑑 − 𝐹 𝐴𝑗 for 𝑖 < 𝑗
Observation: By submodularity:
For every 𝒅 ∈ 𝑫
𝚫𝒊 (𝒅) ≥ 𝚫𝒋 (𝒅) for 𝒊 < 𝒋 since 𝑨𝒊 𝑨𝒋 i(d) j(d)
Marginal benefits 𝚫𝒊 (𝒅) only shrink! d
(as i grows) Selecting document d in step i covers
more words than selecting d at step j (j>i)
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 40
[Leskovec et al., KDD ’07]
Idea:
(Upper bound on)
▪ Use i as upper-bound on j (j > i) Marginal gain 1
Lazy Greedy: a A1={a}
▪ Keep an ordered list of marginal b
benefits i from previous iteration c
▪ Re-evaluate i only for top d
element
e
▪ Re-sort and prune
F(A {d}) – F(A) ≥ F(B {d}) – F(B) A B
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 41
[Leskovec et al., KDD ’07]
Idea:
Upper bound on
▪ Use i as upper-bound on j (j > i) Marginal gain 2
Lazy Greedy: a A1={a}
▪ Keep an ordered list of marginal b
benefits i from previous iteration c
▪ Re-evaluate i only for top d
element
e
▪ Re-sort and prune
F(A {d}) – F(A) ≥ F(B {d}) – F(B) A B
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 42
[Leskovec et al., KDD ’07]
Idea:
Upper bound on
▪ Use i as upper-bound on j (j > i) Marginal gain 2
Lazy Greedy: a A1={a}
▪ Keep an ordered list of marginal d A2={a,b}
benefits i from previous iteration b
▪ Re-evaluate i only for top element e
▪ Re-sort and prune c
F(A {d}) – F(A) ≥ F(B {d}) – F(B) A B
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 43
Summary so far:
▪ Diversity can be formulated as a set cover
▪ Set cover is submodular optimization problem
▪ Can be (approximately) solved using greedy algorithm
▪ Lazy-greedy gives significant speedup
400
exhaustive search
300 (all subsets)
Lower is better
running time (seconds)
naive
200 greedy
100
Lazy
0
1 2 3 4 5 6 7 8 9 10
number of blogs selected
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 44
But what about
personalization?
Election trouble
model Songs of Syria
Sandy delays
Recommendations
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 45
We assumed same concept weighting for all users
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
France intervenes
Chuck for Defense
Argo wins big
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 46
Each user has different preferences over
concepts
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
politico
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
movie buff
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 47
Assume each user u has different preference
vector wc(u) over concepts c
Goal: Learn personal concept weights from
user feedback
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 48
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
France intervenes
Chuck for Defense
Argo wins big
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 49
Multiplicative Weights algorithm
▪ Assume each concept 𝒄 has weight 𝒘𝒄
▪ We recommend document 𝒅 and receive feedback,
say 𝒓 = +1 or -1
▪ Update the weights:
▪ For each 𝒄 ∈ 𝑿𝒅 set 𝒘𝒄 = 𝜷𝒓𝒘𝒄
▪ If concept c appears in doc d and we received positive feedback r=+1
then we increase the weight wc by multiplying it by 𝜷 (𝜷 > 𝟏)
otherwise we decrease the weight (divide by 𝜷)
▪ Normalize weights so that σ𝒄 𝒘𝒄 = 𝟏
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 50
Steps of the algorithm:
1. Identify items to recommend from
2. Identify concepts [what makes items redundant?]
3. Weigh concepts by general importance
4. Define item-concept coverage function
5. Select items using probabilistic set cover
6. Obtain feedback, update weights
3/7/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 51