Homework 1 Data
Homework 1 Data
Problem 1
trans_encode = TransactionEncoder()
fitted = trans_encode.fit(transactions).transform(transactions)
df = pd.DataFrame(fitted, columns=trans_encode.columns_)
support itemsets
0 0.4 (Data Science)
1 0.5 (Introduction to AI)
2 0.6 (Machine Learning)
3 0.4 (Mathematics)
4 0.6 (Python)
5 0.3 (Machine Learning, Data Science)
6 0.3 (Python, Data Science)
7 0.3 (Machine Learning, Introduction to AI)
8 0.3 (Python, Introduction to AI)
9 0.4 (Machine Learning, Python)
10 0.3 (Machine Learning, Python, Introduction to AI)
antecedent consequent
antecedents consequents support confidence lift leverage conviction zhangs_metric
support support
(Machine
0 (Data Science) 0.4 0.6 0.3 0.750000 1.250000 0.06 1.6 0.333333
Learning)
1 (Data Science) (Python) 0.4 0.6 0.3 0.750000 1.250000 0.06 1.6 0.333333
(Introduction to (Machine
2 0.5 0.6 0.3 0.600000 1.000000 0.00 1.0 0.000000
AI) Learning)
(Introduction to
3 (Python) 0.5 0.6 0.3 0.600000 1.000000 0.00 1.0 0.000000
AI)
(Machine
4 (Python) 0.6 0.6 0.4 0.666667 1.111111 0.04 1.2 0.250000
Learning)
(Machine
5 (Python) 0.6 0.6 0.4 0.666667 1.111111 0.04 1.2 0.250000
Learning)
(Machine
(Introduction
6 Learning, 0.4 0.5 0.3 0.750000 1.500000 0.10 2.0 0.555556
to AI)
Python)
(Machine
Learning,
7 (Python) 0.3 0.6 0.3 1.000000 1.666667 0.12 inf 0.571429
Introduction to
AI)
(Python,
(Machine
8 Introduction to 0.3 0.6 0.3 1.000000 1.666667 0.12 inf 0.571429
Learning)
AI)
(Machine
(Introduction to
9 Learning, 0.5 0.4 0.3 0.600000 1.500000 0.10 1.5 0.666667
AI)
Python)
Some of the key strengths of the Apriori algorithm include its relative simplicity, both in terms of readability and implementation. More
specifically, its usage of candidate itemset generation and using support/confidence thresholds is straightforward and intuitve. Another
strength is that it incorporates pruning to remove unlikely itemsets from consideration early on, which reduces the number of calculations
and speeds up computation.
On the other hand, a glaring disadvantage of this algorithm is that it needs multiple scans in order to calculate support and generate
itemsets. For modern-day applications with enormous datasets, problems such as computational complexity and time arise as a result.
Problem 2
Compared to the Apriori algorithm, the FP-Growth algorithm's greatest strength is its computational speed. Because the FP-Growth
algorithm doesn't require multiple scans of the database and its scanning time only increases linearly, it is significantly faster than the
Apriori algorithm.
In terms of relative weaknesses, the FP-Growth algorithm requires a substantial amount of setup in order to use, as you need to build the
FP-tree as well as the conditional trees for each frequent item. These steps can be relatively complex in terms of implementation and
readability compared to to the simplicity of the apriori algorithm.
Problem 3
te = TransactionEncoder()
te_ary = te.fit(t2).transform(t2)
df1 = pd.DataFrame(te_ary, columns=te.columns_)
support itemsets
1 0.7 (Python)
4 0.3 (Mathematics)
After generating the above frequent itemsets, we then apply the budget and timeslot constraints. Because no individual course costs
more than 3800 dollars, the following itemsets comply with the constraints: {Machine Learning}, {Python}, {Introduction to AI}, {Data
Science}, and {Mathematics}. Next, we eliminate itemsets with items whose summed costs exceed 3800, removing itemsets {Machine
Learning, Introduction to AI} and {Machine Learning, Python, Introduction to AI}. Therefore, we are left with {Machine Learning, Python},
{Python, Introduction to AI}, {Python, Data Science}, and {Machine Learning, Mathematics}. In short, the compliant frequent itemsets are
the following (represented by their index number in the above chart): 0, 1, 2, 3, 4, 5, 7, 9, and 10.
The budget constraint is in fact anti-monotone. For any of the itemsets that satisfies the budget constraint, all of its subsets also satisfy
the budget constraint. This is useful to consider because anti-monotone constraints can be used very handily in the pruning process for
frequent itemset mining. If an itemset violates a given anti-monotone constraint, then we can prune away all of its supersets, which
should decrease computational costs and runtime (fewer candidates to consider).
One monotone constraint that could be applied to this dataset is a budget minimum of 3000 dollars (this situation could arise if an
individual had to spend over a certain amount to receive government or company financial aid). This constraint satisfies monotonicity
because if any itemset violates this constraint, its subsets will also violate it. For a similar reason as above, this constraint is useful to
consider because it allows us to more quickly converge upon itemsets of interest while eliminating superfluous itemsets from
consideration early.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js