Preliminaries
Data Mining
The art of extracting knowledge from large
bodies of structured data.
What does the process look like?
1
Preliminaries
Data Mining Tasks
Descriptive Tasks
1 Here, the objective is to derive patterns (correlations, trends,
clusters, trajectories, and anomalies) that are able to summarize
the underlying relationships in data. Descriptive data mining
tasks are often exploratory in nature and frequently require
postprocessing techniques to validate and explain the results.
Predictive Tasks
2 The objective of these tasks is to predict the value of a particular
attribute based on the values of other attributes. The attribute
to be predicted is commonly known as the target or dependent
variable, while the attributes used for making the prediction
are known as the explanatory or independent variables.
2
Preliminaries
Data Mining Tasks
• Anomaly Detection: the task of detecting unusual deviations.
• Association Analysis: the task of discovering patterns that
describe relationships.
• Clustering: the task of discovering groups and structures
• Classification: the task of assigning (discrete) target variables
to one of several predefined categories.
• Regression: the task of finding a function that models
(continuous) target variables.
• Collaborative Filtering: the task of filtering patterns for an
3 unknown user based on patterns for known users.
Preliminaries
Defining a Data Mining Task
• Generate a problem statement.
• Utilize background knowledge.
• Posit the right question.
• Understand the data.
• Implement one or more modeling approaches.
• Identify performance measurement criteria.
• Interpret the model(s).
• Visualize and present the results.
4
Preliminaries
The Goal
Using the knowledge discovery process
to turn data into knowledge.
5
Preliminaries
Knowledge Discovery Process
Raw Data Data Data Data Modeling Validation & Knowledge
Understanding Preprocessing Interpretation
Let’s look at what we mean by data.
6
What’s in Data?
Preliminaries
What’s in Data?
• Instance:
– Thing to be classified, associated, or clustered.
– Individual, independent example of target concept.
– Characterized by a predetermined set of features or
attributes.
• Input to learning scheme: set of instances
– Usually represented as a single relation.
– Traditional form for data mining.
8 – Advanced methods now exist for relational data.
Preliminaries
What’s in an Instance?
• Each instance is described by a set of “features.”
• A feature is a property or characteristic of an instance.
• A feature can take several values (feature values).
– Can be categorical (nominal or ordinal)
– Can be numeric (interval or ratio)
• Features can discrete or continuous.
9
Preliminaries
Discrete Features
• Qualitative features.
– Enough information to distinguish one object from another.
• Has only a reasonable set of values.
– Thumb-rule: count with your fingers.
• Often represented as integer variables.
– For example: 0 for red; 1 for blue; etc.
• Note: binary attributes are a special case of discrete
features.
10
Preliminaries
Continuous Features
• Most numeric properties hold.
• Can be integer or real number.
• Examples: temperature, height, weight, age, counts.
• Practically, real values can only be measured and represented
using a finite number of digits.
11
Preliminaries
Supervised Learning
• Data in the form of tuples or input-output pairs 𝑥𝑖 , 𝑦𝑖 that
comes from a deterministic mapping of 𝑋 → 𝑌.
∀𝑥𝑖 ∈ 𝑋
𝑥 ≡ 𝑥1 , 𝑥2 … , 𝑥𝑛
∀𝑦𝑖 ∈ 𝑌 (𝑠𝑒𝑡 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠/𝑐𝑜𝑛𝑐𝑒𝑝𝑡𝑠 𝑐 𝑥 )
– An example of “supervised” learning.
• Develop an approximation to the mapping that is “consistent”
with the data and “not too complex.”
12 – Learn a function of the hypothesis.
Preliminaries
Instances, Features, and Classes
Input Features: X Class: Y
𝑥1 𝑥2 .. 𝑥𝑚 𝑦
1
2
Instances
.
.
13
Preliminaries
Instances, Features, and Classes
Make Cylinders Length Weight Style
Honda Four 150 1956 Hatchback
Toyota Four 167.9 2280 Wagon
BMW Six 176.8 2765 Sedan
Given car make, cylinders, length, and weight,
learn a function for the body style.
14
Preliminaries
Instances, Features, and Classes
Temperature Wind Speed Decision
80° Low Bike Day
40° Low Couch Day
60° Medium Couch Day
80° High Bike Day
Go out and bike or laze on the couch.
15
Preliminaries
Features and Classes
1. If wind-speed = high, then Bike Day.
2. If wind-speed = medium, then Couch Day.
3. If wind-speed = low and temp ≤ 40, then Couch Day.
4. If wind-speed = low and temp > 40, then Bike Day.
16
Preliminaries
Now Consider this Problem
• An advertising company wants to group customers based on
similarities. There are no predefined labels for this group, and
based on the groups on demographics and past buying
behavior, they will have targeted marketing and advertising
initiatives.
• What is this?
– An example of unsupervised learning.
– No predefinition of groups, a.k.a. classes.
– Find similarities in data based on features.
17 – This is the simplistic view of clustering.
Preliminaries
Unsupervised Learning
• Given data Points 𝑋:
– Develop a model or representation of the data such that
“important” structure or “regularities” (and irregularities)”
are captured.
– Organizing instances into groups that share similar features.
– Model, for example, can be probability distribution
estimation: 𝑝 𝑥 of the entire 𝑋.
18
Preliminaries
Examples of Kinds of Data
• Financial transactions
• Genetic sequence data
• Documents
• WWW
• Molecular structures
• Medical data
• Geographical data
19
Preliminaries
Knowledge Discovery Process
Raw Data Data Data Data Modeling Validation & Knowledge
Understanding Preprocessing Interpretation
So that’s data. The process also often
involves the concept of learning.
20
What is Machine Learning?
Preliminaries
Definition of Machine Learning
“A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks T, as measured by P,
improves with experience E.”
—Tom Mitchell/Machine Learning
22
Preliminaries
Definition of Machine Learning
“A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks T, as measured by P,
improves with experience E.”
—Tom Mitchell/Machine Learning
• Improve over class of tasks 𝑇
• With respect to performance measure 𝑃
• Based on experience 𝐸
23 We need to be able to formulate 𝑇, 𝑃, and 𝐸.
Preliminaries
Learning to Play Checkers
• Task: Playing checkers.
• Performance: Percent of games won in world tournament.
• What training experience?
• What exactly should be learned? (Target function)
• How to represent the target function?
• What specific algorithm to learn it?
24
Preliminaries
Formalizing the Learning Task
• Training experience → training data.
• Task → target function required.
– What is the outcome or what is to be predicted?
• Identify the objective or learning function required to fit the
data.
– For example, rules or decision trees.
• Performance measurement criteria → evaluate the learning
function on the testing data.
• How accurate is the function?
25
Preliminaries
Concept Learning
• Acquire general concepts from a set of training examples.
• A concept can describe some objects or events.
– People, continually, attach “description” to objects or events.
• I will enjoy sports today if the sky is sunny, air temperature is warm, humidity
is normal, and wind is not strong.
• Typically, inferring a boolean-valued function.
– True or false.
• Definition of concept learning: approximate a boolean valued
function from training examples.
26
Preliminaries
Instances, Features, and Classes
Make Cylinders Length Weight Style
Hatch
Honda Four 150 1956
back
Wago
Toyota Four 167.9 2280
n
BMW Six 176.8 2765 Sedan
Given car make, cylinders, length, and weight,
learn a function for the body style.
27
Preliminaries
The Weather Problem
Outlook Temperature Humidity Windy Play
sunny 85° 85 false no
sunny 80° 90 true no
overcast 83° 86 false yes
rainy 70° 96 false yes
Given past data, can you come up with the
rules for determining the value of Play?
28
Preliminaries
Formalizing Concept Learning
• Given:
– Instances 𝑋: Possible days, each described by the attributes Outlook,
Temperature, Humidity, Windy.
– Target function 𝑐: PlayGolf: 𝑋 → 0,1 .
– Hypothesis set 𝐻: Conjunction of attributes.
– Training examples 𝐷: Positive and negative examples of the target
function 𝑥1 , 𝑐 𝑥1 , … 𝑥𝑛 , 𝑐 𝑥𝑛 .
• Determine a hypothesis ℎ in 𝐻 such that ℎ 𝑥 = 𝑐 𝑥 for all 𝑥 in
𝐷.
29
Preliminaries
Consistency
“A hypothesis ℎ is consistent with a set of training
examples 𝐷 of target concept 𝑐 if and only if ℎ 𝑥 =
𝑐 𝑥 for each training example 𝑥, 𝑐 𝑥 in 𝐷.”
𝐶𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑡 ℎ, 𝐷 ≡ ∀ 𝑥, 𝑐 𝑥 ∈𝐷 ℎ 𝑥 =𝑐 𝑥
30
Preliminaries
The Weather Problem
Conditions for playing sport:
Outlook Temperature Humidity Windy Play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild normal false yes
. . . . .
. . . . .
If Outlook = sunny and Humidity = high, then Play = no.
If Outlook = rainy and Windy = true, then Play = no.
If Outlook = overcast, then Play = yes.
If Humidity = normal, then Play = yes.
31 If none of the above, then Play = yes.
Preliminaries
The Weather Problem (Mixed Features)
Conditions for playing sport (with some numeric attributes):
Outlook Temperature Humidity Windy Play
sunny 85° 85 false no
sunny 80° 90 true no
overcast 83° 86 false yes
rainy 70° 96 false yes
. . . . .
. . . . .
If Outlook = sunny and Humidity > 83, then Play = no.
If Outlook = rainy and Windy = true, then Play = no.
If Outlook = overcast, then Play = yes.
If Humidity < 85, then Play = yes.
32 If none of the above, then Play = yes.
Preliminaries
The Premise of Learning
• Given a training set, the (concept) learning algorithm has to
estimate the function 𝑓 or hypothesize about it.
• We have just formed a premise behind inductive learning.
– What is inductive learning?
33
Preliminaries
Inductive Learning Hypothesis
“Any hypothesis found to approximate the target
function well over a large set of training examples will
also approximate the target function well over other
unobserved examples.”
—Tom Mitchell/Machine Learning
34
Preliminaries
Inductive Learning Hypothesis
• In other words:
– Given a training set (known information), at best we can build (induce) a
hypothesis around it.
– Think about studying for an exam.
– Finding a solution that is having a different “inductive” bias.
35
Preliminaries
Inductive Learning Hypothesis
Set of training
examples
Learning algorithm Apply a
classification
New (test)
instance
Hypothesis ℎ
Use training instances to formulate or induce or discover a theory.
Go from specific to general.
36
Preliminaries
Types of Learning
• Supervised learning
– Given the value of an input vector 𝑋 and 𝑐 𝑥 , predict c on
future unseen 𝑥’s.
– ex., classification, regression
• Unsupervised learning
– Given 𝑋, automatically discover the structure, representations,
etc.
– ex., clustering
37
Preliminaries
Supervised Learning
• Classification
– The predictions or outputs, 𝑐 𝑥 are categorical while 𝑥 can
take any set of values (real or categorical). The goal is select
correct class for a new instance.
• Regression
– Given the value of an input 𝑋, the output 𝑌 denoted by 𝑦
belongs to the set of real values ℝ. Goal is to predict output
accurately for new input.
38
Preliminaries
Supervised Learning
• Time series prediction
• Data is in the form of a moving time series. The goal is to
perform classification/regression on future time series based
on data known so far.
39
Preliminaries
Classification
• Find ways to separate data items into pre-defined groups.
– We know 𝑋 and 𝑌 belong together, find other things in same
group.
• Requires “training data”: data items where group is known.
40
Preliminaries
Unsupervised Learning
• Anomaly detection
• 𝑋 can be anything, goal is to detect deviations from normal
behavior.
• Association rules
• Find joint values of the variables 𝑋 = 𝑥1 , 𝑥2 , … , 𝑥𝑛 , that
tend to appear more frequently in the database.
41
Preliminaries
Unsupervised Learning
• Clustering
• 𝑋 is provided 𝑐 𝑥 or 𝑌 is unknown. Grouping a collection of
objects into “clusters” such that objects within a cluster are
more closely related than those in different clusters.
• Density estimation
• Describing your data.
42
Preliminaries
Anomaly Detection
• Find unexpected values and outliers.
43
Preliminaries
Association Rules
• Identify dependencies in the data.
– 𝑋 makes 𝑌 likely
• Indicate significance of each dependency.
44
Preliminaries
Clustering
• Find groups of similar data items.
• Statistical techniques require some definition of “distance”
(e.g., between travel profiles) while conceptual techniques use
background concepts and logical descriptions.
45
Preliminaries
Inductive Learning Bias
• Consider:
– Concept learning algorithm 𝐿.
– Instances 𝑋, target concept 𝑐.
– Training examples 𝐷𝑐 = 𝑥, 𝑐 𝑥 .
– Let 𝐿 𝑥𝑖 , 𝐷𝑐 denote the classification assigned to the instance 𝑥𝑖 by 𝐿 after
training on data 𝐷𝑐 .
– The inductive bias of 𝐿 is any minimal set of assertions 𝐵 such that for any
target concept 𝑐 and corresponding training examples 𝐷𝑐 .
∀𝑥𝑖 ∈ 𝑋 𝐵⋀𝐷𝑐 ⋀𝑥𝑖 ⟼ 𝐿 𝑥𝑖 , 𝐷𝑐
46
Preliminaries
The Simpler the Better
• If there rare two hypotheses describing a data, one complex
and one simple, which one to take?
• William of Occam in the year 1320 said, “Prefer the simplest
hypothesis that fits the data.”
– “One should not increase beyond what is necessary, the number of
entities required to explain anything.”
– Solid theory in machine learning behind it.
• Remember our set of hypotheses 𝐻.
– Given a set of data, there are multiple ways to model it, a set of 𝐻.
– Choose the simpler ℎ of 𝐻 that fits the data.
Preliminaries
Inductive Learning Issues
• Along the way, we also learned that the learning algorithm
should be able to generalize well.
• The algorithm should be able to induce knowledge of a domain
from given training examples, and not merely and completely
“overfit” on the training examples.
Preliminaries
Overfitting
• Overfitting means doing very well on training data but poorly
on the test data, lest test data exactly mimics the training data
– Counterintuitive?
– Think about preparing for an exam. What is better: rote memorization or
understanding a concept?
– Learner could not induce a function that generalizes well.
• Generalization is often a tenet of machine learning.
Preliminaries
Generalization Behavior
Preliminaries
Summarizing the Basics of Learning
• Define the “domain” (training data and testing data); What are
we trying to learn? What is our data source? What
characteristics of data are relevant?
• Define the “learner.” Select what is appropriate (no free lunch).
• Do we have any prior knowledge about the domain?
• Define the objective function, Φ. Adjust the learning to
minimize the objective function.
• How do we define success? What is the output? How does the
learner demonstrate that it has learned something? What
decisions can I take based on what the learner tells me?
Summarizing the “Preliminaries”
Preliminaries
Summarizing the “Preliminaries”
• Data are composed of instances.
• Each instance is described by a set of features.
• Sets of instances are used to perform a task.
• Data mining tasks can be descriptive or predictive.
• Often, these data mining tasks involve learning.
• Learning involves a tradeoff of bias and variance.
53