[go: up one dir, main page]

0% found this document useful (0 votes)
26 views16 pages

R Lect1 Introduction

This document introduces R programming for data science. It discusses how data science involves analyzing large volumes of diverse data like images, text, and sensor data to find patterns. These patterns are used to create models that can predict and describe data. The document then outlines the key steps in knowledge discovery in databases (KDD): data collection, preprocessing, transformation, data mining to generate models, and interpretation. Finally, it discusses common data mining methods like classification, regression, clustering, association analysis, and visualization that are used to extract patterns from data.

Uploaded by

Aakash Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views16 pages

R Lect1 Introduction

This document introduces R programming for data science. It discusses how data science involves analyzing large volumes of diverse data like images, text, and sensor data to find patterns. These patterns are used to create models that can predict and describe data. The document then outlines the key steps in knowledge discovery in databases (KDD): data collection, preprocessing, transformation, data mining to generate models, and interpretation. Finally, it discusses common data mining methods like classification, regression, clustering, association analysis, and visualization that are used to extract patterns from data.

Uploaded by

Aakash Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

R PROGRAMMING FOR DATA

SCIENCES
Dr. Athira B
Lect1 - Introduction
Data Science
• “Analyze a huge volume of data about a
specific problem with the purpose of creating
patterns in scientific fields like Statistics,
Machine Learning and Pattern Recognition”.
• Those patterns, found in multiple forms like
associations, anomalies, clusters, classes etc
• These patterns, also termed, as models
• Data generation: shopping cart data, medical
records, social media announcements,
banking and stock market operations and so
on.
• wide variety of types - images, videos, real
time data, DNA sequences – Big Data
• Methods and tools used - Hadoop, Map-
Reduce, Hive, MongoDB, GraphPD
• The two main goals of practical Data Science
are to create models, which can be used both
in predicting and describing data.
KNOWLEDGE DISCOVERY IN DATABASES (KDD)

• 5 steps: Data Collection, Preprocessing,


Transformation, Data Mining,
Interpretation/Evaluation
1. DATA COLLECTION
– by using sensors or not automatically e.g. via a
questionnaire
2. PREPROCESSING
– cleansing data: handling defective, false or missing
data.
3. TRANSFORMATION
– converting data under a common frame allowing us to
edit them later. It is mostly used for smoothing data
and removing noise.
4. DATA MINING
– an algorithm is used for model generation. Clean
and transformed data are now ready to be used
by an algorithm in order to create a model,
usually for categorization or prediction.
5. INTERPRETATION AND EVALUATION
– interpret and evaluate the results
• Examples
• By using data from older recorded temperatures during the
summer season of the previous 15 years, we try to predict
the temperatures for the summer season of the next 15
years.
• Telecommunication companies not only reward clients who
spend lots of money but also clients named as “guides”.
• After 9/11, Bill Clinton announced that after examining lots
of databases, FBI agents discovered that 5 of the
perpetrators were registered to these databases. One of
them owned 30 credit cards with a negative balance of
$250.000 and lived in US for less than two years.
• Finding a phone number from a phonebook
• Finding information about Paris on the
internet
• Finding the average of exams grades
• Searching for the medical records of a
patience with a particular disease, in order to
further analyze his medical record.
Data Mining Methods
• Depending on the data types and the type of knowledge
extracted, they are classified in different categories.
1. CLASSIFICATION:
– a predictive method – ‘classifier’
• In classification, the outcome we want to predict is the
class of the samples.
• A class can have discrete values from a finite set.
• On the contrary, during prediction with methods like
regression,
• the variable-goal could be any real number
2. REGRESSION
• Regression is a similar to classification process,
whose goal is learning or else training a
function.
• It is also a predictive method.
• By using some independent variables its goal
is to predict the values of a dependent
variable.
• The variables in this example are the square meters of a house and the
selling price in thousands of dollars.
• Linear regression adapts a line in the samples of the dataset
• By having the optimal line we can then estimate pretty accurately
questions like: “Which is the selling price for 150 square meters houses?”.
3. CLUSTERING
• Clustering is a descriptive method.
• Given a dataset, the goal of clustering is to
create clusters (groups with the same or
similar features).
4. EXTRACTION AND ASSOCIATION ANALYSIS
• These association rules discover hidden relationships
between features of a dataset.
• A classic example of association rules in practice has to do
with the analysis of a shopping cart in a super market, where
data have to do with clients transactions.
• Eg: some transactions - {bread, milk}, {bread, diapers, beer,
eggs}, {milk, diapers, beer, soda}, {bread, milk, diapers, beer}
and {bread, milk, diapers, soda}.
• it’s quite possible that whoever buys milk and bread might
also buy eggs and soda.
5. VISUALIZATION
• Data visualization helps in better understanding
not only the data themselves but also correlations
that might occur between them.
6. ANOMALY DETECTION
• Anomaly detection focuses in finding deviations in
data according to similar data collected in the past
or by typical values of these data
• Some other examples of anomaly detection are the
following:
• Fraud detection based on a user profile
• Finding dysfunctional objects in industrial production

You might also like