R PROGRAMMING FOR DATA
SCIENCES
Dr. Athira B
Lect1 - Introduction
Data Science
• “Analyze a huge volume of data about a
specific problem with the purpose of creating
patterns in scientific fields like Statistics,
Machine Learning and Pattern Recognition”.
• Those patterns, found in multiple forms like
associations, anomalies, clusters, classes etc
• These patterns, also termed, as models
• Data generation: shopping cart data, medical
records, social media announcements,
banking and stock market operations and so
on.
• wide variety of types - images, videos, real
time data, DNA sequences – Big Data
• Methods and tools used - Hadoop, Map-
Reduce, Hive, MongoDB, GraphPD
• The two main goals of practical Data Science
are to create models, which can be used both
in predicting and describing data.
KNOWLEDGE DISCOVERY IN DATABASES (KDD)
• 5 steps: Data Collection, Preprocessing,
Transformation, Data Mining,
Interpretation/Evaluation
1. DATA COLLECTION
– by using sensors or not automatically e.g. via a
questionnaire
2. PREPROCESSING
– cleansing data: handling defective, false or missing
data.
3. TRANSFORMATION
– converting data under a common frame allowing us to
edit them later. It is mostly used for smoothing data
and removing noise.
4. DATA MINING
– an algorithm is used for model generation. Clean
and transformed data are now ready to be used
by an algorithm in order to create a model,
usually for categorization or prediction.
5. INTERPRETATION AND EVALUATION
– interpret and evaluate the results
• Examples
• By using data from older recorded temperatures during the
summer season of the previous 15 years, we try to predict
the temperatures for the summer season of the next 15
years.
• Telecommunication companies not only reward clients who
spend lots of money but also clients named as “guides”.
• After 9/11, Bill Clinton announced that after examining lots
of databases, FBI agents discovered that 5 of the
perpetrators were registered to these databases. One of
them owned 30 credit cards with a negative balance of
$250.000 and lived in US for less than two years.
• Finding a phone number from a phonebook
• Finding information about Paris on the
internet
• Finding the average of exams grades
• Searching for the medical records of a
patience with a particular disease, in order to
further analyze his medical record.
Data Mining Methods
• Depending on the data types and the type of knowledge
extracted, they are classified in different categories.
1. CLASSIFICATION:
– a predictive method – ‘classifier’
• In classification, the outcome we want to predict is the
class of the samples.
• A class can have discrete values from a finite set.
• On the contrary, during prediction with methods like
regression,
• the variable-goal could be any real number
2. REGRESSION
• Regression is a similar to classification process,
whose goal is learning or else training a
function.
• It is also a predictive method.
• By using some independent variables its goal
is to predict the values of a dependent
variable.
• The variables in this example are the square meters of a house and the
selling price in thousands of dollars.
• Linear regression adapts a line in the samples of the dataset
• By having the optimal line we can then estimate pretty accurately
questions like: “Which is the selling price for 150 square meters houses?”.
3. CLUSTERING
• Clustering is a descriptive method.
• Given a dataset, the goal of clustering is to
create clusters (groups with the same or
similar features).
4. EXTRACTION AND ASSOCIATION ANALYSIS
• These association rules discover hidden relationships
between features of a dataset.
• A classic example of association rules in practice has to do
with the analysis of a shopping cart in a super market, where
data have to do with clients transactions.
• Eg: some transactions - {bread, milk}, {bread, diapers, beer,
eggs}, {milk, diapers, beer, soda}, {bread, milk, diapers, beer}
and {bread, milk, diapers, soda}.
• it’s quite possible that whoever buys milk and bread might
also buy eggs and soda.
5. VISUALIZATION
• Data visualization helps in better understanding
not only the data themselves but also correlations
that might occur between them.
6. ANOMALY DETECTION
• Anomaly detection focuses in finding deviations in
data according to similar data collected in the past
or by typical values of these data
• Some other examples of anomaly detection are the
following:
• Fraud detection based on a user profile
• Finding dysfunctional objects in industrial production