CSD101 Fundamentals of Data Science
Session-1 and 2 Introduction to Key Terminology in Data Science
Course Outline
1. Introduction to key terminology in data science (2)
2. Types of data, scales of measurement, and methods for collection (2)
3. Data cleaning using MS Excel (2)
4. Simple descriptive statistics (4)
5. Data visualisation using MS Excel and other software (3)
6. Introduction to Geographical Information Systems (4)
7. Introduction to Computer Programming (5)
8. Infographics project (4)
What is Data ?
Data is collection of
numbers, words, (Textual Data)
images, sound, videos (Multimedia Data)
It is collected through Surveys, Interviews, Extracting from web, etc.
Data - Storage Perspective
Data is stored physically by writing on papers, printed on papers and
other materials (sign boards) etc and digitally on memory devices.
To quickly apply processing on data, it needs to be “DIGITIZED”.
Digitization is a process of transferring data to electronic/magnetic
devices for storage and processing by Computations.
Data is stored in Computers in form of files. Many files are
arranged/categorized into folders.
Data Formats
The common format to store and make data available for further
processing are as follows
1. Textual Data - .txt, .doc, .docx, .pdf, .csv etc.
2. Alpha-Numeric Data (Tabular Data) – .xls, .xlsx (spreadsheets
formats), .dbf, .mdb (sql databases) etc.
3. Images - .jpeg, .gif, .bmp, .cdr, .png, .tiff,
4. Sound - .mp3, .wav, .wma.
5. Video - .mp4 (MPEG-4), WMV.
6. Data Transformation format - .csv, .json, .xml
Data Storage
● Where the data resides
○ Cloud or Computing Clusters or Individual Computer/Server
● Storage System
○ SQL (DBMS): eg MySQL, Oracle, MS Server,...(Structured Data)
○ NoSQL : eg MongoDB, Cassandra, Couchbase, Hbase, Hive etc.
(Unstructured Data)
○ Text Indexing: eg Solr, ElasticSearch
Structured Query Language
Data can be (will be covered later in details)
1. Qualitative
2. Quantitative
1. Qualitative : Here the information is grouped by Category,
Hence also known as Categorical.
eg. I am born in India.
eg. He is a fast runner.
eg. He has black hair.
Data can be (will be covered later in details)
1. Qualitative
2. Quantitative
2. Quantitative : (Discrete data – Certain values)
eg. Jatin have 2 cars.
eg. There are 13 players on the field.
eg. The farmer has 200 cows.
(Continuous Data – Value within a Range)
eg. His weight is 37 kg
eg. He is sick. His body temperature is 101 F.
Relationship between the four Components
• Pyramid showing
Wisdom
transformation of Data to
Information to Knowledge Knowledge
to Wisdom. (Predictability)
Information
(Patterns)
Data
(Unfiltered)
Data in different domains
Data is categorized from many domains, to name a few
Data related to Business (Sales, Finance, Marketing, etc.)
Geographical (Climate, Meteorological, Location, etc.)
Transport
Scientific (Biological, Astrophysics, Medical, etc.)
Statistical , etc
Where the Data is ?
Sources of Data
Data is: Big
● 2.5 quintillion (1018) bytes of data are generated every day
● Everything around us collects/generates data
● Social media sites
● Business transactions
● Location-based data
● Sensors data
● Digital photos, videos
● Consumer behaviour (online and store transactions)
● Cloud based & mobile applications are widespread
Benefits of having Data
Recommendations (based on learned preferences,
recommendation engines can refer you to movies, restaurants
and books you like)
Classifications (eg in email server classifying emails as
“important”, “Social”, “Promotions” or “junk”)
Pattern detection (weather patterns, financial market patterns,
etc.)
Forecasting (sales and customer retention)
Recognition (facial, voice, text, etc.)
Anomaly detection (fraud, disease, crime, etc.)
Data and current scenario
Traditionally, the data that we had was mostly structured and small in size, which could
be analyzed by using the simple Business Intelligence tools.
But, today most of the data is unstructured or semi-structured.
Image shows that
by 2025, more
than 80 % of the
data will be
unstructured.
Data Science
It is the field of study that combines domain expertise, programming skills, and knowledge of
statistics/maths to extract meaningful insights from data.
It incorporates techniques like machine learning, cluster analysis, data mining and
visualization.
Data Science
Data Science is: multidisciplinary
● Statisticians
● Mathematicians
● Computer Scientists in
○ Data mining
○ Artificial Intelligence & Machine Learning
○ Systems Development and Integration
○ Database development
○ Analytics
● Domain Experts
○ Medical experts
○ Geneticists
○ Finance, Business, Economy experts
○ etc.
Data Science in various Domains
How about if you could understand the precise requirements of your customers from
the existing data like the customer’s past browsing history, purchase history, age and
income. No doubt you had all this data earlier too, but now with the vast amount and
variety of data, you can train models more effectively and recommend the product to
your customers with more precision. Wouldn’t it be amazing as it will bring more business
to your organization?
Data Science in various Domains
Let’s take a different scenario to understand the role of Data Science in decision
making. How about if your car had the intelligence to drive you home? The self-driving
cars collect live data from sensors, including radars, cameras and lasers to create a map
of its surroundings. Based on this data, it takes decisions like when to speed up, when to
speed down, when to overtake, where to take a turn – making use of advanced machine
learning algorithms.
Also Google map
traffic density module.
Data Science in various Domains
Let’s see how Data Science can be used in predictive analytics. Let’s take weather
forecasting as an example. Data from ships, aircrafts, radars, satellites can be collected and
analyzed to build models. These models will not only forecast the weather but also help in
predicting the occurrence of any natural calamities. It will help you to take appropriate
measures beforehand and save many precious lives.
Data Analysis Overview
Data Analysis is a process of inspecting, cleaning, transforming and
modeling data with the goal of discovering useful information,
suggesting conclusions and supporting decision-making
Types of Data Analysis
Data Mining (Focuses on Statistical Modeling eg. Does customer
chooses product Y with X)
Business Intelligence (Focuses on Business Information) eg.
Financial Planning for the year.
Predictive Analytics (Focuses on old data to predict future) eg. The
customers salary has increased, does he buys costlier brands.
Text Analytics (Focuses on Linguistic data) eg. Sentiment Analysis
(emotions – positive, negative, neutral)
Data Analysis Process
Data Analysis is a process of collecting, transforming, cleaning, and
modeling data with the goal of discovering the required information.
The results so obtained are communicated, suggesting conclusions,
and supporting decision-making.
Data visualization is used to portray the data for the ease of
discovering the useful patterns in the data.
The terms Data Modeling and Data Analysis mean the same.
Data Analysis Process
Data Analysis Process
● Data Requirement Specifications – The required data for input is
identified by Analysis Process. (eg. If weather forecasting is aimed
then vegetation coverage and humidity can be considered as
required variables), domain knowledge is required.
Data Collection - Process of gathering information on targeted
variables identified as data requirements as in first step.
Data Processing - Structuring the data as required for the relevant
Analysis tools. (Eg. the data might have to be stored in table for
Statistical Application).
Data Analysis Process
Data Cleaning
● Dataare often incomplete, incorrect like
○ Typographical Errors : e.g., text data in numeric fields
○ Missing Values : some fields may not be collected for some
of the examples
○ Impossible Data combinations:
eg. gender=MALE, pregnant = TRUE
○ Out-of-Range Values: eg., age=1000
Data Analysis Process
Data Analysis
Statistical Data Models such as Correlation, Regression Analysis
can be used to identify the relations among the data variables.
Communication
The results are presented through visualization techniques, such
as tables and charts, which help in communicating the message
clearly and efficiently to the users.
Concepts of BigData
Big Data is also data but with a huge size and growing exponentially
with time.
Such data is so large and complex that traditional data management
tools are unable to store it or process it efficiently.
The main focus was on building a framework and solutions to store
data.
Hadoop and many Cloud frameworks like Amazon, Google,
RackSpace etc. and other frameworks have successfully solved the
problem of storage,
Now the focus has shifted to the processing of this data.
Examples of Big-Data
Social Media Data eg. Facebook, Twitter, WhatsApp, Instagram, etc.
Data from GPS Systems.
Sensor Data
Big Data
Big Data can be categorized as
Structured
Unstructured
Semi-structured
Big Data
Structured – Data storage format and processing is fixed.
eg Data stored in DBMS (Database Management System)
Enrollment No. Name BirthDate Stream
145633 Shilpan Desai 12/3/1998 Science
321123 Manish Gor 05/06/1997 General
Big Data
Unstructured - Format of the Data is not particular.
Processing unstructured data is a big challenge as format is not
particular.
Eg. Combination of text files, images, videos , data in books, journals,
email messages, web-pages.
Big Data
Semi-structured – Is mixture of Structured and Unstructured. Data though look like structured but
it is not represented/stored/processed in form of tabular structures.
Eg. XML formatted Data, JSON data, CSV data.
XML – eXtended Markup Language
JSON – Java Script Object Notation
CSV - Comma Separated Verbose Format.
XML Format
<students>
<student>
<enrollmentno> 145633 </enrollmentno>
<name>Shilpan Desai </name>
<stream>Science</stream>
<photo>myimages/sd.jpg</photo>
</student>
</students>
Big Data
Ex. of XML Format
<students>
<student>
<enrollmentno> 145633 </enrollmentno>
<name>Shilpan Desai </name>
<stream>Science</stream>
<photo>myimages/sd.jpg</photo>
</student>
</students>
Ex. of JSON Format: Javascript Object Notation
student{“enrollmentno”:”145633”,”name”:”Shilpan Desai”,”stream”:”Science”}
Data Mining
• Data Mining also alternatively known as Knowledge Discovery in Data
(KDD).
• Discovering patterns and trends from large sources of data is the goal
of Data Mining.
• Predicting outcomes and actionable information from data.
• It is about creation of Model.
• A model uses algorithm to act on data for predictions and finding
patterns.
• Data can be mined irrespective of data stored in flat files,
spreadsheets, database tables, or some other storage format
• hence the process is independent of data formats.
Relationship between Artificial Intelligence (AI), Machine
Learning (ML) and Deep Learning(DL).
Relationship between Artificial Intelligence (AI), Machine
Learning (ML) and Deep Learning(DL).
• Analytics is a collection of techniques and tools for creating value
from Data.
• The techniques are AI, ML, and DL which are interrelated.
• Artificial Intelligence :- Algorithms that exhibit human like
intelligence
• Machine Learning:- It is subset of AI that can learn to perform a
task with extracted data and models (algorithms).
• Deep Learning :- It is subset of ML whose algorithms imitate
learning like human brain to solve the task.
Relationship between Artificial Intelligence (AI), Machine
Learning (ML) and Deep Learning(DL).
• Hence all the 3 are nothing but algorithms which by looking at
enormous data helps to solve corresponding task just as we human
do.
• Humans learns through multiple experiences to perform a task.
Two common types of Learning Algorithms
1. Supervised Learning
2. Unsupervised Learning
Types of Learning Algorithms
1. Supervised Learning
Types of Learning Algorithms
1. Supervised Learning
• It requires labeled training data.
• Labeled data means there is already an input which is
attached with output also.
Apple Apple
Apple
Apple
Types of Learning Algorithms
1. Supervised Learning
• Training data work as the supervisor which teaches the
machines (algorithm) to predict the correct output.
• It is an analogy with student learning in the supervision of
the teacher (labeled data)
• Hence technically it is a mapping function to map the
input variable(x) (apple image) with the output
variable(y) (apple label). If we show a image of an apple
to algorithm it will give us output text as “apple”.
Types of Learning Algorithms
1. Supervised Learning
• The algorithm is trained to look at colour, shape, edges
at some part, colour patterns.
• Hence colour, shape, edges, no. of sides, angles, colour
patterns are known features in terms of learning
algorithms.
• Eg. Classification algorithms
Types of Learning Algorithms
2. Unsupervised Learning
Types of Learning Algorithms
2. Unsupervised Learning
• No labeled training data. Hence these algorithm has no
knowledge of outcome variable.
• Hence the algorithm has to decide on its own possible outcomes.
• It tries itself to find the hidden patterns and insights from the
given data. (no teacher, no supervisor)
• It is near to learning which takes place in the human brain while
learning new things.
• Useful for categorizing things.
• Eg are Clustering algorithms.
N E T F L I X C A S E –S T U D Y
Netflix Data Science Case Study to Improve its
Recommendation System
• Do you remember the last movie you watched on Netflix ?
• After watching the movie, were you recommended of similar
movies?
• How does Netflix know what you’d like?
• The secret here is Data Science.
• Netflix uses Data Science to cater relevant and interesting
recommendations to us.
Netflix Case Study
Data Science at Netflix
• Netflix initially started as a DVD rental service in 1998.
• It relied on a third party postal services to deliver its DVDs to the
users.
• Introduced online streaming service in 2007.
• Netflix invested in a lot of algorithms to provide a flawless movie
experience to its users.
• One of such algorithms is the recommendation system that is used
by Netflix to provide suggestions to the users.
• A recommendation system understands the needs of the users
and provides suggestions of the various cinematographic products.
Netflix Case Study
Netflix Case Study
What is a Recommendation System?
• A recommendation system is a platform that provides its users with
various contents based on their preferences and likings.
• A recommendation system takes the information about the user as
an input.
• This information can be in the form of the past usage of product or
the ratings that were provided to the product.
• It then processes this information to predict how much the user
would rate or prefer the product.
Netflix Case Study
What is a Recommendation System?
• A recommendation system makes use of a variety of machine
learning algorithms.
• Another important role that a recommendation system plays is to
search for similarity between different products.
• Recommendation system searches for movies that are similar to
the ones you have watched or have liked previously.
• Based on the movies that are watched, Netflix provides
recommendations of the films that share a degree of similarity.
Netflix Case Study
There are two main types of Recommendation Systems
1. Content-based recommendation systems
• In a content-based recommendation system, the knowledge of the
products and customer information are taken into consideration.
• Based on the content that you have viewed on Netflix, it provides you
with similar suggestions.
• for eg, if you have watched a film that has a science-fiction genre, the
content-based recommendation system will provide you with
suggestions for similar films that have the same genre.
Netflix Case Study
2. Collaborative filtering recommendation systems
• Collaborative Filtering provides recommendations based on the similar
profiles of its users.
• One key advantage of collaborative filtering is that it is independent of the
product knowledge.
• For example, if a person A watches crime, science-fiction and thriller genres
and B watches science-fiction, thriller and action genres then A will also like
action and B will like crime genre.
Person – A watches Person-B watches
crime action
Science-fiction Science-fiction
thriller thriller
Netflix Case Study
2. Collaborative filtering recommendation systems
REFERENCES
• Machine Learning using Python (2019), M Pradhan, UD Kumar, Ch. 1
Introduction to Machine Learning , Wiley Publications.
• Python Data Science Handbook (2016), Jake Vander Plas, O’Reilly
Publications. ISBN: 9781491912058
• Web Access (Aug 2020), https://www.edureka.co/blog/what-is-data-
science/https://machinelearning-blog.com
• Web Access (Aug 2020), https://intellipaat.com/blog/what-is-data-science
• Web Access (Aug 2020), http://www.saedsayad.com/data_mining_map.htm