Data Science Process
End-to-End Machine Learning Project
Important Basics
Python Programming Basics
•Python Data Analysis Libraries Basics
•NumPy, Pandas, Matplotlib, & Seaborn
Tools
Installing Anaconda and Python (Watch Video)
Data Collection
What is a dataset?
A dataset is a collection of data in which data is arranged in some order.
• A tabular dataset can be understood as a
database table or matrix, where each
column corresponds to a particular
variable
• The most supported file type for a tabular
dataset is "Comma Separated
File," or CSV
An attribute is a property or characteristic of
a data object.
Attribute values are numbers or symbols assigned to an attribute.
object → raw = record = entity = instance
attribute → field = features = characteristic
Types of data in statistics
Categorical Data ( Qualitative ):
Represents categories or groups with qualitative distinctions, such as gender(Male / female)
Yes/No, True/False, Blue/green, etc.
Numerical Data (Quantitative):
Represents measurable quantities expressed in numerical form, such as height or weight ,
house price, temperature, etc.
Types of Categorical Data: Nominal and Ordinal
1. Nominal Data
Categories without any inherent order or ranking .
Examples :
• Gender (Male, Female, Other)
• Eye color (Brown, Blue, Green)
• Marital status (Single, Married, Divorced, Widowed)
• Types of vehicles (Car, Truck, Motorcycle)
• Blood type (A, B, AB, O)
2. Ordinal Data
Categories with a clear order or ranking, where the intervals between categories may not be equal.
Examples :
• Educational level (High School Diploma, bachelor’s degree, master’s degree, PhD)
• Rating scales (1 star, 2 stars, 3 stars, 4 stars, 5 stars)
• Severity of illness (Mild, Moderate, Severe)
• Frequency of Travel (Rarely, Occasionally, Frequently, Regularly)
Types of Numerical Data: Discrete and Continuous
1. Discrete Data
Consists of distinct and separate values that are countable and finite, often representing whole
numbers. These values cannot be broken down into smaller units and typically arise from counting.
Such as Number of students in a class (5, 10, 15, ...)
2. Continuous Data
Consists of measurements that can take on any value within a given range. These values are infinite
and uncountable, often resulting from measurement. Continuous data can be broken down into
smaller and smaller units, and they can take fractional and decimal values. Such as
• [ Temperature (measured in °C, °F,) … 37.5
• Time taken to complete a task (measured in sec, min, or hr) 2,5 h
Types of datasets
Image Datasets:
Image datasets contain an assortment of images and are normally utilized in computer vision tasks
such as image classification, object detection, and image segmentation.
Examples :
o ImageNet - MNIST
Text Datasets:
Text datasets comprise textual information, like articles, books reviews, or posts. These datasets are
utilized in NLP techniques like sentiment analysis, text classification, and machine translation.
Examples :
o IMDb film reviews dataset
Time Series Datasets:
Time series datasets include information focuses gathered after some time. They are generally utilized
in determining abnormality location, and pattern examination.
Examples :
o Climate information
Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets.
Practical
• How to get data sets
[ https://www.kaggle.com/datasets
https://archive.ics.uci.edu/ml/index.php ]
• Google Colab configuration.
• Example on Titanic dataset
• EDA