Week 1 Data Science
Week 1 Data Science
1. Structured Data
Definition:
Data that is organized in a predefined format, usually in tables (rows and columns). It is easy to
search, filter, and analyze using tools like Excel, SQL, etc.
Example:
A table of student information:
Student ID Name Age Grade
101 Asha 20 A
102 Rohan 21 B
This is structured data because it is clearly organized.
2. Unstructured Data
Definition:
Data that has no specific format or structure. It's harder to process and analyze directly using
traditional tools.
Example:
A WhatsApp message:
"Hey Usha! Congrats on your research in lung cancer detection 🎉. Let's meet this weekend!"
This is unstructured — it's just text, and not in a tabular format.
Summary:
Type Format Example
Structured Tables (rows/columns) Student database
Unstructured Free text, images, videos WhatsApp messages, images
1. In Business:
o Customer Insights: Businesses use data science to understand customer
preferences, behaviors, and trends, leading to better-targeted marketing and
improved customer service.
o Operational Efficiency: Data science helps optimize supply chains, streamline
operations, and reduce costs.
o Risk Management: Companies use data science for fraud detection, financial
analysis, and to assess various risks.
2. In Healthcare:
o Medical Research: Data science is used in drug discovery, genomics, and clinical
trials to uncover new insights in medicine.
o Healthcare Predictions: It helps in predicting disease outbreaks, patient
conditions, and improving treatment outcomes.
3. In Government:
o Public Policy: Governments use data science for decision-making in areas like
education, healthcare, and public safety.
o Social Good: Data science is used in managing resources, tackling poverty,
disaster response, and addressing climate change.
4. In Technology:
o AI and Machine Learning: Data science is the backbone of artificial
intelligence, enabling the creation of smarter machines and applications.
o Big Data: With the growth of big data, the role of data science becomes crucial in
analyzing massive datasets to uncover insights.
Conclusion
Data Science is more than just a technical field—it's a critical driver of innovation and efficiency
across multiple industries. By enabling smarter decision-making, improving predictions, and
fostering automation, data science has become indispensable in the modern world. Whether
you're in business, healthcare, technology, or any other field, the ability to harness and interpret
data is a key factor for success.
Scenario:
Amazon uses data science to recommend products to its users based on their browsing history,
purchase history, cart items, and search queries.
What (How Data Science Works Here):
1. Data Collection: Amazon collects vast amounts of data from its users—what they search
for, what they buy, how much time they spend on a product page, and even what they add
to their wish lists.
2. Data Cleaning: This raw data often contains missing or irrelevant entries. Data scientists
clean the data to remove inconsistencies.
3. Exploratory Data Analysis (EDA): Patterns like frequent purchases, trending items, or
seasonal behaviours are identified.
4. Modelling and Prediction: Amazon uses machine learning algorithms (like
collaborative filtering) to predict what a customer might want to buy next.
5. Data Visualization: Dashboards are used internally to help teams visualize what
products are trending, which marketing campaigns are working, etc.
Conclusion:
This example of Amazon demonstrates how data science turns raw data into valuable
insights, leading to smarter business decisions, increased efficiency, and better user satisfaction.
It clearly showcases the what, why, and importance of data science in real life.
Definition: The simulation of human intelligence in machines that can perform tasks like
reasoning,learning, and decision-making.
Example:
Google Maps using AI to suggest faster routes by analyzing traffic in real time.
ChatGPT answering your questions like a human.
2. Machine Learning (ML)
Definition: A subset of AI where machines learn from data without being explicitly
programmed.
Example:
Netflix recommending movies based on your watch history.
A model predicting if a credit card transaction is fraudulent.
3. Deep Learning
Definition: A type of ML that uses neural networks with many layers to model complex patterns.
Example:
Facial recognition in smartphones.
Self-driving cars detecting pedestrians and signs.
4. Big Data
Definition: Extremely large datasets that are too complex for traditional tools to process. It is
characterized by the 5Vs – Volume, Velocity, Variety, Veracity, and Value.
Example:
Facebook handling billions of posts, messages, and images every day.
Amazon analyzing massive customer transaction data for personalized deals.
6. Data Mining
Definition: The process of discovering patterns and relationships in large datasets.
Example:
Analyzing supermarket purchases to see that people often buy bread and butter together.
7. Data Analytics
Definition: The science of analyzing raw data to make conclusions. It includes descriptive,
diagnostic, predictive, and prescriptive analytics.
Example:
Analyzing app usage to understand which features users love.
Predicting next month’s sales using past data.
8. Data Engineering
Definition: Preparing and building data pipelines so that data is clean, usable, and accessible for
analysis.
Example:
Designing the backend to collect and process data from a fitness app to store in a data
warehouse.
9. Data Visualization
Definition: Representing data through charts, graphs, and visuals to make insights easy to
understand.
Example:
A line graph showing COVID-19 cases over time.
Pie chart showing user distribution by country.
1. Data
Definition: Raw facts and figures without context.
Example: A list of temperatures recorded every hour in Chennai for a week.
2. Dataset
Definition: A collection of related data, usually organized in a table.
Example: An Excel sheet containing columns like Date, City, Temperature, Humidity, etc.
3. Data Cleaning
Definition: The process of correcting or removing inaccurate records from a dataset.
Example: Removing rows with missing values or fixing typos in city names like "Chenai" to
"Chennai".
4. Feature
Definition: An individual measurable property or characteristic of a data point.
Example: In a house price dataset, features could be the number of bedrooms, size in square
feet, and location.
5. Label
Definition: The target variable the model is trying to predict.
Example: In a house price prediction model, the price of the house is the label.
6. Model
Definition: A mathematical representation trained on data to make predictions or decisions.
Example: A machine learning model trained to detect spam emails based on content and sender
information.
7. Training Data
Definition: The data used to train a machine learning model.
Example: 80% of an email dataset used to teach the model which emails are spam.
8. Test Data
Definition: The data used to evaluate the model's performance.
Example: The remaining 20% of the email dataset is used to test if the model correctly classifies
emails as spam or not.
9. Overfitting
Definition: When a model learns the training data too well, including noise and errors.
Example: A stock prediction model that works perfectly on past data but fails on future trends.
10. Underfitting
Definition: When a model is too simple to learn the underlying patterns in the data.
Example: A linear model trying to predict a non-linear relationship between hours studied and
exam scores.
13. Clustering
Definition: Grouping similar data points together.
Example: Grouping YouTube viewers into clusters based on watch history.
14. Classification
Definition: Predicting categories or labels.
Example: Predicting whether a tumor is benign or malignant.
15. Regression
Definition: Predicting continuous values.
Example: Predicting the salary of an employee based on years of experience.
16. Accuracy
Definition: The percentage of correct predictions made by the model.
Example: If the model correctly classifies 90 out of 100 emails, the accuracy is 90%.
18. Precision
Definition: Out of all predicted positive cases, how many are actually positive.
Example: If 100 emails are marked spam and only 80 are truly spam, precision is 80%.
19. Recall
Definition: Out of all actual positive cases, how many were correctly predicted.
Example: If there are 100 spam emails and 90 are detected, recall is 90%.
20. F1 Score
Definition: The harmonic mean852085 of precision and recall.
Example: A balanced measure when precision and recall are equally important.
Education
Data science is used to analyze student performance, learning patterns, and drop-out
prediction.
Example: EdTech platforms like BYJU’S use learning analytics to personalize lessons for
each student.
Manufacturing
Predictive maintenance and process optimization are powered by data science.
Example: GE(General Electric.) uses data from sensors on machines to predict failures
before they happen.
Sports
Analyzing player performance, game strategies, and injury risks.
Example: In the NBA, teams use player tracking data to improve performance and make
recruitment decisions.
Energy Sector
Forecasting demand, optimizing energy usage, and detecting faults.
Example: Smart grids use data science to predict peak demand times and manage load
distribution efficiently.
Retail
Inventory optimization, customer behavior analysis, and sales forecasting.
Example: Walmart uses predictive analytics to manage stock levels and plan promotions.
Weather Forecasting
Data science models are used to predict weather patterns, storms, and climate changes.
Example: The Indian Meteorological Department (IMD) uses data models to issue
cyclone warnings.
Banking
Customer credit scoring, default prediction, and loan approval automation.
Example: HDFC Bank uses data science to evaluate customer eligibility for pre-approved
loans.
Cybersecurity
Intrusion detection, anomaly detection, and phishing attack prevention.
Example: Google uses machine learning to detect and block phishing emails in Gmail.
Telecommunications
Churn prediction, network optimization, and user behavior analysis.
Example: Jio uses customer usage data to offer targeted data packs and reduce user
churn.
Aviation
Application: Flight delay prediction, route optimization, fuel efficiency
Real-time Example: Delta Airlines uses predictive analytics to reduce flight delays by
analyzing weather, air traffic, and historical data.
Insurance
Application: Risk assessment, fraud detection, claims prediction
Real-time Example: Progressive Insurance uses telematics data (from vehicle sensors)
to offer personalized car insurance premiums based on driving behavior.
Real Estate
Application: Price prediction, market trend analysis, property recommendation
Real-time Example: Zillow uses data science to estimate property values and suggest
homes to buyers based on preferences and location trends.
Environmental Science
Application: Air quality monitoring, deforestation tracking, wildlife conservation
Real-time Example: NASA uses satellite data and machine learning to detect illegal
deforestation and monitor climate change indicators globally.
Automotive Industry
Application: Autonomous driving, vehicle safety, predictive maintenance
Real-time Example: Tesla collects data from its fleet to train self-driving models and
release over-the-air updates to improve driving performance.
Space Exploration
Application: Mission planning, spacecraft health monitoring, anomaly detection
Real-time Example: NASA's Mars Rover missions use machine learning to analyze
Martian terrain and autonomously select safe navigation paths.
Data Types – Structured, Unstructured, Semi-structured, Metadata
In data science, understanding different data types is crucial for selecting appropriate
storage, processing, and analysis techniques. Here's a breakdown of the four major data
types
1. Structured Data
Definition:
Structured data refers to data that is organized in a fixed format, typically rows and
columns (like in relational databases). It's easily searchable and manageable with SQL or
spreadsheet tools.
Characteristics:
Clearly defined fields
Stored in tables (rows & columns)
Easy to input, query, and analyze
Examples:
Excel spreadsheet with columns: Customer_ID, Name, Age, Purchase_Amount
2. Unstructured Data
Definition:
Unstructured data doesn't follow a predefined format or model. It's typically text-heavy,
image, audio, or video-based, and harder to analyze directly.
Characteristics:
No fixed structure
Needs preprocessing (e.g., NLP for text)
Requires advanced tools to extract meaning
Examples:
Text files: Emails, social media posts
Media files: Photos, videos, audio recordings
PDF documents
"Hey Usha! I loved your latest blog on machine learning. Super helpful!"
3.Semi-structured Data
Definition:
Semi-structured data doesn't reside in a traditional table format, but still contains tags
or markers to separate elements, making it easier to process than unstructured data.
Characteristics:
Has structure, but not as rigid as relational databases
Often found in formats like JSON, XML, YAML
Examples:
JSON data:
{
"id": 101,
"name": "Usha",
"skills": ["Python", "Machine Learning", "Marketo"]
}
XML documents
4. Metadata
Definition:
Metadata is "data about data." It describes or provides information about other data,
helping users understand, find, or manage the actual data.
Characteristics:
Descriptive
Enhances data usability
Doesn't include the content, only information about the content
Examples:
For an image:
o File type: JPEG
o Size: 1.2MB
o Resolution: 1920x1080
o Date created: May 30, 2025
For a document:
o Title: "Lung Cancer Detection Thesis"
o Author: Usha
o Word count: 15,000
Summary Table
🐍 Python
General-purpose programming language widely used in Data Science.
Libraries:
o Pandas – data manipulation (df.describe())
o NumPy – numerical computations (np.array())
o Matplotlib / Seaborn – data visualization (sns.heatmap())
o Scikit-learn – machine learning (model.fit())
Example: Predicting house prices using linear regression.
📊R
Statistical computing language popular in academia and research.
Libraries:
o ggplot2 – powerful data visualization
o dplyr – data wrangling (filter(), mutate())
o caret – machine learning modeling
Example: Analyzing survey data and visualizing trends.
📓 Jupyter Notebook
Web-based interactive environment for writing and running code.
Supports Python, Retc.
Great for:
o Data exploration and visualization
o Documentation with Markdown
Example: Step-by-step EDA (Exploratory Data Analysis) with plots and comments.
Other Tools
Excel – Data manipulation, pivot tables, simple charts
Tableau / Power BI – Drag-and-drop data visualization tools
Apache Spark – Big Data processing, often used with PySpark
Google Colab – Cloud-based Jupyter notebook with free GPU
KNIME / RapidMiner – GUI-based Data Science platforms
Goal:
Predict if a patient has lung cancer (Yes or No) based on features like age, smoking habit,
and shortness of breath.
df = pd.DataFrame(data)
Output:
Accuracy: 1.0
Classification Report:
precision recall f1-score support
accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3
the input data has information for 10 people, but the output (accuracy and classification report) only
shows 3 test cases.
test_size=0.3 means 30% of the data is used for testing.
30% of 10 people = 3 people for testing, 7 for training.
They were used in the training phase to train the logistic regression model, so they're not
part of the y_test or y_pred, which are used for evaluation.
Precision: Out of all patients predicted as "having cancer (or not)", how many were
correctly predicted.
Recall: Out of all actual cancer (or no cancer) patients, how many did the model correctly
find.
F1-score: A balance between precision and recall (good for uneven data).
Support: The actual number of patients in each class in the test data.