0% found this document useful (0 votes)

17 views14 pages

Data Science

The document provides an overview of data science concepts, including definitions, processes, and comparisons of various techniques and tools. It covers topics such as data cleaning, sampling, supervised and unsupervised learning, and the bias-variance trade-off. Additionally, it discusses the importance of statistical significance and provides examples of applications in data science.

Uploaded by

Partho Dey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views14 pages

Data Science

Uploaded by

Partho Dey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

19.

Statistical significance is
A. The science of collecting, organizing, and applying numerical facts
B. Measure of the probability that a certain hypothesis is incorrect given certain observations
C. One of the defining aspects of a data warehouse, which is specially built around all the
existing applications of the operational data
D. None of these
1) Explain Data Science

Ans:
Data Science is using data to find useful information. It combines math, coding, and machine
learning to solve problems.

Steps:

1. Collect – Get data.

2. Clean – Fix mistakes.
3. Analyze – Find patterns.
4. Model – Make predictions.
5. Use – Apply results.

In short Collect → Clean → Analyze → Model → Use

Examples:

● Fraud detection
● Movie suggestions (Netflix)
● Health predictions

2)Compare box plot and histogram.

Ans:

Box Plot Histogram

Purpose is Summarizes data Shows data distribution and frequency

Data Type Continuous Continuous and discrete

Granularity Low High

Detailed distribution analysis

Use Case is Summary and comparison

Displays Median, quartiles, IQR, outliers Frequency in bins

3) Explain briefly about Data science tools.
Ans:

Category Tools

Programming Language Python , R

Data Manipulation Pandas, SQL

Databases SQL, NoSQL

Cloud Google Cloud, AWS

Data Storage MySQL, MongoDB

IDEs Jupyter Notebooks , VS Code

4) Interpret applications of classification with example

Ans:
Email Spam Detection: Classifying emails as spam or not spam
Customer Segmentation: Grouping customers by purchasing behavior.
Speech Recognition: Converting spoken language into text.
Image Recognition: Detecting objects in images.
Medical Diagnosis: Classifying diseases as present or absent (e.g., cancer detection).

5) List down the conditions for Overfitting and Underfitting.

Ans:
Overfitting Conditions:
Complex model
Small training data
Too much training
Too many parameters
Insufficient data
Lack of regularization
Too many features

Underfitting Conditions:

Too simple a model

Insufficient training
Too much regularization

Too few features

Low model complexity relative to data complexity

Data preprocessing issues

6) Summarize the reason why Python is used for data cleaning in Data Science
Ans:
Python is used for data cleaning because:

Libraries: Pandas and NumPy make it easy .

Flexibility: Works with different data formats

Efficiency: Handles large datasets quickly.
Support: Lots of resources and help from the community.
Automation: Can automate cleaning tasks.
Community Support

7) Define data analytics.

Ans:
Data analytics is the process of using tools and techniques to analyze data to find patterns.
It involves using various techniques and tools to: Collect , Clean , Analyze , Describe

Advantage : Improved Efficiency , Risk Management , Enhanced Operational Performance

,Better Financial Management

8) Illustrate supervised and unsupervised learning.

Ans:
Supervised Learning:
Supervised learning is a machine learning technique that uses labeled data to train algorithms
to predict outcomes
You train the model with labeled data (input and the correct output).
Goal: The model learns to predict the output for new, unseen data.
Example: Email Spam Classification

Unsupervised Learning:
Unsupervised learning is a machine learning technique that analyzes data without human
intervention.
You train the model with unlabeled data (just inputs, no outputs).
Goal: The model finds hidden patterns or groups in the data.
Example: Customer Segmentation

9) Write briefly about Data mining concept.

Ans:

Data mining is the process of discovering hidden patterns, trends, and valuable information
within large datasets.

Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis

Data Collection: collect data from various sources

Analysis : Using techniques like machine learning, and algorithms to find patterns
Goal: Extract valuable knowledge

Common Techniques in Data Mining: Classification , Clustering , Association

10) Organize Data cleaning and sampling with an example

Ans:

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, incomplete, or duplicate data to
improve its quality.

Example:

● Removing duplicate customer records.

● Filling missing phone numbers.
● Correcting invalid email formats.

Sampling

Sampling is the process of selecting a smaller, representative subset of data from a larger
dataset for analysis.

Example:
● From a dataset of 100,000 customers, you randomly select a sample of 1,000 customers
to analyze sales patterns.

11) Explain briefly about Data Science.

Ans: same number 1

12)List down the conditions for Overfitting and Underfitting.

Ans: Same

13) Explain briefly about the libraries used in Data Science

Ans:
Pandas: For working with data
NumPy: For math and working with numbers.
Matplotlib: For making charts and graphs.
Seaborn: For beautiful, easy-to-read charts.
Scikit-learn: For machine learning tasks
TensorFlow/Keras: For building smart models
SciPy: For solving scientific problems
Statsmodels: For statistical analysis.
Plotly: For interactive charts and graphs.
NLTK/SpaCy: For working with text data

14) Write a short note about Data cleaning.

Ans:
Data Cleaning is the process of identifying and correcting errors in a dataset to improve its
quality. The goal is to ensure the data is accurate, complete, and ready for analysis.

Key Steps:

1. Remove duplicates.

2. Handle missing values.
3. Correct errors
4. Standardize formats.
5. Remove outliers.

15) Write a brief note about Data sampling

Ans:
Data Sampling is the process of selecting a smallerdata from a larger dataset for analysis. It
helps in making analysis more manageable, especially when dealing with large datasets.

Types:

1. Random Sampling: Random selection of data points.

2. Stratified Sampling: Sampling from specific groups.
3. Systematic Sampling: Selecting every nth data point.
4. Convenience Sampling: Choosing easily available data.

16) How can outlier values can be determined?

Ans:

Z-Score → If >3 or <-3.

IQR Method → Outside Q1 - 1.5×IQR or Q3 + 1.5×IQR.

Box Plot → Dots outside whiskers.

ML Methods → Isolation Forest, DBSCAN.

Visual Check → Scatter plots, histograms.

17) Compare between data analytics and data science.

Ans:

Data Analytics Data Science

Focus on "What happened?" "What will happen?"

structured data Both structured and unstructured data

Tools - Excel , SQL Python, R

lower complexity higher complexity

Jobs - Data Analyst Data Scientist

Methods - Reports, charts, SQL. AI, ML, coding.

Finds patterns in past data. Predicts future trends.

18) Explain briefly about Eigenvectors and Eigenvalues.

Ans:

Eigenvectors are special vectors that don’t change direction when a matrix is applied to
them—only their length changes.

Eigenvalues are the numbers that tell us how much the eigenvector is stretched

19) Interpret what do you understand by Imbalanced Data?

Ans:
Imbalanced data means one type of data appears much more than the other.

Example:

● Fraud detection: 99 out of 100 transactions are normal, only 1 is fraud.

● Medical tests: 95 out of 100 people are healthy, only 5 are sick.

Advantages

Matches real-world data (e.g., fraud, diseases).

Helps find rare but important cases.
Faster training with more common data.

Disadvantages

Model may ignore rare cases.

Harder to train good models.
20) Compare expected value and mean value.
Ans:

Expected Value Mean Value

Predicted average Actual average

Future predictions Past data analysis

Example - Average roll of a dice = 3.5 Rolling a dice 10 times and averaging results

Used to predict long-term average outcomes Used to summarize a given dataset.

Depends on Probability of values Total sum of values

Stays the same for a given probability Changes with different data samples

21) Define bias-variance trade-off.

Ans:

Bias-Variance Trade-off

It is about finding the right balance in a model:

● Bias (Too simple) → Model makes mistakes because it doesn’t learn enough.
● Variance (Too complex) → Model learns too much, and makes mistakes on new data.

Goal:

Find a balance where the model is not too simple or too complex.

22) Define the confusion matrix.

Ans:

Confusion Matrix

A confusion matrix helps check how well a model predicts things. It compares actual vs.
predicted results.
Table Example:
Predicted: Yes Predicted: No

Actual: Yes Correct (TP) Wrong (FN)

Actual: No Wrong (FP) Correct (TN)

Simple Meaning:

● True Positive (TP) → Model is right

● True Negative (TN) → Model is right
● False Positive (FP) → Model is wrong
● False Negative (FN) → Model is wrong

23) List the major drawbacks in Linear model.

Ans:

● Needs a straight-line pattern

● Limited flexibility
● Can’t capture variable interactions
● Overfits with too many features
● Not good for categories
● Struggles with related inputs

24) Develop RMSE and MSE in a linear regression model.

Ans:
25) Compare between correlation and covariance

Ans:

Correlation Covariance

Shows how strongly two variables are Shows how two variables change together.
related.

Range Between -1 and +1. Any value (positive or negative).

Independent of units Depends on units

+1 (Strong positive), 0 (No relation), -1 Positive (Move together), Negative (Move

(Strong negative). opposite).

Uses - Comparing relationships Checking variable movement.

Standardized Not standardized

Data Science
No ratings yet
Data Science
10 pages
Ixs8h l8mgc
No ratings yet
Ixs8h l8mgc
40 pages
Big Data (Imp-Questions)
No ratings yet
Big Data (Imp-Questions)
17 pages
Question Bank With Answers
No ratings yet
Question Bank With Answers
103 pages
2 Marks With Answers
No ratings yet
2 Marks With Answers
39 pages
DSV Notes
No ratings yet
DSV Notes
13 pages
FDS Unit 1 QB
No ratings yet
FDS Unit 1 QB
7 pages
Data Science
No ratings yet
Data Science
28 pages
01.ad3491 Fdsa QB
No ratings yet
01.ad3491 Fdsa QB
16 pages
Da 1733591326
No ratings yet
Da 1733591326
132 pages
DS 3-Marks Semeseter Suggestion
No ratings yet
DS 3-Marks Semeseter Suggestion
54 pages
Ds Revision 1
No ratings yet
Ds Revision 1
5 pages
Crack Data Science Interview 1731300339
No ratings yet
Crack Data Science Interview 1731300339
132 pages
Top Data Science Interview Questions and Answers in 2023 PDF
100% (1)
Top Data Science Interview Questions and Answers in 2023 PDF
14 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
AD3491 - Unit 1 - Introduction To Data Science Important Questions 2 Marks With Answer - 3-8
No ratings yet
AD3491 - Unit 1 - Introduction To Data Science Important Questions 2 Marks With Answer - 3-8
6 pages
Data Science Interview
No ratings yet
Data Science Interview
132 pages
Xii - Ai - Notes - U 2
No ratings yet
Xii - Ai - Notes - U 2
8 pages
UNIT 1 Material
No ratings yet
UNIT 1 Material
28 pages
Cs3352 - Foundation of Data Science
No ratings yet
Cs3352 - Foundation of Data Science
56 pages
DA (All CHP.)
No ratings yet
DA (All CHP.)
14 pages
Data Minig Anwers
No ratings yet
Data Minig Anwers
37 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
32 pages
Data Science and Analytics Reviewer
No ratings yet
Data Science and Analytics Reviewer
5 pages
Da CH1 Slqa
No ratings yet
Da CH1 Slqa
6 pages
ML Chapter 2
No ratings yet
ML Chapter 2
9 pages
Unit I 2 Marks
No ratings yet
Unit I 2 Marks
5 pages
Unit 1 - 2marks
No ratings yet
Unit 1 - 2marks
3 pages
Q1. Explain Data Science Process Along With Detailed Diagram
No ratings yet
Q1. Explain Data Science Process Along With Detailed Diagram
7 pages
OCS353 Data Science Fundamentals QB - (Common To EEE, Mech, Civil)
No ratings yet
OCS353 Data Science Fundamentals QB - (Common To EEE, Mech, Civil)
7 pages
Data Science Assignment
No ratings yet
Data Science Assignment
9 pages
7 - Foundations of DS
No ratings yet
7 - Foundations of DS
8 pages
Data Analyst Essentials Guide
No ratings yet
Data Analyst Essentials Guide
48 pages
Unit I
No ratings yet
Unit I
52 pages
12 2marks With Ans
No ratings yet
12 2marks With Ans
21 pages
Data Science Comprehension Worksheets
No ratings yet
Data Science Comprehension Worksheets
32 pages
BigDataSolution of Paper Oct 2022
No ratings yet
BigDataSolution of Paper Oct 2022
11 pages
Data Science Concepts & Techniques
No ratings yet
Data Science Concepts & Techniques
18 pages
12 2marks With Ans
No ratings yet
12 2marks With Ans
21 pages
Ch-04: Data and Analysis - Short Question and Answers - PDF
No ratings yet
Ch-04: Data and Analysis - Short Question and Answers - PDF
10 pages
Class 9 (Chap #4)
No ratings yet
Class 9 (Chap #4)
9 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Long Answered Questions With Answer
No ratings yet
Long Answered Questions With Answer
6 pages
23SC3201 Data Science and Challenges-2
No ratings yet
23SC3201 Data Science and Challenges-2
28 pages
Data Science - Notes - X
No ratings yet
Data Science - Notes - X
3 pages
Dsbda 4
No ratings yet
Dsbda 4
16 pages
DS Final 3 Marks
No ratings yet
DS Final 3 Marks
10 pages
100 Data Science Interview Questions and Answers
No ratings yet
100 Data Science Interview Questions and Answers
33 pages
Class 12 AI - Chapter 1
No ratings yet
Class 12 AI - Chapter 1
5 pages
Unit 2 MCQ 12th Class
No ratings yet
Unit 2 MCQ 12th Class
11 pages
Sfds Aat
No ratings yet
Sfds Aat
8 pages
DS
No ratings yet
DS
7 pages
A) What Is Big Data?
No ratings yet
A) What Is Big Data?
7 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
TYCS Data Science Questions Bank
No ratings yet
TYCS Data Science Questions Bank
3 pages
Company Wise Data Science Interview Questions
100% (2)
Company Wise Data Science Interview Questions
39 pages
Summary Business Analytics
No ratings yet
Summary Business Analytics
24 pages
DS PPT 1
No ratings yet
DS PPT 1
30 pages
Study and Evaluation of Tourism Websites Based On User Perspective
No ratings yet
Study and Evaluation of Tourism Websites Based On User Perspective
18 pages
Jain (2024-25) - AI - ML Batch
No ratings yet
Jain (2024-25) - AI - ML Batch
18 pages
LG 42sl8
No ratings yet
LG 42sl8
11 pages
Python Function Practice 50 Questions
No ratings yet
Python Function Practice 50 Questions
2 pages
Chapter-9 Internal and External Communication
100% (1)
Chapter-9 Internal and External Communication
4 pages
DAGs for Compiler Optimization
No ratings yet
DAGs for Compiler Optimization
9 pages
Exercise CSC415
No ratings yet
Exercise CSC415
7 pages
Ffxiv 2.0 Outline en
No ratings yet
Ffxiv 2.0 Outline en
15 pages
Not Their Daughter Laura Elliot Instant Download
No ratings yet
Not Their Daughter Laura Elliot Instant Download
26 pages
Complete Authorization Object Documentat
33% (3)
Complete Authorization Object Documentat
174 pages
Cellular Layouts
No ratings yet
Cellular Layouts
10 pages
VW Ppvzzz3czel000186 2024-11-14011958PM
No ratings yet
VW Ppvzzz3czel000186 2024-11-14011958PM
3 pages
Getting Started GuideIntl
No ratings yet
Getting Started GuideIntl
12 pages
TOUCH 2 Datasheet
No ratings yet
TOUCH 2 Datasheet
2 pages
CN Lab Manual 2018 19
No ratings yet
CN Lab Manual 2018 19
81 pages
Law Students on Cyber IP Issues
No ratings yet
Law Students on Cyber IP Issues
21 pages
Bad News (Routledge Revivals)
100% (9)
Bad News (Routledge Revivals)
13 pages
Cameroonian Passport Guide
No ratings yet
Cameroonian Passport Guide
4 pages
Turberg DWR G3512E PDF
100% (2)
Turberg DWR G3512E PDF
176 pages
46E9AF MS91LA Service Manual
No ratings yet
46E9AF MS91LA Service Manual
76 pages
IoT-Based Smart Alarm System
No ratings yet
IoT-Based Smart Alarm System
3 pages
Training Material of MT27 Series Chassis 2013jun27 Thur
No ratings yet
Training Material of MT27 Series Chassis 2013jun27 Thur
33 pages
Interesting Topics For Presentation
0% (1)
Interesting Topics For Presentation
13 pages
Audit Case 1 - Analytical Procedures
No ratings yet
Audit Case 1 - Analytical Procedures
2 pages
Digital Temperature Controllers: E5CN/E5CN-U
No ratings yet
Digital Temperature Controllers: E5CN/E5CN-U
24 pages
Engine Control System
No ratings yet
Engine Control System
24 pages
Ict in Language Teaching
No ratings yet
Ict in Language Teaching
25 pages
0.56 Dual Digit Display. Part Number
No ratings yet
0.56 Dual Digit Display. Part Number
4 pages
EARTHIMAGER
No ratings yet
EARTHIMAGER
123 pages
Hand Gesture Recognition Using Neural Networks: G.R.S. Murthy R.S. Jadon
No ratings yet
Hand Gesture Recognition Using Neural Networks: G.R.S. Murthy R.S. Jadon
5 pages

Data Science

Uploaded by

Data Science

Uploaded by

19.

1.​ Collect – Get data.

In short Collect → Clean → Analyze → Model → Use

2)Compare box plot and histogram.

Box Plot Histogram

Purpose is Summarizes data Shows data distribution and frequency

Data Type Continuous Continuous and discrete

Granularity Low High

Detailed distribution analysis

Displays Median, quartiles, IQR, outliers Frequency in bins

Programming Language Python , R

Data Manipulation Pandas, SQL

Databases SQL, NoSQL

Cloud Google Cloud, AWS

Data Storage MySQL, MongoDB

IDEs Jupyter Notebooks , VS Code

4) Interpret applications of classification with example

5) List down the conditions for Overfitting and Underfitting.

Too simple a model

Too few features

Low model complexity relative to data complexity

Data preprocessing issues

Libraries: Pandas and NumPy make it easy .

Flexibility: Works with different data formats

7) Define data analytics.

Advantage : Improved Efficiency , Risk Management , Enhanced Operational Performance

8) Illustrate supervised and unsupervised learning.

9) Write briefly about Data mining concept.

Data Collection: collect data from various sources

Common Techniques in Data Mining: Classification , Clustering , Association

10) Organize Data cleaning and sampling with an example

●​ Removing duplicate customer records.

11) Explain briefly about Data Science.

12)List down the conditions for Overfitting and Underfitting.

13) Explain briefly about the libraries used in Data Science

14) Write a short note about Data cleaning.

1.​ Remove duplicates.

15) Write a brief note about Data sampling

1.​ Random Sampling: Random selection of data points.

16) How can outlier values can be determined?

Z-Score → If >3 or <-3.

IQR Method → Outside Q1 - 1.5×IQR or Q3 + 1.5×IQR.

Box Plot → Dots outside whiskers.

ML Methods → Isolation Forest, DBSCAN.

Visual Check → Scatter plots, histograms.

17) Compare between data analytics and data science.

Data Analytics Data Science

Focus on "What happened?" "What will happen?"

structured data Both structured and unstructured data

Tools - Excel , SQL Python, R

lower complexity higher complexity

Jobs - Data Analyst Data Scientist

Methods - Reports, charts, SQL. AI, ML, coding.

Finds patterns in past data. Predicts future trends.

19) Interpret what do you understand by Imbalanced Data?

●​ Fraud detection: 99 out of 100 transactions are normal, only 1 is fraud.

Matches real-world data (e.g., fraud, diseases).​

Model may ignore rare cases.​

Expected Value Mean Value

Predicted average Actual average

Future predictions Past data analysis

Used to predict long-term average outcomes Used to summarize a given dataset.

Depends on Probability of values Total sum of values

21) Define bias-variance trade-off.

It is about finding the right balance in a model:

22) Define the confusion matrix.

Actual: Yes Correct (TP) Wrong (FN)

Actual: No Wrong (FP) Correct (TN)

●​ True Positive (TP) → Model is right

23) List the major drawbacks in Linear model.

●​ Needs a straight-line pattern

24) Develop RMSE and MSE in a linear regression model.

Range Between -1 and +1. Any value (positive or negative).

Independent of units Depends on units

+1 (Strong positive), 0 (No relation), -1 Positive (Move together), Negative (Move

Uses - Comparing relationships Checking variable movement.

1. Collect – Get data.

● Removing duplicate customer records.

1. Remove duplicates.

1. Random Sampling: Random selection of data points.

● Fraud detection: 99 out of 100 transactions are normal, only 1 is fraud.

Matches real-world data (e.g., fraud, diseases).

Model may ignore rare cases.

● True Positive (TP) → Model is right

● Needs a straight-line pattern