19.
Statistical significance is
A. The science of collecting, organizing, and applying numerical facts
B. Measure of the probability that a certain hypothesis is incorrect given certain observations
C. One of the defining aspects of a data warehouse, which is specially built around all the
existing applications of the operational data
D. None of these
1) Explain Data Science
Ans:
Data Science is using data to find useful information. It combines math, coding, and machine
learning to solve problems.
Steps:
1. Collect – Get data.
2. Clean – Fix mistakes.
3. Analyze – Find patterns.
4. Model – Make predictions.
5. Use – Apply results.
In short Collect → Clean → Analyze → Model → Use
Examples:
● Fraud detection
● Movie suggestions (Netflix)
● Health predictions
2)Compare box plot and histogram.
Ans:
Box Plot Histogram
Purpose is Summarizes data Shows data distribution and frequency
Data Type Continuous Continuous and discrete
Granularity Low High
Detailed distribution analysis
Use Case is Summary and comparison
Displays Median, quartiles, IQR, outliers Frequency in bins
3) Explain briefly about Data science tools.
Ans:
Category Tools
Programming Language Python , R
Data Manipulation Pandas, SQL
Databases SQL, NoSQL
Cloud Google Cloud, AWS
Data Storage MySQL, MongoDB
IDEs Jupyter Notebooks , VS Code
4) Interpret applications of classification with example
Ans:
Email Spam Detection: Classifying emails as spam or not spam
Customer Segmentation: Grouping customers by purchasing behavior.
Speech Recognition: Converting spoken language into text.
Image Recognition: Detecting objects in images.
Medical Diagnosis: Classifying diseases as present or absent (e.g., cancer detection).
5) List down the conditions for Overfitting and Underfitting.
Ans:
Overfitting Conditions:
Complex model
Small training data
Too much training
Too many parameters
Insufficient data
Lack of regularization
Too many features
Underfitting Conditions:
Too simple a model
Insufficient training
Too much regularization
Too few features
Low model complexity relative to data complexity
Data preprocessing issues
6) Summarize the reason why Python is used for data cleaning in Data Science
Ans:
Python is used for data cleaning because:
Libraries: Pandas and NumPy make it easy .
Flexibility: Works with different data formats
Efficiency: Handles large datasets quickly.
Support: Lots of resources and help from the community.
Automation: Can automate cleaning tasks.
Community Support
7) Define data analytics.
Ans:
Data analytics is the process of using tools and techniques to analyze data to find patterns.
It involves using various techniques and tools to: Collect , Clean , Analyze , Describe
Advantage : Improved Efficiency , Risk Management , Enhanced Operational Performance
,Better Financial Management
8) Illustrate supervised and unsupervised learning.
Ans:
Supervised Learning:
Supervised learning is a machine learning technique that uses labeled data to train algorithms
to predict outcomes
You train the model with labeled data (input and the correct output).
Goal: The model learns to predict the output for new, unseen data.
Example: Email Spam Classification
Unsupervised Learning:
Unsupervised learning is a machine learning technique that analyzes data without human
intervention.
You train the model with unlabeled data (just inputs, no outputs).
Goal: The model finds hidden patterns or groups in the data.
Example: Customer Segmentation
9) Write briefly about Data mining concept.
Ans:
Data mining is the process of discovering hidden patterns, trends, and valuable information
within large datasets.
Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis
Data Collection: collect data from various sources
Analysis : Using techniques like machine learning, and algorithms to find patterns
Goal: Extract valuable knowledge
Common Techniques in Data Mining: Classification , Clustering , Association
10) Organize Data cleaning and sampling with an example
Ans:
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, incomplete, or duplicate data to
improve its quality.
Example:
● Removing duplicate customer records.
● Filling missing phone numbers.
● Correcting invalid email formats.
Sampling
Sampling is the process of selecting a smaller, representative subset of data from a larger
dataset for analysis.
Example:
● From a dataset of 100,000 customers, you randomly select a sample of 1,000 customers
to analyze sales patterns.
11) Explain briefly about Data Science.
Ans: same number 1
12)List down the conditions for Overfitting and Underfitting.
Ans: Same
13) Explain briefly about the libraries used in Data Science
Ans:
Pandas: For working with data
NumPy: For math and working with numbers.
Matplotlib: For making charts and graphs.
Seaborn: For beautiful, easy-to-read charts.
Scikit-learn: For machine learning tasks
TensorFlow/Keras: For building smart models
SciPy: For solving scientific problems
Statsmodels: For statistical analysis.
Plotly: For interactive charts and graphs.
NLTK/SpaCy: For working with text data
14) Write a short note about Data cleaning.
Ans:
Data Cleaning is the process of identifying and correcting errors in a dataset to improve its
quality. The goal is to ensure the data is accurate, complete, and ready for analysis.
Key Steps:
1. Remove duplicates.
2. Handle missing values.
3. Correct errors
4. Standardize formats.
5. Remove outliers.
15) Write a brief note about Data sampling
Ans:
Data Sampling is the process of selecting a smallerdata from a larger dataset for analysis. It
helps in making analysis more manageable, especially when dealing with large datasets.
Types:
1. Random Sampling: Random selection of data points.
2. Stratified Sampling: Sampling from specific groups.
3. Systematic Sampling: Selecting every nth data point.
4. Convenience Sampling: Choosing easily available data.
16) How can outlier values can be determined?
Ans:
Z-Score → If >3 or <-3.
IQR Method → Outside Q1 - 1.5×IQR or Q3 + 1.5×IQR.
Box Plot → Dots outside whiskers.
ML Methods → Isolation Forest, DBSCAN.
Visual Check → Scatter plots, histograms.
17) Compare between data analytics and data science.
Ans:
Data Analytics Data Science
Focus on "What happened?" "What will happen?"
structured data Both structured and unstructured data
Tools - Excel , SQL Python, R
lower complexity higher complexity
Jobs - Data Analyst Data Scientist
Methods - Reports, charts, SQL. AI, ML, coding.
Finds patterns in past data. Predicts future trends.
18) Explain briefly about Eigenvectors and Eigenvalues.
Ans:
Eigenvectors are special vectors that don’t change direction when a matrix is applied to
them—only their length changes.
Eigenvalues are the numbers that tell us how much the eigenvector is stretched
19) Interpret what do you understand by Imbalanced Data?
Ans:
Imbalanced data means one type of data appears much more than the other.
Example:
● Fraud detection: 99 out of 100 transactions are normal, only 1 is fraud.
● Medical tests: 95 out of 100 people are healthy, only 5 are sick.
Advantages
Matches real-world data (e.g., fraud, diseases).
Helps find rare but important cases.
Faster training with more common data.
Disadvantages
Model may ignore rare cases.
Harder to train good models.
20) Compare expected value and mean value.
Ans:
Expected Value Mean Value
Predicted average Actual average
Future predictions Past data analysis
Example - Average roll of a dice = 3.5 Rolling a dice 10 times and averaging results
Used to predict long-term average outcomes Used to summarize a given dataset.
Depends on Probability of values Total sum of values
Stays the same for a given probability Changes with different data samples
21) Define bias-variance trade-off.
Ans:
Bias-Variance Trade-off
It is about finding the right balance in a model:
● Bias (Too simple) → Model makes mistakes because it doesn’t learn enough.
● Variance (Too complex) → Model learns too much, and makes mistakes on new data.
Goal:
Find a balance where the model is not too simple or too complex.
22) Define the confusion matrix.
Ans:
Confusion Matrix
A confusion matrix helps check how well a model predicts things. It compares actual vs.
predicted results.
Table Example:
Predicted: Yes Predicted: No
Actual: Yes Correct (TP) Wrong (FN)
Actual: No Wrong (FP) Correct (TN)
Simple Meaning:
● True Positive (TP) → Model is right
● True Negative (TN) → Model is right
● False Positive (FP) → Model is wrong
● False Negative (FN) → Model is wrong
23) List the major drawbacks in Linear model.
Ans:
● Needs a straight-line pattern
● Limited flexibility
● Can’t capture variable interactions
● Overfits with too many features
● Not good for categories
● Struggles with related inputs
24) Develop RMSE and MSE in a linear regression model.
Ans:
25) Compare between correlation and covariance
Ans:
Correlation Covariance
Shows how strongly two variables are Shows how two variables change together.
related.
Range Between -1 and +1. Any value (positive or negative).
Independent of units Depends on units
+1 (Strong positive), 0 (No relation), -1 Positive (Move together), Negative (Move
(Strong negative). opposite).
Uses - Comparing relationships Checking variable movement.
Standardized Not standardized