Module 1 - Foundations of Data Science
(Simple Explanation for Exam)
1. What is Data Science?
Data Science is the process of extracting useful insights from data using a combination of
statistics, computer science, and domain knowledge. It helps answer questions like:
- What happened?
- Why did it happen?
- What will happen?
- What can be done next?
2. AI, ML, and DL
• Artificial Intelligence (AI): Systems that simulate human intelligence.
• Machine Learning (ML): A subset of AI where computers learn from data.
• Deep Learning (DL): A subset of ML that uses neural networks with many layers.
3. Types of Machine Learning
• Supervised Learning: Learns from labelled data (e.g., spam or not spam).
• Unsupervised Learning: Finds patterns in unlabelled data (e.g., grouping customers).
4. Classification vs Regression
• Classification: Predicts categories (e.g., cat or dog).
• Regression: Predicts continuous values (e.g., house price).
5. Feature Vector and Feature Selection
• Feature: Individual measurable property.
• Feature Vector: A list of features used to describe an object.
• Feature Selection: Choosing the best features to improve model accuracy and reduce
complexity.
6. Overfitting, Underfitting & Generalization
• Overfitting: Model memorizes training data (high variance).
• Underfitting: Model doesn’t learn enough from data (high bias).
• Generalization: Model performs well on new data.
7. Curse of Dimensionality
• Adding too many features can reduce model performance.
• Solution: Dimensionality Reduction using PCA, LDA etc.
8. Evaluation and Model Selection
• Confusion Matrix: Shows True/False Positives/Negatives.
• Accuracy = (TP + TN) / Total
• Precision = TP / (TP + FP)
• Recall = TP / (TP + FN)
• ROC Curve: Graph showing model performance across thresholds.
9. Bias-Variance Tradeoff
• Bias: Error from wrong assumptions.
• Variance: Error from sensitivity to small changes in training data.
• Goal: Low bias and low variance.
10. Training, Validation, Test Sets
• Training Set: Used to train the model.
• Validation Set: Used to tune hyperparameters.
• Test Set: Used to evaluate final model performance.