Tutorial Sheet1 (M.L.)
Tutorial Sheet1 (M.L.)
Tutorial Sheet1 (M.L.)
Short Questions:
Q1. What is machine learning?
Training Data: The data used to teach the machine learning model.
Model: The algorithm that learns from data and makes predictions
or decisions.
Avinash Shukla (27)
1. Supervised Learning:
Description: In supervised learning, the model is trained on
labelled data, where each input is paired with a correct output. The
goal is for the algorithm to learn a mapping from inputs to outputs
and make accurate predictions when given new data.
Example Use Cases:
Classification: Identifying whether an email is spam or not
(spam detection).
Regression: Predicting continuous values like house prices
based on features like size, location, etc.
2. Unsupervised Learning:
3. Reinforcement Learning:
5. Self-supervised Learning:
2. Handling Complexity:
3. Adaptability:
4. Data Dependency:
7. Use of Examples:
2. Model Development:
3. Minimizing Errors:
During the training phase, the model adjusts its internal parameters
(weights, biases, etc.) to minimize the difference between its
predictions and the actual outputs in the training data. The goal is to
reduce errors so that the model can make accurate predictions on
unseen data.
5. Feature Importance:
Training data helps the model identify which features (variables) are
important for making predictions. For instance, in predicting house
prices, features like the number of bedrooms or the house’s location
may have a significant influence, and the model learns which of
these are most important based on the training data.
6. Avoiding Overfitting:
Proper use of training data also helps the model avoid overfitting,
which happens when the model learns not only the patterns in the
data but also the noise (irrelevant details) specific to the training
set. A model that overfits will perform well on the training data but
poorly on new, unseen data. By using a large and diverse set of
training data, overfitting can be minimized.
7. Hyperparameter Tuning:
How It Works:
Avinash Shukla (27)
1. Data Collection:
o Netflix collects large amounts of data from users, including
what shows or movies they watch, how long they watch them,
whether they liked or rated them, and even when they pause
or stop watching.
o Additional data like the genres, actors, directors, and viewing
times are also tracked.
2. Training a Model:
o Using this data, Netflix trains a machine learning model to find
patterns in user behavior and content preferences.
o The model learns to predict what content a user might enjoy
based on similarities with other users and the content they
have watched.
3. Making Recommendations:
o When a user logs in, Netflix uses the trained model to
recommend new content that is likely to match the user’s
interests. The recommendation engine considers what similar
users have enjoyed, trends in popular content, and even
personalized genres based on the user’s previous activity.
4. Continuous Learning:
o Netflix continuously updates its recommendation model as
new data is collected. Every time a user watches a new movie
or series, the model adapts and refines its recommendations
to reflect the user’s evolving tastes.
Techniques Used:
True/False:
Q1. Machine learning models require labelled data for training in all
cases. (False)
Q2. Machine learning is a subset of artificial intelligence. (True)
Section 2: Data
Short Questions:
Components of a Dataset:
4. Data Splits:
o Datasets are often divided into different subsets for various
purposes:
Training Set: Used to train the model, typically
comprising the majority of the dataset.
Validation Set: Used to tune hyperparameters and
select the best model during the training process.
Avinash Shukla (27)
Types of Datasets:
Example:
Features
Labels
Key Differences
Datasets often contain missing values that can distort analysis and
model performance. Preprocessing includes techniques to handle
missing data, such as imputation (filling in missing values) or
removal of incomplete records, ensuring that the dataset is
complete.
Data may come from various sources and have inconsistent formats
(e.g., date formats, categorical encodings). Preprocessing ensures
that all data is standardized and formatted correctly, facilitating
easier analysis and modeling.
9. Enhances Generalization:
2. Removing Duplicates:
2. Imputation
Mean/Median/Mode Imputation:
o For numerical data, replace missing values with the mean or
median of the available values in that feature.
o For categorical data, replace missing values with the mode
(the most frequent category).
K-Nearest Neighbors (KNN) Imputation: Use the values of the
nearest neighbors (based on feature similarity) to fill in missing
values. This method takes into account the local structure of the
data.
Regression Imputation: Predict the missing value using a
regression model based on the other available features. This
involves training a model on the complete cases and predicting
missing values for incomplete cases.
Multiple Imputation: Create multiple copies of the dataset with
different imputed values and then combine the results. This
technique accounts for uncertainty associated with the imputed
values.
5. Interpolation
6. Domain-Specific Methods
True/False:
1. Features are the input variables used to make predictions in
machine learning. (True)
2. All datasets in machine learning have labelled outputs. (False)
1. Model Development:
2. Automatic Differentiation:
Avinash Shukla (27)
3. Performance Optimization:
4. Model Training:
6. Deployment:
1. User-Friendly API:
3. Data Preprocessing:
4. Model Evaluation:
Avinash Shukla (27)
5. Hyperparameter Tuning:
6. Pipelines:
1. Data Structures:
2. Data Cleaning:
3. Data Transformation:
Filtering and Subsetting: Users can easily filter rows and columns
based on specific conditions, allowing for focused analysis of
relevant data.
Aggregation and Grouping: Pandas supports powerful grouping
and aggregation functions, enabling users to summarize data based
on categories or groups (e.g., calculating averages or sums).
Merging and Joining: It allows for combining multiple DataFrames
using various join operations (like SQL joins), which is essential for
integrating data from different sources.
4. Data Exploration:
1. Data Visualization:
4. Results Comparison:
6. Custom Visualizations:
Matplotlib works well with other scientific libraries like NumPy and
Pandas, making it easy to visualize data stored in these formats. For
instance, you can quickly plot data from a Pandas DataFrame.
True/False:
The machine learning process typically involves several key steps that
guide practitioners from understanding the problem to deploying a model.
Here’s an overview of the main steps:
Avinash Shukla (27)
2. Collect Data
3. Data Preparation
5. Select a Model
Fit the selected model to the training data. This involves adjusting
the model parameters to learn from the data.
8. Hyperparameter Tuning
Avinash Shukla (27)
Assess the final model using the test dataset to obtain an unbiased
evaluation of its performance. This step helps confirm that the
model can generalize to unseen data.
Forward Pass: For each instance in the training data, the model
processes the input features through its architecture to produce a
prediction or output.
4. Iterations
6. Model Evaluation
2. Evaluation Metrics
3. Train-Test Split
4. Cross-Validation
6. Model Comparison
7. Iterative Process
1. Generalization Assessment
3. Overfitting Detection
5. Real-World Simulation
6. Benchmarking
The results obtained from testing data can provide valuable insights
into the strengths and weaknesses of the model. This feedback is
crucial for guiding future work, including data collection, feature
engineering, and model selection.
2. Deployment Scenarios
3. Deployment Platforms
4. Model Serving
Avinash Shukla (27)
6. Feedback Loop
True/False:
1. The machine learning process ends after the model is trained.
(False)
2. Model training involves finding the best fit for the data. (True)
1. Basic Formula:
3. Assumptions:
5. Evaluation Metrics:
6. Applications:
Where:
Where:
β0\beta_0β0 = Intercept
During the training phase, the model learns the coefficients β\betaβ
by minimizing the prediction error on the training dataset. This is
typically done using the Ordinary Least Squares (OLS) method,
which seeks to minimize the sum of the squared differences
between the actual target values and the predicted values.
3. Making Predictions
For a new observation, the model requires values for all independent
variables (features). For example, if you have two features X1X_1X1
and X2X_2X2, you would input their values.
The model substitutes the input feature values into the regression
equation. For example, with the equation:
Ypred=β0+β1X1+β2X2Y_{\text{pred}} = \beta_0 + \beta_1 X_1 + \
beta_2 X_2Ypred=β0+β1X1+β2X2
You would replace X1X_1X1 and X2X_2X2 with their actual values:
Ypred=β0+β1⋅(value of X1)+β2⋅(value of X2)Y_{\text{pred}} = \
beta_0 + \beta_1 \cdot \text{(value of } X_1\text{)} + \beta_2 \
⋅(value of X2)
cdot \text{(value of } X_2\text{)}Ypred=β0+β1⋅(value of X1)+β2
4. Interpreting Predictions
Avinash Shukla (27)
1. Linear Relationship
2. Equation Representation
In this equation:
4. Limitations
5. Extensions
True/False:
1. Linear regression can model nonlinear relationships. (False)
2. The line that best fits the data points in linear regression is called
the regression line.
Avinash Shukla (27)
y=a0+a1x+a2x2+...+anxn+ϵ
True/False:
1. Polynomial regression is used when the relationship between
variables is linear. (False)
Section 7: Features
Short Questions:
1. What are features in a machine learning model?
Features are individual measurable properties or characteristics of the
data used as input to a machine learning model. They represent the input
variables that the model uses to learn and make predictions.
True/False:
1. Features are the inputs used by the model to make predictions.
(True)
2. The process of creating new features from raw data is called Ans:
Avinash Shukla (27)
Section 8: Scaling
Short Questions:
Q1. Why is feature scaling important in machine
learning?
Feature scaling is important because many machine learning algorithms
rely on the distance between data points. If features have different scales,
it can lead to biased results, as some features may dominate others.
Scaling helps to ensure that all features contribute equally to the model’s
performance.
True/False:
1. Feature scaling ensures that all features contribute equally to the
model. (True)
True/False:
1. The cost function helps measure the accuracy of a model. (True)
Ans: minimize
Avinash Shukla (27)
True/False:
1. Gradient descent is used to find the minimum of the cost function.
(True)
Ans: speed
Avinash Shukla (27)
True/False:
1. A higher learning rate leads to faster convergence but may
overshoot the minimum. (True)
2. The learning rate controls how much the model's weights are
adjusted during training. (True)
Ans: slowly
2. The _______ controls the size of steps taken during gradient descent.