0% found this document useful (0 votes)

16 views7 pages

Salary Prediction with Linear Regression

This document discusses linear regression for salary prediction. It describes the assumptions and process for building a simple linear regression model, including splitting data, fitting a model, and making predictions. The model is used to predict salary based on years of experience.

Uploaded by

Dev Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views7 pages

Salary Prediction with Linear Regression

Uploaded by

Dev Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

16/08/2023, 20:32 linearregrsalaryprediction.

ipynb - Colaboratory

Linear Regression

Supervised machine learning algorithms: It is a type of machine learning, where the

algorithm learns from labeled data.

Labeled data means the dataset whose respective target value is already known.

Supervised learning has two types:

Classification: It predicts the class of the dataset based on the independent

input variable. Class is the categorical or discrete values. like the image of an
animal is a cat or dog?

Regression: It predicts the continuous output variables based on the

independent input variable. like the prediction of house prices based on different
parameters like house age, distance from the main road, location, area, etc.

The above diagram is an example of Simple Linear Regression, where change in the value
of feature 'Y' is proportional to value of 'X'.

Y : Dependent or Target Variable.

X : Independent Variable.

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 1/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

Regression Line: It is best-fit line of the model, by which we can predict value of 'Y' for
new values of 'X'.

Assumption of Linear Regression:

Linear regression makes several key assumptions about the data and the relationships it
models. Violations of these assumptions can affect the validity and reliability of the
regression results. Here are the main assumptions of linear regression:

Linearity: The relationship between the independent variable(s) and the dependent
variable is linear. This means that the change in the dependent variable for a unit
change in the independent variable is constant.

Independence of Errors: The errors (residuals) of the model are assumed to be

independent of each other. In other words, the error of one observation should not be
influenced by the errors of other observations.

Homoscedasticity: Homoscedasticity refers to the assumption that the variance of

the residuals is constant across all levels of the independent variables. This means
that the spread of residuals should be roughly the same throughout the range of the
predictor variables.

Normality of Errors: The errors (residuals) should be normally distributed. This

assumption is important for hypothesis testing and constructing confidence intervals.

No or Little Multicollinearity: Multicollinearity occurs when two or more independent

variables in the model are highly correlated. This can make it difficult to interpret the
individual effects of each variable on the dependent variable.

No Endogeneity: Endogeneity refers to the situation where an independent variable is

correlated with the error term. This can arise due to omitted variable bias or
simultaneous causation and can lead to biased and inconsistent coefficient
estimates.

No Autocorrelation: Autocorrelation occurs when the residuals of the model are

correlated with each other. This assumption is important when dealing with time
series data, where observations are dependent on previous observations.

Constant Variance of Residuals (Homoscedasticity): Also known as

homoscedasticity, this assumption states that the variance of the residuals is
consistent across all levels of the independent variables. This is crucial for accurate
hypothesis testing and confidence interval estimation.

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 2/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

No Perfect Collinearity: Perfect collinearity exists when one independent variable can
be perfectly predicted by a linear combination of other independent variables. This
situation leads to a rank-deficient matrix, making it impossible to estimate unique
regression coefficients.

Salary Prediction using Simple Linear Regression

# Step1: Import important libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Step2: Import the Dataset

data = pd.read_csv('/kaggle/input/salary-dataset-simple-linear-regression/Salary_datas
print(data.head())

Unnamed: 0 YearsExperience Salary

0 0 1.2 39344.0
1 1 1.4 46206.0
2 2 1.6 37732.0
3 3 2.1 43526.0
4 4 2.3 39892.0

data.shape

(30, 3)

# Get information of the Dataset

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 30 non-null int64
1 YearsExperience 30 non-null float64
2 Salary 30 non-null float64
dtypes: float64(2), int64(1)
memory usage: 848.0 bytes

Exploratory Data Analysis (EDA):

# 1. NULL Value Treatment

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 3/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

data.isna().sum()
# So, no null values present

Unnamed: 0 0
YearsExperience 0
Salary 0
dtype: int64

# 2. Drop duplicate values

data.duplicated()
# No duplicates present

0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
dtype: bool

# 3. Calculate summary statistics

data.describe()

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 4/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

Unnamed: 0 YearsExperience Salary

count 30.000000 30.000000 30.000000

mean 14.500000 5.413333 76004.000000

std 8.803408 2.837888 27414.429785

min 0.000000 1.200000 37732.000000

25% 7.250000 3.300000 56721.750000

50% 14.500000 4.800000 65238.000000

# 4. No categorical variables present
75% 21.750000 7.800000 100545.750000

max 29.000000 10.600000 122392.000000

Split Dataset:

# Extract dependent(denoted by Y - target variable) and

# independent(denoted by X) features from Dataset
X = data['YearsExperience']
Y = data['Salary']

Splitting Training and Testing Dataset:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=

# Convert Series to DataFrame

x_train = pd.DataFrame(x_train)
x_test = pd.DataFrame(x_test)
y_train = pd.DataFrame(y_train)
y_test = pd.DataFrame(y_test)

Model Fitting:

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(x_train, y_train)

▾ LinearRegression
LinearRegression()

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 5/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

# Predict output for the x_test dataset

y_pred = regressor.predict(x_test)
y_pred

array([[39297.22202233],
[75603.43359409],
[37386.36878171],
[60316.60766914],
[63182.88753007],
[52673.19470666]])

Checking Accuracy Score:

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Mean Squared Error

mse = mean_squared_error(y_test, y_pred)

mse

36064238.493955195

# R2 - Score

r2 = r2_score(y_test, y_pred)
r2

# r2_score = 0.81, which is closer to 1.

# So, the line of regression is accurate.

0.8143022783109011

# Mean Absolute Error

mae = mean_absolute_error(y_test, y_pred)

mae

5392.453356511894

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 6/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 7/7

Linear - Regression - Ipynb - Colaboratory
No ratings yet
Linear - Regression - Ipynb - Colaboratory
4 pages
Predicting Salary with Experience
100% (1)
Predicting Salary with Experience
7 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Linear Regression2
No ratings yet
Linear Regression2
9 pages
Data Analysis with Pandas & Matplotlib
No ratings yet
Data Analysis with Pandas & Matplotlib
3 pages
Python Simple Linear Regression Guide
No ratings yet
Python Simple Linear Regression Guide
14 pages
Simple Linear Regression in Machine Learning
No ratings yet
Simple Linear Regression in Machine Learning
7 pages
Experiment No.8
No ratings yet
Experiment No.8
5 pages
Task 1
No ratings yet
Task 1
5 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
Machine Learning 2
No ratings yet
Machine Learning 2
45 pages
ML 1-11
No ratings yet
ML 1-11
27 pages
Practical # 10
No ratings yet
Practical # 10
5 pages
Linear Regression Salary Prediction
No ratings yet
Linear Regression Salary Prediction
8 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
Data Preprocessing & Visualization1
No ratings yet
Data Preprocessing & Visualization1
2 pages
Predictive Modelling Alternate Project Business Case
No ratings yet
Predictive Modelling Alternate Project Business Case
47 pages
2.3 Assumptions of Linear Regression
No ratings yet
2.3 Assumptions of Linear Regression
16 pages
EXP-4 DMusingPYTHON
No ratings yet
EXP-4 DMusingPYTHON
7 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
Predictive Modelling Alternative Firm Level PDF
100% (4)
Predictive Modelling Alternative Firm Level PDF
26 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
4 pages
Linear Regression Assignment
0% (2)
Linear Regression Assignment
8 pages
Linear Regression3.0
No ratings yet
Linear Regression3.0
24 pages
Python Simple Linear Regression Guide
No ratings yet
Python Simple Linear Regression Guide
8 pages
Python 1
No ratings yet
Python 1
3 pages
Linear Regression 2
No ratings yet
Linear Regression 2
3 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
2 pages
Logistic Binary Classification
No ratings yet
Logistic Binary Classification
3 pages
CKCS 149 Lab Final - S25Completed
No ratings yet
CKCS 149 Lab Final - S25Completed
6 pages
Linear Regression 1
No ratings yet
Linear Regression 1
2 pages
Salary Prediction
No ratings yet
Salary Prediction
9 pages
Geo Python Doc (1) 7,8 Bavesh
No ratings yet
Geo Python Doc (1) 7,8 Bavesh
9 pages
ML Recordjp
No ratings yet
ML Recordjp
35 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
Da Rec
No ratings yet
Da Rec
29 pages
Building Logistic Regression Model in Python
No ratings yet
Building Logistic Regression Model in Python
24 pages
2 Linear Regression
No ratings yet
2 Linear Regression
5 pages
Data Science for Beginners
No ratings yet
Data Science for Beginners
98 pages
DS Unit 4
No ratings yet
DS Unit 4
21 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
Salaries For San Francisco Employee - ML - FA - DA Projects
No ratings yet
Salaries For San Francisco Employee - ML - FA - DA Projects
33 pages
Experiment No 11
No ratings yet
Experiment No 11
19 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Python Data Preprocessing & Regression
No ratings yet
Python Data Preprocessing & Regression
68 pages
Salaries For San Francisco Employee
No ratings yet
Salaries For San Francisco Employee
30 pages
Regression
No ratings yet
Regression
16 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Social Network Analysis: Cheruvu Nvss Suhas 21BCE8374
No ratings yet
Social Network Analysis: Cheruvu Nvss Suhas 21BCE8374
10 pages
Netflix Stock Price Prediction
No ratings yet
Netflix Stock Price Prediction
20 pages
Data Science Record - 05
No ratings yet
Data Science Record - 05
20 pages
Lecture-2 Unit 2
No ratings yet
Lecture-2 Unit 2
56 pages
Predictive Modelling Project
No ratings yet
Predictive Modelling Project
28 pages
Unit5 - Linear Regression
No ratings yet
Unit5 - Linear Regression
4 pages
Intro to Machine Learning Basics
No ratings yet
Intro to Machine Learning Basics
132 pages
Machine Exercise 3
No ratings yet
Machine Exercise 3
22 pages
Model Terbenar
No ratings yet
Model Terbenar
16 pages
Class XII Project Work
No ratings yet
Class XII Project Work
2 pages
Statistical Hypothesis Testing Guide
No ratings yet
Statistical Hypothesis Testing Guide
12 pages
Stat 221 Test 1 2024 Real
No ratings yet
Stat 221 Test 1 2024 Real
2 pages
Gage R&R Study Guidelines
No ratings yet
Gage R&R Study Guidelines
11 pages
Solution For Homework 2 Problem 1
No ratings yet
Solution For Homework 2 Problem 1
8 pages
Syllabus: Department of Mechanical Engineering
No ratings yet
Syllabus: Department of Mechanical Engineering
2 pages
HSTS417 2017 05
No ratings yet
HSTS417 2017 05
3 pages
Forecasting Techniques Guide
No ratings yet
Forecasting Techniques Guide
4 pages
Chi-Square Tests Guide
100% (1)
Chi-Square Tests Guide
14 pages
BCH-226 Business Statistics, 2022-23
No ratings yet
BCH-226 Business Statistics, 2022-23
4 pages
Feature Selection Exercises Guide
No ratings yet
Feature Selection Exercises Guide
2 pages
MA6451-Probability and Random Processes
No ratings yet
MA6451-Probability and Random Processes
19 pages
Excel Formula Function
No ratings yet
Excel Formula Function
40 pages
Grad-Level Quantitative Methods
0% (1)
Grad-Level Quantitative Methods
6 pages
8 Hypothesis Testing 1
No ratings yet
8 Hypothesis Testing 1
26 pages
Model FDS
No ratings yet
Model FDS
2 pages
Banerjee 2016 Spatial Data Analysis
No ratings yet
Banerjee 2016 Spatial Data Analysis
17 pages
Business Statistics Course Guide
No ratings yet
Business Statistics Course Guide
4 pages
Sparse Bayesian Learning Explained
No ratings yet
Sparse Bayesian Learning Explained
19 pages
2023-3-SMK Perlis - A
No ratings yet
2023-3-SMK Perlis - A
12 pages
STAT 310 Syllabus
No ratings yet
STAT 310 Syllabus
5 pages
Factors Affecting On Students Test Scores
No ratings yet
Factors Affecting On Students Test Scores
43 pages
Flood Estimation by Log Pearson's Type III Method
No ratings yet
Flood Estimation by Log Pearson's Type III Method
3 pages
P10926 PDF
No ratings yet
P10926 PDF
18 pages
Program To Find The Variance and Standard Deviation of Set of Elements
No ratings yet
Program To Find The Variance and Standard Deviation of Set of Elements
3 pages
Chapter 7
No ratings yet
Chapter 7
24 pages
CS1A
No ratings yet
CS1A
11 pages
Populasi Dan Sampel Skala Pengukuran
No ratings yet
Populasi Dan Sampel Skala Pengukuran
25 pages
Pearson Correlation Coefficient Formula Excel Template
No ratings yet
Pearson Correlation Coefficient Formula Excel Template
6 pages

Salary Prediction with Linear Regression

Uploaded by

Salary Prediction with Linear Regression

Uploaded by

16/08/2023, 20:32 linearregrsalaryprediction.

Supervised machine learning algorithms: It is a type of machine learning, where the

Supervised learning has two types:

Classification: It predicts the class of the dataset based on the independent

Regression: It predicts the continuous output variables based on the

Y : Dependent or Target Variable.

Assumption of Linear Regression:

Independence of Errors: The errors (residuals) of the model are assumed to be

Homoscedasticity: Homoscedasticity refers to the assumption that the variance of

Normality of Errors: The errors (residuals) should be normally distributed. This

No or Little Multicollinearity: Multicollinearity occurs when two or more independent

No Endogeneity: Endogeneity refers to the situation where an independent variable is

No Autocorrelation: Autocorrelation occurs when the residuals of the model are

Constant Variance of Residuals (Homoscedasticity): Also known as

Salary Prediction using Simple Linear Regression

# Step1: Import important libraries

# Step2: Import the Dataset

Unnamed: 0 YearsExperience Salary

# Get information of the Dataset

Exploratory Data Analysis (EDA):

# 1. NULL Value Treatment

# 2. Drop duplicate values

# 3. Calculate summary statistics

Unnamed: 0 YearsExperience Salary

count 30.000000 30.000000 30.000000

mean 14.500000 5.413333 76004.000000

std 8.803408 2.837888 27414.429785

min 0.000000 1.200000 37732.000000

25% 7.250000 3.300000 56721.750000

50% 14.500000 4.800000 65238.000000

max 29.000000 10.600000 122392.000000

# Extract dependent(denoted by Y - target variable) and

Splitting Training and Testing Dataset:

from sklearn.model_selection import train_test_split

# Convert Series to DataFrame

from sklearn.linear_model import LinearRegression

# Predict output for the x_test dataset

Checking Accuracy Score:

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Mean Squared Error

mse = mean_squared_error(y_test, y_pred)

# r2_score = 0.81, which is closer to 1.

# Mean Absolute Error

mae = mean_absolute_error(y_test, y_pred)

You might also like