CHAPTER 2: REGRESSION
1. CHECKING LINEARITY : [pg.no:21-22]
PROGRAM :
from pandas import DataFrame
import matplotlib.pyplot as plt
Stock_Market = {'Year': [2017, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016],
'Month': [12, 11, 10, 9, 8, 7, 6, 5, 4, 3,
2, 1, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
'Interest_Rate': [2.75, 2.5, 2.5, 2.5, 2.5,
2.5, 2.5, 2.25, 2.25, 2.25, 2, 2, 2, 1.75, 1.75, 1.75, 1.75,
1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75],
'Unemployment_Rate': [5.3, 5.3, 5.3, 5.3,
5.4, 5.6, None, 5.5, None, 5.6, 5.7, 5.9, 6, 5.9, 5.8, 6.1,
6.2, 6.1, 6.1, 6.1, 6.1, 5.9, 6.2, 6.2],
'Stock_Index_Price': [1464, 1394, 1357,
1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047,
965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719]}
df = DataFrame(Stock_Market, columns=['Year', 'Month',
'Interest_Rate', 'Unemployment_Rate', 'Stock_Index_Price'])
plt.scatter(df['Interest_Rate'], df['Stock_Index_Price'],
color='red')
plt.title('Stock Index Price Vs Interest Rate', fontsize=14)
plt.xlabel('Interest Rate', fontsize=14)
plt.ylabel('Stock Index Price', fontsize=14)
plt.grid(True)
plt.show()
plt.scatter(df['Unemployment_Rate'],
df['Stock_Index_Price'], color='green')
plt.title('Stock Index Price Vs Unemployment Rate',
fontsize=14)
plt.xlabel('Unemployment Rate', fontsize=14)
plt.ylabel('Stock Index Price', fontsize=14)
plt.grid(True)
plt.show()
OUTPUT:
2. SIMPLE LINEAR REGRESSION:[pg.no:22-23]
PROGRAM :
from pandas import DataFrame
from sklearn import linear_model
Stock_Market = {'Year': [2017, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016],
'Month': [12, 11, 10, 9, 8, 7, 6, 5, 4, 3,
2, 1, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
'Interest_Rate': [2.75, 2.5, 2.5, 2.5, 2.5,
2.5, 2.5, 2.25, 2.25, 2.25, 2, 2, 2, 1.75, 1.75, 1.75, 1.75,
1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75],
'Unemployment_Rate': [5.3, 5.3, 5.3, 5.3,
5.4, 5.6, None, 5.5, None, 5.6, 5.7, 5.9, 6, 5.9, 5.8, 6.1,
6.2, 6.1, 6.1, 6.1, 6.1, 5.9, 6.2, 6.2],
'Stock_Index_Price': [1464, 1394, 1357,
1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047,
965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719]}
df = DataFrame(Stock_Market, columns=['Year', 'Month',
'Interest_Rate', 'Unemployment_Rate', 'Stock_Index_Price'])
# Here we have 1 variable for linear regression
X = df[['Interest_Rate']]
Y = df['Stock_Index_Price']
# Model fitting with sklearn linear regression
regr = linear_model.LinearRegression()
regr.fit(X, Y)
# Displaying Intercept and coefficients
print('Intercept:\n', regr.intercept_)
print('\nCoefficients:\n', regr.coef_)
# Prediction with sklearn
new_interest_rate = 2.75
print('Predicted Stock Index Price:\n',
regr.predict([[new_interest_rate]]))
OUTPUT:
INTERPRETATION:
Simple linear regression is of the form y=w0 + wlx. The output shows wo (Intercept)
as —99 . 4 6431881371655 and W1 (Coefficient) as 564 . 2038924 9. According to the
above example, the equation becomes
Stock_Index_Price= wo+W1* Interest_Rate
i.e, Stock_Index_Price= -99.46431881371655 +564.20389249* Interest Rate
Stock_Index_Price = 1452 . 0 9 63 8 554 which is exactly the predicted stock index price.
3. READING FROM A CSV FILE AND PREDICTING A SET OF DEPENDENT VARIABLES :[pg.no:24-25]
PROGRAM:
import pandas as pd
from pandas import DataFrame
from sklearn import linear_model
# Reading the input data from a csv file
df = pd.read_csv("stock.csv")
# Here we have 1 variable for linear regression
X = df[['Interest_Rate']]
Y = df['Stock_Index_Price']
# Model fitting with sklearn linear regression
regr = linear_model.LinearRegression()
regr.fit(X, Y)
# Displaying Intercept and coefficients
print('Intercept:\n', regr.intercept_)
print('Coefficients:\n', regr.coef_)
# Prediction with sklearn for all the interest rates
new_interest_rate = df[['Interest_Rate']]
df1 = DataFrame(regr.predict(new_interest_rate))
print('Predicted Stock Index Price:\n', df1)
Output:
4. MULTIPLE LINEAR REGRESSION :[pg.no:25-27]
PROGRAM:
from pandas import DataFrame
from sklearn import linear_model
import statsmodels.api as sm
Stock_Market = {'Year': [2017, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016],
'Month': [12, 11, 10, 9, 8, 7, 6, 5, 4, 3,
2, 1, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
'Interest_Rate': [2.75, 2.5, 2.5, 2.5, 2.5,
2.5, 2.5, 2.25, 2.25, 2.25, 2, 2, 2, 1.75, 1.75, 1.75, 1.75,
1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75],
'Unemployment_Rate': [5.3, 5.3, 5.3, 5.3,
5.4, 5.6, 5.5, 5.5, 5.5, 5.6, 5.7, 5.9, 6, 5.9, 5.8, 6.1,
6.2, 6.1, 6.1, 6.1, 6.1, 5.9, 6.2, 6.2],
'Stock_Index_Price': [1464, 1394, 1357,
1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047,
965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719]}
df = DataFrame(Stock_Market, columns=['Year', 'Month',
'Interest_Rate', 'Unemployment_Rate', 'Stock_Index_Price'])
# Here we have 2 variables for multiple regression.
X = df[['Interest_Rate', 'Unemployment_Rate']]
Y = df['Stock_Index_Price']
# Model fitting with sklearn linear regression
regr = linear_model.LinearRegression()
regr.fit(X, Y)
# Displaying Intercept and coefficients
print('Intercept:\n', regr.intercept_)
print('Coefficients:\n', regr.coef_)
# Prediction with sklearn
new_interest_rate = 2.75
new_unemployment_rate = 5.3
print('Stock Index Price:')
print(regr.predict([[new_interest_rate,new_unemployment_rate]]))
# Prediction with statsmodels
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print(model.summary())
Output:
INTREPRETATION OF RESULT:
This output includes the intercept and coeffcients. We can use this information to
build the multiple linear regression equation as follows. Stock_Index_Price = (Intercept) +
(Interest_Rate coef)*X1 +(Unemployment_Rate coef)*X2 Substituting the values of
intercept and coeffcients we get Stock_Index_Price= (1798.4040) +(3455401)*X1+ (-250.1466)*X2
Let. Interest_Rate = 2.75 (i.e., X 1= 2.75) and Unemployment_Rate = 5.3
(i.e., X2= 5.3). Substituting the above data into the regression equation, we will get the exact same
predicted results as displayed. = (1798.4040) + (3455401)*(2.75)+(-250.1466)*(5.3) = 1422.86
The table OLS Regression results displays a comprehensive table with statistical info
generated by statsmodels. Following are some important information from the OLS Regression
Results table.
Adj. R-squared reflects the fit of the model.
R-squared values range from 0 to 1, where a higher value generally indicates a better fit,
assuming certain conditions are met.
const coeffcient is our Y-intercept. It means that if Interest Rate coeffcient is zero, then the
expected output (i.e., the Y) would be equal to the const coeffcient.
Interest Rate coefficient represents the change in the output Y due to a change of one unit in
the interest rate (everything else held constant).
Unemployment Rate coefficient represents the change in the output Y due to a change of one
unit in the interest rate (everything else held constant).
std err reflects the level of accuracy of the coeffcients. The lower it is, the higher is the level of
accuracy.
P >ltl is your p-value. A p-value of less than 0.05 is considered to be statistically
significant. Confidence interval represents the range in which our coefficients are likely to fall
(with a likelihood of95%).
Notice that the coeffcients captured in this table (highlighted) match with the coeffcients generated
by sklearn. We got consistent results by applying both sklearn and statsmodels.
5. LINEAR REGRESSION:[pg.no:29-30]
PROGRAM :
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('position_salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=0)
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Visualizing the Linear Regression results
plt.scatter(X_train, y_train, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.title('Linear Regression')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
Output:
Explanation:
In this example, we have used 4 libraries namely numpy, pandas, matplotlib and
sklearn. We have imported libraries and got the dataset first. The dataset is a table
which contains all values in our csv file. X, the 2nd column which contains Years of
Experience array and y the last column which contains Salary array. We have split
our dataset to get training set and testing set (both X and y values per each set).
Test_size=0.2:We have split our dataset (10 observations) into 2 parts (training
set, test set) and the ratio of test set compare to dataset is 0.2 (2 observations will
be put into the test set. We can put it 1/7 to get 20% or 0.2, they are the same. We
should not let the test set too big. If it's too big, we will be lacking data to train. Normally, we should
pick around 5% to 30%.
Train_size : If we use the test size already, the rest of data will
automatically be assigned to train_size.
Random_state : This is the seed for the random number generator. We can put
an instance of the RandomState class as well. If we leave it blank or 0, the
RandomState instance used by np.random will be used instead. We have
already the train set, test set, and built the linear regression model. Now, will build
a polynomial regression model and visualize it.
6. POLYNOMINAL REGRESSION :[pg.no:30-31]
PROGRAM :
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('position_salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=0)
# Fitting polynomial regression to the dataset
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
# Visualizing the Polynomial Regression results
def viz_polynomial():
plt.scatter(X, y, color='red')
plt.plot(X, lin_reg.predict(poly_reg.fit_transform(X)),
color='blue')
plt.title('Polynomial Regression')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
return
viz_polynomial()
OUTPUT:
7. LOGISTIC REGRESSION:[pg.no:32-33]
PROGRAM FOR CONFUSION MATRIX:
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
data = {'y_Predicted': [1, 1, 0, 1,0,1,1,0,1,0,0,0],
'y_Actual': [1, 0, 0, 1, 0, 1, 0, 0, 1,0,1,0]
df = pd.DataFrame(data, columns=['y_Actual', 'y_Predicted'])
# Creating confusion matrix
confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], rownames=['Actual'],
colnames=['Predicted'],margins=True)
# Generating heatmap and displaying it
ax = sn.heatmap(confusion_matrix, annot=True)
plt.show()
# Getting the statistics of the confusion matrix
print(confusion_matrix)
Output:
8. PROGRAM:[pg.no:33-35]
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt
candidates = {
'gmat': [780, 750, 690, 710, 680, 730, 690, 720, 740,
690, 610, 690, 710, 680, 770, 610, 580, 650, 540,
590, 620, 600, 550, 550,570, 670, 660, 580, 650,
660, 640, 620, 660, 660, 680, 650, 670, 580, 590, 690],
'gpa': [4,3.9, 3.3, 3.7, 3.9, 3.7, 2.3, 3.3, 3.3,
1.7, 2.7, 3.7, 3.7,3.3, 3.3, 3, 2.7, 3.7, 2.7, 2.3,
3.3, 2,2.3, 2.7, 3, 3.3, 3.7, 2.3, 3.7,
3.3, 3, 2.7, 4, 3.3, 3.3, 2.3, 2.7, 3.3, 1.7,
3.7],
'work experience': [3, 4, 3, 5, 4, 6, 1, 4, 5, 1,3, 5, 6,
4, 3, 1, 4, 6, 2, 3, 2, 1, 4, 1, 2, 6, 4, 2, 6, 5, 1, 2, 4,
6, 5, 1,2, 1, 4, 5],
'admitted': [1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1,
1, 0, 0, 0, 0, 1]
df = pd.DataFrame(candidates, columns=['gmat', 'gpa',
'work experience', 'admitted'])
X = df[['gmat', 'gpa', 'work experience']]
y = df['admitted']
# Splitting the dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=0)
# Fitting logistic regression to the dataset
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
y_pred = logistic_regression.predict(X_test)
# Creating confusion matrix
confusion_matrix = pd.crosstab(y_test, y_pred,
rownames=['Actual'], colnames=['Predicted'], margins=True)
# Generating heatmap and displaying it
ax = sn.heatmap(confusion_matrix, annot=True)
plt.show()
print(confusion_matrix)
# Displaying accuracy
print('Accuracy:', metrics.accuracy_score(y_test,y_pred))
Output:
9. PROGRAM:[pg.no:37-38]
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt
candidates = {
'gmat': [780, 750, 690, 710, 680, 730, 690, 720, 740,
690, 610, 690, 710, 680, 770, 610, 580, 650, 540, 590, 620,
600, 550, 550, 570, 670, 660, 580, 650, 660, 640, 620, 660,
660, 680, 650, 670, 580, 590, 690],
'gpa': [4,3.9, 3.3, 3.7, 3.9, 3.7, 2.3, 3.3, 3.3,
1.7, 2.7, 3.7, 3.7,3.3, 3.3, 3, 2.7, 3.7, 2.7, 2.3,
3.3, 2,2.3, 2.7, 3, 3.3, 3.7, 2.3, 3.7,
3.3, 3, 2.7, 4, 3.3, 3.3, 2.3, 2.7, 3.3, 1.7,
3.7],
'work experience': [3, 4, 3, 5, 4, 6, 1, 4, 5, 1,3, 5, 6,
4, 3, 1, 4, 6, 2, 3, 2, 1, 4, 1, 2, 6, 4, 2, 6, 5, 1, 2, 4,
6, 5, 1,2, 1, 4, 5],
'admitted': [1,1,1,1,1,1,0,1,1,0,0,1,1,1,1,0,0,1,
0,0,0,0,0,0,0,1,1,0,1,1,0,0,1,1,1,0,0,0,0,1]
df = pd.DataFrame(candidates, columns=['gmat', 'gpa',
'work experience', 'admitted'])
X = df[['gmat', 'gpa', 'work experience']]
y = df['admitted']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=0)
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
new_candidates = {
'gmat': [590, 740, 680, 610, 710],
'gpa': [2,3.7, 3.3, 2.3, 3],
'work experience': [3, 4, 6, 1, 5]
df2 = pd.DataFrame(new_candidates, columns=['gmat', 'gpa',
'work experience'])
y_pred = logistic_regression.predict(df2)
print(df2)
print(y_pred)
Output: