[go: up one dir, main page]

0% found this document useful (0 votes)
69 views8 pages

Load Dataset: Import As

The document loads and analyzes an advertising dataset with 200 entries and 4 variables: TV, Radio, Newspaper, and Sales. It performs exploratory data analysis including calculating statistics, visualizing the distributions with histograms, and splitting the data into training and test sets for linear regression. Linear regression is performed to predict Sales with TV, Radio, and Newspaper as features. The model performance is evaluated using mean absolute error. Pairwise relationships between variables are also visualized with a scatter plot.

Uploaded by

ZESTY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views8 pages

Load Dataset: Import As

The document loads and analyzes an advertising dataset with 200 entries and 4 variables: TV, Radio, Newspaper, and Sales. It performs exploratory data analysis including calculating statistics, visualizing the distributions with histograms, and splitting the data into training and test sets for linear regression. Linear regression is performed to predict Sales with TV, Radio, and Newspaper as features. The model performance is evaluated using mean absolute error. Pairwise relationships between variables are also visualized with a scatter plot.

Uploaded by

ZESTY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

5/7/2019 lab06

Load dataset

In [2]:

import seaborn as sns

In [3]:

import pandas as pd
# %matplotlib inline
import matplotlib.pyplot as plt
data=pd.read_csv("D:\8 semester\Data warehousing and data mining\Labs\LAB6\Advertising.csv"

In [4]:

data=pd.read_csv("D:\8 semester\Data warehousing and data mining\Labs\LAB6\Advertising.csv"

Task NO 1

In [5]:

data.shape

Out[5]:

(200, 4)

localhost:8888/notebooks/lab06.ipynb# 1/9
5/7/2019 lab06

In [6]:

data.describe()

Out[6]:

TV Radio Newspaper Sales

count 200.000000 200.000000 200.000000 200.000000

mean 147.042500 23.264000 30.554000 14.022500

std 85.854236 14.846809 21.778621 5.217457

min 0.700000 0.000000 0.300000 1.600000

25% 74.375000 9.975000 12.750000 10.375000

50% 149.750000 22.900000 25.750000 12.900000

75% 218.825000 36.525000 45.100000 17.400000

max 296.400000 49.600000 114.000000 27.000000

In [7]:

data.max()

Out[7]:

TV 296.4
Radio 49.6
Newspaper 114.0
Sales 27.0
dtype: float64

In [8]:

data.min()

Out[8]:

TV 0.7
Radio 0.0
Newspaper 0.3
Sales 1.6
dtype: float64

localhost:8888/notebooks/lab06.ipynb# 2/9
5/7/2019 lab06

In [9]:

data.mean()

Out[9]:

TV 147.0425
Radio 23.2640
Newspaper 30.5540
Sales 14.0225
dtype: float64

In [10]:

data.count()

Out[10]:

TV 200
Radio 200
Newspaper 200
Sales 200
dtype: int64

In [11]:

#data.count

In [12]:

data.columns.values

Out[12]:

array(['TV', 'Radio', 'Newspaper', 'Sales'], dtype=object)

In [13]:

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 1 to 200
Data columns (total 4 columns):
TV 200 non-null float64
Radio 200 non-null float64
Newspaper 200 non-null float64
Sales 200 non-null float64
dtypes: float64(4)
memory usage: 7.8 KB

localhost:8888/notebooks/lab06.ipynb# 3/9
5/7/2019 lab06

In [14]:

data.dtypes

Out[14]:

TV float64
Radio float64
Newspaper float64
Sales float64
dtype: object

In [15]:

data.ndim

Out[15]:

In [16]:

data.size

Out[16]:

800

In [17]:

data.values

Out[17]:

array([[230.1, 37.8, 69.2, 22.1],


[ 44.5, 39.3, 45.1, 10.4],
[ 17.2, 45.9, 69.3, 9.3],
[151.5, 41.3, 58.5, 18.5],
[180.8, 10.8, 58.4, 12.9],
[ 8.7, 48.9, 75. , 7.2],
[ 57.5, 32.8, 23.5, 11.8],
[120.2, 19.6, 11.6, 13.2],
[ 8.6, 2.1, 1. , 4.8],
[199.8, 2.6, 21.2, 10.6],
[ 66.1, 5.8, 24.2, 8.6],
[214.7, 24. , 4. , 17.4],
[ 23.8, 35.1, 65.9, 9.2],
[ 97.5, 7.6, 7.2, 9.7],
[204.1, 32.9, 46. , 19. ],
[195.4, 47.7, 52.9, 22.4],
[ 67.8, 36.6, 114. , 12.5],

localhost:8888/notebooks/lab06.ipynb# 4/9
5/7/2019 lab06

In [18]:

data.empty

Out[18]:

False

Task no 2 compare "conda" vs "pip"


Pip is conatain only the pakages of python but conda install pakages which may contain software of any
language. pip installation is easy with one command but conda installation is complex. Conda also include pip
but pip can't include conda. Pip compiles from source and conda install libraries. conda create and manage
multiple envirement but pip can't.

Task no 3

In [19]:

fig,ax=plt.subplots(1,4,figsize=(15, 3))
data['Radio'].plot(kind="hist", ax=ax[0],color ='blue',alpha=0.6)
data['Sales'].plot(kind="hist", ax=ax[1],color='green',alpha=0.6)
data['Newspaper'].plot( kind="hist",ax=ax[2],color='cyan',alpha=0.6)
data['TV'].plot( kind="hist",ax=ax[3],color='red',alpha=0.6)

Out[19]:

<matplotlib.axes._subplots.AxesSubplot at 0x20e1e4f78d0>

In [20]:

X = data[['TV', 'Radio', 'Newspaper']]

In [21]:

y = data['Sales']

localhost:8888/notebooks/lab06.ipynb# 5/9
5/7/2019 lab06

In [22]:

from sklearn.model_selection import train_test_split

In [23]:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [24]:

from sklearn.linear_model import LinearRegression

In [25]:

lr = LinearRegression()

In [26]:

lr.fit(X_train, y_train)

Out[26]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [27]:

print(lr.intercept_)

2.8769666223179318

In [28]:

print(lr.coef_)

[0.04656457 0.17915812 0.00345046]

In [29]:

y_pred = lr.predict(X_test)

localhost:8888/notebooks/lab06.ipynb# 6/9
5/7/2019 lab06

In [30]:

# Mean Absolute Error


from sklearn import metrics
print(metrics.mean_absolute_error(y_test, y_pred))

1.0668917082595215

What occurs inside a linear regression model?


Compare it with KNN classifier.
Inside a linear regression,draw a line whose distance from all point is small and distance from the origen is
large.In dataset, multiple lines are draw during regression, by changing the value of intersept and coffecient. By
repeating this process, finally one line is selected whose distance from all points is small. In kneignbour
classifier, different classes are made according to their similarities. when new data is come, its distance is
calculated from it k number of neighbour and it consider a part of shortest distance class.

Contribution

How Sales are related with other variables using scatter plot

In [31]:

sns.pairplot(data, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4, aspect=1,


plt.show()

Numerical data distribution


All the types of our data from our dataset and take only the numerical ones

localhost:8888/notebooks/lab06.ipynb# 7/9
5/7/2019 lab06

In [32]:

list(set(data.dtypes.tolist()))

Out[32]:

[dtype('float64')]

In [33]:

data = data.select_dtypes(include = ['float64', 'int64'])


data.head()

Out[33]:

TV Radio Newspaper Sales

1 230.1 37.8 69.2 22.1

2 44.5 39.3 45.1 10.4

3 17.2 45.9 69.3 9.3

4 151.5 41.3 58.5 18.5

5 180.8 10.8 58.4 12.9

In [34]:

data.hist(figsize=(6, 6), bins=50, xlabelsize=8, ylabelsize=8);


# ; avoid having the matplotlib verbose informations

localhost:8888/notebooks/lab06.ipynb# 8/9

You might also like