[go: up one dir, main page]

0% found this document useful (0 votes)
31 views69 pages

20ITPL702 - DataScienceWithMachineLearning

This document serves as a Bonafide Certificate for the Data Science Laboratory course (20ITPL702) for B.Tech Information Technology students during the academic year 2024-25. It includes an index of experiments covering R and Python programming, focusing on data analysis, statistics, visualization, and machine learning techniques. The document outlines various practical tasks, algorithms, and sample code for students to complete as part of their coursework.

Uploaded by

Susila Sakthy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views69 pages

20ITPL702 - DataScienceWithMachineLearning

This document serves as a Bonafide Certificate for the Data Science Laboratory course (20ITPL702) for B.Tech Information Technology students during the academic year 2024-25. It includes an index of experiments covering R and Python programming, focusing on data analysis, statistics, visualization, and machine learning techniques. The document outlines various practical tasks, algorithms, and sample code for students to complete as part of their coursework.

Uploaded by

Susila Sakthy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

NAME :

REGISTER NUMBER :

20ITPL702 – DATA SCIENCE LABORATORY

(IV YEAR / VII SEM)


(BATCH: 2021 – 2025)

B. TECH INFORMATION TECHNOLOGY

ACADEMIC YEAR: 2024–25


Bonafide Certificate
Register No:

Certified that this is the Bonafide Record of work done by


_________________________________________in the B.Tech Degree Course
INFORMATION TECHNOLOGY in the 20ITPL702- DATA SCIENCE
LABORATORY during the Academic Year 2024-25.

Station: Chennai

Date:

STAFF IN CHARGE HEAD OF THE DEPARTMENT

Submitted for Practical Examination held on at


Sri Sai Ram Engineering College, Chennai – 600 044.

INTERNAL EXAMINER EXTERNAL EXAMINER


INDEX

S.NO DATE LIST OF EXPERIMENTS PG NO SIGN

1 Install, configure, and run R programs with 1


necessary packages
2a Simple R programs using numbers 3

2b Computing the mean of a vector 4

2c R program that demonstrates the use of objects 5

Calculator Application using Python

3a (i) Python script to develop calculator applications 6


without using python objects
3a (ii) Python script to develop calculator application 8
with using python objects
3b Python script to develop calculator application 10
using mathematical functions
3c Python script to create python objects for 12
calculator application and save in a specified
location in disk

Descriptive Statistics in Python


4a Python script to find basic descriptive statistics 14
using summary, str, quartile on mtcars and cars
datasets
4b Python script to find subset of dataset by using 16
subset(), aggregate() functions on iris dataset
Reading and Writing different types of Datasets

5a Python script to read different types of data sets 18


from web and disk and write to a file in a specific
location
5b Python script to read Excel data sheets 20

5c python script to read XML data sets 21


Visualization

6a R program to find the data distributions using box 22


and scatter plots
6b R program to find the outliers using boxplots, Z- 25
Score, and Interquartile Range (IQR)
6c R program to plot the histogram, bar chart, and pie 28
chart for the given data
Correlation and Covariance

7a R program to find the correlation matrix of the 32


given iris data
7b R program to plot the correlation plot on the 34
dataset and visualize giving an overview of
relationships among data on iris data
7c R program for Analysis of covariance: variance 36
(ANOVA) if data have categorical variables on
iris data
Regression Model

8 R program to perform Logistic Regression for 38


Student Admission Prediction
Multiple Regression Model

9 Multiple Regression Analysis with continuous 43


Predictors

Regression Model for Prediction

10 Predicting Data Using Regression Models 47

Classification Model

11a Install relevant packages for classification 51

11b Choose a classifier for a classification problem 53

11c Evaluate the performance of the classifier 57


Clustering Model

12a Clustering algorithms for unsupervised 59


classification
12b Plot the cluster data using Python visualizations 63
EX.NO.1 INSTALL, CONFIGURE, AND RUN R PROGRAMS WITH NECESSARY PACKAGES
DATE:
AIM:
To install, configure, and run R with the necessary packages, you can follow these steps.
ALGORITHM:
Step 1: Install R
• Go to the official R website at https://www.r-project.org/

• Click on "CRAN" under the "Download" section.


• Choose a mirror site close to your location.
• Download and install R according to your operating system (Windows, macOS, or Linux).

Step 2: Install RStudio (optional but recommended)


• RStudio is an integrated development environment (IDE) for R that provides a user friendly
interface. It's not required, but highly recommended.
• Go to the RStudio website at https://www.rstudio.com/products/rstudio/download/#download
• Download and install RStudio Desktop according to your operating system.

Step 3: Launch R or RStudio


• If you installed RStudio, open it.
• Otherwise, open the R console directly.

Step 4: Install necessary packages


• To install packages, you can use the install.packages() function.
install.packages(“dplyr”)
For example, to install the "dplyr" package, type the following command and press Enter:
• Repeat this step for all the packages you need. Make sure to include all the necessary packages for
your specific analysis or task.

1
Step 5: Load package
• Once the packages are installed, you need to load them into the R session using the library()
function.
library(dplyr)
For example, to load the "dplyr" package, type the following command and press Enter
Repeat this step for all the packages you installed.

Step 6: Start coding


• You can now start writing R code using the installed and loaded packages.
• In RStudio, you can create a new script file by clicking on "File" -> "New File" -> "R Script".
• Alternatively, you can directly type your code in the R console.

Step 7: Run R code


• To run your R code, you can either execute the code line by line or run the entire script.
• In RStudio, you can execute a single line of code by placing the cursor on that line and pressing
"Ctrl + Enter".
• To run the entire script, click on the "Source" button or press "Ctrl + Shift + S". That's it! You have
now installed, configured, and can run R with necessary packages.

RESULT: Thus, the R package has been successfully installed.

2
EX.NO:2A SIMPLE R PROGRAMS USING NUMBERS
DATE:
AIM:
To write a simple R program using numbers.
ALGORITHM:
1. Assign the value 5 to variable a.
2. Assign the value 3 to variable b.
3. Add a and b together and store the result in variable sum.
4. Print the value of sum.

PROGRAM:
a <- 5
b <- 5
sum <- a+b
print(sum)

OUTPUT:
[1] 8

RESULT: Thus, simple R programs using numbers have been executed and verified successfully.

3
EX.NO:2B COMPUTING THE MEAN OF A VECTOR
DATE:
AIM:
To write simple R program compute the mean of a vector.
ALGORITHM:
1. Create a vector vec with values 2, 4, 6, 8, and 10.
2. Calculate the mean of the elements in vec using the mean() function and store the result in
mean_val.
3. Print the value of mean_val.

PROGRAM:
vec <-c (2,4,6,8,10)
mean_val <- mean(vec)
print(mean_val)

OUTPUT:
[1] 6

RESULT: Thus, computing the mean of a vector using R program has been executed and verified
successfully.
4
EX.NO:2C R PROGRAM THAT DEMONSTRATES THE USE OF OBJECTS
DATE:
AIM:
To write a simple R program that demonstrates the use of objects.
ALGORITHM:
1. Create an empty object using the `list()` function.
2. Assign values to different attributes of the object.
3. Access the attributes and print the student's information.

PROGRAM:
This program creates an object representing a student and displays their information:
# Create an empty object
student <- list()
# Assign the value to attributes
student$name <- "John Doe"
student$age <- 20
student$major <- "Computer Science"
# Access attributes and print student's information
cat((paste("Name:", student$name)),(paste("\nAge:", student$age)) ,(paste("\nMajor:",
student$major)))

OUTPUT:
[1] "Name: John Doe"
[1] "Age: 20"
[1] "Major: Computer Science"

RESULT: Thus, the execution of R programs using objects has been executed and verified successfully.

5
EX.NO.3A (i) PYTHON SCRIPT TO DEVELOP CALCULATOR APPLICATIONS
WITHOUT USING PYTHON OBJECTS
DATE:
AIM:
To write a simple Python script to develop a calculator application without using Python objects on
console.

ALGORITHM:
1. Prompt the user to enter the first number.
2. Prompt the user to enter the second number.
3. Prompt the user to select an operation (addition, subtraction, multiplication, or division).
4. Perform the selected operation on the numbers.
5. Display the result.
PROGRAM:
num1 = float(input("Enter the first number: "))
num2 = float(input("Enter the second number: "))
print("Select operation:")
print("1. Addition")
print("2. Subtraction")
print("3. Multiplication")
print("4. Division")
choice = int(input("Enter your choice (1-4): "))
if choice == 1:
result = num1 + num2
elif choice == 2:
result = num1 - num2
elif choice == 3:
result = num1 * num2
elif choice == 4:
result = num1 / num2
else:
print("Invalid choice!")

6
OUTPUT:
Enter the first number: 5
Enter the second number: 3
Select operation:
1. Addition
2. Subtraction
3. Multiplication
4. Division
Enter your choice (1-4): 3
Result: 15.0

RESULT: Thus, calculator application without using Python objects on the console has been developed
and the output has been executed and verified successfully.

7
EX.NO.3A (ii) PYTHON SCRIPT TO DEVELOP CALCULATOR APPLICATIONS
WITH USING PYTHON OBJECTS

DATE:
AIM:
To write a simple Python script to develop a calculator application using Python objects on
console.

ALGORITHM:
1. Create a Calculator class with methods for addition, subtraction, multiplication, and division.
2. Initialize the Calculator object.
3. Prompt the user to enter the first number.
4. Prompt the user to enter the second number.
5. Prompt the user to select an operation (addition, subtraction, multiplication, or division).
6. Perform the selected operation using the appropriate method of the Calculator object.
7. Display the result.

PROGRAM:
class Calculator:
def addition(self, num1, num2):
return num1 + num2
def subtraction(self, num1, num2):
return num1 - num2
def multiplication(self, num1, num2):
return num1 * num2
def division(self, num1, num2):
return num1 / num2
calculator = Calculator()
num1 = float(input("Enter the first number: "))
num2 = float(input("Enter the second number: "))
# Prompt the user to select an operation

8
print("Select operation:")
print("1. Addition")
print("2. Subtraction")
print("3. Multiplication")
print("4. Division")
choice = int(input("Enter your choice (1-4): "))
if choice == 1:
result = calculator.addition(num1, num2)
elif choice == 2:
result = calculator.subtraction(num1, num2)
elif choice == 3:
result= calculator.multiplication(num1, num2)
elif choice == 4:
result = calculator.division(num1, num2)
else:
print("Invalid choice!")
print("Result:", result)

OUTPUT:
Enter the first number: 5
Enter the second number: 3
Select operation:
1. Addition
2. Subtraction
3. Multiplication
4. Division Enter your choice (1-4):3
Result: 15.0

RESULT: Thus, calculator application without using Python objects on the console has been developed
and the output has been executed and verified successfully.
9
EX.NO.3B PYTHON SCRIPT TO DEVELOP CALCULATOR APPLICATION USING
MATHEMATICAL FUNCTIONS

DATE:
AIM:
To write a simple Python script to develop a calculator application using mathematical functions on
console.

ALGORITHM:
1. Prompt the user to enter the first number.
2. Prompt the user to enter the second number.
3. Prompt the user to select an operation (addition, subtraction, multiplication, division, or
exponentiation).
4. Perform the selected operation on the numbers using appropriate mathematical functions.
5. Display the result.

PROGRAM:
import math
num1 = float(input("Enter the first number: "))
num2 = float(input("Enter the second number: "))
print("Select operation:")
print("1. Addition")
print("2. Subtraction")
print("3. Multiplication")
print("4. Division")
print("5. Exponentiation ")
choice = int(input("Enter your choice (1-5): "))
if choice == 1:
result = num1 + num2
elif choice == 2:
result = num1 - num2
10
elif choice == 3:
result = num1 * num2
elif choice == 4:
result = num1 / num2
elif choice == 5:
result = math.pow(num1,num2)
else:
print("Invalid choice!")
exit(1)
print("Result:", result)

OUTPUT:
Enter the first number: 5
Enter the second number: 3
Select operation:
1. Addition
2. Subtraction
3. Multiplication
4. Division
5. Exponentiation Enter your choice (1-5): 4
Result: 1.6666666666666667

RESULT: Thus, calculator application using mathematical functions has been developed and the output
has been executed and verified successfully.
11
EX.NO.3C PYTHON SCRIPT TO CREATE PYTHON OBJECTS FOR CALCULATOR
APPLICATION AND SAVE IN A SPECIFIED LOCATION IN DISK

DATE:
AIM:
To write a simple Python script to create Python objects for calculator application and save in a
specified location in the disk.

ALGORITHM:
1. We start by defining a Calculator class with the basic arithmetic operations: addition,
subtraction, multiplication, and division. The result is stored in the result attribute.
2. We create an instance of the Calculator class called calculator.
3. We perform some calculations by calling the appropriate methods on the calculator object.
4. Next, we specify the location where we want to save the calculator object. In this example, we
use the filename calculator_object.pkl, but you can change it to any desired location.
5. We use the pickle module to save the calculator object to the specified location on the disk.
6. Finally, we print a message indicating the successful saving of the calculator object.

PROGRAM:
import pickle
class Calculator:
def init (self):
self.result = 0
def add(self, num):
self.result += num
def subtract(self, num):
self.result -= num
def multiply(self, num):
self.result *= num
def divide(self, num):
if num != 0:
self.result /= num
12
else:
print("Error: Cannot divide by zero.")
calculator = Calculator()
calculator.add(5)
calculator.subtract(2)
calculator.multiply(3)
calculator.divide(4)
save_location = 'calculator_object.pkl'
with open(save_location, 'wb') as file:
pickle.dump(calculator, file)
print(f"Calculator object saved to: {save_location}")

OUTPUT : Calculator object saved to: calculator_object.pkl


After running the script, you will find the saved calculator object (calculator_object.pkl) in the specified
location on your disk

RESULT: Thus, simple Python script to create Python objects for calculator application and save in a
specified location on disk has been executed and verified successfully.

13
EX.NO:4A PYTHON SCRIPT TO FIND BASIC DESCRIPTIVE STATISTICS USING
SUMMARY, STR, QUARTILE ON MTCARS AND CARS DATASETS

DATE:
AIM:
To write a simple python script to find basic descriptive statistics using summary, str, quartile on
mtcars and cars datasets.
ALGORITHM:
1. Import the pandas library to work with datasets.
2. Load the mtcars dataset using the read_csv() function and assign it to the variable mtcars.
3. Use the describe() function on the mtcars dataset to compute summary statistics and assign
the result to mtcars_summary.
4. Print the summary statistics for the mtcars dataset using print(mtcars_summary).
5. Use the info() function on the mtcars dataset to display information about the data types of the
columns.
6. Print the data type information for the mtcars dataset using print(mtcars.info()).
7. Use the quantile() function on the mtcars dataset to calculate quartiles at 25%, 50%, and 75%.
8. Assign the calculated quartiles to mtcars_quartiles.
9. Print the quartiles for the mtcars dataset using print(mtcars_quartiles).
10. Repeat steps 2-9 for the cars dataset, replacing mtcars with cars in the variable names and
filenames.

PROGRAM:
import pandas as pd
mtcars = pd.read_csv('mtcars.csv')
mtcars_summary = mtcars.describe()
print("Summary Statistics for mtcars dataset:",mtcars_summary)
print("\nData Type Information for mtcars dataset:",mtcars.info())
mtcars_quartiles = mtcars.quantile([0.25, 0.5, 0.75])
print("\nQuartiles for mtcars dataset:",mtcars_quartiles)

14
OUTPUT :
Summary Statistics for mtcars dataset: mpg cyl disp ... am gear carb
count 32.000000 32.000000 32.000000 ... 32.000000 32.000000 32.0000
mean 20.090625 6.187500 230.721875 ... 0.406250 3.687500 2.8125
std 6.026948 1.785922 123.938694 ... 0.498991 0.737804 1.6152
min 10.400000 4.000000 71.100000 ... 0.000000 3.000000 1.0000
25% 15.425000 4.000000 120.825000 ... 0.000000 3.000000 2.0000
50% 19.200000 6.000000 196.300000 ... 0.000000 4.000000 2.0000
75% 22.800000 8.000000 326.000000 ... 1.000000 4.000000 4.0000
max 33.900000 8.000000 472.000000 ... 1.000000 5.000000 8.0000

[8 rows x 11 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 32 non-null object
1 mpg 32 non-null float64
2 cyl 32 non-null int64
3 disp 32 non-null float64
4 hp 32 non-null int64
5 drat 32 non-null float64
6 wt 32 non-null float64
7 qsec 32 non-null float64
8 vs 32 non-null int64
9 am 32 non-null int64
10 gear 32 non-null int64
11 carb 32 non-null int64
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB

Data Type Information for mtcars dataset: None

Quartiles for mtcars dataset: mpg cyl disp hp drat ... qsec vs am gear carb
0.25 15.425 4.0 120.825 96.5 3.080 ... 16.8925 0.0 0.0 3.0 2.0
0.50 19.200 6.0 196.300 123.0 3.695 ... 17.7100 0.0 0.0 4.0 2.0
0.75 22.800 8.0 326.000 180.0 3.920 ... 18.9000 1.0 1.0 4.0 4.0

RESULT: Thus, a simple Python script to find basic descriptive statistics using summary, str,
and quartile on mtcars and cars datasets has been executed and verified successfully.

15
EX.NO:4B PYTHON SCRIPT TO FIND SUBSET OF DATASET BY USING
SUBSET(), AND AGGREGATE() FUNCTIONS ON IRIS DATASET
DATE:
AIM:
To write a simple python script to find subset of dataset by using subset(), aggregate() functions on
iris dataset.
ALGORITHM:
1. Import the pandas library to work with the dataset.
2. Load the Iris dataset using the read_csv() function and assign it to the variable iris.
3. Use the subset() function to filter the dataset based on a condition. In this example, we subset
the dataset where the species is equal to 'setosa' and assign the result to the variable subset.
4. Print the subset of the Iris dataset using print(subset).
5. Use the aggregate() function on the subset to compute aggregate statistics. In this example, we
calculate the minimum, maximum, and mean values for the 'sepal_length' column, as well as the
mean value for the 'sepal_width' column.
6. Assign the aggregated result to the variable aggregate_result.
7. Print the aggregated result using print(aggregate_result).

PROGRAM:
import pandas as pd
iris = pd.read_csv("C:\\Users\\Dell\\Downloads\\iris\i\\iris.csv") # your File Path
subset = iris[iris['variety'] == 'Setosa']
print("Subset of Iris dataset (variety='Setosa'):")
print(subset)
aggregate_result = subset.aggregate({'sepal.length': ['min', 'max', 'mean'], 'sepal.width': 'mean'})
print("\nAggregate result for the subset:")
print(aggregate_result)

OUTPUT :
Subset of Iris dataset (variety='Setosa'):
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa

16
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
5 5.4 3.9 1.7 0.4 Setosa
6 4.6 3.4 1.4 0.3 Setosa
7 5.0 3.4 1.5 0.2 Setosa
8 4.4 2.9 1.4 0.2 Setosa
9 4.9 3.1 1.5 0.1 Setosa
10 5.4 3.7 1.5 0.2 Setosa
11 4.8 3.4 1.6 0.2 Setosa
12 4.8 3.0 1.4 0.1 Setosa
13 4.3 3.0 1.1 0.1 Setosa
14 5.8 4.0 1.2 0.2 Setosa
15 5.7 4.4 1.5 0.4 Setosa
16 5.4 3.9 1.3 0.4 Setosa
17 5.1 3.5 1.4 0.3 Setosa
18 5.7 3.8 1.7 0.3 Setosa
19 5.1 3.8 1.5 0.3 Setosa
20 5.4 3.4 1.7 0.2 Setosa
21 5.1 3.7 1.5 0.4 Setosa
22 4.6 3.6 1.0 0.2 Setosa
23 5.1 3.3 1.7 0.5 Setosa
24 4.8 3.4 1.9 0.2 Setosa
25 5.0 3.0 1.6 0.2 Setosa
26 5.0 3.4 1.6 0.4 Setosa
27 5.2 3.5 1.5 0.2 Setosa
28 5.2 3.4 1.4 0.2 Setosa
29 4.7 3.2 1.6 0.2 Setosa
30 4.8 3.1 1.6 0.2 Setosa

Aggregate result for the subset:


sepal.length sepal.width
min 4.300 NaN
max 5.800 NaN
mean 5.006 3.428

RESULT: Thus, a simple python script to find subset of dataset by using subset(), aggregate() functions
on iris dataset has been executed and verified successfully.

17
EX.NO:5A PYTHON SCRIPT TO READ DIFFERENT TYPES OF DATA SETS
FROM WEB AND DISK AND WRITE TO A FILE IN A SPECIFIC LOCATION
DATE:
AIM:
To write a simple python script to find, read different types of data sets from web and disk and
write to a file in a specific location.
ALGORITHM:
1. Start the program
2. Tread data from csv file using pandas package.
3. To read data from excel file using the pandas package.
4. To read data from an html file using the pandas package.
5. Display the output.
6. Stop the program.

PROGRAM:
import numpy as np
import pandas as pd
df = pd.read_csv("C:\\Users\\Dell\\Downloads\\sample_day.csv") # Your File Path
print(df)
OUTPUT :
Day Temperature Humidity Weather
0 Monday 20 55 Sunny
1 Tuesday 22 60 Cloudy
2 Wednesday 21 58 Rainy
3 Thursday 23 57 Sunny
4 Friday 19 54 Cloudy

EXCEL:
import pandas as pd
df=pd.read_excel("C:\\Users\\Dell\\Downloads\\sample_weather_data.xlsx")
print(df)

18
OUTPUT:

Day Temperature Humidity Weather


0 Monday 20 55 Sunny
1 Tuesday 22 60 Cloudy
2 Wednesday 21 58 Rainy
3 Thursday 23 57 Sunny
4 Friday 19 54 Cloudy
HTML:
import pandas as pd
url = "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list"
df = pd.read_html(url)
match = "Silicon Valley Bank"
df_list = pd.read_html(url, match=match)
print(df_list[0])

OUTPUT:
Bank Name Fund Sort ascending
0 Republic First Bank dba Republic Bank 10546
1 Citizens Bank 10545
2 Heartland Tri-State Bank 10544
3 First Republic Bank 10543
4 Signature Bank 10540
5 Silicon Valley Bank 10539
6 Almena State Bank 10538
7 First City Bank of Florida 10537
8 The First State Bank 10536
9 Ericson State Bank 10535

RESULT: Thus, the commands for reading data from CSV files, excel files, and HTML are successfully
executed.

19
EX.NO:5B PYTHON SCRIPT TO READ EXCEL DATA SHEETS
DATE:
AIM:
To write a simple Python script to read Excel data sheets.
ALGORITHM:
1. Start the program
2. Install openpyxl library using pip from command line.
3. Import openpyxl library.
4. Read data from an existing spreadsheet.
5. Also users can perform calculations on existing data.
6. Install xlwings library using pip from the command line. STEP 7: Import xlwings library.
7. Read data from an existing spreadsheet.
8. Write data to the existing spreadsheet.
9. Display the output.
10. Stop the program.

PROGRAM:
import xlwings as xw
ws = xw.Book("C:\\Users\\ Downloads\\sample_weather_data.xlsx").sheets['Sheet1'] # your File Path
v1 = ws.range("A1:A4").value
v2 = ws.range("F5").value
print("Result:", v1, v2)

OUTPUT:
Result: ['Day', 'Monday', 'Tuesday', 'Wednesday'] None

RESULT: Thus, python script was written to read data from and write data to Excel file using Python
libraries are successfully executed and verified.

20
EX.NO:5C PYTHON SCRIPT TO READ XML DATASETS
DATE:
AIM:
To write a simple Python script to read XML data set.
ALGORITHM:
1. Start the program
2. Install the BeautifulSoup library using pip from the command line.
3. Also install a third-party Python parser lxml using pip command..
4. Read data from an XML file and find tags and extract them.
5. Import Element Tree class found inside XML library.
6. Read and write data from an XML file.
7. Display the output.
8. Stop the program.

PROGRAM:
from bs4 import BeautifulSoup
with open("C:\\Users\\Dell\\Downloads\\dict.xml", 'r') as f:
data = f.read()
Bs_data = BeautifulSoup(data, "xml")
b_unique = Bs_data.find_all('unique')
print(b_unique)
b_name = Bs_data.find('child', {'name':'Frank'})
print(b_name)
value = b_name.get('test')
print(value)

OUTPUT:
[<unique>
<child name="Frank" test="1234">Hello</child>
</unique>, <unique>
<child name="Alice" test="5678">Hi</child>
</unique>, <unique>
<child name="Bob" test="91011">Greetings</child>
</unique>]
<child name="Frank" test="1234">Hello</child>
1234

RESULT: Thus, a python script was written to read data from and write data to an XML file using
Python libraries that are successfully executed and verified.

21
EX.NO:6A R PROGRAM TO FIND THE DATA DISTRIBUTIONS USING BOX AND SCATTER PLOTS
DATE:
AIM:
To find the data distributions using box and scatter plots.
ALGORITHM:
1. Start the Program
2. Import any dataset.
3. Pick any two columns.
4. Plot the boxplot using the boxplot function.
5. Draw the boxplot with a notch.
6. Display the result.
7. Plot the scatter plot using plot function STEP 8: Print the result
8. Stop the process

• The basic syntax to create a boxplot in R is − boxplot(x, data, notch, varwidth,


names, main)
• The basic syntax for creating scatterplot in R is − plot(x, y, main, xlab, ylab, xlim,
ylim, axes)

PROGRAM:
input <- mtcars[,c('mpg','cyl')]
# Give the chart file a name.
png(file = "boxplot.png")
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders", ylab = "Miles Per Gallon",
main = "Mileage Data")
# Save the file.
dev.off()

22
OUTPUT:

BOXPLOT WITH NOTCH:


# Give the chart file a name.
png(file = "boxplot_with_notch.png")
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders", ylab = "Miles Per Gallon",
main = "Mileage Data", notch = TRUE, varwidth = TRUE, col = c("green","yellow","purple"),
names = c("High","Medium","Low"))
# Save the file.

OUTPUT:

23
CREATING SCATTERPLOT:
input <- mtcars[,c('wt','mpg')]
# Give the chart file a name. png(file = "scatterplot.png")
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,xlab = "Weight", ylab = "Mileage", xlim = c(2.5,5), ylim =
c(15,30),main = "Weight vs Mileage")
pairs(~wt+mpg+disp+cyl,data = mtcars, main = "Scatterplot Matrix")
# Save the file.
dev.off()

OUTPUT:

RESULT: Thus, python script was written to read data from and write data to excel file using Python
libraries are successfully executed and verified.

24
EX.NO:6B R PROGRAM TO FIND THE OUTLIERS USING BOXPLOTS, Z-SCORE AND
INTERQUARTILE RANGE (IQR)
DATE:
AIM:
To find the outliers using boxplots, Z-Score and Interquartile Range(IQR).
ALGORITHM:
1. Start the Program
2. Import any dataset.
3. Pick any two columns.
4. Plot the boxplot using the boxplot function.
5. Calculate the Z-Score using the formula
i. Z-Score = (xi -Mean)/SD
6. Define a threshold and compare with Z-score(for eg. thresh=3).
7. Sort the data in ascending order for calculating IQR.
8. First Calculate Q1(25th percentile) and Q3(75th percentile)
9. Calculate IQR =Q3-Q1.
10. Compute Lower bound =Q1-1.5*IQR
11. Compute Upper bound =Q3-+1.5*IQR
12. Mark each data point that falls outside the lower and upper bounds as outliers .
13. Display the result
14. Stop the program.

PROGRAM:
# Define x as a random dataset
x <- rnorm(100) # 100 random values from a normal distribution
# Create a boxplot with red outliers
boxplot(x, outcoll = "red")

OUTPUT:

25
R code for calculating Z-score:
# Load necessary library
library(tidyverse)

# Sample data
x <- c(10, 12, 9, 14, 15, 13, 7, 8, 6, 20, 25, 30)

# Drawing boxplot with customized outlier color


boxplot(x, col = "lightblue", outcol = "red", main = "Boxplot with Outliers")

# Set Z-score threshold


thresh <- 3

# Calculate mean and standard deviation


mean_x <- mean(x)
std_x <- sd(x)
# Calculate Z-score for each value and identify outliers based on Z-score
data <- data.frame(x = x)
data <- data %>%
mutate(zscore = (x - mean(x)) / sd(x)) %>%
mutate(outlier = ifelse(abs(zscore) > thresh, "Outlier", "Not Outlier"))

# Display Z-score results


print(data)

OUTPUT:

26
R Code for an IQR function that returns exact values of your outliers:

# Function to calculate IQR-based outliers


IQR_function <- function(x) {
Q1 <- quantile(x, 0.25) # First quartile (25th percentile)
Q3 <- quantile(x, 0.75) # Third quartile (75th percentile)
IQR <- Q3 - Q1 # Interquartile range
left <- Q1 - 1.5 * IQR # Lower bound for outliers
right <- Q3 + 1.5 * IQR # Upper bound for outliers
# Return the outliers
c(x[x < left], x[x > right])
}
x <- c(10, 12, 9, 14, 15, 13, 7, 8, 6, 20, 25, 30)
# Call the IQR function on the data
outliers <- IQR_function(x)
print("IQR-based Outliers:")
print(outliers)

OUTPUT:

RESULT: Thus, the program to find the outliers using boxplots, Z-Score, and Interquartile
Range (IQR) was implemented and executed successfully.
27
EX.NO.6C R PROGRAM TO PLOT THE HISTOGRAM, BAR CHART, AND PIE CHART FOR
THE GIVEN DATA
DATE:
AIM:
To plot the histogram, bar chart and pie chart for the given data.

PROCEDURE:
Bar Plot or Bar Chart
Bar plot or Bar Chart in R is used to represent the values in the data vector as height of the
bars. The data vector passed to the function is represented over the y-axis of the graph. Bar
charts can behave like histogram by using table() function instead of data vector.
Syntax: barplot(data, xlab, ylab)
Pie Diagram or Pie Chart
Pie chart is a circular chart divided into different segments according to the ratio of data
provided. The total value of the pie is 100 and the segments tell the fraction of the whole pie.
It is another method to represent statistical data in graphical form and pie() function is used
to perform the same.
Syntax: pie(x, labels, col, main, radius)
Pie chart in 3D can also be created in R by using following syntax but requires a plotrix
library.
Syntax: pie3D(x, labels, radius, main)

Histogram
Histogram is a graphical representation used to create a graph with bars representing the
frequency of grouped data in a vector. Histogram is the same as bar chart but the only
difference between them is histogram represents frequency of grouped data rather than data
itself.
Syntax: hist(x, col, border, main, xlab, ylab)
PROGRAM:
BAR CHART:
# defining vector
x <- c(7, 15, 23, 12, 44, 56, 32)
# output to be present as PNG file
png(file = "barplot.png")
# plotting vector
barplot(x, xlab = "GeeksforGeeks Audience",
ylab = "Count", col = "white", col.axis = "darkgreen", col.lab = "darkgreen")

28
dev.off()

OUTPUT:

PIE CHART:
# defining vector x with number of articles x <- c(210, 450, 250, 100, 50, 90) 36
# defining labels for each value in x
names(x) <- c("Algo", "DS", "Java", "C", "C++", "Python")
# output to be present as PNG file png(file = "piechart.png")
# creating pie chart
pie(x, labels = names(x), col = "white",main = "Articles on GeeksforGeeks", radius = -1,
col.main = "darkgreen")
# saving the file
dev.off()

OUTPUT:

29
PIE 3D:
# Install and load plotrix if not already installed
install.packages("plotrix")
library(plotrix)

# Defining vector x with the number of articles


x <- c(210, 450, 250, 100, 50, 90)

# Defining labels for each value in x


names(x) <- c("Algo", "DS", "Java", "C", "C++", "Python")

# Output to be saved as PNG file


png(file = "piechart3d.png")

# Creating the 3D pie chart


pie3D(x,
labels = names(x),
col = c("red", "blue", "green", "yellow", "orange", "purple"), main = "Articles on
GeeksforGeeks",
labelcol = "darkgreen",
col.main = "darkgreen")

# Saving the file


dev.off()

OUTPUT:

30
HISTOGRAM:

# defining vector
x <- c(21, 23, 56, 90, 20, 7, 94, 12,
57, 76, 69, 45, 34, 32, 49, 55, 57)
# output to be present as PNG
file png(file = "hist.png")
hist(x, main = "Histogram of Vector x",xlab = "Values", col.lab = "darkgreen",col.main =
"darkgreen")
# saving the file
dev.off()

OUTPUT:

RESULT: Thus, the program to plot the histogram, bar chart and pie chart for the given data
were implemented and executed successfully.

31
EX.NO:7A R PROGRAM TO FIND THE CORRELATION MATRIX OF THE GIVEN IRIS DATA
DATE:
AIM:
To find the correlation matrix of the given iris data.
ALGORITHM:
1. Start the Program
2. Import your data in R
3. Compute the Correlation matrix
4. Identify its Significance levels.
5. Format Correlation matrix
6. Visualize the Correlation matrix.
7. Stop the program

PROGRAM:
The R function cor() can be used to compute a correlation matrix.
cor(x, method = c("pearson", "kendall", "spearman"))

Import your data in R


# If .txt tab file, use this
my_data <- read.delim(file.choose()) ]

# Or, if .csv file, use this


my_data <- read.csv(file.choose())

library(ggplot2)
library(tidyr)
library(datasets) data("iris")
summary(iris)

Compute correlation matrix:


library(DataExplorer)
library(corrplot)
title="matrix_iris"
plot_correlation(iris)

32
OUTPUT:

RESULT: Thus, the program to find the Correlation matrix for the given Iris data were implemented and
executed successfully.

33
EX.NO:7B R PROGRAM TO PLOT THE CORRELATION PLOT ON THE DATASET AND VISUALIZE
GIVING AN OVERVIEW OF RELATIONSHIPS AMONG DATA ON IRIS DATA
DATE:
AIM:
To plot the correlation plot on the dataset and visualize giving an overview of relationships among
data on iris data.
ALGORITHM:
1. Start the Program
2. Import your data in R
3. Compute the Correlation matrix
4. Use any plot to represent the feature by values. (Box plot, Jitter Scatter plot)
5. Visualize the relationship between any two features
6. Stop the program

PROGRAM:
Box plot of Petal Length by flower species:

library(ggplot2)
ggplot(data=iris)+ geom_boxplot(mapping = aes(x=iris$Species,y=iris$Petal.Length), fill="white",
color=c("yellow","blue","orange"))+ ggtitle(" box plot of Petal Length by flower specie")

OUTPUT:

34
Scatter jitter plot of Petal Width on the x axis vs. Petal Length on y axis, for the species of flower
you identify in your boxplot that has the smallest median Petal Length:

seto<-iris[iris$Species =='setosa',c("Petal.Length","Petal.Width","Species")] ggplot(seto)+


geom_point(mapping=aes(x=seto$Petal.Width,y=seto$Petal.Length), position = "jitter") + ggtitle("Petal
Width vs. Petal Length-Jitter scatter plot")
OUTPUT:

Scatter Point Plot without the jitter - Petal Width Vs. Petal Length:

ggplot(seto)+ geom_point(mapping=aes(seto$Petal.Width,
seto$Petal.Length,colour=Petal.Width>=0.6))+ ggtitle("Pental Width vs. Petal Length--outlier")
OUTPUT:

RESULT: Thus, the program to plot the correlation plot on the dataset and visualize giving an overview
of relationships among iris data was implemented and executed successfully.
35
EX.NO:7C R PROGRAM FOR ANALYSIS OF COVARIANCE: VARIANCE(ANOVA), IF DATA
HAVE CATEGORICAL VARIABLES ON IRIS DATA
DATE:
AIM:
To implement Analysis of Covariance: Variance (ANOVA) if data have categorical variables on iris
data
ALGORITHM:
1. Start the Program
2. Load the data set into the working environment under the name df
3. Ensure whether the data meets the key assumptions “homogeneity of variance” by running
Levene’s test.
4. Install car package to execute Levene’s test.
5. Run the ANOVA command.
6. Report the finding by using the describeBy command from psych package.
7. Plot any type of graph for pictorial representation of the finding
8. Stop the program

PROGRAM:
df=iris
install.packages("car")
library(car)
leveneTest(Petal.Length~Species,df)

OUTPUT :

Run the ANOVA command


fit = aov(Petal.Length ~ Species, df)
summary(fit)

OUTPUT:

36
Reporting Results of ANOVA:

install.packages("psych") l
ibrary(psych)
describeBy(df$Petal.Length, df$Species)

OUTPUT:

Pictorial representation of the finding:

install.packages("ggplot2")
library(ggplot2)
ggplot(df, aes(y = Petal.Length, x = Species, fill = Species)) +
stat_summary(fun.y = "mean", geom = "bar", position = "dodge") +
stat_summary(fun.data = "mean_se", geom = "errorbar", position = "dodge", width = 0.8)

OUTPUT:

RESULT: Thus, execution of simple R programs using numbers has been executed and verified
successfully.
37
EX.NO:8 R PROGRAM TO PERFORM LOGISTIC REGRESSION FOR STUDENT
ADMISSION PREDICTION

DATE:
AIM:
To perform Logistic regression, find out the relation between variables that are affecting the
admission of a student in an institute based on his or her GRE score, GPA obtained, and rank of the student
and also to check whether the model fits it or not.
ALGORITHM:
1. Start the Program.
2. The dataset can be downloaded from “https://stats.idre.ucla.edu/stat/data/binary.csv”
3. Load necessary libraries and the dataset from the local file path into mydata. Convert the rank
variable to a factor.
4. Fit a logistic regression model with admit as the outcome variable and gre, gpa, and rank as
predictors.
5. Summarize the model and calculate confidence intervals for the coefficients.
6. Predict admission probabilities for different ranks and prepare the data for visualization.
7. Plot the predicted probabilities using ggplot2, showing confidence intervals over gre scores by
rank.
PROGRAM:
# Load necessary libraries
library(aod)
library(ggplot2)
# Read the dataset from a local CSV file
mydata <- read.csv("c:/users/Donwloads/desktop/binary.csv") # Your File Path
# View the first few rows of the data
head(mydata)
# Summary statistics of the dataset
summary(mydata)
# Standard deviations of the dataset
sapply(mydata, sd)
# Two-way contingency table of categorical outcome and predictors
xtabs(~admit + rank, data = mydata)
# Convert 'rank' to a factor
mydata$rank <- factor(mydata$rank)
# Fit a logistic regression model
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
# Summary of the logistic regression model
summary(mylogit)
# Confidence intervals using profiled log-likelihood
confint(mylogit)
38
# Confidence intervals using standard errors
confint.default(mylogit)
# Wald test for specific terms (rank2, rank3, rank4)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 4:6)
# Perform Wald test comparing rank 2 and rank 3
l <- cbind(0, 0, 0, 1, -1, 0)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l)
# Odds ratios
exp(coef(mylogit))
# Odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))
# Create a new data frame with average gre and gpa for each rank
newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
# View the new data frame
newdata1
# Predict the probability of admission for each rank
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
# View the updated data frame
newdata1
# Create a new data frame for plotting predicted probabilities over a range of gre scores
newdata2 <- with(mydata, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100), 4),
gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
# Add predictions and standard errors to the new data frame
newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link", se = TRUE))
# Calculate the predicted probabilities and confidence intervals
newdata3 <- within(newdata3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
# View the first few rows of the final dataset
head(newdata3)
# Plot the predicted probabilities with confidence intervals
ggplot(newdata3, aes(x = gre, y = PredictedProb)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = rank), alpha = 0.2) +
geom_line(aes(colour = rank), linewidth = 1)
with(mylogit, null.deviance - deviance)
with(mylogit, df.null - df.residual)
with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
logLik(mylogit)

39
OUTPUT:

admit gre gpa rank


1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3.00 2

admit gre gpa rank


Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
Median :0.0000 Median :580.0 Median :3.395 Median :2.000
Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000

admit gre gpa rank


0.4660867 115.5165364 0.3805668 0.9444602

rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12

Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = mydata)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 499.98 on 399 degrees of freedom


Residual deviance: 458.52 on 394 degrees of freedom

40
AIC: 470.52

Number of Fisher Scoring iterations: 4

Waiting for profiling to be done...


2.5 % 97.5 %
(Intercept) -6.2716202334 -1.792547080
gre 0.0001375921 0.004435874
gpa 0.1602959439 1.464142727
rank2 -1.3008888002 -0.056745722
rank3 -2.0276713127 -0.670372346
rank4 -2.4000265384 -0.753542605

2.5 % 97.5 %
(Intercept) -6.2242418514 -1.755716295
gre 0.0001202298 0.004408622
gpa 0.1536836760 1.454391423
rank2 -1.2957512650 -0.055134591
rank3 -2.0169920597 -0.663415773
rank4 -2.3703986294 -0.732528724

Wald test:
----------

Chi-squared test:
X2 = 20.9, df = 3, P(> X2) = 0.00011

Wald test:
----------

Chi-squared test:
X2 = 5.5, df = 1, P(> X2) = 0.019

(Intercept) gre gpa rank2 rank3 rank4


0.0185001 1.0022670 2.2345448 0.5089310 0.2617923 0.2119375

Waiting for profiling to be done...


OR 2.5 % 97.5 %
(Intercept) 0.0185001 0.001889165 0.1665354
gre 1.0022670 1.000137602 1.0044457
gpa 2.2345448 1.173858216 4.3238349
rank2 0.5089310 0.272289674 0.9448343
rank3 0.2617923 0.131641717 0.5115181
rank4 0.2119375 0.090715546 0.4706961

41
gre gpa rank
1 587.7 3.3899 1
2 587.7 3.3899 2
3 587.7 3.3899 3
4 587.7 3.3899 4

gre gpa rank rankP


1 587.7 3.3899 1 0.5166016
2 587.7 3.3899 2 0.3522846
3 587.7 3.3899 3 0.2186120
4 587.7 3.3899 4 0.1846684

gre gpa rank fit se.fit residual.scale UL LL PredictedProb


1 200.0000 3.3899 1 -0.8114870 0.5147714 1 0.5492064 0.1393812 0.3075737
2 206.0606 3.3899 1 -0.7977632 0.5090986 1 0.5498513 0.1423880 0.3105042
3 212.1212 3.3899 1 -0.7840394 0.5034491 1 0.5505074 0.1454429 0.3134499
4 218.1818 3.3899 1 -0.7703156 0.4978239 1 0.5511750 0.1485460 0.3164108
5 224.2424 3.3899 1 -0.7565919 0.4922237 1 0.5518545 0.1516973 0.3193867
6 230.3030 3.3899 1 -0.7428681 0.4866494 1 0.5525464 0.1548966 0.3223773

RESULT: Thus, the program has been executed and the output has been verified successfully.
42
EX.NO:9 MULTIPLE REGRESSION ANALYSIS WITH CONTINUOUS PREDICTORS
DATE:
AIM:
To apply multiple regression, if data have a continuous independent variable, apply on a above
dataset.
ALGORITHM:
1. Load or prepare the dataset for multiple regression.
2. You can find the dataset at :https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
3. Split the dataset into training and testing sets (e.g., using train_test_split function). Choose a
multiple regression algorithm (e.g., Multiple Linear Regression, Ridge Regression).
4. Train the regression model on the training set with multiple independent variables.
5. Make predictions on the testing set.
6. Evaluate the performance of the multiple regression model using appropriate metrics (e.g., mean
squared error, R-squared score).
7. Dependent variable (Y): A continuous variable that you want to predict or explain.
8. Independent variables (X1, X2, X3, ...): One or more continuous variables that you believe may
influence the dependent variable.
Once you have the dataset, you can follow these steps to apply multiple regression:
Step 1: Prepare the data
• Ensure that your dataset is clean and free of missing values.
• Split your dataset into a training set and a testing set (optional but recommended). The training
set will be used to train the model, and the testing set will be used to evaluate its performance.
Step 2: Explore the data
• Calculate descriptive statistics and visualize the relationship between the dependent variable and
each independent variable. This can help you identify any patterns or potential outliers in the data.
Step 3: Build the regression model
• Specify the multiple regression model using the equation: Y = b0 + b1X1 + b2X2 + ... + bn*Xn, where
b0 is the intercept and b1, b2, ..., bn are the coefficients for each independent variable.
• Estimate the coefficients using a suitable method, such as ordinary least squares (OLS).
Step 4: Assess the model
• Evaluate the goodness of fit of the model by analyzing metrics such as R-squared, adjusted R-
squared, and p-values for the coefficients.
• Check for assumptions of multiple regression, including linearity, independence,
homoscedasticity, and normality of residuals.

43
Step 5: Make predictions
• Once the model is assessed and deemed satisfactory, you can use it to make predictions on new
or unseen data.
• It's important to note that the steps mentioned above provide a general framework for applying
multiple regression. The specific implementation may vary depending on the programming
language or statistical software you are using.
To apply multiple regression on a dataset with a continuous independent variable, you'll need
to have at least one dependent variable and one or more additional independent variables.
Multiple regression allows you to analyze the relationship between the dependent variable and
multiple independent variables simultaneously
PROGRAM:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

data = pd.read_csv("C:\\Users\\Desktop\\housing.csv") # Replace with your File Path


data = data.dropna() # Drop missing values (if any)
# Convert 'furnishingstatus' to numeric values (encoding)
data['furnishingstatus'] = data['furnishingstatus'].map({'furnished': 1, 'semi-furnished': 0.5,
'unfurnished': 0})
print(data.info())

features = ['area', 'bedrooms', 'furnishingstatus'] # Select relevant features and target


target = 'price'
# Split data into training and testing sets
X = data[features]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the linear regression model


model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
# Predict house price based on user input
def predict_house_price():
44
print("|---------------House Price Prediction-----------------|")
area = float(input("Enter the area (in square feet): "))
bedrooms = int(input("Enter number of bedrooms: "))
furnishingstatus = float(
input("Enter furnishing status (1 for furnished, 0.5 for semi-furnished, 0 for unfurnished):
"))
# Pass input as DataFrame with matching column names
user_input = pd.DataFrame({'area': [area], 'bedrooms': [bedrooms], 'furnishingstatus':
[furnishingstatus]})
predicted_price = model.predict(user_input)[0]
print("|-----------------------------------------------|")
print(f"Predicted house price: {predicted_price}")
print("|-----------------------------------------------|")

# Call the function to make a prediction


predict_house_price()

OUTPUT:

45
RESULT: Thus, the program to apply multiple regression, if data have a continuous
independent variable were implemented and executed successfully.

46
EX.NO:10 PREDICTING DATA USING REGRESSION MODELS

DATE:
AIM:
To apply regression model techniques to predict the data on the dataset.
ALGORITHM:
1. Load or prepare the dataset for regression.
2. Split the dataset into training and testing sets (e.g., using train_test_split function).
3. Choose a regression algorithm (e.g., Linear Regression, Random Forest Regression).
4. Train the regression model on the training set.
5. Make predictions on the testing set.
6. Evaluate the performance of the regression model using appropriate metrics (e.g., mean squared
error, R-squared score).

Step 1: Import Libraries: Start by importing the necessary libraries for data manipulation,
visualization, and regression modeling. Commonly used libraries include NumPy, Pandas,
Matplotlib, and scikit-learn.
Step 2: Load the Dataset: Load the dataset you want to work with into your program. This can
be done using the appropriate functions provided by the chosen library (e.g., Pandas).
Step 3: Explore the Data: Perform exploratory data analysis (EDA) to gain insights into the
dataset. This step involves checking the data types, summary statistics, missing values, and
visualizing the distribution of the variables.
Step 4: Split the Data: Divide the dataset into two subsets: the training set and the testing set.
The training set is used to train the regression model, while the testing set is used to evaluate its
performance.
Step 5: Feature Selection/Engineering: Identify the relevant features (independent variables)
that are likely to have an impact on the target variable (dependent variable). Perform any
necessary feature engineering steps, such as handling missing values, categorical encoding, or
feature scaling.
Step 6: Choose a Regression Model: Select an appropriate regression model based on the
problem you are trying to solve and the nature of your data. Some common regression models
include Linear R
Step 7: Train the Regression Model: Fit the chosen regression model to the training data. This
involves finding the best parameters or coefficients that minimize the error between the predicted
values and the actual target values.
Step 8: Evaluate the Model: Use the testing set to evaluate the performance of the trained
regression model. Common evaluation metrics for regression include mean squared error (MSE),
root mean squared error (RMSE), mean absolute error (MAE), and R-squared.

47
Step 9: Predict New Data: Once the regression model is trained and evaluated, it can be used to
make predictions on new, unseen data by inputting the relevant feature values.
Step 10: Fine-tune and Improve: If the model's performance is not satisfactory, you can fine-
tune the model by adjusting hyper parameters, trying different regression algorithms, or
incorporating more advanced techniques like regularization.

PROGRAM:
# Predicting Miles Per Gallon (MPG) using the mtcars dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Set dimensions for plot
sns.set(rc={'figure.figsize':(11.7,8.27)})
# Load the mtcars dataset from seaborn
mtcars = sns.load_dataset('mpg').dropna() # Dropping missing values
# Display basic info about the dataset
print(mtcars.head())
# Selecting important features for simplicity (drop unnecessary columns)
mtcars_simple = mtcars[['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration']]
# Define input (X) and output (y) variables
X = mtcars_simple.drop(['mpg'], axis=1) # 'mpg' is the target variable
y = mtcars_simple['mpg']
# Split data into train and test sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)
# Model: Linear Regression (Multiple Regression)
lin_reg = LinearRegression(fit_intercept=True)
lin_reg.fit(X_train, y_train)
y_pred_lin = lin_reg.predict(X_test)
# Evaluate Linear Regression
lin_rmse = np.sqrt(mean_squared_error(y_test, y_pred_lin))
print(f"Linear Regression RMSE: {lin_rmse}")
# ---- Visualization Code ----
# 1. Plot Actual vs Predicted values (on test set)
plt.figure(figsize=(10,6))
plt.scatter(y_test, y_pred_lin, alpha=0.3)
plt.title('Actual vs Predicted MPG (Linear Regression)')
plt.xlabel('Actual MPG')
plt.ylabel('Predicted MPG')
plt.show()
# 2. Residuals Plot (Errors between actual and predicted)
plt.figure(figsize=(10,6))
sns.histplot((y_test - y_pred_lin), bins=50, kde=True)
plt.title('Residuals Distribution (Linear Regression)')
48
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

# ---- Predicting MPG based on user input ----

def predict_mpg():
print("\n--- MPG Prediction ---")

# Input values for the model


cylinders = float(input("Enter the number of cylinders: "))
displacement = float(input("Enter the displacement (in cubic inches): "))
horsepower = float(input("Enter the horsepower: "))
weight = float(input("Enter the weight (in lbs): "))
acceleration = float(input("Enter the acceleration: "))
# Create a dataframe with these input values (based on the predictors in the model)
user_input = pd.DataFrame({
'cylinders': [cylinders],
'displacement': [displacement],
'horsepower': [horsepower],
'weight': [weight],
'acceleration': [acceleration],
})

# Predict MPG for the input data


predicted_mpg = lin_reg.predict(user_input)
print(f"\nPredicted MPG: {predicted_mpg[0]:.2f}")

# Run prediction function to take input and predict MPG


predict_mpg()

OUTPUT:

49
RESULT: Thus, the program of regression model depends on the dataset, problem domain,
and the requirements have been executed successfully.

50
EX.NO.11A INSTALL RELEVANT PACKAGES FOR CLASSIFICATION
DATE:
AIM:
To install relevant packages for classification.
ALGORITHM:
1. Identify the relevant packages required for classification tasks.
2. Determine the preferred method of package installation.
3. Install the packages using the chosen method.

PROCEDURE:
1. Scikit-learn: A popular library for machine learning providing various classification algorithms
and tools.
Installation: pip install scikit-learn
2. NumPy: A fundamental package for scientific computing, supporting numerical operations and
multidimensional arrays used in machine learning.
Installation: pip install numpy
3. Pandas: A powerful library for data manipulation and analysis, offering structures like
DataFrame, useful for data preprocessing.
Installation: pip install pandas
4. Matplotlib: A plotting library for creating visualizations, helpful for analyzing and visualizing
classification model results.
Installation: pip install matplotlib
5. Seaborn: A statistical visualization library built on Matplotlib that offers additional functionality
and visually appealing charts.
Installation: pip install seaborn
6. TensorFlow and PyTorch: Deep learning frameworks suitable for deep learning-based
classification models.
• For TensorFlow: pip install tensorflow
• For PyTorch: pip install torch torchvision

51
OUTPUT:

RESULT: Thus, the commonly used packages necessary for classification in Python have
been downloaded successfully.

52
EX.NO.11B CHOOSE A CLASSIFIER FOR A CLASSIFICATION PROBLEM
DATE:
AIM:
To choose a classifier for a classification problem. For a classification problem, one popular
classifier algorithm is the Decision Tree algorithm. A Decision Tree uses a tree-like model of decisions
and their possible consequences. It is easy to interpret and can handle both numerical and categorical
data.
ALGORITHM:
1. Load or generate the dataset for classification. (here, income.csv dataset is used)
2. Dataset can be found at:https://www.kaggle.com/datasets/wenruliu/adult-income-dataset
3. Split the dataset into training and testing sets (e.g., using the train_test_split function).
4. Choose the Decision Tree classifier based on the problem requirements and dataset
characteristics.
5. Train the Decision Tree classifier on the training set.
6. Make predictions on the testing set.
7. Evaluate the performance of the classifier using appropriate metrics (e.g., accuracy, precision,
recall, F1 score) and cross-validation.

PROGRAM:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder, StandardScaler
import matplotlib.pyplot as plt

# Load and preprocess data


data = pd.read_csv('C:\\Users\\Desktop\\income.csv') # Add your File Path here

# Convert the income column to binary values


data['income'] = data['income'].apply(lambda x: 1 if x == '>50K' else 0)

# Drop missing values if any


data.dropna(inplace=True)

# Features and target


X = data[['age', 'educational-num', 'capital-gain', 'hours-per-week']]
y = data['income']

# Label encoding for the target (if necessary)


le = LabelEncoder()
y = le.fit_transform(y)
53
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)

# Decision tree model


tree_model = DecisionTreeClassifier(random_state=42, class_weight='balanced')
tree_model.fit(X_train, y_train)

# Visualize the decision tree


plt.figure(figsize=(20, 10))
plot_tree(tree_model, feature_names=['age', 'educational-num', 'capital-gain', 'hours-per-week'],
class_names=['<=50K', '>50K'], filled=True, rounded=True)
plt.title("Decision Tree Visualization")
plt.show()

# Feature importance plot


feature_importances = tree_model.feature_importances_
features = ['age', 'educational-num', 'capital-gain', 'hours-per-week']

plt.figure(figsize=(10, 6))
plt.barh(features, feature_importances, color='skyblue')
plt.xlabel('Importance Score')
plt.title('Feature Importance in Decision Tree')
plt.show()

# Input new data for prediction


age = int(input("Enter age: "))
education_num = int(input("Enter education number: "))
capital_gain = int(input("Enter capital gain: "))
hours_per_week = int(input("Enter hours per week: "))
new_data = pd.DataFrame([[age, education_num, capital_gain, hours_per_week]],
columns=['age', 'educational-num', 'capital-gain', 'hours-per-week'])

# Standardize new data


new_data_scaled = scaler.transform(new_data)

# Predict income
prediction = tree_model.predict(new_data_scaled)
if prediction[0] == 1:
print("Prediction: Income >50K")
else:
print("Prediction: Income < 50K")
54
OUTPUT:

55
RESULT: Thus, the accuracy of the classifier is calculated using the accuracy_score function and printed
successfully.

56
EX.NO:11C EVALUATE THE PERFORMANCE OF THE CLASSIFIER
DATE:
AIM:
To evaluate the performance of the classifier, several common evaluation metrics can be used
depending on the specific requirements of your classification problem. Some of the commonly used
metrics include accuracy, precision, recall, F1 score, and confusion matrix.
ALGORITHM:
1. Obtain the true labels and predicted labels from the classifier.
2. Calculate and store the evaluation metrics: accuracy, precision, recall, F1 score, and
confusion matrix
The following are key performance metrics used to evaluate the effectiveness of a classification
model
1. Accuracy: The proportion of correctly classified instances out of the total instances.
2. Precision: The ratio of true positive predictions to the total predicted positives, indicating
the quality of positive predictions.
3. Recall: The ratio of true positive predictions to the total actual positives, measuring the
ability to identify relevant instances.
4. F1 Score: The harmonic means of precision and recall, providing a balance between the two
metrics.
5. Confusion Matrix: A table that displays the counts of true positive, true negative, false
positive, and false negative predictions, helping visualize the model's performance.
PROGRAM:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix

# Load the Iris dataset as an example


data = load_iris()
X = data.data
y = data.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest classifier


clf = RandomForestClassifier(n_estimators=100, random_state=42)

57
# Train the classifier on the training data
clf.fit(X_train, y_train)

# Make predictions on the testing data


y_pred = clf.predict(X_test)

# Evaluate the performance of the classifier


accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
confusion = confusion_matrix(y_test, y_pred)

# Print the evaluation metrics


print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("Confusion Matrix:")
print(confusion)

OUTPUT:

RESULT: Thus, the evaluation of the classifier is successful.

58
EX.NO:12A CLUSTERING ALGORITHMS FOR UNSUPERVISED CLASSIFICATION
DATE:
AIM:
To apply clustering algorithms for unsupervised classification.
ALGORITHM:
1. Choose the number of clusters K.
2. Initialize K cluster centroids randomly.
3.Dataset can be found at :https://www.kaggle.com/datasets/wenruliu/adult-income-dataset
4. Repeat until convergence:
• Assign each data point to the nearest centroid.
• Recalculate the centroid of each cluster based on the assigned data points.

PROGRAM:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

# Load and preprocess data


data = pd.read_csv('C:\\Users\\Desktop\\income.csv') # Add your File Path here

# Drop missing values if any


data.dropna(inplace=True)

# Features (excluding the target 'income' because it's unsupervised)


X = data[['age', 'educational-num', 'capital-gain', 'hours-per-week']]

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# K-Means clustering (use 4 clusters as originally intended)


kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X_scaled)

# Add cluster labels to the original data


data['cluster'] = kmeans.labels_

# Visualize the clusters using two features (e.g., 'age' and 'hours-per-week')
plt.figure(figsize=(10, 6))
59
sns.scatterplot(x=data['age'], y=data['hours-per-week'], hue=data['cluster'], palette='coolwarm')
plt.title("K-Means Clustering (Age vs Hours Per Week)")
plt.show()

# Elbow method to determine optimal number of clusters


inertia = []
k_values = range(1, 10)
for k in k_values:
kmeans_temp = KMeans(n_clusters=k, random_state=42) # Temporary model for elbow
method
kmeans_temp.fit(X_scaled)
inertia.append(kmeans_temp.inertia_)

# Plot the elbow curve


plt.figure(figsize=(8, 6))
plt.plot(k_values, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.show()

# Print the actual number of clusters being used


print("Number of clusters:", 4) # This is what you set originally

# Input new data for prediction


age = int(input("Enter age: "))
education_num = int(input("Enter education number: "))
capital_gain = int(input("Enter capital gain: "))
hours_per_week = int(input("Enter hours per week: "))

new_data = pd.DataFrame([[age, education_num, capital_gain, hours_per_week]],


columns=['age', 'educational-num', 'capital-gain', 'hours-per-week'])

# Standardize new data


new_data_scaled = scaler.transform(new_data)

# Predict cluster for new data using the original kmeans model (with 4 clusters)
predicted_cluster = kmeans.predict(new_data_scaled)
print(f"The new data point belongs to cluster: {predicted_cluster[0]}")

60
OUTPUT:

61
RESULT: Thus, the program generates a scatter plot showing the relationship between age
and hours worked per week, with data points colored by predicted income class, and the
output was verified.

62
EX.NO:12B PLOT THE CLUSTER DATA USING PYTHON VISUALIZATION
DATE:
AIM:
To plot the cluster data using python visualization techniques.
ALGORITHM:
1. Obtain the cluster labels and data points.
2. Create a scatter plot where each data point is assigned a different color based on its cluster.
3. Customize the plot by adding labels, titles, and other visual elements.
4. Display the plot.

PROGRAM:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Generate sample data
X, y = make_blobs(n_samples=200, centers=4, random_state=0)
# Create a K-means clustering object with K=4
kmeans = KMeans(n_clusters=4, random_state=0)
# Fit the data to the K-means model
kmeans.fit(X)
#Get the cluster labels
labels = kmeans.labels_
# Plot the cluster data
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Cluster Data')
plt.colorbar(label='Cluster Label')
plt.show()

63
OUTPUT:

RESULT: The code generates a scatter plot where each data point is colored according to its assigned
cluster label. The color mapping is displayed on a color to represent the cluster labels.

64

You might also like