20ITPL702 - DataScienceWithMachineLearning
20ITPL702 - DataScienceWithMachineLearning
REGISTER NUMBER :
Station: Chennai
Date:
Classification Model
1
Step 5: Load package
• Once the packages are installed, you need to load them into the R session using the library()
function.
library(dplyr)
For example, to load the "dplyr" package, type the following command and press Enter
Repeat this step for all the packages you installed.
2
EX.NO:2A SIMPLE R PROGRAMS USING NUMBERS
DATE:
AIM:
To write a simple R program using numbers.
ALGORITHM:
1. Assign the value 5 to variable a.
2. Assign the value 3 to variable b.
3. Add a and b together and store the result in variable sum.
4. Print the value of sum.
PROGRAM:
a <- 5
b <- 5
sum <- a+b
print(sum)
OUTPUT:
[1] 8
RESULT: Thus, simple R programs using numbers have been executed and verified successfully.
3
EX.NO:2B COMPUTING THE MEAN OF A VECTOR
DATE:
AIM:
To write simple R program compute the mean of a vector.
ALGORITHM:
1. Create a vector vec with values 2, 4, 6, 8, and 10.
2. Calculate the mean of the elements in vec using the mean() function and store the result in
mean_val.
3. Print the value of mean_val.
PROGRAM:
vec <-c (2,4,6,8,10)
mean_val <- mean(vec)
print(mean_val)
OUTPUT:
[1] 6
RESULT: Thus, computing the mean of a vector using R program has been executed and verified
successfully.
4
EX.NO:2C R PROGRAM THAT DEMONSTRATES THE USE OF OBJECTS
DATE:
AIM:
To write a simple R program that demonstrates the use of objects.
ALGORITHM:
1. Create an empty object using the `list()` function.
2. Assign values to different attributes of the object.
3. Access the attributes and print the student's information.
PROGRAM:
This program creates an object representing a student and displays their information:
# Create an empty object
student <- list()
# Assign the value to attributes
student$name <- "John Doe"
student$age <- 20
student$major <- "Computer Science"
# Access attributes and print student's information
cat((paste("Name:", student$name)),(paste("\nAge:", student$age)) ,(paste("\nMajor:",
student$major)))
OUTPUT:
[1] "Name: John Doe"
[1] "Age: 20"
[1] "Major: Computer Science"
RESULT: Thus, the execution of R programs using objects has been executed and verified successfully.
5
EX.NO.3A (i) PYTHON SCRIPT TO DEVELOP CALCULATOR APPLICATIONS
WITHOUT USING PYTHON OBJECTS
DATE:
AIM:
To write a simple Python script to develop a calculator application without using Python objects on
console.
ALGORITHM:
1. Prompt the user to enter the first number.
2. Prompt the user to enter the second number.
3. Prompt the user to select an operation (addition, subtraction, multiplication, or division).
4. Perform the selected operation on the numbers.
5. Display the result.
PROGRAM:
num1 = float(input("Enter the first number: "))
num2 = float(input("Enter the second number: "))
print("Select operation:")
print("1. Addition")
print("2. Subtraction")
print("3. Multiplication")
print("4. Division")
choice = int(input("Enter your choice (1-4): "))
if choice == 1:
result = num1 + num2
elif choice == 2:
result = num1 - num2
elif choice == 3:
result = num1 * num2
elif choice == 4:
result = num1 / num2
else:
print("Invalid choice!")
6
OUTPUT:
Enter the first number: 5
Enter the second number: 3
Select operation:
1. Addition
2. Subtraction
3. Multiplication
4. Division
Enter your choice (1-4): 3
Result: 15.0
RESULT: Thus, calculator application without using Python objects on the console has been developed
and the output has been executed and verified successfully.
7
EX.NO.3A (ii) PYTHON SCRIPT TO DEVELOP CALCULATOR APPLICATIONS
WITH USING PYTHON OBJECTS
DATE:
AIM:
To write a simple Python script to develop a calculator application using Python objects on
console.
ALGORITHM:
1. Create a Calculator class with methods for addition, subtraction, multiplication, and division.
2. Initialize the Calculator object.
3. Prompt the user to enter the first number.
4. Prompt the user to enter the second number.
5. Prompt the user to select an operation (addition, subtraction, multiplication, or division).
6. Perform the selected operation using the appropriate method of the Calculator object.
7. Display the result.
PROGRAM:
class Calculator:
def addition(self, num1, num2):
return num1 + num2
def subtraction(self, num1, num2):
return num1 - num2
def multiplication(self, num1, num2):
return num1 * num2
def division(self, num1, num2):
return num1 / num2
calculator = Calculator()
num1 = float(input("Enter the first number: "))
num2 = float(input("Enter the second number: "))
# Prompt the user to select an operation
8
print("Select operation:")
print("1. Addition")
print("2. Subtraction")
print("3. Multiplication")
print("4. Division")
choice = int(input("Enter your choice (1-4): "))
if choice == 1:
result = calculator.addition(num1, num2)
elif choice == 2:
result = calculator.subtraction(num1, num2)
elif choice == 3:
result= calculator.multiplication(num1, num2)
elif choice == 4:
result = calculator.division(num1, num2)
else:
print("Invalid choice!")
print("Result:", result)
OUTPUT:
Enter the first number: 5
Enter the second number: 3
Select operation:
1. Addition
2. Subtraction
3. Multiplication
4. Division Enter your choice (1-4):3
Result: 15.0
RESULT: Thus, calculator application without using Python objects on the console has been developed
and the output has been executed and verified successfully.
9
EX.NO.3B PYTHON SCRIPT TO DEVELOP CALCULATOR APPLICATION USING
MATHEMATICAL FUNCTIONS
DATE:
AIM:
To write a simple Python script to develop a calculator application using mathematical functions on
console.
ALGORITHM:
1. Prompt the user to enter the first number.
2. Prompt the user to enter the second number.
3. Prompt the user to select an operation (addition, subtraction, multiplication, division, or
exponentiation).
4. Perform the selected operation on the numbers using appropriate mathematical functions.
5. Display the result.
PROGRAM:
import math
num1 = float(input("Enter the first number: "))
num2 = float(input("Enter the second number: "))
print("Select operation:")
print("1. Addition")
print("2. Subtraction")
print("3. Multiplication")
print("4. Division")
print("5. Exponentiation ")
choice = int(input("Enter your choice (1-5): "))
if choice == 1:
result = num1 + num2
elif choice == 2:
result = num1 - num2
10
elif choice == 3:
result = num1 * num2
elif choice == 4:
result = num1 / num2
elif choice == 5:
result = math.pow(num1,num2)
else:
print("Invalid choice!")
exit(1)
print("Result:", result)
OUTPUT:
Enter the first number: 5
Enter the second number: 3
Select operation:
1. Addition
2. Subtraction
3. Multiplication
4. Division
5. Exponentiation Enter your choice (1-5): 4
Result: 1.6666666666666667
RESULT: Thus, calculator application using mathematical functions has been developed and the output
has been executed and verified successfully.
11
EX.NO.3C PYTHON SCRIPT TO CREATE PYTHON OBJECTS FOR CALCULATOR
APPLICATION AND SAVE IN A SPECIFIED LOCATION IN DISK
DATE:
AIM:
To write a simple Python script to create Python objects for calculator application and save in a
specified location in the disk.
ALGORITHM:
1. We start by defining a Calculator class with the basic arithmetic operations: addition,
subtraction, multiplication, and division. The result is stored in the result attribute.
2. We create an instance of the Calculator class called calculator.
3. We perform some calculations by calling the appropriate methods on the calculator object.
4. Next, we specify the location where we want to save the calculator object. In this example, we
use the filename calculator_object.pkl, but you can change it to any desired location.
5. We use the pickle module to save the calculator object to the specified location on the disk.
6. Finally, we print a message indicating the successful saving of the calculator object.
PROGRAM:
import pickle
class Calculator:
def init (self):
self.result = 0
def add(self, num):
self.result += num
def subtract(self, num):
self.result -= num
def multiply(self, num):
self.result *= num
def divide(self, num):
if num != 0:
self.result /= num
12
else:
print("Error: Cannot divide by zero.")
calculator = Calculator()
calculator.add(5)
calculator.subtract(2)
calculator.multiply(3)
calculator.divide(4)
save_location = 'calculator_object.pkl'
with open(save_location, 'wb') as file:
pickle.dump(calculator, file)
print(f"Calculator object saved to: {save_location}")
RESULT: Thus, simple Python script to create Python objects for calculator application and save in a
specified location on disk has been executed and verified successfully.
13
EX.NO:4A PYTHON SCRIPT TO FIND BASIC DESCRIPTIVE STATISTICS USING
SUMMARY, STR, QUARTILE ON MTCARS AND CARS DATASETS
DATE:
AIM:
To write a simple python script to find basic descriptive statistics using summary, str, quartile on
mtcars and cars datasets.
ALGORITHM:
1. Import the pandas library to work with datasets.
2. Load the mtcars dataset using the read_csv() function and assign it to the variable mtcars.
3. Use the describe() function on the mtcars dataset to compute summary statistics and assign
the result to mtcars_summary.
4. Print the summary statistics for the mtcars dataset using print(mtcars_summary).
5. Use the info() function on the mtcars dataset to display information about the data types of the
columns.
6. Print the data type information for the mtcars dataset using print(mtcars.info()).
7. Use the quantile() function on the mtcars dataset to calculate quartiles at 25%, 50%, and 75%.
8. Assign the calculated quartiles to mtcars_quartiles.
9. Print the quartiles for the mtcars dataset using print(mtcars_quartiles).
10. Repeat steps 2-9 for the cars dataset, replacing mtcars with cars in the variable names and
filenames.
PROGRAM:
import pandas as pd
mtcars = pd.read_csv('mtcars.csv')
mtcars_summary = mtcars.describe()
print("Summary Statistics for mtcars dataset:",mtcars_summary)
print("\nData Type Information for mtcars dataset:",mtcars.info())
mtcars_quartiles = mtcars.quantile([0.25, 0.5, 0.75])
print("\nQuartiles for mtcars dataset:",mtcars_quartiles)
14
OUTPUT :
Summary Statistics for mtcars dataset: mpg cyl disp ... am gear carb
count 32.000000 32.000000 32.000000 ... 32.000000 32.000000 32.0000
mean 20.090625 6.187500 230.721875 ... 0.406250 3.687500 2.8125
std 6.026948 1.785922 123.938694 ... 0.498991 0.737804 1.6152
min 10.400000 4.000000 71.100000 ... 0.000000 3.000000 1.0000
25% 15.425000 4.000000 120.825000 ... 0.000000 3.000000 2.0000
50% 19.200000 6.000000 196.300000 ... 0.000000 4.000000 2.0000
75% 22.800000 8.000000 326.000000 ... 1.000000 4.000000 4.0000
max 33.900000 8.000000 472.000000 ... 1.000000 5.000000 8.0000
[8 rows x 11 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 32 non-null object
1 mpg 32 non-null float64
2 cyl 32 non-null int64
3 disp 32 non-null float64
4 hp 32 non-null int64
5 drat 32 non-null float64
6 wt 32 non-null float64
7 qsec 32 non-null float64
8 vs 32 non-null int64
9 am 32 non-null int64
10 gear 32 non-null int64
11 carb 32 non-null int64
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB
Quartiles for mtcars dataset: mpg cyl disp hp drat ... qsec vs am gear carb
0.25 15.425 4.0 120.825 96.5 3.080 ... 16.8925 0.0 0.0 3.0 2.0
0.50 19.200 6.0 196.300 123.0 3.695 ... 17.7100 0.0 0.0 4.0 2.0
0.75 22.800 8.0 326.000 180.0 3.920 ... 18.9000 1.0 1.0 4.0 4.0
RESULT: Thus, a simple Python script to find basic descriptive statistics using summary, str,
and quartile on mtcars and cars datasets has been executed and verified successfully.
15
EX.NO:4B PYTHON SCRIPT TO FIND SUBSET OF DATASET BY USING
SUBSET(), AND AGGREGATE() FUNCTIONS ON IRIS DATASET
DATE:
AIM:
To write a simple python script to find subset of dataset by using subset(), aggregate() functions on
iris dataset.
ALGORITHM:
1. Import the pandas library to work with the dataset.
2. Load the Iris dataset using the read_csv() function and assign it to the variable iris.
3. Use the subset() function to filter the dataset based on a condition. In this example, we subset
the dataset where the species is equal to 'setosa' and assign the result to the variable subset.
4. Print the subset of the Iris dataset using print(subset).
5. Use the aggregate() function on the subset to compute aggregate statistics. In this example, we
calculate the minimum, maximum, and mean values for the 'sepal_length' column, as well as the
mean value for the 'sepal_width' column.
6. Assign the aggregated result to the variable aggregate_result.
7. Print the aggregated result using print(aggregate_result).
PROGRAM:
import pandas as pd
iris = pd.read_csv("C:\\Users\\Dell\\Downloads\\iris\i\\iris.csv") # your File Path
subset = iris[iris['variety'] == 'Setosa']
print("Subset of Iris dataset (variety='Setosa'):")
print(subset)
aggregate_result = subset.aggregate({'sepal.length': ['min', 'max', 'mean'], 'sepal.width': 'mean'})
print("\nAggregate result for the subset:")
print(aggregate_result)
OUTPUT :
Subset of Iris dataset (variety='Setosa'):
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
16
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
5 5.4 3.9 1.7 0.4 Setosa
6 4.6 3.4 1.4 0.3 Setosa
7 5.0 3.4 1.5 0.2 Setosa
8 4.4 2.9 1.4 0.2 Setosa
9 4.9 3.1 1.5 0.1 Setosa
10 5.4 3.7 1.5 0.2 Setosa
11 4.8 3.4 1.6 0.2 Setosa
12 4.8 3.0 1.4 0.1 Setosa
13 4.3 3.0 1.1 0.1 Setosa
14 5.8 4.0 1.2 0.2 Setosa
15 5.7 4.4 1.5 0.4 Setosa
16 5.4 3.9 1.3 0.4 Setosa
17 5.1 3.5 1.4 0.3 Setosa
18 5.7 3.8 1.7 0.3 Setosa
19 5.1 3.8 1.5 0.3 Setosa
20 5.4 3.4 1.7 0.2 Setosa
21 5.1 3.7 1.5 0.4 Setosa
22 4.6 3.6 1.0 0.2 Setosa
23 5.1 3.3 1.7 0.5 Setosa
24 4.8 3.4 1.9 0.2 Setosa
25 5.0 3.0 1.6 0.2 Setosa
26 5.0 3.4 1.6 0.4 Setosa
27 5.2 3.5 1.5 0.2 Setosa
28 5.2 3.4 1.4 0.2 Setosa
29 4.7 3.2 1.6 0.2 Setosa
30 4.8 3.1 1.6 0.2 Setosa
RESULT: Thus, a simple python script to find subset of dataset by using subset(), aggregate() functions
on iris dataset has been executed and verified successfully.
17
EX.NO:5A PYTHON SCRIPT TO READ DIFFERENT TYPES OF DATA SETS
FROM WEB AND DISK AND WRITE TO A FILE IN A SPECIFIC LOCATION
DATE:
AIM:
To write a simple python script to find, read different types of data sets from web and disk and
write to a file in a specific location.
ALGORITHM:
1. Start the program
2. Tread data from csv file using pandas package.
3. To read data from excel file using the pandas package.
4. To read data from an html file using the pandas package.
5. Display the output.
6. Stop the program.
PROGRAM:
import numpy as np
import pandas as pd
df = pd.read_csv("C:\\Users\\Dell\\Downloads\\sample_day.csv") # Your File Path
print(df)
OUTPUT :
Day Temperature Humidity Weather
0 Monday 20 55 Sunny
1 Tuesday 22 60 Cloudy
2 Wednesday 21 58 Rainy
3 Thursday 23 57 Sunny
4 Friday 19 54 Cloudy
EXCEL:
import pandas as pd
df=pd.read_excel("C:\\Users\\Dell\\Downloads\\sample_weather_data.xlsx")
print(df)
18
OUTPUT:
OUTPUT:
Bank Name Fund Sort ascending
0 Republic First Bank dba Republic Bank 10546
1 Citizens Bank 10545
2 Heartland Tri-State Bank 10544
3 First Republic Bank 10543
4 Signature Bank 10540
5 Silicon Valley Bank 10539
6 Almena State Bank 10538
7 First City Bank of Florida 10537
8 The First State Bank 10536
9 Ericson State Bank 10535
RESULT: Thus, the commands for reading data from CSV files, excel files, and HTML are successfully
executed.
19
EX.NO:5B PYTHON SCRIPT TO READ EXCEL DATA SHEETS
DATE:
AIM:
To write a simple Python script to read Excel data sheets.
ALGORITHM:
1. Start the program
2. Install openpyxl library using pip from command line.
3. Import openpyxl library.
4. Read data from an existing spreadsheet.
5. Also users can perform calculations on existing data.
6. Install xlwings library using pip from the command line. STEP 7: Import xlwings library.
7. Read data from an existing spreadsheet.
8. Write data to the existing spreadsheet.
9. Display the output.
10. Stop the program.
PROGRAM:
import xlwings as xw
ws = xw.Book("C:\\Users\\ Downloads\\sample_weather_data.xlsx").sheets['Sheet1'] # your File Path
v1 = ws.range("A1:A4").value
v2 = ws.range("F5").value
print("Result:", v1, v2)
OUTPUT:
Result: ['Day', 'Monday', 'Tuesday', 'Wednesday'] None
RESULT: Thus, python script was written to read data from and write data to Excel file using Python
libraries are successfully executed and verified.
20
EX.NO:5C PYTHON SCRIPT TO READ XML DATASETS
DATE:
AIM:
To write a simple Python script to read XML data set.
ALGORITHM:
1. Start the program
2. Install the BeautifulSoup library using pip from the command line.
3. Also install a third-party Python parser lxml using pip command..
4. Read data from an XML file and find tags and extract them.
5. Import Element Tree class found inside XML library.
6. Read and write data from an XML file.
7. Display the output.
8. Stop the program.
PROGRAM:
from bs4 import BeautifulSoup
with open("C:\\Users\\Dell\\Downloads\\dict.xml", 'r') as f:
data = f.read()
Bs_data = BeautifulSoup(data, "xml")
b_unique = Bs_data.find_all('unique')
print(b_unique)
b_name = Bs_data.find('child', {'name':'Frank'})
print(b_name)
value = b_name.get('test')
print(value)
OUTPUT:
[<unique>
<child name="Frank" test="1234">Hello</child>
</unique>, <unique>
<child name="Alice" test="5678">Hi</child>
</unique>, <unique>
<child name="Bob" test="91011">Greetings</child>
</unique>]
<child name="Frank" test="1234">Hello</child>
1234
RESULT: Thus, a python script was written to read data from and write data to an XML file using
Python libraries that are successfully executed and verified.
21
EX.NO:6A R PROGRAM TO FIND THE DATA DISTRIBUTIONS USING BOX AND SCATTER PLOTS
DATE:
AIM:
To find the data distributions using box and scatter plots.
ALGORITHM:
1. Start the Program
2. Import any dataset.
3. Pick any two columns.
4. Plot the boxplot using the boxplot function.
5. Draw the boxplot with a notch.
6. Display the result.
7. Plot the scatter plot using plot function STEP 8: Print the result
8. Stop the process
PROGRAM:
input <- mtcars[,c('mpg','cyl')]
# Give the chart file a name.
png(file = "boxplot.png")
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders", ylab = "Miles Per Gallon",
main = "Mileage Data")
# Save the file.
dev.off()
22
OUTPUT:
OUTPUT:
23
CREATING SCATTERPLOT:
input <- mtcars[,c('wt','mpg')]
# Give the chart file a name. png(file = "scatterplot.png")
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,xlab = "Weight", ylab = "Mileage", xlim = c(2.5,5), ylim =
c(15,30),main = "Weight vs Mileage")
pairs(~wt+mpg+disp+cyl,data = mtcars, main = "Scatterplot Matrix")
# Save the file.
dev.off()
OUTPUT:
RESULT: Thus, python script was written to read data from and write data to excel file using Python
libraries are successfully executed and verified.
24
EX.NO:6B R PROGRAM TO FIND THE OUTLIERS USING BOXPLOTS, Z-SCORE AND
INTERQUARTILE RANGE (IQR)
DATE:
AIM:
To find the outliers using boxplots, Z-Score and Interquartile Range(IQR).
ALGORITHM:
1. Start the Program
2. Import any dataset.
3. Pick any two columns.
4. Plot the boxplot using the boxplot function.
5. Calculate the Z-Score using the formula
i. Z-Score = (xi -Mean)/SD
6. Define a threshold and compare with Z-score(for eg. thresh=3).
7. Sort the data in ascending order for calculating IQR.
8. First Calculate Q1(25th percentile) and Q3(75th percentile)
9. Calculate IQR =Q3-Q1.
10. Compute Lower bound =Q1-1.5*IQR
11. Compute Upper bound =Q3-+1.5*IQR
12. Mark each data point that falls outside the lower and upper bounds as outliers .
13. Display the result
14. Stop the program.
PROGRAM:
# Define x as a random dataset
x <- rnorm(100) # 100 random values from a normal distribution
# Create a boxplot with red outliers
boxplot(x, outcoll = "red")
OUTPUT:
25
R code for calculating Z-score:
# Load necessary library
library(tidyverse)
# Sample data
x <- c(10, 12, 9, 14, 15, 13, 7, 8, 6, 20, 25, 30)
OUTPUT:
26
R Code for an IQR function that returns exact values of your outliers:
OUTPUT:
RESULT: Thus, the program to find the outliers using boxplots, Z-Score, and Interquartile
Range (IQR) was implemented and executed successfully.
27
EX.NO.6C R PROGRAM TO PLOT THE HISTOGRAM, BAR CHART, AND PIE CHART FOR
THE GIVEN DATA
DATE:
AIM:
To plot the histogram, bar chart and pie chart for the given data.
PROCEDURE:
Bar Plot or Bar Chart
Bar plot or Bar Chart in R is used to represent the values in the data vector as height of the
bars. The data vector passed to the function is represented over the y-axis of the graph. Bar
charts can behave like histogram by using table() function instead of data vector.
Syntax: barplot(data, xlab, ylab)
Pie Diagram or Pie Chart
Pie chart is a circular chart divided into different segments according to the ratio of data
provided. The total value of the pie is 100 and the segments tell the fraction of the whole pie.
It is another method to represent statistical data in graphical form and pie() function is used
to perform the same.
Syntax: pie(x, labels, col, main, radius)
Pie chart in 3D can also be created in R by using following syntax but requires a plotrix
library.
Syntax: pie3D(x, labels, radius, main)
Histogram
Histogram is a graphical representation used to create a graph with bars representing the
frequency of grouped data in a vector. Histogram is the same as bar chart but the only
difference between them is histogram represents frequency of grouped data rather than data
itself.
Syntax: hist(x, col, border, main, xlab, ylab)
PROGRAM:
BAR CHART:
# defining vector
x <- c(7, 15, 23, 12, 44, 56, 32)
# output to be present as PNG file
png(file = "barplot.png")
# plotting vector
barplot(x, xlab = "GeeksforGeeks Audience",
ylab = "Count", col = "white", col.axis = "darkgreen", col.lab = "darkgreen")
28
dev.off()
OUTPUT:
PIE CHART:
# defining vector x with number of articles x <- c(210, 450, 250, 100, 50, 90) 36
# defining labels for each value in x
names(x) <- c("Algo", "DS", "Java", "C", "C++", "Python")
# output to be present as PNG file png(file = "piechart.png")
# creating pie chart
pie(x, labels = names(x), col = "white",main = "Articles on GeeksforGeeks", radius = -1,
col.main = "darkgreen")
# saving the file
dev.off()
OUTPUT:
29
PIE 3D:
# Install and load plotrix if not already installed
install.packages("plotrix")
library(plotrix)
OUTPUT:
30
HISTOGRAM:
# defining vector
x <- c(21, 23, 56, 90, 20, 7, 94, 12,
57, 76, 69, 45, 34, 32, 49, 55, 57)
# output to be present as PNG
file png(file = "hist.png")
hist(x, main = "Histogram of Vector x",xlab = "Values", col.lab = "darkgreen",col.main =
"darkgreen")
# saving the file
dev.off()
OUTPUT:
RESULT: Thus, the program to plot the histogram, bar chart and pie chart for the given data
were implemented and executed successfully.
31
EX.NO:7A R PROGRAM TO FIND THE CORRELATION MATRIX OF THE GIVEN IRIS DATA
DATE:
AIM:
To find the correlation matrix of the given iris data.
ALGORITHM:
1. Start the Program
2. Import your data in R
3. Compute the Correlation matrix
4. Identify its Significance levels.
5. Format Correlation matrix
6. Visualize the Correlation matrix.
7. Stop the program
PROGRAM:
The R function cor() can be used to compute a correlation matrix.
cor(x, method = c("pearson", "kendall", "spearman"))
library(ggplot2)
library(tidyr)
library(datasets) data("iris")
summary(iris)
32
OUTPUT:
RESULT: Thus, the program to find the Correlation matrix for the given Iris data were implemented and
executed successfully.
33
EX.NO:7B R PROGRAM TO PLOT THE CORRELATION PLOT ON THE DATASET AND VISUALIZE
GIVING AN OVERVIEW OF RELATIONSHIPS AMONG DATA ON IRIS DATA
DATE:
AIM:
To plot the correlation plot on the dataset and visualize giving an overview of relationships among
data on iris data.
ALGORITHM:
1. Start the Program
2. Import your data in R
3. Compute the Correlation matrix
4. Use any plot to represent the feature by values. (Box plot, Jitter Scatter plot)
5. Visualize the relationship between any two features
6. Stop the program
PROGRAM:
Box plot of Petal Length by flower species:
library(ggplot2)
ggplot(data=iris)+ geom_boxplot(mapping = aes(x=iris$Species,y=iris$Petal.Length), fill="white",
color=c("yellow","blue","orange"))+ ggtitle(" box plot of Petal Length by flower specie")
OUTPUT:
34
Scatter jitter plot of Petal Width on the x axis vs. Petal Length on y axis, for the species of flower
you identify in your boxplot that has the smallest median Petal Length:
Scatter Point Plot without the jitter - Petal Width Vs. Petal Length:
ggplot(seto)+ geom_point(mapping=aes(seto$Petal.Width,
seto$Petal.Length,colour=Petal.Width>=0.6))+ ggtitle("Pental Width vs. Petal Length--outlier")
OUTPUT:
RESULT: Thus, the program to plot the correlation plot on the dataset and visualize giving an overview
of relationships among iris data was implemented and executed successfully.
35
EX.NO:7C R PROGRAM FOR ANALYSIS OF COVARIANCE: VARIANCE(ANOVA), IF DATA
HAVE CATEGORICAL VARIABLES ON IRIS DATA
DATE:
AIM:
To implement Analysis of Covariance: Variance (ANOVA) if data have categorical variables on iris
data
ALGORITHM:
1. Start the Program
2. Load the data set into the working environment under the name df
3. Ensure whether the data meets the key assumptions “homogeneity of variance” by running
Levene’s test.
4. Install car package to execute Levene’s test.
5. Run the ANOVA command.
6. Report the finding by using the describeBy command from psych package.
7. Plot any type of graph for pictorial representation of the finding
8. Stop the program
PROGRAM:
df=iris
install.packages("car")
library(car)
leveneTest(Petal.Length~Species,df)
OUTPUT :
OUTPUT:
36
Reporting Results of ANOVA:
install.packages("psych") l
ibrary(psych)
describeBy(df$Petal.Length, df$Species)
OUTPUT:
install.packages("ggplot2")
library(ggplot2)
ggplot(df, aes(y = Petal.Length, x = Species, fill = Species)) +
stat_summary(fun.y = "mean", geom = "bar", position = "dodge") +
stat_summary(fun.data = "mean_se", geom = "errorbar", position = "dodge", width = 0.8)
OUTPUT:
RESULT: Thus, execution of simple R programs using numbers has been executed and verified
successfully.
37
EX.NO:8 R PROGRAM TO PERFORM LOGISTIC REGRESSION FOR STUDENT
ADMISSION PREDICTION
DATE:
AIM:
To perform Logistic regression, find out the relation between variables that are affecting the
admission of a student in an institute based on his or her GRE score, GPA obtained, and rank of the student
and also to check whether the model fits it or not.
ALGORITHM:
1. Start the Program.
2. The dataset can be downloaded from “https://stats.idre.ucla.edu/stat/data/binary.csv”
3. Load necessary libraries and the dataset from the local file path into mydata. Convert the rank
variable to a factor.
4. Fit a logistic regression model with admit as the outcome variable and gre, gpa, and rank as
predictors.
5. Summarize the model and calculate confidence intervals for the coefficients.
6. Predict admission probabilities for different ranks and prepare the data for visualization.
7. Plot the predicted probabilities using ggplot2, showing confidence intervals over gre scores by
rank.
PROGRAM:
# Load necessary libraries
library(aod)
library(ggplot2)
# Read the dataset from a local CSV file
mydata <- read.csv("c:/users/Donwloads/desktop/binary.csv") # Your File Path
# View the first few rows of the data
head(mydata)
# Summary statistics of the dataset
summary(mydata)
# Standard deviations of the dataset
sapply(mydata, sd)
# Two-way contingency table of categorical outcome and predictors
xtabs(~admit + rank, data = mydata)
# Convert 'rank' to a factor
mydata$rank <- factor(mydata$rank)
# Fit a logistic regression model
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
# Summary of the logistic regression model
summary(mylogit)
# Confidence intervals using profiled log-likelihood
confint(mylogit)
38
# Confidence intervals using standard errors
confint.default(mylogit)
# Wald test for specific terms (rank2, rank3, rank4)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 4:6)
# Perform Wald test comparing rank 2 and rank 3
l <- cbind(0, 0, 0, 1, -1, 0)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l)
# Odds ratios
exp(coef(mylogit))
# Odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))
# Create a new data frame with average gre and gpa for each rank
newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
# View the new data frame
newdata1
# Predict the probability of admission for each rank
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
# View the updated data frame
newdata1
# Create a new data frame for plotting predicted probabilities over a range of gre scores
newdata2 <- with(mydata, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100), 4),
gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
# Add predictions and standard errors to the new data frame
newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link", se = TRUE))
# Calculate the predicted probabilities and confidence intervals
newdata3 <- within(newdata3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
# View the first few rows of the final dataset
head(newdata3)
# Plot the predicted probabilities with confidence intervals
ggplot(newdata3, aes(x = gre, y = PredictedProb)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = rank), alpha = 0.2) +
geom_line(aes(colour = rank), linewidth = 1)
with(mylogit, null.deviance - deviance)
with(mylogit, df.null - df.residual)
with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
logLik(mylogit)
39
OUTPUT:
rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = mydata)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
40
AIC: 470.52
2.5 % 97.5 %
(Intercept) -6.2242418514 -1.755716295
gre 0.0001202298 0.004408622
gpa 0.1536836760 1.454391423
rank2 -1.2957512650 -0.055134591
rank3 -2.0169920597 -0.663415773
rank4 -2.3703986294 -0.732528724
Wald test:
----------
Chi-squared test:
X2 = 20.9, df = 3, P(> X2) = 0.00011
Wald test:
----------
Chi-squared test:
X2 = 5.5, df = 1, P(> X2) = 0.019
41
gre gpa rank
1 587.7 3.3899 1
2 587.7 3.3899 2
3 587.7 3.3899 3
4 587.7 3.3899 4
RESULT: Thus, the program has been executed and the output has been verified successfully.
42
EX.NO:9 MULTIPLE REGRESSION ANALYSIS WITH CONTINUOUS PREDICTORS
DATE:
AIM:
To apply multiple regression, if data have a continuous independent variable, apply on a above
dataset.
ALGORITHM:
1. Load or prepare the dataset for multiple regression.
2. You can find the dataset at :https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
3. Split the dataset into training and testing sets (e.g., using train_test_split function). Choose a
multiple regression algorithm (e.g., Multiple Linear Regression, Ridge Regression).
4. Train the regression model on the training set with multiple independent variables.
5. Make predictions on the testing set.
6. Evaluate the performance of the multiple regression model using appropriate metrics (e.g., mean
squared error, R-squared score).
7. Dependent variable (Y): A continuous variable that you want to predict or explain.
8. Independent variables (X1, X2, X3, ...): One or more continuous variables that you believe may
influence the dependent variable.
Once you have the dataset, you can follow these steps to apply multiple regression:
Step 1: Prepare the data
• Ensure that your dataset is clean and free of missing values.
• Split your dataset into a training set and a testing set (optional but recommended). The training
set will be used to train the model, and the testing set will be used to evaluate its performance.
Step 2: Explore the data
• Calculate descriptive statistics and visualize the relationship between the dependent variable and
each independent variable. This can help you identify any patterns or potential outliers in the data.
Step 3: Build the regression model
• Specify the multiple regression model using the equation: Y = b0 + b1X1 + b2X2 + ... + bn*Xn, where
b0 is the intercept and b1, b2, ..., bn are the coefficients for each independent variable.
• Estimate the coefficients using a suitable method, such as ordinary least squares (OLS).
Step 4: Assess the model
• Evaluate the goodness of fit of the model by analyzing metrics such as R-squared, adjusted R-
squared, and p-values for the coefficients.
• Check for assumptions of multiple regression, including linearity, independence,
homoscedasticity, and normality of residuals.
43
Step 5: Make predictions
• Once the model is assessed and deemed satisfactory, you can use it to make predictions on new
or unseen data.
• It's important to note that the steps mentioned above provide a general framework for applying
multiple regression. The specific implementation may vary depending on the programming
language or statistical software you are using.
To apply multiple regression on a dataset with a continuous independent variable, you'll need
to have at least one dependent variable and one or more additional independent variables.
Multiple regression allows you to analyze the relationship between the dependent variable and
multiple independent variables simultaneously
PROGRAM:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
OUTPUT:
45
RESULT: Thus, the program to apply multiple regression, if data have a continuous
independent variable were implemented and executed successfully.
46
EX.NO:10 PREDICTING DATA USING REGRESSION MODELS
DATE:
AIM:
To apply regression model techniques to predict the data on the dataset.
ALGORITHM:
1. Load or prepare the dataset for regression.
2. Split the dataset into training and testing sets (e.g., using train_test_split function).
3. Choose a regression algorithm (e.g., Linear Regression, Random Forest Regression).
4. Train the regression model on the training set.
5. Make predictions on the testing set.
6. Evaluate the performance of the regression model using appropriate metrics (e.g., mean squared
error, R-squared score).
Step 1: Import Libraries: Start by importing the necessary libraries for data manipulation,
visualization, and regression modeling. Commonly used libraries include NumPy, Pandas,
Matplotlib, and scikit-learn.
Step 2: Load the Dataset: Load the dataset you want to work with into your program. This can
be done using the appropriate functions provided by the chosen library (e.g., Pandas).
Step 3: Explore the Data: Perform exploratory data analysis (EDA) to gain insights into the
dataset. This step involves checking the data types, summary statistics, missing values, and
visualizing the distribution of the variables.
Step 4: Split the Data: Divide the dataset into two subsets: the training set and the testing set.
The training set is used to train the regression model, while the testing set is used to evaluate its
performance.
Step 5: Feature Selection/Engineering: Identify the relevant features (independent variables)
that are likely to have an impact on the target variable (dependent variable). Perform any
necessary feature engineering steps, such as handling missing values, categorical encoding, or
feature scaling.
Step 6: Choose a Regression Model: Select an appropriate regression model based on the
problem you are trying to solve and the nature of your data. Some common regression models
include Linear R
Step 7: Train the Regression Model: Fit the chosen regression model to the training data. This
involves finding the best parameters or coefficients that minimize the error between the predicted
values and the actual target values.
Step 8: Evaluate the Model: Use the testing set to evaluate the performance of the trained
regression model. Common evaluation metrics for regression include mean squared error (MSE),
root mean squared error (RMSE), mean absolute error (MAE), and R-squared.
47
Step 9: Predict New Data: Once the regression model is trained and evaluated, it can be used to
make predictions on new, unseen data by inputting the relevant feature values.
Step 10: Fine-tune and Improve: If the model's performance is not satisfactory, you can fine-
tune the model by adjusting hyper parameters, trying different regression algorithms, or
incorporating more advanced techniques like regularization.
PROGRAM:
# Predicting Miles Per Gallon (MPG) using the mtcars dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Set dimensions for plot
sns.set(rc={'figure.figsize':(11.7,8.27)})
# Load the mtcars dataset from seaborn
mtcars = sns.load_dataset('mpg').dropna() # Dropping missing values
# Display basic info about the dataset
print(mtcars.head())
# Selecting important features for simplicity (drop unnecessary columns)
mtcars_simple = mtcars[['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration']]
# Define input (X) and output (y) variables
X = mtcars_simple.drop(['mpg'], axis=1) # 'mpg' is the target variable
y = mtcars_simple['mpg']
# Split data into train and test sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)
# Model: Linear Regression (Multiple Regression)
lin_reg = LinearRegression(fit_intercept=True)
lin_reg.fit(X_train, y_train)
y_pred_lin = lin_reg.predict(X_test)
# Evaluate Linear Regression
lin_rmse = np.sqrt(mean_squared_error(y_test, y_pred_lin))
print(f"Linear Regression RMSE: {lin_rmse}")
# ---- Visualization Code ----
# 1. Plot Actual vs Predicted values (on test set)
plt.figure(figsize=(10,6))
plt.scatter(y_test, y_pred_lin, alpha=0.3)
plt.title('Actual vs Predicted MPG (Linear Regression)')
plt.xlabel('Actual MPG')
plt.ylabel('Predicted MPG')
plt.show()
# 2. Residuals Plot (Errors between actual and predicted)
plt.figure(figsize=(10,6))
sns.histplot((y_test - y_pred_lin), bins=50, kde=True)
plt.title('Residuals Distribution (Linear Regression)')
48
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()
def predict_mpg():
print("\n--- MPG Prediction ---")
OUTPUT:
49
RESULT: Thus, the program of regression model depends on the dataset, problem domain,
and the requirements have been executed successfully.
50
EX.NO.11A INSTALL RELEVANT PACKAGES FOR CLASSIFICATION
DATE:
AIM:
To install relevant packages for classification.
ALGORITHM:
1. Identify the relevant packages required for classification tasks.
2. Determine the preferred method of package installation.
3. Install the packages using the chosen method.
PROCEDURE:
1. Scikit-learn: A popular library for machine learning providing various classification algorithms
and tools.
Installation: pip install scikit-learn
2. NumPy: A fundamental package for scientific computing, supporting numerical operations and
multidimensional arrays used in machine learning.
Installation: pip install numpy
3. Pandas: A powerful library for data manipulation and analysis, offering structures like
DataFrame, useful for data preprocessing.
Installation: pip install pandas
4. Matplotlib: A plotting library for creating visualizations, helpful for analyzing and visualizing
classification model results.
Installation: pip install matplotlib
5. Seaborn: A statistical visualization library built on Matplotlib that offers additional functionality
and visually appealing charts.
Installation: pip install seaborn
6. TensorFlow and PyTorch: Deep learning frameworks suitable for deep learning-based
classification models.
• For TensorFlow: pip install tensorflow
• For PyTorch: pip install torch torchvision
51
OUTPUT:
RESULT: Thus, the commonly used packages necessary for classification in Python have
been downloaded successfully.
52
EX.NO.11B CHOOSE A CLASSIFIER FOR A CLASSIFICATION PROBLEM
DATE:
AIM:
To choose a classifier for a classification problem. For a classification problem, one popular
classifier algorithm is the Decision Tree algorithm. A Decision Tree uses a tree-like model of decisions
and their possible consequences. It is easy to interpret and can handle both numerical and categorical
data.
ALGORITHM:
1. Load or generate the dataset for classification. (here, income.csv dataset is used)
2. Dataset can be found at:https://www.kaggle.com/datasets/wenruliu/adult-income-dataset
3. Split the dataset into training and testing sets (e.g., using the train_test_split function).
4. Choose the Decision Tree classifier based on the problem requirements and dataset
characteristics.
5. Train the Decision Tree classifier on the training set.
6. Make predictions on the testing set.
7. Evaluate the performance of the classifier using appropriate metrics (e.g., accuracy, precision,
recall, F1 score) and cross-validation.
PROGRAM:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder, StandardScaler
import matplotlib.pyplot as plt
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
plt.figure(figsize=(10, 6))
plt.barh(features, feature_importances, color='skyblue')
plt.xlabel('Importance Score')
plt.title('Feature Importance in Decision Tree')
plt.show()
# Predict income
prediction = tree_model.predict(new_data_scaled)
if prediction[0] == 1:
print("Prediction: Income >50K")
else:
print("Prediction: Income < 50K")
54
OUTPUT:
55
RESULT: Thus, the accuracy of the classifier is calculated using the accuracy_score function and printed
successfully.
56
EX.NO:11C EVALUATE THE PERFORMANCE OF THE CLASSIFIER
DATE:
AIM:
To evaluate the performance of the classifier, several common evaluation metrics can be used
depending on the specific requirements of your classification problem. Some of the commonly used
metrics include accuracy, precision, recall, F1 score, and confusion matrix.
ALGORITHM:
1. Obtain the true labels and predicted labels from the classifier.
2. Calculate and store the evaluation metrics: accuracy, precision, recall, F1 score, and
confusion matrix
The following are key performance metrics used to evaluate the effectiveness of a classification
model
1. Accuracy: The proportion of correctly classified instances out of the total instances.
2. Precision: The ratio of true positive predictions to the total predicted positives, indicating
the quality of positive predictions.
3. Recall: The ratio of true positive predictions to the total actual positives, measuring the
ability to identify relevant instances.
4. F1 Score: The harmonic means of precision and recall, providing a balance between the two
metrics.
5. Confusion Matrix: A table that displays the counts of true positive, true negative, false
positive, and false negative predictions, helping visualize the model's performance.
PROGRAM:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix
57
# Train the classifier on the training data
clf.fit(X_train, y_train)
OUTPUT:
58
EX.NO:12A CLUSTERING ALGORITHMS FOR UNSUPERVISED CLASSIFICATION
DATE:
AIM:
To apply clustering algorithms for unsupervised classification.
ALGORITHM:
1. Choose the number of clusters K.
2. Initialize K cluster centroids randomly.
3.Dataset can be found at :https://www.kaggle.com/datasets/wenruliu/adult-income-dataset
4. Repeat until convergence:
• Assign each data point to the nearest centroid.
• Recalculate the centroid of each cluster based on the assigned data points.
PROGRAM:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Visualize the clusters using two features (e.g., 'age' and 'hours-per-week')
plt.figure(figsize=(10, 6))
59
sns.scatterplot(x=data['age'], y=data['hours-per-week'], hue=data['cluster'], palette='coolwarm')
plt.title("K-Means Clustering (Age vs Hours Per Week)")
plt.show()
# Predict cluster for new data using the original kmeans model (with 4 clusters)
predicted_cluster = kmeans.predict(new_data_scaled)
print(f"The new data point belongs to cluster: {predicted_cluster[0]}")
60
OUTPUT:
61
RESULT: Thus, the program generates a scatter plot showing the relationship between age
and hours worked per week, with data points colored by predicted income class, and the
output was verified.
62
EX.NO:12B PLOT THE CLUSTER DATA USING PYTHON VISUALIZATION
DATE:
AIM:
To plot the cluster data using python visualization techniques.
ALGORITHM:
1. Obtain the cluster labels and data points.
2. Create a scatter plot where each data point is assigned a different color based on its cluster.
3. Customize the plot by adding labels, titles, and other visual elements.
4. Display the plot.
PROGRAM:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Generate sample data
X, y = make_blobs(n_samples=200, centers=4, random_state=0)
# Create a K-means clustering object with K=4
kmeans = KMeans(n_clusters=4, random_state=0)
# Fit the data to the K-means model
kmeans.fit(X)
#Get the cluster labels
labels = kmeans.labels_
# Plot the cluster data
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Cluster Data')
plt.colorbar(label='Cluster Label')
plt.show()
63
OUTPUT:
RESULT: The code generates a scatter plot where each data point is colored according to its assigned
cluster label. The color mapping is displayed on a color to represent the cluster labels.
64