DataScience Lab Manual
DataScience Lab Manual
Data Science
(3151608)
B.E. Semester 5
(Information Technology Department)
Institute logo
Directorate of Technical
Education,Gandhinagar,Gujarat
Vishwakarma Government Engineering
College,Chandkheda
Department of Information Technology
Certificate
Place: __________
Date: __________
Preface
Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant
competencies in psychomotor domain. By keeping in view, GTU has designed competency
focused outcome-based curriculum for engineering degree programs where sufficient weightage
is given to practical work. It shows importance of enhancement of skills amongst the students
and it pays attention to utilize every second of time allotted for practical amongst students,
instructors and faculty members to achieve relevant outcomes by performing the experiments
rather than having merely study type experiments. It is must for effective implementation of
competency focused outcome-based curriculum that every practical is keenly designed to serve
as a tool to develop and enhance relevant competency required by the various industry among
every student. These psychomotor skills are very difficult to develop through traditional chalk
and board content delivery method in the classroom. Accordingly, this lab manual is designed
to focus on the industry defined relevant outcomes, rather than old practice of conducting
practical to prove concept and theory.
Data Science is a rapidly growing field that combines statistical and computational techniques
to extract knowledge and insights from data. The goal of this lab manual is to provide students
with hands-on experience in using data science tools and techniques to analyze and interpret
real-world data.
This manual is designed to accompany a course in Data Science and assumes a basic knowledge
of programming concepts and statistical analysis. The labs are structured to guide students
through the process of collecting, cleaning, analyzing, and visualizing data, using popular
programming languages and software tools such as Python, R, SQL, and Tableau.
Each lab in this manual consists of a set of instructions that guide students through a specific
data analysis project. The labs are organized in a progressive sequence, with each lab building
on the skills and concepts covered in the previous lab. The exercises within each lab are
designed to be completed in a single class session, with additional time required for preparation
and follow-up analysis.
Throughout the manual, we emphasize the importance of critical thinking and data ethics,
providing guidance on how to analyze data responsibly and communicate findings effectively.
By the end of this manual, students will have gained a solid foundation in data science and be
well-equipped to apply these skills to real-world problems.
Data Science (3151608)
Sr.
Objective(s) of Experiment CO-1 CO-2 CO-3 CO-4
No.
Exploration and Visualization Using Mathematical
1. √ √
and Statistical Tools
Study of Measures of Central Tendency,
Correlation, Percentile, Decile, Quartile, Measure
2. √ √
of Variation, and Measure of Shape (Skewness
and Kurtosis) with Excel Functions
Study of Basics of Python data types, NumPy, Matplotlib,
3. √ √
Pandas.
Implementation of Various Probability
4. Distributions with NumPy Random Library √
Functions
Implementation of Estimation of Parameters for the Best-
5. Fit Probability Distribution using the Fitter Class in √
Python.
Implementation of Linear Regression with Scikit-learn
6. √
library in Python.
Implementation of Logistic Regression with
7. √
Scikit-learn library in Python
Implementation of Decision Tree for Student
8. √
Classification
Data Science (3151608)
1. Students are expected to carefully listen to all the theory classes delivered by the faculty
members and understand the COs, content of the course, teaching and examination
scheme, skill set to be developed etc.
2. Students will have to perform experiments as per practical list given.
3. Students have to show output of each program in their practical file.
4. Students are instructed to submit practical list as per given sample list shown on next page.
5. Student should develop a habit of submitting the experimentation work as per the schedule
and s/he should be well prepared for the same.
Index
(Progressive Assessment Sheet)
Date:
AIM: Data Exploration and Visualization Using Mathematical and Statistical Tools
Introduction:
Data exploration and visualization are important steps in the data analysis process. In this lab,
students will learn how to explore and visualize data using mathematical and statistical tools such
as histograms, box plots, scatter plots, and correlation matrices. Students will also learn how to
use Excel/R to perform these analyses.
Objectives:
Materials:
Procedure:
Example:
Education Marital Employment
Age Gender Income Level Status Status Industry
32 Female 45000 Bachelor's Single Employed Technology
45 Male 65000 Master's Married Employed Finance
28 Female 35000 High School Single Unemployed None
52 Male 80000 Doctorate Married Employed Education
36 Female 55000 Bachelor's Divorced Employed Healthcare
40 Male 70000 Bachelor's Married Self-Employed Consulting
29 Female 40000 Associate's Single Employed Retail
55 Male 90000 Master's Married Employed Engineering
33 Female 47000 Bachelor's Single Employed Government
47 Male 75000 Bachelor's Married Self-Employed Entertainment
41 Female 60000 Master's Single Employed Nonprofit
38 Male 52000 High School Divorced Employed Construction
31 Female 48000 Bachelor's Married Employed Technology
49 Male 85000 Doctorate Married Employed Finance
27 Female 30000 High School Single Unemployed None
54 Male 92000 Master's Married Employed Education
39 Female 58000 Bachelor's Married Self-Employed Consulting
30 Male 42000 Associate's Single Employed Retail
56 Female 96000 Doctorate Married Employed Healthcare
35 Male 55000 Bachelor's Single Employed Government
48 Female 73000 Bachelor's Married Self-Employed Entertainment
42 Male 65000 Master's Divorced Employed Nonprofit
37 Female 50000 High School Married Employed Construction
34 Male 49000 Bachelor's Single Unemployed None
51 Female 82000 Master's Married Employed Engineering
This dataset includes information on age, gender, income, education level, and marital status,
employment status and Industry for a sample of 25 individuals. This data could be used to explore
and visualize various relationships and patterns, such as the relationship between age and income,
or the distribution of income by education level. Few more relationships and patterns that could be
explored and visualized using the sample dataset I provided:
1. Relationship between age and income: Create a scatter plot to see if there is a relationship
between age and income. Also calculate the correlation coefficient to determine the
strength and direction of the relationship.
2. Distribution of income by gender: Create a box plot to compare the distribution of income
between males and females. This could reveal any differences in the median, quartiles, and
outliers for each gender.
3. Distribution of income by education level: Create a box plot to compare the distribution of
income for each level of education. This could reveal any differences in the median,
quartiles, and outliers for each education level.
4. Relationship between education level and marital status: Create a contingency table and
calculate the chi-square test statistic to see if there is a relationship between education
level and marital status. This could reveal whether certain education levels are more or
less likely to be associated with certain marital statuses.
5. Relationship between age and education level: Create a histogram to see the distribution
of ages for each education level. This could reveal any differences or similarities in the age
distribution across education levels.
Output:
//give screenshots here
Conclusion:
In this lab, students learned how to explore and visualize data using mathematical and statistical
tools such as histograms, box plots, scatter plots, and correlation matrices. These tools are useful
in identifying patterns and relationships in data, and in making informed decisions based on data
analysis. The skills students have learned in this lab will be helpful in your future studies and
career in data analysis.
Quiz: (Sufficient space to be provided for the answers or use extra file pages to write answers)
1. What are the measures of central tendency? Provide examples and explain when each
measure is appropriate to use.
2. How can you calculate the correlation coefficient between two variables using
mathematical and statistical tools? Interpret the correlation coefficient value.
3. Explain the concept of skewness and kurtosis in statistics. How can you measure and
interpret these measures using mathematical and statistical tools?
Suggested References:
1. "Python for Data Analysis" by Wes McKinney
2. "Data Visualization with Python and Matplotlib" by Benjamin Root
02 02 05 01 10
Experiment No: 2
Date:
AIM: Study of Measures of Central Tendency, Correlation, Percentile, Decile, Quartile, Measure
of Variation, and Measure of Shape (Skewness and Kurtosis) with Excel Functions
Objective:
The objective of this lab practical is to provide students with hands-on experience in using Excel
functions to explore and analyze a sample data sheet. Students will learn to calculate measures of
central tendency, correlation, percentile, decile, quartile, measure of variation, and measure of
shape using Excel functions. Additionally, students will learn to create visualizations to better
understand the data.
Materials:
- Computer with Microsoft Excel installed
- Sample data sheet (provided below or dataset may be provided by subject teacher)
Test1 Test2
StudentID Score Score Age Gender
1 85 92 19 Male
2 92 87 20 Female
3 78 80 18 Male
4 85 89 19 Male
5 90 95 21 Female
6 75 82 18 Male
7 83 87 20 Female
8 92 90 19 Male
9 80 85 18 Female
10 87 88 20 Female
|
Procedure:
Part 2: Correlation
1. Calculate the correlation between test 1 score and test 2 score using Excel functions.
2. Create a scatter plot to visualize the relationship between test 1 score and test 2 score.
3. Write a brief interpretation of the results.
Interpretation/Program/Code:
//write here
Output:
// Paste here
Conclusion:
//Student needs to write down following:
//Write a brief conclusion summarizing the findings from the analysis of the sample data sheet.
//Discuss the relationships between the variables and the overall trends observed.
//Also, mention any limitations or assumptions made during the analysis.
Quiz:
1) What Excel function can be used to calculate the mean of a dataset?
a) AVERAGE
b) MEDIAN
c) MODE
d) STANDARDIZE
2) What does the correlation coefficient measure in terms of the relationship between two
variables?
a) Strength of the linear relationship
b) Variability of the data
c) Difference between mean and median
d) Skewness of the distribution
Suggested Refrences:
1. "Microsoft Excel Data Analysis and Business Modeling" by Wayne L. Winston
2. "Excel 2021: Data Analysis and Business Modeling" by Wayne L. Winston
Rubrics wise marks obtained
02 02 05 01 10
Experiment No: 3
Date:
Objective:
The objective of this lab practical is to gain hands-on experience with NumPy, Matplotlib, and
Pandas libraries to manipulate and visualize data. Through this practical, students will learn how
to use different functions of these libraries to perform various data analysis tasks.
Materials Used:
- Python programming environment
- NumPy library
- Matplotlib library
- Pandas library
- Dataset file (provided by faculty)
//Example of dataset file like sales_Data.csv
o Date: Date of sale
o Product: Name of the product sold
o Units Sold: Number of units sold
o Revenue: Total revenue generated from the sale
o Region: Geographic region where the sale took place
o Salesperson: Name of the salesperson who made the sale
Procedures:
Part 1: NumPy
1. Import the NumPy library into Python.
2. Create a NumPy array with the following specifications:
a. Dimensions: 5x5
b. Data type: integer
c. Values: random integers between 1 and 100
3. Reshape the array into a 1x25 array and calculate the mean, median, variance, and standard
deviation using NumPy functions.
4. Generate a random integer array of length 10 and find the percentile, decile, and quartile values
using NumPy functions.
Part 2: Matplotlib
1. Import the Matplotlib library into Python.
2. Create a simple bar chart using the following data:
a. X-axis values: ['A', 'B', 'C', 'D']
b. Y-axis values: [10, 20, 30, 40]
3. Customize the plot by adding a title, axis labels, and changing the color and style of the bars.
4. Create a pie chart using the following data:
a. Labels: ['Red', 'Blue', 'Green', 'Yellow']
b. Values: [20, 30, 10, 40]
5. Customize the pie chart by adding a title, changing the colors of the slices, and adding a legend.
Part 3: Pandas
1. Import the Pandas library into Python.
2. Load the "sales_data.csv" file into a Pandas data frame.
3. Calculate the following statistics for the Units Sold and Revenue columns:
a. Mean
b. Median
c. Variance
d. Standard deviation
4. Group the data frame by Product and calculate the mean, median, variance, and standard
deviation of Units Sold and Revenue for each product using Pandas functions.
5. Create a line chart to visualize the trend of Units Sold and Revenue over time for each product.
Interpretation/Program/code:
//write here
Output:
// Paste here
Conclusion:
In conclusion, this lab practical provided hands-on experience with NumPy, Matplotlib, and
Pandas libraries in Python for data manipulation and visualization. These libraries have wide-
ranging applications in various fields, enabling researchers and analysts to gain insights from
large datasets quickly and efficiently. Through exercises such as calculating statistical measures
and visualizing data using charts, we explored the functionality and flexibility of these powerful
data analysis tools. Overall, gaining proficiency in these libraries equips individuals to tackle
complex data analysis challenges and contribute to their respective fields of study or industries.
Quiz:
Suggested References:-
1. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
2. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
3. Data Science For Dummies by Lillian Pierson , Jake Porway
Rubrics wise marks obtained
02 02 05 01 10
Experiment No: 4
Date:
Materials Used:
- Python environment (Anaconda, Jupyter Notebook, etc.)
- NumPy library
Procedure:
1. Introduction to Probability Distributions:
o Probability theory is the branch of mathematics that deals with the study of random events
or phenomena. In probability theory, a probability distribution is a function that describes
the likelihood of different outcomes in a random process. Probability distributions can be
categorized into two types: discrete and continuous.
o Discrete probability distributions are used when the possible outcomes of a random
process are countable and can be listed. The most commonly used discrete probability
distributions are Bernoulli, Binomial, and Poisson distributions.
o Continuous probability distributions are used when the possible outcomes of a random
process are not countable and can take any value within a certain range. The most
commonly used continuous probability distributions are Normal and Exponential
distributions.
o Each probability distribution has its own set of properties, such as mean, variance,
skewness, and kurtosis. Mean represents the average value of the random variable,
variance represents how much the values vary around the mean, skewness represents the
degree of asymmetry of the distribution, and kurtosis represents the degree of peakedness
or flatness of the distribution.
o Probability distributions are widely used in fields such as finance, engineering, physics,
and social sciences to model real-world phenomena and make predictions about future
events. Understanding different probability distributions and their properties is an
important tool for analyzing data and making informed decisions.
#python
import numpy as np
import matplotlib.pyplot as plt
# Generate 1000 random numbers following a normal distribution with mean 0 and standard
deviation 1
normal_dist = np.random.normal(0, 1, 1000)
In this example, we generate 1000 random numbers following a normal distribution with mean 0
and standard deviation 1 using the `np.random.normal()` function. We then calculate the mean
and standard deviation of the distribution using the `np.mean()` and `np.std()` functions.
We also generate 1000 random numbers following a Poisson distribution with lambda 5 using the
`np.random.poisson()` function. We calculate the mean and variance of the Poisson distribution
using the `np.mean()` and `np.var()` functions.
We then plot the probability density function (PDF) and cumulative distribution function (CDF)
of both distributions using the `plt.hist()` and `plt.plot()` functions from the Matplotlib library.
3. Exercise:
- Generate a dataset of your choice or given by faculty with a given probability distribution
using NumPy random library functions
- Plot the probability density function and cumulative distribution function for the generated
data
- Calculate the descriptive statistics of the generated data
Interpretation/Program/code:
//write here
Output:
// Paste here
Conclusion:
This lab practical provided an opportunity to explore and implement various probability
distributions using NumPy random library functions. By understanding and applying different
probability distributions, one can model real-world phenomena and make predictions about future
events. With the knowledge gained in this lab practical, student will be equipped to work with
probability distributions and analyze data in a wide range of fields, including finance, engineering,
and social sciences.
Quiz:
1. Which NumPy function can be used to generate random numbers from a normal distribution?
a) numpy.random.uniform
b) numpy.random.poisson
c) numpy.random.normal
d) numpy.random.exponential
2. What is the purpose of the probability density function (PDF) in probability distributions?
Suggested References:-
1. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
2. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
3. Data Science For Dummies by Lillian Pierson , Jake Porway
Rubrics wise marks obtained
02 02 05 01 10
Experiment No: 5
Date:
AIM: Implementation of Estimation of Parameters for the Best-Fit Probability Distribution using
the Fitter Class in Python.
Objectives: The objective of this lab practical is to learn how to estimate the parameters for the
best-fit probability distribution for a given dataset using the Fitter class in Python.
Materials Used:
1. Python 3.x
2. Jupyter Notebook
3. NumPy library
4. Fitter library
Theory:
Dataset:
Consider the following dataset, which represents the heights of individuals in centimeters:
170, 165, 180, 172, 160, 175, 168, 155, 185, 190, 162, 178, 168, 172, 180, 160, 165, 172, 168,
175
Procedure:
Parameter estimation is important because it allows us to make inferences, predictions, and draw
meaningful conclusions from the data. By estimating the parameters, we can effectively model
and analyze various phenomena, summarizing complex datasets in a more simplified and
interpretable manner.
The concept of the best-fit probability distribution refers to finding the distribution that provides
the closest match to the observed data. The best-fit distribution is determined by estimating the
parameters in such a way that the observed data exhibits the highest likelihood or best matches the
underlying characteristics of the data. Selecting the best-fit distribution helps us understand the
data's behavior, make accurate predictions, and gain insights into its properties.
Commonly used probability distributions include the normal (Gaussian) distribution, uniform
distribution, exponential distribution, Poisson distribution, and binomial distribution. Each
distribution has its own characteristics and applications in various fields.
7. Conclusion:
- Summarize the importance of parameter estimation and the best-fit distribution in data
analysis.
- Highlight the capabilities of the Fitter class in Python for automating the estimation of
parameters.
- Discuss potential applications and further exploration in different domains.
Interpretation/Program/code:
//write here
Output:
// Paste here
Conclusion:
In this example, we have a dataset of heights of individuals. We use the Fitter class from the
`fitter` library to estimate the parameters for the best-fit probability distribution.
We instantiate the Fitter class with the dataset `data`. Then, we use the `.fit()` method to fit the
data to various distributions available in the Fitter class. The `.fit()` method automatically
estimates the parameters for each distribution and selects the best-fit distribution based on the
goodness-of-fit metrics.
Finally, we retrieve the best-fit distribution using the `.get_best()` method and print the summary
of the distribution using the `.summary()` method. We also plot the histogram of the dataset and
overlay the probability density function (PDF) of the best-fit distribution using the `.plot_pdf()`
method.
Note: Before running the code, make sure you have the `numpy`, `fitter`, and `matplotlib` libraries
installed. You can install the `fitter` library using pip: `pip install fitter`.
Through this practical, we learned the importance of parameter estimation in probability
distributions and the significance of selecting the best-fit distribution for accurate modeling and
analysis. The Fitter class provided a convenient and efficient way to fit the dataset to various
distributions and evaluate their goodness of fit using metrics such as AIC or BIC.
Quiz:
a) fit
b) predict
c) evaluate
d) transform
Suggested References:-
1. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
2. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
3. Data Science For Dummies by Lillian Pierson , Jake Porway
Rubrics wise marks obtained
02 02 05 01 10
Experiment No: 6
Date:
Objective:
The objective of this lab practical is to implement linear regression to predict the value of
a variable in a given dataset. Linear regression is a statistical technique used to model the
relationship between a dependent variable and one or more independent variables. In this
lab, we will explore how to build a linear regression model and use it to make
predictions.
Materials Used:
- Python 3.x
- Jupyter Notebook
- NumPy library
- Pandas library
- Matplotlib library
- Scikit-learn library
Dataset:
For this lab, we will use a dataset that contains information about houses and their sale
prices. The dataset has the following columns:
Procedure:
y = β0 + β1*x + ε
where:
4. Data Preprocessing:
- Handle missing values, if any, by imputation or removal.
- Convert categorical variables into numerical representations, if required.
- Split the dataset into input features (independent variables) and the target variable
(dependent variable).
8. Visualization of Results:
- Visualize the actual values versus the predicted values using scatter plots or other
suitable plots.
- Plot the regression line to show the relationship between the independent and
dependent variables.
Interpretation/Program/code:
//write here
Output:
// Paste here
Conclusion:
Quiz:
1. Which scikit-learn function is used to create a linear regression model object in Python?
a) sklearn.linear_model.LinearRegression
b) sklearn.preprocessing.StandardScaler
c) sklearn.model_selection.train_test_split
d) sklearn.metrics.mean_squared_error
a) To measure the average squared difference between predicted and actual values
b) To evaluate the significance of predictor variables
c) To quantify the proportion of variance in the dependent variable explained by the independent
variables
d) To determine the optimal number of features for the regression model
Suggested References:-
04 04 10 04 20
Experiment No: 7
Date:
Objective:
The objective of this lab practical is to implement logistic regression using Scikit-learn
library in Python. Logistic regression is a popular classification algorithm used to model
the relationship between input variables and categorical outcomes. In this lab, we will
explore how to build a logistic regression model and use it for classification tasks.
Materials Used:
- Python 3.x
- Jupyter Notebook
- Scikit-learn library
- Pandas library
- NumPy library
- Matplotlib library
Dataset:
For this lab, we will use a dataset that contains information about customers and whether
they churned or not from a telecommunications company. The dataset has the following
columns:
CustomerID,Gender,Age,Income,Churn
1,Male,32,50000,0
2,Female,28,35000,0
3,Male,45,80000,1
4,Male,38,60000,0
5,Female,20,20000,1
6,Female,55,75000,0
7,Male,42,90000,0
8,Female,29,40000,1
Procedure:
4. Data Preprocessing:
- Split the dataset into input features (independent variables) and the target variable
(dependent variable).
- Convert categorical variables into numerical representations using one-hot encoding
or label encoding.
- Split the dataset into training and testing sets for model evaluation.
7. Visualization of Results:
- Visualize the model's performance using confusion matrix, ROC curve, or other
suitable visualizations.
- Plot the decision boundary to demonstrate the classification boundaries.
//Write here
Interpretation/Program/code:
//write here
Output:
// Paste here
8. Conclusion:
Quiz:
1. Which scikit-learn function is used to create a logistic regression model object in Python?
a) sklearn.linear_model.LogisticRegression
b) sklearn.preprocessing.StandardScaler
c) sklearn.model_selection.train_test_split
d) sklearn.metrics.accuracy_score
Suggested References:-
04 04 10 02 20
Experiment No: 8
Date:
Relevant CO :- CO4
Objective:
The objective of this lab practical is to implement a decision tree algorithm to classify
students as either average or clever based on given student data. Decision trees are
widely used in machine learning and data mining for classification and regression tasks.
In this lab, we will explore how to build a decision tree model and use it to classify
students based on their attributes.
Materials Used:
- Python 3.x
- Jupyter Notebook
- Scikit-learn library
- Pandas library
- NumPy library
- Matplotlib library
Dataset:
For this lab, we will use a dataset that contains information about students and their
performance. The dataset has the following columns:
Procedure:
//Write here
Interpretation/Program/code:
//write here
Output:
// Paste here
Conclusion:
- The implementation of the decision tree algorithm proved effective in classifying
students as average or clever based on their attributes. Decision trees provide
interpretable results and can be used in various domains for classification tasks. The
decision tree model offers insights into the important features contributing to the
classification. This lab demonstrates the practical application of decision trees for student
classification.
Quiz:
1. In decision tree classification, what is the main objective of the splitting criterion?
Suggested References:-
02 02 05 01 10