[go: up one dir, main page]

0% found this document useful (0 votes)
156 views11 pages

Data Analysis Lab - Final - 23-24

The document describes a data analysis lab course that covers: 1. Using Python libraries like NumPy and Pandas for data manipulation and visualization. 2. Performing data cleaning, wrangling, and various operations on data. 3. Visualizing data using Matplotlib and seaborn. The course content is divided into 4 units covering topics like NumPy, Pandas, data loading/storage, cleaning, wrangling, aggregation, time series analysis, and visualization. Experiments include tasks with NumPy, the Iris dataset, Series, DataFrames, and predictive analysis on various real-world datasets.

Uploaded by

forallofus435
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views11 pages

Data Analysis Lab - Final - 23-24

The document describes a data analysis lab course that covers: 1. Using Python libraries like NumPy and Pandas for data manipulation and visualization. 2. Performing data cleaning, wrangling, and various operations on data. 3. Visualizing data using Matplotlib and seaborn. The course content is divided into 4 units covering topics like NumPy, Pandas, data loading/storage, cleaning, wrangling, aggregation, time series analysis, and visualization. Experiments include tasks with NumPy, the Iris dataset, Series, DataFrames, and predictive analysis on various real-world datasets.

Uploaded by

forallofus435
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

CS352 Data Analysis Lab

Course Objectives:

The main objectives of the course are to:


1. Introduce Python libraries used for data manipulation and visualization
2. Create awareness on data cleaning, wrangling and various operations on data
3. Impart knowledge on visualizing the data using various plots

Course Outcomes:

On successful completion of the course, students will be able to:


1. Perform operations on data using basic concepts of Numpy and Pandas
2. Perform Data cleaning and Data wrangling operations
3. Visualize data using the tool Matplotlib
4. Perform operations on aggregations and time series data

Course Content:

UNIT-I

NumPy Basics: Arrays and Vectorized ComputationThe NumPyndarray,


Universal Functions, Array-Oriented Programming with Arrays, File Input and
Output with Arrays, Linear Algebra, Pseudorandom Number Generation, Example:
Random Walks

Pandas Data Structure: Introduction to pandas Data Structure, Essential


Functionality, Summarizing and Computing Descriptive Statistics

UNIT-II

Data Loading, Storage, and File Formats: Reading and Writing Data in Text
Format, Binary Data Formats, Interacting with Web APIs, Interacting with
Databases.

Data Cleaning and Preparation: Handling Missing Data, Data Transformation,


String Manipulation.
UNIT-III

Data Wrangling: Join, Combine, and Reshape: Hierarchical Indexing,


Combining and Merging Datasets, Reshaping and Pivoting

Plotting and Visualization: A Brief matplotlib API Primer, Plotting with pandas
and seaborn.

UNIT-IV

Data Aggregation and Group Operations: Group By Mechanics, Data


Aggregation, Apply: General split-apply-combine, Pivot Tables and Cross-
Tabulation.

Time Series: Date and Time Data Types and Tools, Time Series Basics, Date
Ranges, Frequencies, and Shifting, Time Zone Handling, Periods and Period
Arithmetic, Resampling and Frequency Conversion, Moving Window Functions

Learning Resources:

Textbook(s):

1. Wes McKinney, Python for Data Analysis - Data Wrangling with Pandas,
NumPy, and IPython 2nd Edition. O’Reilly/SPD

References:

1. Jake VanderPlas, Python Data Science Handbook Essential Tools for


Working with Data. O’Reilly/SPD
2. David Taieb ,”Data Analysis with Python: A Modern Approach “ 1st
Edition, Packt Publishing

List of Experiments:
1. Numpy Array operations
2. Iris Dataset
3. Pandas Series
4. Pandas Dataframes
5. Canada Pizza Price Prediction
6. Mobile Phone Price Data set
7. National Universities Rankings.
8. Adidas Sales Dataset
9. Movies Dataset.
10. Avocado Prices
1.Numpy Array operations
Write a Python program to do the following operations: Library: NumPy

a) Create a one-dimensional array and perform all operations on it.


b) Create multi-dimensional arrays and find its shape and dimension
c) Create a matrix full of zeros and ones
d) Reshape and flatten data in the array
e) Perform arithmetic operations on multi-dimensional arrays
f) Append data vertically and horizontally
g) Apply indexing and slicing on array
h) Use statistical functions on array – Min, Max, Mean, Median and Standard Deviation
i) Dot matrix product of two arrays
j) Compute the Eigen values of a matrix
k) Solve a linear matrix equation such as 3 * x0 + x1 = 9, x0 + 2 * x1 = 8
l) Compute the multiplicative inverse of a matrix
m) Compute the rank of a matrix
n) Compute the determinant of an array
o) Perform transpose and change of axes operations on arrays.
p) Perform splitting operations on arrays.

2. Fisher’s Iris Dataset


Description:

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in ormalizes of
the variables sepal length and width and petal length and width, respectively, for 50
flowers from each of 3 species of iris. The species are Iris ormal, versicolor, and
virginica.
Format
iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length,
Sepal.Width, Petal.Length, Petal.Width, and Species.
The header is : sepal length, sepal width, petal length, petal width, iris, Species No. It has
value 1 for Iris setosa, 2 for Iris virginica and 3 for Iris versicolor.

Questions:

a) Load the data in the file Iris.txt in a 2-D array called iris.
b) Drop column whose index=4 from the array iris.
c) Display the shape, dimensions and size of iris.
d) Split iris into three 2-D arrays, each array for a different species.callthem iris1,
iris2, and iris3.
e) Print the three arrays iris1,iris2,iris3
f) Create a 1-D array header having elements “sepal length”,” sepalwidth”,
“petallength”, “petalwidth”,” species No” in that order.
g) Display the array header.
h) Find the max, min, mean, and standard deviation for the columns of the iris and
store the results in the arrays iris_max, iris_min, iris_avg, iris_std,iris_varresp.The
results must be rounded to not more than two decimal places.
i) Similarly find the max, min, mean, and standard deviation for the columns of the
iris1, iris2, iris3 and store the results in the arrays with appropriate names.
j) Check the minimum value for sepal length, sepal width , petal length, petal width
of the three species in comparison to the minimum value of sepal
length,sepalwidth,petallength,petal width for the data set as awhole and fill the
table below with True if the species value is greater than the dataset value and
False otherwise.
Iris setosa Iris virginica Iris versicolor
k) Sepal length
Sepal width
Petal length
Petal width
Compare Iris setosa’s average sepal width to that of Iris virginica.
l) Compare Iris setosa’s average petallength to that of Iris virginica.
m) Compare Iris setosa’s average petal width to that of Iris virginica.
n) Save the array iris_avg in a comma separated file named IrisMeanValues.txt on
the hard disk.
o) Save the arrays irisw_max, iris_avg, iris_min in a comma separated file named
IrisStat.txt on the hard disk.

3. Pandas Series Programs


Write a Python program to do the following operations: Library: PandasSeries

a) To add, subtract, multiple and divide two pandas Series.


b) To convert all the string values to upper, lower cases in a given pandas series.
Also find the length of the string values.
c) To remove whitespaces, left sided whitespaces and right sided whitespaces of the
string values of a given pandas series.
d) To create a series from a list, numpy array and dict
e) To calculate the number of characters in each word in a series.
f) To compare the elements of the two Pandas Series.
Sample Series: [2, 4, 6, 8, 10], [1, 3, 5, 7, 10]
g) To convert a Panda module Series to Python list and it’s type.
h) To create a series from a list, numpy array and dict.
i) To Combine many series to form a dataframe.
j) to stack two series vertically and horizontally
k) To create and display a DataFrame from a specified dictionary data which has the
index labels.
l) Identify frequency counts of unique items of a series.
m) To get the items of series A not present in series B?
n) To convert a numpy array to a dataframe of given shape.

4. Pandas DataFrames Programs

Python program to do the following operations: Library: Pandas DataFrames


I. import and read a CSV file
II. To Generate a basic understanding of a given data.
a. Print First 5 rows, last 5 rows of data
b. Check the basic information of the data
c. Extract the shape of the data
d. Print the unique values of the marital status field based on the column
e. To make it consistent data( widow and widowed )are two different naming for the same
category on the column values
f. Check for duplicates and null values in the whole dataset
III. select and filter data based on conditions
a. Select a subset of the data points(Birthdate, Education, and Income of every customer)
from the data frame.
b. Using loc() and iloc() methods to retrieve the first seven data points.
c. Filter data using the loc() and isin() methods.(Note:we choose the variable of interest
and we select the categories )
d. In our data, that satisfies two conditions such as choosing the customers with an income
higher than 75,000 and with a master’s degree(using python operators) and display the
output.
IV. apply various data operation tools such as creating new variables or changing data types
We can apply different operations on the dataset using Pandas such as
a. setting a new index with the variable of our interest using the .set_index() method
b. sorting the data frame by one of the variable using .sort_values() with ascending or
descending order;
c. creating a new variable which could be the result of a mathematical operation such as
sum of other variables
d. changing the datatype of variables into datetime or integer types
e. determining the age based on year of birth
f. creating the week date (calendar week and year) from the purchase date
V. perform data aggregation using group by and pivot table methods
After we created new variables, we can further aggregate and to analyze data by groups,
a. To apply groupby()method to find the mean of income ,recency,number of web and
store purchases by educational group.
b. To apply pivot_table()method to find the aggregated sum of purchases and mean of
recency per education and marital status group.

5. Canada Pizza Price Prediction


Columns:

company, price_cad, diameter, topping, variant, size, extra_sauce, extra_cheese,


extra_mushrooms

Questions:

a) Count the number of null values in the pizza dataset and replace null values with
average of the concerned columns.
b) Calculate average price of pizza prepared by each company.
c) Find the companies, who prepared pizzas with different variants with same
diameter.
d) Which company has more pizzas? Show the result with graph.
e) Check whether the pizza data set contains null value or not. /Count the no. of null
values in the pizza dataset./ Find the number of missing data points per column.
f) Rename the column price_cad as price.
g) Identify the number of companies in each category
h) Identify which type of pizza is more expensive.
i) Find diameter of jumbo size pizza.
j) Any jumbo pizza with diameter less than 16 exists, remove such rows.
k) Calculate average price of a pizza prepared by company A.
l) Find the mean of the diameter and average price of pizzas prepared by company C.
m) Find the companies, who prepared pizzas with different variants with same
diameter.
n) Find the pizza variant with extra_mushrooms and topping with chicken.
o) What is the most expensive pizza in each company?
p) Which company has more pizzas on the menu? Show the result with graph.
q) What is the average price of pizza in each company?

6.Mobile Phone Price Data set:


 Columns:
o Brand: the manufacturer of the phone
o Model: the name of the phone model
o Storage (GB): the amount of storage space (in gigabytes) available on the
phone
o RAM (GB): the amount of RAM (in gigabytes) available on the phone
o Screen Size (inches): the size of the phone's display screen in inches
o Camera (MP): the megapixel count of the phone's rear camera(s)
o Battery Capacity (mAh): the capacity of the phone's battery in
milliampere hours
o Price ($): the retail price of the phone in US dollars

Questions:

a) Identify the models & the price released by each brand.


b) Identify the correlation between Battery Capacity and price.
c) Find how many models are there per each Battery capacity with same price.
d) Count the number of models in each brand with highest storage. Draw the graph.
e) Identify how many models are released by each brand.
f) Find the RAM capacity of all models of every brand.
g) Identify the correlation between Battery Capacity and price.
h) Find how many models are there per each Battery capacity.
i) Calculate average price of each brand.
j) Find which mobile brand has highest price.
k) Identify any missing values are there in mobile phone price dataset.
l) Display all models associated with apple brand.
m) Find the mobile prices based on Camera (MP).
n) List the models along with brands which have highest storage.
o) How many models in each brand having RAM>6.
p) List the models having price >600 and Storage between 100 and 200.

7.National Universities Rankings


Columns:

o Name – institution name,


o Location – City and state where located,
o Rank – Ranking according to U.S News & World Report ,
o Description – Snippet of text overview from U.S News ,
o Tuition and fees – Combined tuition and fees for out–of–state students ,
o In–state – Tuition and fees for in–state students ,
o Undergraduate Enrollment – Number of enrolled undergraduate students .

Questions:

a) Find the universities along with state whose fee is in between 25,000$ to 30,000$
b) Find university where undergraduate enrollement is morethan 25000 containing
in-state students.
c) Find the states where universities are located in three or more cities.
d) Find max & min tuition fee in each state.
e) Find the city & state where maximum tuition fee difference in the universities in
that city is greater than 5,000.
f) Print the names of universities having no. of branches along with the names of the
branches.
g) Print university name and where it is located.
h) Find cities having more than 2 universities along with state.
i) Find the no. of states and cities locating top 100 universities.
j) Draw the plot to show undergraduate nrolment of each university.
k) Draw the plot to show university name and its corresponding tuition fee.
l) Plot the no. of universities in each state having ranks>100.

8.Adidas Sales Dataset


Columns:

o Retailer ID, Invoice Date,


o Region, State,
o City, Product,
o Price per Unit, Units Sold,
o Total Sales, Operating Profit,
o Operating Margin, Sales Method

Questions:

a) List all the products sold in every region.


b) Find the Cities & the retailers who sold womens related products.
c) Find the total sales of each womens product in in-store method.
d) For each product, find region wise total sales & units sold.
e) For men’s & women’s products, find state wise units sold & total sales.
f) Find states where women’s products sold were more than men’s products.
g) Find region wise units sold for each product
h) Find region wise profit for every retailer.
i) Find the states along with units sold where products sold in more than one city in
the state.
j) Draw plot to show monthly sales in 2020 in every region
k) Draw the plot to show year wise sales in every region.
l) Draw plots to show Region wise sales in every year.

9. Movies dataset
Columns:

o Title, US Gross,
o Worldwide Gross, US DVD Sales,
o Production Budget, Release Date,
o MPAA Rating, Running Time (min),
o Distributor, Source,
o Major Genre, Creative Type,
o Director, Rotten Tomatoes Rating,
o IMDB Rating, IMDB Votes

Questions:

a) Find number of movies released under each genre in each year.


b) Find movies with loss every year for each distributor.
c) Find the Directors who directed for each creative type with IMDB rating above 6.
d) Draw the plot to compare the number of movies released till now by each director.
e) Find the genres of the movies released in each year in the ascending order.
f) Find the budgets of the movies released by each distributor along with movie
names.
g) Find the movies with the same IMDD rating but with different no.of IMDD rating.
h) Write a Pandas program to get those movies whose revenue more than 2 million
and spent less than 1 million.
i) Find the no. of movies in each genre under each source.
j) Find the no. of movies released in each decade.
k) Draw the plot showing the no. of movies released in each genre.
l) Show the no.of movies not rated under each genre in each fiction.
10.Avocado Prices

Historical data on avocado prices and sales volume in multiple US markets

Some relevant columns in the dataset:

o Date - The date of the observation


o AveragePrice - the average price of a single avocado
o type - conventional or organic
o year - the year
o Region - the city or region of the observation
o Total Volume - Total number of avocados sold
o 4046 - Total number of avocados with PLU 4046 sold
o 4225 - Total number of avocados with PLU 4225 sold
o 4770 - Total number of avocados with PLU 4770 sold
About this file
Numerical column names refer to price lookup codes.
1. small Hass
2. large Hass
3. extra large Hass

Questions:

a) How to identify the unique values in the region column.


b) What is the maximum price for an avocado in the dataset.
c) Identify the type distribution and take a single avocado in the dataset and find out
the median price ,mean, and standard deviation.
d) Find the highest, lowest price for conventional avocado’s in year with location.
e) Draw the plots of the distribution of average price for different types of Avocados
f) Find the correlation matrix to measure the strength of the correlation between
variables.
g) Find out the volume of avocado sales has increased in the last 5 years.

You might also like