[go: up one dir, main page]

0% found this document useful (0 votes)
31 views26 pages

BDA File

The document provides an introduction to the Matplotlib and Pandas libraries in Python. It discusses key concepts like figures, axes, artists in Matplotlib and series, DataFrames, data loading and manipulation in Pandas. Examples are given to demonstrate plotting common charts like line, bar, pie, histogram using Matplotlib and working with DataFrames to load, explore and analyze data using Pandas.

Uploaded by

sahil raturi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views26 pages

BDA File

The document provides an introduction to the Matplotlib and Pandas libraries in Python. It discusses key concepts like figures, axes, artists in Matplotlib and series, DataFrames, data loading and manipulation in Pandas. Examples are given to demonstrate plotting common charts like line, bar, pie, histogram using Matplotlib and working with DataFrames to load, explore and analyze data using Pandas.

Uploaded by

sahil raturi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

MATU RAM INSTITUTE OF

ENGINEERING & MANAGEMENT


ROHTAK

Practical File
“Big Data Analytics”

Submitted To: Submitted By:


Dr. Kiran Malik Lalit Rohilla
A.P. in C.S.E. Dept. 5241/C.S.E./20
MRIEM, Rohtak
Matplotlib Introduction
Python provides one of a most popular plotting library called Matplotlib. It is open-source,
cross-platform for making 2D plots for from data in array. It is generally used for data
visualization and represent through the various graphs.

Matplotlib is originally conceived by the John D. Hunter in 2003. The recent version of
matplotlib is 2.2.0 released in January 2018.

Before start working with the matplotlib library, we need to install in our Python environment.

Figure: It is a whole figure which may hold one or more axes (plots). We can think of a Figure
as a canvas that holds plots.

Axes: A Figure can contain several Axes. It consists of two or three (in the case of 3D) Axis
objects. Each Axes is comprised of a title, an x-label, and a y-label.

Axis: Axises are the number of line like objects and responsible for generating the graph limits.

Artist: An artist is the all which we see on the graph like Text objects, Line2D objects, and
collection objects. Most Artists are tied to Axes.
Introduction to pyplot
The matplotlib provides the pyplot package which is used to plot the graph of given data.
The matplotlib.pyplot is a set of command style functions that make matplotlib work like
MATLAB. The pyplot package contains many functions which used to create a figure, create a
plotting area in a figure, decorates the plot with labels, plot some lines in a plotting area, etc.

Program
from matplotlib import pyplot as plt
#ploting our canvas
plt.plot([1,2,3],[4,5,1])
#display the graph
plt.show()

Output
1. Line Graph

The line chart is used to display the information as a series of the line. It is easy to plot. Consider
the following example.

Program

from matplotlib import pyplot as plt


x = [1,2,3]
y = [10,11,12]
plt.plot(x,y)
plt.title("Line graph")
plt.ylabel('Y axis')
plt.xlabel('X axis')
plt.show()

Output
Program
from matplotlib import pyplot as plt
from matplotlib import style
style.use('ggplot')
x = [10, 12, 13]
y = [8, 16, 6]
x2 = [8, 15, 11]
y2 = [6, 15, 7]
fig = plt.figure() # Create figure before plotting
plt.plot(x, y, 'b', label='line one', linewidth=5)
plt.plot(x2, y2, 'r', label='line two', linewidth=5)
plt.title('Epic Info')
plt.ylabel('Y axis')
plt.xlabel('X axis')
plt.show()

Output
2. Bar Graph

Bar graph is one of the most common graphs and it is used to represent the data associated with
the categorical variables. The bar() function accepts three arguments - categorical variables,
values, and color.

Program
from matplotlib import pyplot as plt
Names = ['Arun','James','Ricky','Patrick']
Marks = [51,87,45,67]
plt.bar(Names,Marks,color = 'blue')
plt.title('Result')
plt.xlabel('Names')
plt.ylabel('Marks')
plt.show()

Output
3. Pie Chart

A chart is a circular graph which is divided into the sub-part or segment. It is used to represent
the percentage or proportional data where each slice of pie represents a particular category.

Program
from matplotlib import pyplot as plt
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
Aus_Players = 'Smith', 'Finch', 'Warner', 'Lumberchane'
Runs = [42, 32, 18, 24]
explode = (0.1, 0, 0, 0) # it "explode" the 1st slice
fig1, ax1 = plt.subplots()
ax1.pie(Runs, explode=explode, labels=Aus_Players, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

Output
4. Histogram

The histogram and bar graph is quite similar but there is a minor difference them. A histogram
is used to represent the distribution, and bar chart is used to compare the different entities. A
histogram is generally used to plot the frequency of a number of values compared to a set of
values ranges.

In the following example, we have taken the data of the different score percentages of the
student and plot the histogram with respect to number of student.

Program

from matplotlib import pyplot as plt


percentage = [97, 54, 45, 10, 20, 10, 30, 97, 50, 71, 40, 49, 40, 74, 95, 80, 65, 82, 70,
65, 55, 70, 75, 60, 52, 44, 43, 42, 45]
number_of_student = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
plt.hist(percentage, bins=number_of_student, histtype='bar', rwidth=0.8)
plt.xlabel('Percentage')
plt.ylabel('Number of Students')
plt.title('Histogram')
plt.show()

Output
Pandas Introduction
Pandas is a popular Python library for data manipulation and analysis, offering powerful data
structures and functions to work with structured data effectively. Its versatility, simplicity, and
integration with other libraries make it a fundamental tool for data scientists, analysts, and
developers.
At the heart of Pandas are two primary data structures: Series and DataFrame. A Series is a
one-dimensional array-like object that can hold various data types such as integers, floats,
strings, or even Python objects. On the other hand, a DataFrame is a two-dimensional labeled
data structure resembling a table or spreadsheet, consisting of rows and columns. DataFrames
are particularly useful for working with tabular data and offer a wide range of functionalities
for data manipulation and analysis.
Pandas simplifies data ingestion from various sources such as CSV files, Excel spreadsheets,
SQL databases, and JSON files, making it easy to load and preprocess data for analysis. Once
data is loaded into a DataFrame, Pandas provides a rich set of functions for data cleaning,
transformation, and manipulation. Users can perform tasks like filtering rows, selecting
columns, handling missing values, merging datasets, and performing group-by operations with
ease.
Pandas also excels in data analysis and exploration, offering powerful tools for summarizing
and visualizing data. Users can calculate descriptive statistics, compute aggregate metrics, and
generate visualizations such as histograms, scatter plots, and time series plots directly from
DataFrame objects. Additionally, Pandas integrates seamlessly with other Python libraries such
as Matplotlib and Seaborn for advanced data visualization.
Another key feature of Pandas is its robust indexing and selection capabilities. Users can access
and modify data within DataFrames using label-based indexing (loc), integer-based indexing
(iloc), or boolean indexing, enabling precise data manipulation and analysis.
Overall, Pandas is a versatile and indispensable library for data manipulation and analysis in
Python. Its intuitive data structures, rich functionalities, and seamless integration with other
libraries make it an essential tool for a wide range of data-related tasks, from data cleaning and
preprocessing to exploratory data analysis and modeling. Whether working with small or large
datasets, Pandas provides the tools needed to efficiently handle and analyze structured data in
Python.
Program
import pandas as pd
# Define the data
data = {
"location": ["Rohtak", "Jhajjar", "Bhiwani", "Hisar", "Sirsa"],
"population": [500000, 200000, 150000, 300000, 250000],
"distance_from_rohtak_km": [0, 30, 50, 80, 110]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Print the DataFrame
print(df)
Output
Location population distance_from_rohtak_km
0 Rohtak 500000 0
1 Jhajjar 200000 30
2 Bhiwani 150000 50
3 Hisar 300000 80
4 Sirsa 250000 110

Program
import pandas as pd
# Read CSV file from our computer
# Make sure 'data.csv' is in the same directory or provide the correct path
data_frame = pd.read_csv('data.csv')
# Print the 'Order ID' column
print(data_frame['Order ID'])
Output
0 CA-2013-138688
1 CA-2011-115812
2 CA-2011-115812
3 CA-2011-115812
4 CA-2011-115812
...
3198 CA-2013-125794
3199 CA-2014-121258
3200 CA-2014-121258
3201 CA-2014-121258
3202 CA-2014-119914
Name: Order ID, Length: 3203, dtype: object

Program
Examples on the bases of amazon sales data.
Download sales data from kegel and convert into csv file
import pandas as pd
import numpy as np
data=pd.read_csv('data.csv')
df=pd.DataFrame(data)
data
Output
Program
import pandas as pd
import numpy as np
data=pd.read_csv('data.csv')
df=pd.DataFrame(data)
data df.tail(20)

Output

df.describe# to find the mean mode median of all the given data

Output
df.dtypes # use to print the total no of data types which are used in given data

Output

df.index # use to count the total no of items or rows into the data

output

RangeIndex(start=0, stop=3203, step=1)

df.columns # use to count the total no of items or columns into the data
output
Index(['Order ID', 'Order Date', 'Ship Date', 'EmailID', 'Geography', 'Category', 'Product
Name', 'Sales', 'Quantity', 'Profit'], dtype='object')
df.to_numpy() # use to convert dataframe to numpy

array output

df.T # use to transform the data

convert number of rows into column and number of columns into number of rows.
df.sort_index(axis=0, ascending= False) # use to reverse sort or reverse the
whole csv file

output

df.loc[[1,6,8,10],:] # use to print only selected items from the lis output :
NumPy Introduction
NumPy, short for Numerical Python, is a fundamental Python library for numerical computing. It
provides powerful data structures and functions to efficiently manipulate large arrays and matrices
of numerical data, making it essential for scientific computing, data analysis, and machine learning
tasks.
At the core of NumPy is the ndarray (n-dimensional array) data structure, which represents multi-
dimensional arrays of homogeneous data types. NumPy arrays are significantly more efficient than
Python lists for numerical operations, as they are stored in contiguous memory blocks and support
vectorized operations. This enables faster execution of mathematical computations and makes
NumPy ideal for handling large datasets and performing complex numerical computations.
NumPy offers a vast array of functions for array manipulation and mathematical operations. Users
can perform a wide range of tasks, including array creation, reshaping, slicing, indexing, sorting,
and aggregations. NumPy also provides mathematical functions for basic arithmetic operations,
trigonometric functions, logarithms, exponentials, and more. These functions are optimized for
performance and can operate on entire arrays in a single operation, eliminating the need for explicit
looping over array elements.
Another key feature of NumPy is its broadcasting mechanism, which allows for arithmetic
operations between arrays of different shapes and sizes. Broadcasting automatically adjusts the
shape of smaller arrays to match the shape of larger arrays, enabling element-wise operations
without the need for explicit array manipulation or copying.
NumPy's capabilities extend beyond basic array manipulation to include linear algebra, Fourier
transforms, random number generation, and more. It seamlessly integrates with other scientific
computing libraries in the Python ecosystem, such as SciPy, Matplotlib, and Pandas, providing a
powerful foundation for scientific computing workflows.
In summary, NumPy is a critical library for numerical computing in Python, offering efficient data
structures and functions for array manipulation, mathematical operations, and scientific computing
tasks. Its ndarray data structure and vectorized operations enable fast and efficient computation on
large datasets, making it indispensable for data scientists, researchers, and engineers working in
fields ranging from physics and engineering to finance and machine learning.
Program
Consider a hypothetical dataset related to monthly sales performance for a retail business.
Month Sales (in thousands)
January 50
February 55
March 40
April 75
May 80
June 65
July 90
August 85
September 70
October 60
November 55
December 95
a. Draw a line chart can be used to visualize the trend in monthly sales over the year
import matplotlib.pyplot as plt
# Given data
months = ["January", "February", "March", "April", "May", "June", "July", "August",
"September", "October", "November", "December"]
sales = [50, 55, 40, 75, 80, 65, 90, 85, 70, 60, 55, 95]
# Plotting the line chart
plt.plot(months, sales, marker='o', linestyle='-', color='blue')
plt.title('Monthly Sales Trend')
plt.xlabel('Month')
plt.ylabel('Sales (in thousands)')
plt.grid(True)
plt.show()
Output

b. Draw a bar chart can highlight the differences in sales for each month.
# Plotting the bar chart
plt.bar(months, sales, color='skyblue')
plt.title('Monthly Sales Comparison')
plt.xlabel('Month')
plt.ylabel('Sales (in thousands)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
c. Draw pie chart can represent the proportion of sales for each quarter of the year.
# Calculating quarterly sales
quarterly_sales = [sum(sales[:3]), sum(sales[3:6]), sum(sales[6:9]), sum(sales[9:])]
# Plotting the pie chart
labels = ['Q1', 'Q2', 'Q3', 'Q4']
plt.pie(quarterly_sales, labels=labels, autopct='%1.1f%%', startangle=90,
colors=['lightcoral', 'lightgreen', 'lightblue', 'gold'])
plt.title('Quarterly Sales Distribution')
plt.show()
Output

Program
How matplotlib is useful in data visualization in python? Discuss bar chart, line chart and
scattered plotting using matplotlib with suitable example.
Plot the following data on line chart
Day Income
Monday 510
Tuesday 350
Wednesday 475
Thursday 580
Friday 600
a) Write the title of the chart “The Weekly Income Report”
b) Write the appropriate titles of both the axes
c) Write code to Display legends
d) Display red color of the line
e) Use the line styled-dashed
f) Display diamond styl markers on data points
import matplotlib.pyplot as plt days =
["Monday", "Tuesday", "Wednesday",
"Thursday", "Friday"]
income = [510, 350, 475, 580, 600]
# Plotting the line chart
plt.plot(days, income, color='red',
linestyle='--', marker='D', label='Weekly
Income')
# Adding titles and labels
plt.title('The Weekly Income Report')
plt.xlabel('Day')
plt.ylabel('Income')
# Displaying legends
plt.legend()
# Display the plot
plt.show()
Plot a line chart for quarterly revenue with the title "Quarterly Revenue Performance." Label the
x-axis as "Quarter" and the y-axis as "Revenue." Ensure the line is displayed in blue and with
circular markers.
import matplotlib.pyplot as plt
quarters = ["Q1", "Q2", "Q3", "Q4"]
revenue = [1200, 1500, 1300, 1400]
plt.plot(quarters, revenue, color='blue',
marker='o', label='Quarterly Revenue')
plt.title('Quarterly Revenue
Performance')
plt.xlabel('Quarter')
plt.ylabel('Revenue')
plt.legend()
plt.show()

Visualize the daily website traffic using


a line chart. Title the chart "Daily Website Traffic Analysis," and set the x-axis label to "Day" and
the y-axis label to "Traffic." Display the line in green color with a solid line style.
Program
import matplotlib.pyplot as plt
days = ["Monday", "Tuesday",
"Wednesday", "Thursday", "Friday"]
traffic = [1200, 1800, 1500, 2000, 1700]
plt.plot(days, traffic, color='green',
linestyle='-', label='Website Traffic')
plt.title('Daily Website Traffic Analysis')
plt.xlabel('Day')
plt.ylabel('Traffic')
plt.legend()
plt.show()
Create a line chart to illustrate the weekly customer satisfaction scores. Title the chart "Weekly
Customer Satisfaction," and label the x-axis as "Week" and the y-axis as "Satisfaction Score." Use
a red line with a dashed line style and triangular markers.

Program
import matplotlib.pyplot as plt
weeks = ["Week 1", "Week 2", "Week 3",
"Week 4"]
satisfaction_scores = [80, 85, 90, 88]
plt.plot(weeks, satisfaction_scores,
color='red', linestyle='--', marker='^',
label='Customer Satisfaction')
plt.title('Weekly Customer Satisfaction')
plt.xlabel('Week')
plt.ylabel('Satisfaction Score')
plt.legend()
plt.show()
Program
import numpy as np
# Creating a 1D array
array_1d = np.array([1, 2, 3, 4, 5])
# Creating a 2D array
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Creating an array of zeros
zeros_array = np.zeros((3, 3))
# Creating an array of ones
ones_array = np.ones((2, 4))
# Creating an array with a range of values
range_array = np.arange(1, 10, 2)
# Creating an array with random values
random_array = np.random.random((2, 3))
Output
array_1d:
[1 2 3 4 5]
array_2d:
[[1 2 3]
[4 5 6]
[7 8 9]]
zeros_array:
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
ones_array:
[[1. 1. 1. 1.]
[1. 1. 1. 1.]]
range_array:
[1 3 5 7 9]
random_array:
[[0.432 0.532 0.739]
[0.116 0.825 0.641]]
Program
import numpy as np
# Basic arithmetic operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
sum_array = a + b
difference_array = b - a
product_array = a * b
quotient_array = b / a
# Array manipulation
c = np.array([[1, 2], [3, 4]])
transpose_c = c.T
inverse_c = np.linalg.inv(c)
# Element-wise operations
d = np.array([10, 20, 30])
e = np.array([2, 3, 4])
square_d = np.square(d)
sqrt_e = np.sqrt(e)
Output
sum_array: [5 7 9]
difference_array: [3 3 3]
product_array: [ 4 10 18]
quotient_array: [4. 2.5 2.]
transpose_c:
[[1 3]
[2 4]]
inverse_c:
[[-2. 1. ]
[ 1.5 -0.5]]
square_d: [100 400 900]
sqrt_e: [1.414 1.732 2. ]

Program
import numpy as np
# Statistical functions
data = np.array([1, 2, 3, 4, 5])
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
variance = np.var(data)

Output
mean: 3.0
median: 3.0
std_dev: 1.4142135623730951
variance: 2.0
Program
import numpy as np
# Indexing and slicing
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
element_00 = array[0, 0]
row_1 = array[1]
column_2 = array[:, 2]
sub_array = array[:2, 1:]

Output
element_00: 1
row_1: [4 5 6]
column_2: [3 6 9]
sub_array:
[[2 3]
[5 6]]

You might also like