[go: up one dir, main page]

0% found this document useful (0 votes)
27 views90 pages

Unit - V

The document provides information about NumPy (Numerical Python) and Pandas for data science. It discusses: - NumPy is a fundamental package for high performance computing and data analysis in Python. It provides multidimensional arrays and vectorization capabilities. - NumPy arrays store data in contiguous memory locations and are more efficient than regular Python lists. - Pandas builds on NumPy and allows working with labeled data and tables similar to Excel. It provides data structures like Series and DataFrame. - The document covers NumPy indexing, slicing, data types, operations, descriptive statistics like percentile and variance calculations, and comparisons between NumPy and Pandas performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views90 pages

Unit - V

The document provides information about NumPy (Numerical Python) and Pandas for data science. It discusses: - NumPy is a fundamental package for high performance computing and data analysis in Python. It provides multidimensional arrays and vectorization capabilities. - NumPy arrays store data in contiguous memory locations and are more efficient than regular Python lists. - Pandas builds on NumPy and allows working with labeled data and tables similar to Excel. It provides data structures like Series and DataFrame. - The document covers NumPy indexing, slicing, data types, operations, descriptive statistics like percentile and variance calculations, and comparisons between NumPy and Pandas performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 90

INSTITUTE OF SCIENCE

ANDTECHNOLOGY,
CHENNAI.
SRM
21CSS101J – Programming for Problem Solving
Unit 5
INSTITUTE OF SCIENCE
ANDTECHNOLOGY,
CHENNAI.
SRM
LEARNING RESOURCES

S. No TEXT BOOKS

Python Datascience Handbook, Oreilly,Jake VanderPlas, 2017.


1. [Chapters 2 &3]

Python For Beginners, Timothy C.Needham,2019. [Chapters 1


2. to 4]

3. https://www.tutorialspoint.com/python/index.htm

4. https://www.w3schools.com/python/
INSTITUTE OF SCIENCE ANDTECHNOLOGY,
CHENNAI.

SRM UNIT V

(TOPICS COVERED)
Creating NumPy Array -Numpy Indexing - Numpy Array
attributes - Slicing using Numpy - Descriptive Statistics in
Numpy: Percentile - Variance in Numpy –

Introduction to Pandas - Creating Series Objects, Data Frame


Objects – Simple Operations with Data frames - Querying from
Data Frames -Applying Functions to Data frames - Comparison
between Numpy and Pandas - Speed Testing between Numpy
and Pandas - Other Python Libraries
21CSS101J

PROGRAMMING FOR PROBLEM SOLVING

UNIT-5
Numpy
(Numerical Python)
NumPy
Stands for Numerical Python
Is the fundamental package required for high performance
computing and data analysis
NumPy is so important for numerical computations in Python is
because it is designed for efficiency on large arrays of data.
It provides
ndarray for creating multiple dimensional arrays
Internally stores data in a contiguous block of memory,
independent of other built-in Python objects, use much less
memory than built-in Python sequences.
Standard math functions for fast operations on entire
arrays of data without having to write loops
NumPy Arrays are important because they enable you to
express batch operations on data without writing any for
loops. We call this vectorization.
NumPy ndarray vs list
One of the key features of NumPy is its N-dimensional array
object, or ndarray, which is a fast, flexible container for
large datasets in Python.
Whenever you see “array,” “NumPy array,” or “ndarray” in the
text, with few exceptions they all refer to the same thing: the
ndarray object.
NumPy-based algorithms are generally 10 to 100 times faster
(or more) than their pure Python counterparts and use
significantly less memory.
import numpy as np
my_arr = np.arange(1000000)
my_list = list(range(1000000))
ndarray
ndarray is used for storage of homogeneous data
Every array must have a shape and a dtype
Supports convenient slicing, indexing and efficient vectorized
computation
1-D Arrays
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
These are often used to represent matrix or 2nd order tensors.

3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
These are often used to represent a 3rd order tensor.
NumPy Arrays provides the ndim attribute that returns an integer that
tells us how many dimensions the array have

Higher Dimensional Arrays


An array can have any number of dimensions.
When the array is created, you can define the number of dimensions by using the
ndmin argument.
Access Array Elements
Array indexing is the same as accessing an array element.
You can access an array element by referring to its index number.
The indexes in NumPy arrays start with 0, meaning that the first element has
index 0, and the second has index 1 etc.

Access 2-D Arrays


Think of 2-D arrays like a table with rows and columns, where the row
represents the dimension and the index represents the column.
Access 3-D Arrays

Negative

OUTPUT?

OUTPUT?
Slicing arrays
• Slicing in python means taking elements from one given index to another given
index.
• We pass slice instead of index like this: [start:end].
• We can also define the step, like this: [start:end:step].
• If we don't pass start its considered 0
• If we don't pass end its considered length of array in that dimension
• If we don't pass step its considered 1
2D Array

From both elements, return index 2:


Converting Data Type on Existing Arrays
The astype() function creates a copy of the array, and allows you to specify the data type as
a parameter.

The Difference Between Copy and View


• The main difference between a copy and a view of an array is that the
copy is a new array, and the view is just a view of the original array.

• The copy owns the data and any changes made to the copy will not
affect original array, and any changes made to the original array will not
affect the copy.

• The view does not own the data and any changes made to the view will
affect the original array, and any changes made to the original array will
affect the view.
Joining NumPy Arrays
We pass a sequence of arrays that we want to join to the concatenate() function, along with
the axis. If axis is not explicitly passed, it is taken as 0.
array_split() for splitting arrays, we pass it the array we want to split and the
number of splits.

If the array has less elements than required, it will adjust from the end accordingly.
Searching Arrays
You can search an array for a certain value, and return the indexes that get a
match. To search an array, use the where() method.

Sorting
Operations between arrays and scalars
Array creation functions
Numpy Indexing

Contents of ndarray object can be accessed and modified by


indexing or slicing, just like Python's in-built container
objects.
items in ndarray object follows zero-based index. Three types
of indexing methods are available − field access, basic
slicing and advanced indexing.
Basic slicing is an extension of Python's basic concept of
slicing to n dimensions. A Python slice object is constructed
by giving start, stop, and step parameters to the built-
in slice function. This slice object is passed to the array to
extract a part of array.
Example :
import numpy as np import numpy as np Output:
a = np.arange(10) a = np.arange(10)
s = slice(2,7,2) b = a[2:7:2] [2 4 6]
print a[s] print b
ndarray object is prepared by arange() function. Then a slice object
is defined with start, stop, and step values 2, 7, and 2 respectively.
When this slice object is passed to the ndarray, a part of it starting
with index 2 up to 7 with a step of 2 is sliced.
If a : is inserted in front of it, all items from that index onwards will
be extracted.
Descriptive Statistics in Numpy:
Descriptive statistics allow us to summarise data sets quickly
with just a couple of numbers, and are in general easy to explain
to others.

Descriptive statistics fall into two general categories:


•1) Measures of central tendency which describe a ‘typical’ or
common value (e.g. mean, median, and mode); and,
•2) Measures of spread which describe how far apart values are
(e.g. percentiles, variance, and standard deviation).
Percentile:
numpy.percentile()function used to compute the nth percentile of the
given data (array elements) along the specified axis.

Syntax : numpy.percentile(arr, n, axis=None, out=None)


Parameters :
arr :input array.
n : percentile value.
axis : axis along which we want to calculate the percentile value.
Otherwise, it will consider arr to be flattened(works on all the axis).
axis = 0 means along the column and axis = 1 means working along the
row.
out :Different array in which we want to place the result. The array
must have same dimensions as expected output.
Return :nth Percentile of the array (a scalar value if axis is none)or
array with percentile values along specified axis.
Example: Output :
# Python Program illustrating
# numpy.percentile() method arr : [20, 2, 7, 1, 34]
50th percentile of arr : 7.0
import numpy as np 25th percentile of arr : 2.0
75th percentile of arr : 20.0
# 1D array
arr = [20, 2, 7, 1, 34]
print("arr : ", arr)
1 2 7 20 34
print("50th percentile of arr : ",
np.percentile(arr, 50))
print("25th percentile of arr : ",
np.percentile(arr, 25))
print("75th percentile of arr : ",
np.percentile(arr, 75))
Example:
Numpy: Has two related functions, percentile and quantile. The
percentile function uses q in range [0,100] e.g. for 90th
percentile use 90, whereas the quantile function uses q in range
[0,1], so the equivelant q would be 0.9. They can be used
interchangeably.
p25 = np.percentile(data_sample_even, q=25, interpolation='linear’)
p75 = np.percentile(data_sample_even, q=75, interpolation='linear’)
iqr = p75 - p25
Variance in Numpy -

Variance is the sum of squares of differences between all numbers


and means. The mathematical formula for variance is as follows,

Where,
N is the total number of elements or frequency of distribution.

calculate the variance by using numpy.var() function


Syntax:
numpy.var(a, axis=None, dtype=None, out=None, ddof=0, keepdi
ms=<no value>)

Parameters:
a: Array containing data to be averaged
axis: Axis or axes along which to average a
dtype: Type to use in computing the variance.
out: Alternate output array in which to place the result.
ddof: Delta Degrees of Freedom
keepdims: If this is set to True, the axes which are reduced are left
in the result as dimensions with size one
Example:

# Python program to get variance of a list Output:


4.0
# Importing the NumPy module
import numpy as np

# Taking a list of elements


list = [2, 4, 4, 4, 5, 5, 7, 9]

# Calculating variance using var()


print(np.var(list))
Introduction to Pandas -
What is Pandas?
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and
manipulating data.
• The name "Pandas" has a reference to both "Panel Data", and
"Python Data Analysis" and was created by Wes McKinney in
2008.
Why Use Pandas?
• Pandas allows us to analyze big data and make conclusions
based on statistical theories.
• Pandas can clean messy data sets, and make them readable
and relevant.
• Relevant data is very important in data science.
Installation of pandas:
C:\Users\Your Name>pip install pandas

Once Pandas is installed, import it in your applications by adding


the import keyword:

OR
0 1 1 7 2 2 dtype: int64

Creating Series Objects


What is a Series?
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.
• Example
• Create a simple Pandas Series from a list:

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Labels
• If nothing else is specified, the values are labeled with their
index number. First value has index 0, second value has index 1
etc.
• This label can be used to access a specified value.
Create Labels
With the index argument, you can name your own labels.
Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when
creating a Series. The keys of the dictionary become the labels.

Example
Create a simple Pandas Series from a dictionary:

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)
To select only some of the items in the dictionary, use the index
argument and specify only the items you want to include in the
Series.

Example
Create a Series using only data from "day1" and "day2":

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)
Pandas DataFrame
It is two-dimensional size-
mutable, potentially
heterogeneous tabular data
structure with labeled axes
(rows and columns).

A Data frame is a two-dimensional


data structure, i.e., data is aligned
in a tabular fashion in rows and
columns.

Pandas DataFrame consists of


three principal components,
the data, rows, and columns.
Data Frame Objects
Data sets in Pandas are usually multi-dimensional tables, called
DataFrames.
Series is like a column, a DataFrame is the whole table.

Example
Create a DataFrame from two Series:

import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2
dimensional array, or a table with rows and columns.
Example
Create a simple Pandas DataFrame:

import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Locate Row
As you can see from the result above, the DataFrame is like a
table with rows and columns.

Pandas use the loc attribute to return one or more specified


row(s)

Example
Return row 0:

#refer to the row index:


print(df.loc[0])
Return row 0 and 1:
#use a list of indexes:
print(df.loc[[0, 1]])
Named Indexes
With the index argument, you can name your own indexes.

Example
Add a list of names to give each row a name:

import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)
Locate Named Indexes
Use the named index in the loc attribute to return the specified
row(s).

Example
Return "day2":

#refer to the named index:


print(df.loc["day2"])
Read CSV Files

• A simple way to store big data sets is to use CSV files (Comma
Separated Files).

• CSV files contains plain text and is a well know format that can
be read by everyone including Pandas.

• In our examples we will be using a CSV file called 'data.csv'.


Load Files Into a DataFrame
If your data sets are stored in a file, Pandas can load them into a
DataFrame.

Example
Load a comma separated file (CSV file) into a DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)
Output:
max_rows
• The number of rows returned is defined in Pandas option settings.

• You can check your system's maximum rows with the


pd.options.display.max_rows statement.
Read JSON
• Big data sets are often stored, or extracted as JSON (JavaScript Object Notation).
• JSON is plain text, but has the format of an object, and is well known in the world of
programming, including Pandas.
• In our examples we will be using a JSON file called 'data.json'.
Simple Operations with Data frames
Basic operation which can be performed on Pandas DataFrame :

Creating a DataFrame
Dealing with Rows and Columns
Indexing and Selecting Data
Working with Missing Data
Iterating over rows and columns
Create a Pandas DataFrame from Lists
DataFrame can be created using a single list or a list of lists.

# import pandas as pd
import pandas as pd

# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']

# Calling DataFrame constructor on list


df = pd.DataFrame(lst)
print(df)
Dealing with Rows and Columns
A Data frame is a two-dimensional data structure, i.e., data is aligned
in a tabular fashion in rows and columns. We can perform basic
operations on rows/columns like selecting, deleting, adding, and
renaming.

Column Selection: In Order to select a column in Pandas


DataFrame, we can either access the columns by calling them by
their columns name.
# Import pandas package
import pandas as pd

# Define a dictionary containing employee data


data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}

# Convert the dictionary into DataFrame


df = pd.DataFrame(data)

# select two columns


print(df[['Name', 'Qualification']])
Column Addition
Dropping Columns
Row Selection: Pandas provide a unique method to retrieve rows
from a Data frame. DataFrame.loc[] method is used to retrieve rows
from Pandas DataFrame. Rows can also be selected by passing
integer location to an iloc[] function.

# importing pandas package


import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv",
index_col ="Name")

# retrieving row by loc method


first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
print(first, "\n\n\n", second)
Adding New Row
Dropping Row
Click and refer more problems in pandas
https://www.geeksforgeeks.org/dealing-with-rows-and-columns-in-pandas-data
frame/?
ref=lbp
Working with Missing Data
Checking for missing values using isnull() and notnull() :
In order to check missing values in Pandas DataFrame, we use a function isnull()
and notnull(). Both function help in checking whether a value is NaN or not.
These function can also be used in Pandas Series in order to find null values in a
series.

# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)
# using isnull() function
df.isnull()
Querying from Data Frames

The query() method allows you to query the DataFrame.


The query() method takes a query expression as a string
parameter, which has to evaluate to either True of False.
It returns the DataFrame where the result is True according to the
query expression.
Syntax
dataframe.query(expr, inplace)
Example:
Return the rows where age is over 35:

import pandas as pd

data = {
"name": ["Sally", "Mary", "John"],
"age": [50, 40, 30]
}

df = pd.DataFrame(data)

print(df.query('age > 35'))


Step1: from google.colab import drive
drive.mount('/content/drive')

Step2:
import pandas as pd
path="/content/drive/MyDrive/CT2.csv"
df=pd.read_csv(path)
print(df.query('mark>30'))
Applying Functions to Data frames
The apply() function is used to apply a function along an axis of the
DataFrame.

Objects passed to the function are Series objects whose index is


either the DataFrame’s index (axis=0) or the DataFrame’s columns
(axis=1).

By default (result_type=None), the final return type is inferred from


the return type of the applied function. Otherwise, it depends on the
result_type argument.
Syndax:
dataframe.apply (func, axis, raw, result_type, args, kwds)
Parameters:

Returns: Series or DataFrame


Result of applying func along the given axis of the DataFrame.
Example:
Comparison between Numpy and Pandas
Speed Testing between Numpy and Pandas

For Data Scientists, Pandas and


Numpy are both essential tools in
Python.

Numpy runs vector and matrix


operations very efficiently, while
Pandas provides the R-like data
frames allowing intuitive tabular
data analysis.

A consensus is that Numpy is more


optimized for arithmetic Ref:
computations. https://towardsdatascience.com/speed-testing
-pandas-vs-numpy-ffbf80070ee7
Other Python Libraries

A Python library is a collection of related modules. It contains


bundles of code that can be used repeatedly in different
programs.

It makes Python Programming simpler and convenient for the


programmer.

As we don’t need to write the same code again and again for
different programs.

Python libraries play a very vital role in fields of Machine


Learning, Data Science, Data Visualization, etc.
1.TensorFlow: This library was developed by Google in collaboration
with the Brain Team. It is an open-source library used for high-level
computations. It is also used in machine learning and deep learning
algorithms. It contains a large number of tensor operations. Researchers
also use this Python library to solve complex computations in Mathematics
and Physics.

2.Matplotlib: This library is responsible for plotting numerical data.


And that’s why it is used in data analysis. It is also an open-source library
and plots high-defined figures like pie charts, histograms, scatterplots,
graphs, etc.

3.Pandas: Pandas are an important library for data scientists. It is an


open-source machine learning library that provides flexible high-level data
structures and a variety of analysis tools. It eases data analysis, data
manipulation, and cleaning of data. Pandas support operations like
Sorting, Re-indexing, Iteration, Concatenation, Conversion of data,
Visualizations, Aggregations, etc.
4. Numpy: The name “Numpy” stands for “Numerical Python”. It is the
commonly used library. It is a popular machine learning library that
supports large matrices and multi-dimensional data. It consists of in-
built mathematical functions for easy computations. Even libraries like
TensorFlow use Numpy internally to perform several operations on tensors.
Array Interface is one of the key features of this library.

5. SciPy: The name “SciPy” stands for “Scientific Python”. It is an open-


source library used for high-level scientific computations. This library is
built over an extension of Numpy. It works with Numpy to handle
complex computations. While Numpy allows sorting and indexing of
array data, the numerical data code is stored in SciPy. It is also widely used
by application developers and engineers.
6. Scrapy: It is an open-source library that is used for extracting data
from websites. It provides very fast web crawling and high-level
screen scraping. It can also be used for data mining and automated
testing of data.

7. Scikit-learn: It is a famous Python library to work with complex


data. Scikit-learn is an open-source library that supports machine
learning. It supports variously supervised and unsupervised
algorithms like linear regression, classification, clustering, etc. This
library works in association with Numpy and SciPy.

8. PyGame: This library provides an easy interface to the Standard


Directmedia Library (SDL) platform-independent graphics, audio, and
input libraries. It is used for developing video games using
computer graphics and audio libraries along with Python
programming language.
9. PyTorch: PyTorch is the largest machine learning library that
optimizes tensor computations. It has rich APIs to perform tensor
computations with strong GPU acceleration. It also helps to solve
application issues related to neural networks.

10. PyBrain: The name “PyBrain” stands for Python Based


Reinforcement Learning, Artificial Intelligence, and Neural
Networks library. It is an open-source library built for beginners in
the field of Machine Learning. It provides fast and easy-to-use
algorithms for machine learning tasks. It is so flexible and easily
understandable and that’s why is really helpful for developers that are
new in research fields.

You might also like