Bcse206l Fds Module-5 Smsatapathy
Bcse206l Fds Module-5 Smsatapathy
L T P C
3 0 0 3
Dr. S M SATAPATHY
Associate Professor,
School of Computer Science and Engineering,
VIT Vellore, TN, India – 632 014.
Module – 5
❖ Python IDE
2
PYTHON
FOR
DATA SCIENCE
3
Introduction
❖ The notable factor to use Python in data science is the availability of
different libraries for data science or data analytics.
❖ NumPy (for array-based data manipulation), pandas (for labeled data
manipulation), SciPy (for scientific computing), and scikit-learn (for ML
algorithms) are a few among the many libraries that are generally used
in data science.
4
Python Identifiers
❖ A Python identifier is a name that consists of a series of letters,
numbers, and underscore characters.
❖ A Python identifier should follow the below naming conventions:
➢ Variable and function names start with a lowercase letter.
➢ An identifier starting with an underscore character generally
indicates it to be a private identifier.
➢ Classes name begins with an uppercase letter.
➢ The identifier cannot hold names among the reserved Python
keywords.
5
Python Variables
❖ Python has no syntax or command to declare a variable.
❖ The name assigned to a variable is its variable. It can also be called as
a Python identifier.
❖ These variables are case sensitive and should ideally begin with a
letter.
❖ To improve the readability of the variable name, it is better to use
lowercase-lettered words which are separated using an underscore
character, for example, mixed_case, data_des, etc.
6
Python Data Types
❖ Python supports many basic data types, such as integer, floating-point,
character, or string.
❖ In Python, we need not declare the data type assigned for the variable.
Instead, the Python interpreter identifies the data type of the variable
using the value assigned to it.
❖ The other two data types are Boolean and None.
➢ The Boolean data type takes two values: True or False.
➢ The None data type indicates a null value.
7
Python Data Structures
❖ There are Five data structures in Python:
➢ List
➢ Tuple
➢ Dictionary
➢ Set
➢ String
8
Python Data Structures
List:
❖ Python lists are ordered with a certain count. The values in the lists are
enclosed in square brackets and are separated using commas.
❖ The list indices start with 0.
❖ The list can hold both homogenous (i.e., [1, 6, 4, 7]) as well as
heterogeneous (i.e., [1, ‘a’, ‘d’, 4, 6]) data.
❖ Lists are mutable in Python and thus can be altered by adding new
items, deleting some items, or modifying the existing items.
❖ A list can be created in four different ways:
➢ Comma-separated values − [2, 5, 3, 8, 3]
➢ Empty list − []
➢ Singleton list −[a]
➢ Using list () function
9
Python Data Structures
Tuple:
❖ A tuple is a collection of ordered elements and is immutable.
❖ Any sort of modification in the tuple is prohibited and needs the
creation of a new tuple.
❖ Same as lists, a tuple can hold homogenous and heterogeneous
data. All the items in the tuple are enclosed in parenthesis and are
separated using commas.
❖ Tuple can be created in the following ways:
➢ Comma-separated values – (2, 5, 3, 8, 3)
➢ Empty tuple – ()
➢ Singleton tuple – (a,) Note: Even when creating a single-valued
tuple, a comma must be included after the value.
➢ Using the tuple () function 10
Python Data Structures
Dictionary:
❖ A dictionary is an unordered sequence that contains key-value pairs.
A dictionary has unique keys, but the values can be repeated.
❖ The keys in the dictionary are immutable, whereas the value can
have any type.
❖ Keys and values are separated using a colon, and the whole
dictionary is enclosed in curly brackets
11
Python Data Structures
Set:
❖ Set is an unordered collection of data items that are unique.
❖ is a built-in data structure in Python with the following characteristics:
➢ Unordered: Set doesn’t maintain the order of data insertion
➢ Unchangeable: Set are immutable and we can’t modify items
➢ Unique: Set doesn’t allow duplicate items
➢ Heterogeneous: Set can contain data of all types
❖ There are following two ways to create a set in Python.
➢ Using curly brackets
➢ Using set() constructor
12
Python Data Structures
String:
❖ A string is a combination of zero or more characters enclosed in single
quotes or double quotes.
❖ Python treats single quotes the same as double quotes and vice versa.
❖ Creating a string is the same as assigning the value to a variable in
Python.
13
Python Libraries: Pandas and NumPy
❖ Python provides its large library support for various stages in data
science.
❖ Python libraries contain various tools, functions, and methods to
analyze data. Each library has its focus on data mining, data
visualization, etc.
❖ There are various Python libraries used for data science. Some of
them are listed below:
❖ Data Mining
➢ Scrapy
➢ BeautifulSoap
14
Python Libraries: Pandas and NumPy
❖ Data Processing and Modeling
➢ NumPy
➢ pandas
➢ SciPy
➢ Scikit-learn
➢ XGBoost
❖ Data Visualization
➢ Matplotlib
➢ Seaborn
➢ Pydot
➢ Plotly
15
Pandas
❖ pandas is a free Python data analysis and data handling software
library.
❖ pandas provides a variety of high-performance and convenient data
structures along with operations for data manipulation.
❖ pandas is also known for its range of tools for reading and writing data
between in-memory data structures and various file formats.
❖ The standard form to import the pandas module is:
import pandas as pd
❖ pd is just an alias. Any name can be used as an alias.
16
Pandas
❖ Load and examine the dataset:
❖ DataFrame is a two-dimensional array where each feature (column) is
simply a shared index series.
❖ A DataFrame can be viewed as a generalization of the NumPy array.
❖ A dataset can be read and loaded to the pandas DataFrame using the
below code syntax
❖ df = pd.read_csv('iris.csv')
17
Pandas
import pandas as pd
cols = ['sepal_length', ' sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data', names=cols)
df.head()
df.tail()
df.sample(10)
df.info()
df.describe()
df.loc[df['class']=='Iris-setosa']
df.loc[0:10,['sepal_length','petal_width']]
df.iloc[0,0]
df.iloc[0:4,2:5]
df.sort_index(ascending=False).head()
df.sort_values(by='sepal_length', ascending=True).head()
18
NumPy
❖ NumPy is an acronym for Numerical Python.
❖ NumPy library is majorly used for the functioning of arrays. It can also
be used in linear algebra and matrices.
❖ In a field such as data science, speed and resources play a vital role.
To overcome this disadvantage of the lists, the NumPy arrays are
used.
❖ The major difference which makes the NumPy arrays to be more
prominent than the Python lists is the vectorized operations supported
by the NumPy arrays and not by the Python lists.
❖ ndarray is an array class used in NumPy. All the items in the NumPy
arrays are enclosed in square brackets and are initialized using lists.
19
NumPy
import numpy as np
arr1 = np.array(['a', 'b', 'c', 'd'])
print("Rank 1 Array : \n", arr1)
print()
arr2 = np.array([[1, 2, 3], ['a', 'b', 'c']])
print("Rank 2 Array: \n", arr2)
print()
arr3 = np.array((4, 5, 6))
print("\n Tuple passed Array:\n", arr3)
20
NumPy
# Accessing the array using its index
import numpy as np
arr=np.arange(10)
print(arr)
print("arr[:5]", arr[:5]) #prints elements till the 4th position
print("arr[3:]", arr[3:]) #prints elements from the 3rd position to the end
print("arr[::5]", arr[::5]) #prints elements divisible by 5
print("arr[1::2]", arr[1::2]) #prints elements starting from the 1st position and
step by 2
arr1 = np.array([[1,2,3], [4,5,6], [7, 8, 9]])
print("Initial Array: ")
print(arr1)
sliced_arr = arr1[:2,:2]
print ("Array with first 2 rows and 2 columns", sliced_arr) 21
NumPy
# Basic array operations
import numpy as np
arr1 = np.array([1,2,3,4])
arr2 = np.array([6,7,8,9])
print("arr1:", arr1)
print ("arr1 after adding 2 to every element:", arr1 + 2)
print("\narr2:", arr2)
print("arr1 after subtracting 1 from every element:", arr2-1)
print ("\nSum of all array elements in arr1: ", arr1.sum())
print ("\nArray sum:\n", arr1 + arr2)
22
Exploratory Data Analysis (EDA)
❖ Statically, EDA is a technique that evaluates datasets to contextualize their
primary characteristics, often with visualization.
❖ The main objective of EDA is to make it competent to work with any ML
algorithm.
❖ Objectives of EDA:
➢ Description of the dataset
➢ Removing corrupted data
➢ Handling outliers
➢ Evaluation of the relationship between variables through
visualization
23
Exploratory Data Analysis (EDA)
Description of the dataset:
❖ A data scientist should always know the data and other vital statistics
of the dataset before moving further.
❖ The most fundamental property that can also be used is Python's
describe() function.
❖ Applying the describe() function to a DataFrame in pandas provides
comprehensive statistics, summarizing various factors, such as the
dispersion, the shape using count, std, mean, std, etc.
24
Exploratory Data Analysis (EDA)
Removing corrupted data
❖ Missing values when not handled properly can reduce the excellence
in the performance matrix.
❖ This can also lead to the false classification which can harm our model
thus built.
❖ Missing or NULL values can be handled in various ways:
❖ Drop NULL or missing values:
➢ Although this is the simplest way to handle missing or NULL
values, it is usually not used as it shrinks the dataset by deleting
such observations.
➢ In Python, the dropna() function is used to delete the missing
values from the dataset.
25
Exploratory Data Analysis (EDA)
Removing corrupted data
❖ Fill missing values:
➢ In this method, missing values are replaced with a statistic such as
mean, median, mode, min, max, or even any user-defined value.
➢ The fillna() function is used in Python for filling the missing values.
❖ Predict missing values with an ML algorithm:
➢ Predicting missing values is the most efficient way to handle the
missing data.
➢ One can use the regression model or classification model
(depending on the dataset) to predict the missing data.
26
Exploratory Data Analysis (EDA)
Handling outliers:
❖ Outliers are the data points that are far different from the crowd of data
points.
❖ This can be an indication of the variation in the data-set. Outliers can
be detected using boxplot, scatterplot, inter quartile range, and Z-
score.
27
Exploratory Data Analysis (EDA)
Evaluation of the relationship between variables through visualization:
❖ Visualization helps us understand the various relationship between
features.
❖ Some widely used techniques for data visualization in Python are the
histograms and the heatmaps.
❖ The histogram helps to assess probability distribution that is quite easy
to interpret. Python provides various options to build and plot
histograms.
❖ The heatmap is represented where the data points are denoted using
colors.
28
Time Series Data
❖ Time series data is a type of structured data and is used in various
domains, such as accounting, ecology, economics, neuroscience, etc.
❖ Time series can be referred to as anything and everything that is
measured in the form of time.
❖ The data points in time series appear at regular intervals according to
some rules.
❖ Time series may also be uneven without an even fixed unit of time.
❖ Date and time data come in various types:
➢ Timestamps
➢ Time Intervals and Periods
➢ Time Deltas or Durations
29
Time Series Data
❖ Timestamps – The timestamp refers to particular instants in time (e.g.,
July 24, 2020, at 8:00 a.m.)
❖ Time Intervals and Periods – Time intervals and periods refer to a time
length between the initial and the final point. Periods generally refer to
a specific case of time intervals where each time interval is of a uniform
length and does not coincide (e.g., 24-hour period of days).
❖ Time Deltas or Durations – It refers to an exact length of time (e.g., a
duration of 12.43 seconds)
30
Time Series Data
# Time Series Analysis
# Dates and Times in Python
31
Time Series Data
# Numpy time series
import numpy as np
date = np.datetime64('2014-12-12 20:20:20')
print(date)
date = np.empty(1, dtype='datetime64[s]')
print(date)
date[0] = np.datetime64('2014-12-12 20:20:20')
print(date)
date = np.array(np.datetime64('2019-08-26'))
print(date)
date = date + np.arange(8)
print(date)
32
Time Series Data
# Pandas Time Series
import pandas as pd
dt_index_obj = pd.DatetimeIndex(['2017-12-31 16:00:00-08:00', '2017-12-
31 17:00:00-08:00', '2017-12-31 18:00:00-08:00'],
dtype='datetime64[ns, US/Pacific]', freq='H')
print(dt_index_obj.year)
print(dt_index_obj.month)
print(dt_index_obj.day)
print(dt_index_obj.hour)
print(dt_index_obj.minute)
print(dt_index_obj.date)
print(dt_index_obj.time)
print(dt_index_obj.timetz) 33
Time Series Data
print(dt_index_obj.is_month_start)
print(dt_index_obj.is_month_end)
print(dt_index_obj.is_year_start)
print(dt_index_obj.is_year_end)
print(dt_index_obj.is_leap_year)
print(dt_index_obj.dayofweek)
print(dt_index_obj.round('H'))
34
Clustering with Python
❖ Clustering is a form of unsupervised learning.
❖ In unsupervised learning, the features of a dataset are modelled
without any label references.
❖ Clustering is the process of dividing the data points into many clusters
in such a way that data points in the same clusters are more similar to
other data points in the same cluster and are different from data points
in other clusters. It is generally a compilation of objects in terms of
similarity and dissimilarity among them.
❖ There are various clustering algorithms in ML. The following are the
most common clustering techniques used in data science:
➢ k-Means clustering
➢ Agglomerative clustering
➢ DBSCAN clustering 35
Clustering with Python
k-Means clustering
❖ The simplest unsupervised learning algorithm is the k-means clustering
algorithm. k-Means algorithm classifies “n” observations in “k” clusters
where each observation lies to a cluster where the nearest mean
serves as a cluster prototype.
❖ Suppose we are given a database of “n” objects and the partitioning
method constructs “k” partition of data, each partition will represent a
cluster and k ≤ n.
❖ It means that it will classify the data into “k” groups, which satisfies the
following requirements:
➢ Each group contains at least one object.
➢ Each object must belong to exactly one group
36
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X Rr, and r is
the number of attributes (dimensions) in the data.
• The k-means algorithm partitions the given data into k clusters.
• Each cluster has a cluster center, called centroid.
• k is specified by the user
+
+
K-means clustering
Strength - K-means clustering
• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
• Since both k and t are small. k-means is considered a linear
algorithm.
• K-means is the most popular clustering algorithm.
• Notethat: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.
Weakness - K-means clustering
• The algorithm is only applicable if the mean is defined.
• For categorical data, k-mode - the centroid is represented by most frequent
values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points
with very different values.
Weaknesses of k-means: Problems
with outliers
Apply K-means algorithm to solve the above problem till the convergence
criteria are met (or at least till 2 iterations).
Given
• K=2
• Initial Centroids
• m1 = {185,72}
• m2 = {170,56}
52
Clustering with Python
Types of Hierarchical clustering
• Agglomerative (bottom up) clustering: It builds the dendrogram (tree)
from the bottom level, and
• merges the most similar (or nearest) pair of clusters
• stops when all the data points are merged into a single cluster (i.e.,
the root cluster).
• Divisive (top down) clustering: It starts with all data points in one
cluster, the root.
• Splits the root into a set of child clusters. Each child cluster is
recursively divided further
• stops when only singleton clusters of individual data points remain,
i.e., each cluster with only a single point
53
Clustering with Python
Agglomerative clustering
❖ Agglomerative approach is a type of a hierarchical method of
clustering.
❖ Hierarchical Clustering produces a nested sequence of clusters, a tree,
also called Dendrogram. The clusters here are formed in a tree
structure based on hierarchy.
54
Clustering with Python
❖ This approach is also known as the bottom-up approach. In this, we
start with each object forming a separate group. It keeps on merging
the objects or groups that are close to one another. It keeps on doing
so until all of the groups are merged into one or until the termination
condition holds. Linkage criteria for agglomerative clustering:
55
Clustering with Python
❖ This approach is also known as the bottom-up approach. In this, we
start with each object forming a separate group. It keeps on merging
the objects or groups that are close to one another. It keeps on doing
so until all of the groups are merged into one or until the termination
condition holds. Linkage criteria for agglomerative clustering:
❖ Ward’s Method / Single Linkage – This method looks for two clusters,
giving the least increase in total variance among all clusters, and then
merge them.
❖ Average Linkage – Average linkage search for two clusters to merge
which has the least average distance between the points.
❖ Complete Linkage – This linkage, also called the maximum linkage,
looks for two clusters having the maximum distance between points to
be merged. 56
Clustering with Python
Introduction
Clustering with Python
Introduction
Clustering with Python
Introduction
Clustering with Python
Introduction
Clustering with Python
Introduction
Clustering with Python
Introduction
Clustering with Python
Introduction
Clustering with Python
Introduction
Agglomerative Clustering :: Practice Tutorial
65
Clustering with Python
DBSCAN
❖ It is an acronym for the density-based spatial clustering of applications
with noise.
❖ In DBSCAN, one need not specify the number of clusters in advance.
❖ Also, DBSCAN is well suited for datasets with complex cluster shapes.
❖ In the DBSCAN method, clusters refer to the area that is denser in the
data space, being isolated by the areas which are much less densely
populated or empty.
66
Clustering with Python
❖ Minimum points (MinPts) and epsilon (eps) play a very important role in
the DBSCAN algorithm.
❖ Points lying in the larger dense region are referred to as core samples.
The core samples which are apart by a distance of eps units are put
into one cluster.
❖ After the points being distinguished as core samples, the data points
not making it into any cluster are said to be as noise.
❖ Boundary points are the points that fall within the eps units’ distance
but cannot be termed as core points in themselves
67
ARCH and GARCH
The ARCH Model
❖ The ARCH is an acronym that stands for autoregressive conditional
hetero-skedasticity.
❖ It is an approach that offers a way to model a transformation in
variation in a time series that is dependent on time, such as an
increase or decrease in volatility. It is a process that directly models
variance change over time in a time sequence.
❖ A lag factor must be defined to determine the number of previous error
terms to be used in the model. The number of lags squared error terms
to be used in the ARCH model. An ARCH model is built in three stages:
1) Model definition
2) Model fitting
3) Forecasting 68
ARCH and GARCH
The GARCH Model
❖ GARCH is an acronym for generalized autoregressive conditional
hetero-skedasticity. The GARCH model is an extended version of the
ARCH model.
❖ Explicitly, the model integrates lag variation terms, along with residual
errors during the mean process.
❖ In GARCH model:
➢ p – No. of lag variances.
➢ q – No. of lag residual errors.
❖ Note: The “p” parameter used in the ARCH model is denoted as the “q”
parameter in the GARCH model.
❖ GARCH model colligates the ARCH model, so GARCH(0,q) is the
same as ARCH(q). 69
Dimensionality Reduction
❖ Dimensionality reduction is an example of an unsupervised algorithm in
which the structure of the dataset concludes the labels or other
valuable information.
❖ Dimensionality reduction usually aims to retrieve a few low-dimensional
representations of data that, in a way, maintains the relevant qualities
of its entire dataset.
❖ It converts your original dataset that consists of, say, 250 dimensions
and finds an estimated kind of dataset that uses, say, only ten
dimensions.
❖ The most popular dimensionality reduction techniques used in data
science are as follows:
➢ Principal component analysis (PCA)
➢ Manifold learning 70
Dimensionality Reduction
PCA
❖ It is a technique which extracts a new set of variables from an existing
wide range of variables. Such freshly extracted variables are regarded
as the principal components.
❖ The principal components are retrieved in such a manner that the very
first principal component describes the maximum variance in the data-
set.
❖ The second principal component seeks to outline the remaining
variance in the dataset. This principal component is not correlated to
the main, i.e., the first principal component.
❖ The next principal component aims to describe the variation that has
remained unexplained by the first two principal components and so on.
71
Dimensionality Reduction
Manifold Learning
❖ A class of unsupervised learning that aims to classify datasets as low-
dimensional manifolds enclosed in high-dimensional spaces is called
manifold learning.
❖ One commonly used method of manifold learning is called multi-
dimensional scaling (MDS). There are many types of MDS, but they all
have a general aim: to visualize a high-dimensional space and project
it into a lower dimensional space.
72
Python IDEs for Data Science
❖ IDE is an acronym for “integrated development environment” which
combines all the different facets of writing code — code editor,
compiler/interpreter, and debugger — in a single application.
❖ With IDEs, it is easier to initiate programming new applications easily
as we do not need to set up various utilities and learn about different
tools for running a program.
❖ The debugger tool in the IDEs helps us evaluate the variables and
inspect the code to isolate the errors.
❖ Some of the Python IDEs that are used for data science are given as
follows: Jupyter Notebook, Spyder, PyCharm, Visual Studio Code
73
Thank You for Your Attention !
74