0% found this document useful (0 votes)

8 views74 pages

Bcse206l Fds Module-5 Smsatapathy

Uploaded by

medareddy765

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views74 pages

Bcse206l Fds Module-5 Smsatapathy

Uploaded by

medareddy765

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

BCSE206L FOUNDATIONS OF DATA SCIENCE

L T P C
3 0 0 3

Dr. S M SATAPATHY
Associate Professor,
School of Computer Science and Engineering,
VIT Vellore, TN, India – 632 014.
Module – 5

PLATFORM FOR DATA SCIENCE

❖ Python for Data Science

❖ Numpy and Pandas

❖ Exploration Data Analysis

❖ Time Series Analysis

❖ Clustering – Dimensionality Reduction

❖ Python IDE

2
PYTHON
FOR
DATA SCIENCE

3
Introduction
❖ The notable factor to use Python in data science is the availability of
different libraries for data science or data analytics.
❖ NumPy (for array-based data manipulation), pandas (for labeled data
manipulation), SciPy (for scientific computing), and scikit-learn (for ML
algorithms) are a few among the many libraries that are generally used
in data science.

4
Python Identifiers
❖ A Python identifier is a name that consists of a series of letters,
numbers, and underscore characters.
❖ A Python identifier should follow the below naming conventions:
➢ Variable and function names start with a lowercase letter.
➢ An identifier starting with an underscore character generally
indicates it to be a private identifier.
➢ Classes name begins with an uppercase letter.
➢ The identifier cannot hold names among the reserved Python
keywords.

5
Python Variables
❖ Python has no syntax or command to declare a variable.
❖ The name assigned to a variable is its variable. It can also be called as
a Python identifier.
❖ These variables are case sensitive and should ideally begin with a
letter.
❖ To improve the readability of the variable name, it is better to use
lowercase-lettered words which are separated using an underscore
character, for example, mixed_case, data_des, etc.

6
Python Data Types
❖ Python supports many basic data types, such as integer, floating-point,
character, or string.
❖ In Python, we need not declare the data type assigned for the variable.
Instead, the Python interpreter identifies the data type of the variable
using the value assigned to it.
❖ The other two data types are Boolean and None.
➢ The Boolean data type takes two values: True or False.
➢ The None data type indicates a null value.

7
Python Data Structures
❖ There are Five data structures in Python:
➢ List
➢ Tuple
➢ Dictionary
➢ Set
➢ String

8
Python Data Structures
List:
❖ Python lists are ordered with a certain count. The values in the lists are
enclosed in square brackets and are separated using commas.
❖ The list indices start with 0.
❖ The list can hold both homogenous (i.e., [1, 6, 4, 7]) as well as
heterogeneous (i.e., [1, ‘a’, ‘d’, 4, 6]) data.
❖ Lists are mutable in Python and thus can be altered by adding new
items, deleting some items, or modifying the existing items.
❖ A list can be created in four different ways:
➢ Comma-separated values − [2, 5, 3, 8, 3]
➢ Empty list − []
➢ Singleton list −[a]
➢ Using list () function
9
Python Data Structures
Tuple:
❖ A tuple is a collection of ordered elements and is immutable.
❖ Any sort of modification in the tuple is prohibited and needs the
creation of a new tuple.
❖ Same as lists, a tuple can hold homogenous and heterogeneous
data. All the items in the tuple are enclosed in parenthesis and are
separated using commas.
❖ Tuple can be created in the following ways:
➢ Comma-separated values – (2, 5, 3, 8, 3)
➢ Empty tuple – ()
➢ Singleton tuple – (a,) Note: Even when creating a single-valued
tuple, a comma must be included after the value.
➢ Using the tuple () function 10
Python Data Structures
Dictionary:
❖ A dictionary is an unordered sequence that contains key-value pairs.
A dictionary has unique keys, but the values can be repeated.
❖ The keys in the dictionary are immutable, whereas the value can
have any type.
❖ Keys and values are separated using a colon, and the whole
dictionary is enclosed in curly brackets

11
Python Data Structures
Set:
❖ Set is an unordered collection of data items that are unique.
❖ is a built-in data structure in Python with the following characteristics:
➢ Unordered: Set doesn’t maintain the order of data insertion
➢ Unchangeable: Set are immutable and we can’t modify items
➢ Unique: Set doesn’t allow duplicate items
➢ Heterogeneous: Set can contain data of all types
❖ There are following two ways to create a set in Python.
➢ Using curly brackets
➢ Using set() constructor

12
Python Data Structures
String:
❖ A string is a combination of zero or more characters enclosed in single
quotes or double quotes.
❖ Python treats single quotes the same as double quotes and vice versa.
❖ Creating a string is the same as assigning the value to a variable in
Python.

13
Python Libraries: Pandas and NumPy
❖ Python provides its large library support for various stages in data
science.
❖ Python libraries contain various tools, functions, and methods to
analyze data. Each library has its focus on data mining, data
visualization, etc.
❖ There are various Python libraries used for data science. Some of
them are listed below:

❖ Data Mining
➢ Scrapy
➢ BeautifulSoap

14
Python Libraries: Pandas and NumPy
❖ Data Processing and Modeling
➢ NumPy
➢ pandas
➢ SciPy
➢ Scikit-learn
➢ XGBoost
❖ Data Visualization
➢ Matplotlib
➢ Seaborn
➢ Pydot
➢ Plotly

15
Pandas
❖ pandas is a free Python data analysis and data handling software
library.
❖ pandas provides a variety of high-performance and convenient data
structures along with operations for data manipulation.
❖ pandas is also known for its range of tools for reading and writing data
between in-memory data structures and various file formats.
❖ The standard form to import the pandas module is:
import pandas as pd
❖ pd is just an alias. Any name can be used as an alias.

16
Pandas
❖ Load and examine the dataset:
❖ DataFrame is a two-dimensional array where each feature (column) is
simply a shared index series.
❖ A DataFrame can be viewed as a generalization of the NumPy array.
❖ A dataset can be read and loaded to the pandas DataFrame using the
below code syntax
❖ df = pd.read_csv('iris.csv')

17
Pandas
import pandas as pd
cols = ['sepal_length', ' sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data', names=cols)
df.head()
df.tail()
df.sample(10)
df.info()
df.describe()
df.loc[df['class']=='Iris-setosa']
df.loc[0:10,['sepal_length','petal_width']]
df.iloc[0,0]
df.iloc[0:4,2:5]
df.sort_index(ascending=False).head()
df.sort_values(by='sepal_length', ascending=True).head()

18
NumPy
❖ NumPy is an acronym for Numerical Python.
❖ NumPy library is majorly used for the functioning of arrays. It can also
be used in linear algebra and matrices.
❖ In a field such as data science, speed and resources play a vital role.
To overcome this disadvantage of the lists, the NumPy arrays are
used.
❖ The major difference which makes the NumPy arrays to be more
prominent than the Python lists is the vectorized operations supported
by the NumPy arrays and not by the Python lists.
❖ ndarray is an array class used in NumPy. All the items in the NumPy
arrays are enclosed in square brackets and are initialized using lists.

19
NumPy
import numpy as np
arr1 = np.array(['a', 'b', 'c', 'd'])
print("Rank 1 Array : \n", arr1)
print()
arr2 = np.array([[1, 2, 3], ['a', 'b', 'c']])
print("Rank 2 Array: \n", arr2)
print()
arr3 = np.array((4, 5, 6))
print("\n Tuple passed Array:\n", arr3)

20
NumPy
# Accessing the array using its index
import numpy as np
arr=np.arange(10)
print(arr)
print("arr[:5]", arr[:5]) #prints elements till the 4th position
print("arr[3:]", arr[3:]) #prints elements from the 3rd position to the end
print("arr[::5]", arr[::5]) #prints elements divisible by 5
print("arr[1::2]", arr[1::2]) #prints elements starting from the 1st position and
step by 2
arr1 = np.array([[1,2,3], [4,5,6], [7, 8, 9]])
print("Initial Array: ")
print(arr1)
sliced_arr = arr1[:2,:2]
print ("Array with first 2 rows and 2 columns", sliced_arr) 21
NumPy
# Basic array operations
import numpy as np
arr1 = np.array([1,2,3,4])
arr2 = np.array([6,7,8,9])
print("arr1:", arr1)
print ("arr1 after adding 2 to every element:", arr1 + 2)
print("\narr2:", arr2)
print("arr1 after subtracting 1 from every element:", arr2-1)
print ("\nSum of all array elements in arr1: ", arr1.sum())
print ("\nArray sum:\n", arr1 + arr2)

22
Exploratory Data Analysis (EDA)
❖ Statically, EDA is a technique that evaluates datasets to contextualize their
primary characteristics, often with visualization.
❖ The main objective of EDA is to make it competent to work with any ML
algorithm.
❖ Objectives of EDA:
➢ Description of the dataset
➢ Removing corrupted data
➢ Handling outliers
➢ Evaluation of the relationship between variables through
visualization

23
Exploratory Data Analysis (EDA)
Description of the dataset:
❖ A data scientist should always know the data and other vital statistics
of the dataset before moving further.
❖ The most fundamental property that can also be used is Python's
describe() function.
❖ Applying the describe() function to a DataFrame in pandas provides
comprehensive statistics, summarizing various factors, such as the
dispersion, the shape using count, std, mean, std, etc.

24
Exploratory Data Analysis (EDA)
Removing corrupted data
❖ Missing values when not handled properly can reduce the excellence
in the performance matrix.
❖ This can also lead to the false classification which can harm our model
thus built.
❖ Missing or NULL values can be handled in various ways:
❖ Drop NULL or missing values:
➢ Although this is the simplest way to handle missing or NULL
values, it is usually not used as it shrinks the dataset by deleting
such observations.
➢ In Python, the dropna() function is used to delete the missing
values from the dataset.
25
Exploratory Data Analysis (EDA)
Removing corrupted data
❖ Fill missing values:
➢ In this method, missing values are replaced with a statistic such as
mean, median, mode, min, max, or even any user-defined value.
➢ The fillna() function is used in Python for filling the missing values.
❖ Predict missing values with an ML algorithm:
➢ Predicting missing values is the most efficient way to handle the
missing data.
➢ One can use the regression model or classification model
(depending on the dataset) to predict the missing data.

26
Exploratory Data Analysis (EDA)
Handling outliers:
❖ Outliers are the data points that are far different from the crowd of data
points.
❖ This can be an indication of the variation in the data-set. Outliers can
be detected using boxplot, scatterplot, inter quartile range, and Z-
score.

27
Exploratory Data Analysis (EDA)
Evaluation of the relationship between variables through visualization:
❖ Visualization helps us understand the various relationship between
features.
❖ Some widely used techniques for data visualization in Python are the
histograms and the heatmaps.
❖ The histogram helps to assess probability distribution that is quite easy
to interpret. Python provides various options to build and plot
histograms.
❖ The heatmap is represented where the data points are denoted using
colors.

28
Time Series Data
❖ Time series data is a type of structured data and is used in various
domains, such as accounting, ecology, economics, neuroscience, etc.
❖ Time series can be referred to as anything and everything that is
measured in the form of time.
❖ The data points in time series appear at regular intervals according to
some rules.
❖ Time series may also be uneven without an even fixed unit of time.
❖ Date and time data come in various types:
➢ Timestamps
➢ Time Intervals and Periods
➢ Time Deltas or Durations

29
Time Series Data
❖ Timestamps – The timestamp refers to particular instants in time (e.g.,
July 24, 2020, at 8:00 a.m.)
❖ Time Intervals and Periods – Time intervals and periods refer to a time
length between the initial and the final point. Periods generally refer to
a specific case of time intervals where each time interval is of a uniform
length and does not coincide (e.g., 24-hour period of days).
❖ Time Deltas or Durations – It refers to an exact length of time (e.g., a
duration of 12.43 seconds)

30
Time Series Data
# Time Series Analysis
# Dates and Times in Python

# Native Python datetime

from datetime import datetime
date_now = datetime.now()
print(date_now)
print(date_now.year, date_now.month, date_now.day)

from dateutil import parser

date = parser.parse("8th of August, 2020")
print(date)

31
Time Series Data
# Numpy time series
import numpy as np
date = np.datetime64('2014-12-12 20:20:20')
print(date)
date = np.empty(1, dtype='datetime64[s]')
print(date)
date[0] = np.datetime64('2014-12-12 20:20:20')
print(date)
date = np.array(np.datetime64('2019-08-26'))
print(date)
date = date + np.arange(8)
print(date)

32
Time Series Data
# Pandas Time Series
import pandas as pd
dt_index_obj = pd.DatetimeIndex(['2017-12-31 16:00:00-08:00', '2017-12-
31 17:00:00-08:00', '2017-12-31 18:00:00-08:00'],
dtype='datetime64[ns, US/Pacific]', freq='H')
print(dt_index_obj.year)
print(dt_index_obj.month)
print(dt_index_obj.day)
print(dt_index_obj.hour)
print(dt_index_obj.minute)
print(dt_index_obj.date)
print(dt_index_obj.time)
print(dt_index_obj.timetz) 33
Time Series Data
print(dt_index_obj.is_month_start)
print(dt_index_obj.is_month_end)
print(dt_index_obj.is_year_start)
print(dt_index_obj.is_year_end)
print(dt_index_obj.is_leap_year)
print(dt_index_obj.dayofweek)
print(dt_index_obj.round('H'))

34
Clustering with Python
❖ Clustering is a form of unsupervised learning.
❖ In unsupervised learning, the features of a dataset are modelled
without any label references.
❖ Clustering is the process of dividing the data points into many clusters
in such a way that data points in the same clusters are more similar to
other data points in the same cluster and are different from data points
in other clusters. It is generally a compilation of objects in terms of
similarity and dissimilarity among them.
❖ There are various clustering algorithms in ML. The following are the
most common clustering techniques used in data science:
➢ k-Means clustering
➢ Agglomerative clustering
➢ DBSCAN clustering 35
Clustering with Python
k-Means clustering
❖ The simplest unsupervised learning algorithm is the k-means clustering
algorithm. k-Means algorithm classifies “n” observations in “k” clusters
where each observation lies to a cluster where the nearest mean
serves as a cluster prototype.
❖ Suppose we are given a database of “n” objects and the partitioning
method constructs “k” partition of data, each partition will represent a
cluster and k ≤ n.
❖ It means that it will classify the data into “k” groups, which satisfies the
following requirements:
➢ Each group contains at least one object.
➢ Each object must belong to exactly one group
36
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X  Rr, and r is
the number of attributes (dimensions) in the data.
• The k-means algorithm partitions the given data into k clusters.
• Each cluster has a cluster center, called centroid.
• k is specified by the user

CS583, Bing Liu, UIC 37

K-means clustering
Given k, the k-means algorithm works as follows:
1) Randomly choose k data points (seeds) to be the initial centroids, cluster
centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current cluster memberships.
4) If a convergence criterion is not met, go to 2).
Stopping/convergence criterion
1. no (or minimum) re-assignments of data points to different
clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error (SSE),
k
SSE = 
j =1
xC j
dist(x, m j ) 2

• Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector of

all the data points in Cj), and dist(x, mj) is the distance between data
point x and centroid mj.
K-means clustering

+
+
K-means clustering
Strength - K-means clustering
• Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
• Since both k and t are small. k-means is considered a linear
algorithm.
• K-means is the most popular clustering algorithm.
• Notethat: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.
Weakness - K-means clustering
• The algorithm is only applicable if the mean is defined.
• For categorical data, k-mode - the centroid is represented by most frequent
values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points
with very different values.
Weaknesses of k-means: Problems
with outliers

CS583, Bing Liu, UIC 44

Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.
Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.
There are some
methods to help
choose good
seeds
Weaknesses of k-means (cont …)
• The
k-means algorithm is not suitable for
discovering clusters that are not hyper-ellipsoids (or
hyper-spheres).
K-means clustering
K-means example (TF-IDF)
K-means clustering
K-means example (TF-IDF)
K-means clustering
K-means example (TF-IDF)
K-means summary
• Despite weaknesses, k-means is still the most popular algorithm due
to its simplicity, efficiency and
• other clustering algorithms have their own lists of weaknesses.

• No clear evidence that any other clustering algorithm performs better

in general
• although they may be more suitable for some specific types of data or
applications.

• Comparing different clustering algorithms is a difficult task. No one

knows the correct clusters!
K-Means Clustering :: Practice Tutorial
Sl# X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77

Apply K-means algorithm to solve the above problem till the convergence
criteria are met (or at least till 2 iterations).
Given
• K=2
• Initial Centroids
• m1 = {185,72}
• m2 = {170,56}

52
Clustering with Python
Types of Hierarchical clustering
• Agglomerative (bottom up) clustering: It builds the dendrogram (tree)
from the bottom level, and
• merges the most similar (or nearest) pair of clusters
• stops when all the data points are merged into a single cluster (i.e.,
the root cluster).

• Divisive (top down) clustering: It starts with all data points in one
cluster, the root.
• Splits the root into a set of child clusters. Each child cluster is
recursively divided further
• stops when only singleton clusters of individual data points remain,
i.e., each cluster with only a single point

53
Clustering with Python
Agglomerative clustering
❖ Agglomerative approach is a type of a hierarchical method of
clustering.
❖ Hierarchical Clustering produces a nested sequence of clusters, a tree,
also called Dendrogram. The clusters here are formed in a tree
structure based on hierarchy.

54
Clustering with Python
❖ This approach is also known as the bottom-up approach. In this, we
start with each object forming a separate group. It keeps on merging
the objects or groups that are close to one another. It keeps on doing
so until all of the groups are merged into one or until the termination
condition holds. Linkage criteria for agglomerative clustering:

55
Clustering with Python
❖ This approach is also known as the bottom-up approach. In this, we
start with each object forming a separate group. It keeps on merging
the objects or groups that are close to one another. It keeps on doing
so until all of the groups are merged into one or until the termination
condition holds. Linkage criteria for agglomerative clustering:
❖ Ward’s Method / Single Linkage – This method looks for two clusters,
giving the least increase in total variance among all clusters, and then
merge them.
❖ Average Linkage – Average linkage search for two clusters to merge
which has the least average distance between the points.
❖ Complete Linkage – This linkage, also called the maximum linkage,
looks for two clusters having the maximum distance between points to
be merged. 56
Clustering with Python

Introduction
Clustering with Python

Introduction
Agglomerative Clustering :: Practice Tutorial

65
Clustering with Python
DBSCAN
❖ It is an acronym for the density-based spatial clustering of applications
with noise.
❖ In DBSCAN, one need not specify the number of clusters in advance.
❖ Also, DBSCAN is well suited for datasets with complex cluster shapes.
❖ In the DBSCAN method, clusters refer to the area that is denser in the
data space, being isolated by the areas which are much less densely
populated or empty.

66
Clustering with Python
❖ Minimum points (MinPts) and epsilon (eps) play a very important role in
the DBSCAN algorithm.
❖ Points lying in the larger dense region are referred to as core samples.
The core samples which are apart by a distance of eps units are put
into one cluster.
❖ After the points being distinguished as core samples, the data points
not making it into any cluster are said to be as noise.
❖ Boundary points are the points that fall within the eps units’ distance
but cannot be termed as core points in themselves

67
ARCH and GARCH
The ARCH Model
❖ The ARCH is an acronym that stands for autoregressive conditional
hetero-skedasticity.
❖ It is an approach that offers a way to model a transformation in
variation in a time series that is dependent on time, such as an
increase or decrease in volatility. It is a process that directly models
variance change over time in a time sequence.
❖ A lag factor must be defined to determine the number of previous error
terms to be used in the model. The number of lags squared error terms
to be used in the ARCH model. An ARCH model is built in three stages:
1) Model definition
2) Model fitting
3) Forecasting 68
ARCH and GARCH
The GARCH Model
❖ GARCH is an acronym for generalized autoregressive conditional
hetero-skedasticity. The GARCH model is an extended version of the
ARCH model.
❖ Explicitly, the model integrates lag variation terms, along with residual
errors during the mean process.
❖ In GARCH model:
➢ p – No. of lag variances.
➢ q – No. of lag residual errors.
❖ Note: The “p” parameter used in the ARCH model is denoted as the “q”
parameter in the GARCH model.
❖ GARCH model colligates the ARCH model, so GARCH(0,q) is the
same as ARCH(q). 69
Dimensionality Reduction
❖ Dimensionality reduction is an example of an unsupervised algorithm in
which the structure of the dataset concludes the labels or other
valuable information.
❖ Dimensionality reduction usually aims to retrieve a few low-dimensional
representations of data that, in a way, maintains the relevant qualities
of its entire dataset.
❖ It converts your original dataset that consists of, say, 250 dimensions
and finds an estimated kind of dataset that uses, say, only ten
dimensions.
❖ The most popular dimensionality reduction techniques used in data
science are as follows:
➢ Principal component analysis (PCA)
➢ Manifold learning 70
Dimensionality Reduction
PCA
❖ It is a technique which extracts a new set of variables from an existing
wide range of variables. Such freshly extracted variables are regarded
as the principal components.
❖ The principal components are retrieved in such a manner that the very
first principal component describes the maximum variance in the data-
set.
❖ The second principal component seeks to outline the remaining
variance in the dataset. This principal component is not correlated to
the main, i.e., the first principal component.
❖ The next principal component aims to describe the variation that has
remained unexplained by the first two principal components and so on.
71
Dimensionality Reduction
Manifold Learning
❖ A class of unsupervised learning that aims to classify datasets as low-
dimensional manifolds enclosed in high-dimensional spaces is called
manifold learning.
❖ One commonly used method of manifold learning is called multi-
dimensional scaling (MDS). There are many types of MDS, but they all
have a general aim: to visualize a high-dimensional space and project
it into a lower dimensional space.

72
Python IDEs for Data Science
❖ IDE is an acronym for “integrated development environment” which
combines all the different facets of writing code — code editor,
compiler/interpreter, and debugger — in a single application.
❖ With IDEs, it is easier to initiate programming new applications easily
as we do not need to set up various utilities and learn about different
tools for running a program.
❖ The debugger tool in the IDEs helps us evaluate the variables and
inspect the code to isolate the errors.
❖ Some of the Python IDEs that are used for data science are given as
follows: Jupyter Notebook, Spyder, PyCharm, Visual Studio Code

73
Thank You for Your Attention !

HG Notes
No ratings yet
HG Notes
77 pages
Basic Python For Data Science
No ratings yet
Basic Python For Data Science
12 pages
Fiot U 3
No ratings yet
Fiot U 3
12 pages
Python - 1
No ratings yet
Python - 1
12 pages
Lab 1 - SR
No ratings yet
Lab 1 - SR
6 pages
Python
No ratings yet
Python
4 pages
Unit 3
No ratings yet
Unit 3
63 pages
DS Python
No ratings yet
DS Python
7 pages
Lecture Intro Python
No ratings yet
Lecture Intro Python
26 pages
Py
No ratings yet
Py
2 pages
Unit 5
No ratings yet
Unit 5
8 pages
Session2 - Analytics For Programming II - Siryani - 082924
No ratings yet
Session2 - Analytics For Programming II - Siryani - 082924
31 pages
Robocoupler Report
No ratings yet
Robocoupler Report
9 pages
Ip File Library Stock
No ratings yet
Ip File Library Stock
36 pages
Pds Imp
No ratings yet
Pds Imp
60 pages
Python Notes
No ratings yet
Python Notes
24 pages
Python Book Pages
No ratings yet
Python Book Pages
135 pages
Editted Report PDF
No ratings yet
Editted Report PDF
152 pages
Lecture 3
No ratings yet
Lecture 3
8 pages
Chapter 1
No ratings yet
Chapter 1
20 pages
Numpy Data Analysis and Visualisation With Python
No ratings yet
Numpy Data Analysis and Visualisation With Python
75 pages
Python Handout
No ratings yet
Python Handout
4 pages
Fiot 3rd Unit Notes
No ratings yet
Fiot 3rd Unit Notes
33 pages
4 Weeks Session 2 DA Fundamentals
No ratings yet
4 Weeks Session 2 DA Fundamentals
36 pages
Ct3 QB Answers
No ratings yet
Ct3 QB Answers
8 pages
Introduction To Python Libraries
No ratings yet
Introduction To Python Libraries
13 pages
AdCom Midterms Reviewer
No ratings yet
AdCom Midterms Reviewer
11 pages
m6 Answer 2025
No ratings yet
m6 Answer 2025
10 pages
Python Data Structure
No ratings yet
Python Data Structure
11 pages
DSP Full Notes Unit 1 To 5
No ratings yet
DSP Full Notes Unit 1 To 5
61 pages
MTE204 Data Python
No ratings yet
MTE204 Data Python
45 pages
Lesson No 1 (Shreya)
No ratings yet
Lesson No 1 (Shreya)
9 pages
Python Numpy Primer
No ratings yet
Python Numpy Primer
54 pages
Module 2
No ratings yet
Module 2
148 pages
Python Datatypes
No ratings yet
Python Datatypes
22 pages
PPUnit 1&2
No ratings yet
PPUnit 1&2
16 pages
FIT1043 - Lecture 2 - 2024 Slides
No ratings yet
FIT1043 - Lecture 2 - 2024 Slides
55 pages
Python
No ratings yet
Python
144 pages
Introduction To Python Programming
No ratings yet
Introduction To Python Programming
9 pages
Python
No ratings yet
Python
12 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Report of Python (1.)
No ratings yet
Report of Python (1.)
52 pages
Advance Python Compressed$20241122125807
No ratings yet
Advance Python Compressed$20241122125807
37 pages
PYDS 3150713 Unit-1
No ratings yet
PYDS 3150713 Unit-1
18 pages
Data Types in Python
No ratings yet
Data Types in Python
4 pages
DAY6 Pandas Seaborn
No ratings yet
DAY6 Pandas Seaborn
97 pages
Basics of Python Programming
No ratings yet
Basics of Python Programming
56 pages
Weather Forecasting Project
No ratings yet
Weather Forecasting Project
18 pages
Resource 20250120143018 Python Intro (Datatype, Operator, Variable)
No ratings yet
Resource 20250120143018 Python Intro (Datatype, Operator, Variable)
56 pages
Internship
No ratings yet
Internship
31 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
PP - Chapter - 8
No ratings yet
PP - Chapter - 8
112 pages
Data Visualization1
No ratings yet
Data Visualization1
52 pages
CH 4
No ratings yet
CH 4
17 pages
LASSO Book Tibshirani PDF
No ratings yet
LASSO Book Tibshirani PDF
362 pages
Galgotias College of Engineering & Technology: Inroduction To Data Analytics and Visualization Lab File (KDS-551)
No ratings yet
Galgotias College of Engineering & Technology: Inroduction To Data Analytics and Visualization Lab File (KDS-551)
47 pages
Chapter 10 Eng Introducing Python Pandas
100% (3)
Chapter 10 Eng Introducing Python Pandas
28 pages
Python Unit 1
No ratings yet
Python Unit 1
14 pages
Matrix 111
No ratings yet
Matrix 111
12 pages
Chapter 3-Unsupervised Learning - Updated
No ratings yet
Chapter 3-Unsupervised Learning - Updated
54 pages
The Relationship Between Career Management and Organisational Commitment: The Moderating Effect of Openness To Experience
No ratings yet
The Relationship Between Career Management and Organisational Commitment: The Moderating Effect of Openness To Experience
192 pages
Python Atb
No ratings yet
Python Atb
17 pages
J Ajdmkd 20240901 11
No ratings yet
J Ajdmkd 20240901 11
19 pages
Wan 2011
No ratings yet
Wan 2011
9 pages
Uma Srinivasan - Surya Nepal - Managing Multimedia Semantics-IRM Press (2005)
No ratings yet
Uma Srinivasan - Surya Nepal - Managing Multimedia Semantics-IRM Press (2005)
428 pages
Rolandson MotivationMusic 2020
No ratings yet
Rolandson MotivationMusic 2020
23 pages
FAULT+DIAGNOSIS+OF+WIND+TURBINE+GEARBOXES ICEDyn2019
No ratings yet
FAULT+DIAGNOSIS+OF+WIND+TURBINE+GEARBOXES ICEDyn2019
13 pages
ML 1
No ratings yet
ML 1
27 pages
Remote Sensing Mineral Exploration Lithium
100% (1)
Remote Sensing Mineral Exploration Lithium
16 pages
Pertemuan 9
No ratings yet
Pertemuan 9
34 pages
Statistical Analysis of Cricket Leagues Using Principal Component Analysis
No ratings yet
Statistical Analysis of Cricket Leagues Using Principal Component Analysis
13 pages
Trends in Food Science & Technology: Eloisa Bagnulo, Giulia Strocchi, Carlo Bicchi, Erica Liberto
No ratings yet
Trends in Food Science & Technology: Eloisa Bagnulo, Giulia Strocchi, Carlo Bicchi, Erica Liberto
13 pages
Comert Online
No ratings yet
Comert Online
16 pages
Osteoporosis Detection Using Machine and Deep Learning Techniques
No ratings yet
Osteoporosis Detection Using Machine and Deep Learning Techniques
15 pages
Local Binary Patterns As An Image Preprocessing For Face Authentication
No ratings yet
Local Binary Patterns As An Image Preprocessing For Face Authentication
13 pages
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
Unit 2: Feature Extraction & Selection: Artificial Intelligence & Machine Learning
No ratings yet
Unit 2: Feature Extraction & Selection: Artificial Intelligence & Machine Learning
42 pages
A System For Automated Detection of Ampoule Injection Impurities
No ratings yet
A System For Automated Detection of Ampoule Injection Impurities
10 pages
Reference Manual: Paleontological Statistics
No ratings yet
Reference Manual: Paleontological Statistics
222 pages
Introduction To Principal Components and Factoranalysis
No ratings yet
Introduction To Principal Components and Factoranalysis
29 pages
Mapping Hydrothermal Minerals Using Remotely Sensed Reflectance Spectroscopy Data From Landsat
No ratings yet
Mapping Hydrothermal Minerals Using Remotely Sensed Reflectance Spectroscopy Data From Landsat
11 pages
BT307 Biological Data Analysis Assignment 1
No ratings yet
BT307 Biological Data Analysis Assignment 1
2 pages
E N T R E P R E Ne U R S H I P-Midterm: Lesson: Market Segmentation
No ratings yet
E N T R E P R E Ne U R S H I P-Midterm: Lesson: Market Segmentation
10 pages
Exploratory Factor Analysis - A Five-Step Guide For Novices
100% (1)
Exploratory Factor Analysis - A Five-Step Guide For Novices
14 pages
A Survey: Face Recognition by Sparse Representation: Jyoti Reddy, Rajesh Kumar Gupta, Dr. Mohan Awasthy
No ratings yet
A Survey: Face Recognition by Sparse Representation: Jyoti Reddy, Rajesh Kumar Gupta, Dr. Mohan Awasthy
3 pages
Petroleum: Pengyu Gao, Chong Jiang, Qin Huang, Hui Cai, Zhifeng Luo, Meijia Liu
No ratings yet
Petroleum: Pengyu Gao, Chong Jiang, Qin Huang, Hui Cai, Zhifeng Luo, Meijia Liu
5 pages
ML Lab File
No ratings yet
ML Lab File
53 pages
Applied Probability and Statistics
No ratings yet
Applied Probability and Statistics
2 pages
4th Attempts Huawei
No ratings yet
4th Attempts Huawei
6 pages

Bcse206l Fds Module-5 Smsatapathy

Uploaded by

Bcse206l Fds Module-5 Smsatapathy

Uploaded by

BCSE206L FOUNDATIONS OF DATA SCIENCE

PLATFORM FOR DATA SCIENCE

❖ Python for Data Science

❖ Numpy and Pandas

❖ Exploration Data Analysis

❖ Time Series Analysis

❖ Clustering – Dimensionality Reduction

# Native Python datetime

from dateutil import parser

CS583, Bing Liu, UIC 37

• Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector of

CS583, Bing Liu, UIC 44

• No clear evidence that any other clustering algorithm performs better

• Comparing different clustering algorithms is a difficult task. No one

You might also like