Data Science
Data Science
Data Science
DEPARTMENT OF COMPUTER
SCIENCE & ENGINEERING
LECTURE NOTES
ON
DATA SCIENCE
SYLLABUS
UNIT-I INTRODUCTION TO DATA SCIENCE
Definition — Big Data and Data Science Hype — Why data science — Getting Past the Hype —
The Current Landscape — Data Scientist - Data Science Process Overview — Defining goals —
Retrieving data — Data preparation — Data exploration — Data modeling — Presentation.
Data Scientist
We noticed about most of the job descriptions:
➢ they ask data scientists to be experts in computer science, Mathematics, statistics,
Machine Learning, communication and Presentation Skills, data visualization, and to
have extensive domain expertise.
➢ Nobody is an expert in everything, which is why it makes more sense to create teams of
Dept. of CSE, SANK Page 6 Dr. N. Krishna Kumar
Data Science (20AI401)
people who have different profiles and different expertise—together, as a team, they can
specialize in all those things.
➢
➢ Data can be stored in many forms, ranging from simple text files to tables in a database.
➢ The objective now is acquiring all the data you need.
➢ This may be difficult, and even if you succeed, data is often like a diamond in the rough:
it needs polishing to be of any use to you.
➢ Step 3: Cleansing, integrating, and transforming data
Cleansing data
➢ Data cleansing is a subprocess of the data science process that focuses on removing
errors in your data so your data becomes a true and consistent representation of the
processes it originates from.
➢ By “true and consistent representation” we imply that at least two types of errors exist.
➢ The first type is the interpretation error, such as when you take the value in your data for
granted, like saying that a person’s age is greater than 300 years.
Dept. of CSE, SANK Page 11 Dr. N. Krishna Kumar
Data Science (20AI401)
➢ The second type of error points to inconsistencies between data sources or against your
company’s standardized values.
➢ An example of this class of errors is putting “Female” in one table and “F” in another
when they represent the same thing: that the person is female.
a) DATA ENTRY ERRORS
➢ Data collection and data entry are error-prone processes.
➢ They often require human intervention, and because humans are only human, they make
typos or lose their concentration for a second and introduce an error into the chain.
Most errors of this type are easy to fix with simple assignment statements and if-then else
rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
b) REDUNDANT WHITESPACE
➢ Whitespaces tend to be hard to detect but cause errors like other redundant characters
would.
➢ Who hasn’t lost a few days in a project because of a bug that was caused by whitespaces
at the end of a string?
➢ You ask the program to join two keys and notice that observations are missing from the
output file.
➢ After looking for days through the code, you finally find the bug.
➢ They all provide string functions that will remove the leading and trailing whitespaces.
For instance, in Python you can use the strip() function to remove leading and trailing
spaces.
c) FIXING CAPITAL LETTER MISMATCHES
➢ Capital letter mismatches are common.
➢ Most programming languages make a distinction between “Brazil” and “brazil”.
➢ In this case you can solve the problem by applying a function that returns both strings in
lowercase, such as .
➢ lower() in Python.
➢ “Brazil”.lower() ==“brazil”.lower()
TRANSFORMING DATA
Data transformation is the process of converting data from one format to another, typically from
the format of a source system into the required format of a destination system.
➢ APPENDING TABLES
➢ Exploratory Data Analysis (EDA) is a robust technique for familiarizing yourself with
Data and extracting useful insights.
➢ Data Scientists analysis through Unstructured Data to find patterns and inter
relationships between Data elements.
➢ Data Scientists use Statistics and Visualization tools to summaries Central
Measurements and variability to perform EDA.
➢ If Data skewness persists, appropriate transformations are used to scale the distribution
around its mean.
➢ When Datasets have a lot of features, exploring them can be difficult.
➢ As a result, to reduce the complexity of Model inputs, Feature Selection is used to rank
them in order of significance in Model Building for enhanced efficiency.
➢ Using Business Intelligence tools like Tableau, MicroStrategy, etc. can be quite
beneficial in this step.
➢ This step is crucial in Data Science Modeling as the Metrics are studied carefully for
validation of Data Outcomes.
I. Supervised Learning
It is based on the results of a previous operation that is related to the existing business
operation. Based on previous patterns, Supervised Learning aids in the prediction of an
outcome. Some of the Supervised Learning Algorithms are:
➢ Linear Regression
➢ Random Forest
➢ Support Vector Machines
II. Unsupervised Learning
This form of learning has no pre-existing consequence or pattern. Instead, it concentrates on
examining the interactions and connections between the presently available Data points.
Some of the Unsupervised Learning Algorithms are:
➢ KNN (k-Nearest Neighbors)
➢ K-means Clustering
➢ Hierarchical Clustering
➢ Anomaly Detection
III. Reinforcement Learning
➢ It is a fascinating Machine Learning technique that uses a dynamic Dataset that interacts
with the real world. In simple terms, it is a mechanism by which a system learns from its
mistakes and improves over time. Some of the Reinforcement Learning Algorithms are:
➢ Q-Learning
➢ State-Action-Reward-State-Action (SARSA)
➢ Deep Q Network
➢
Dept. of CSE, SANK Page 17 Dr. N. Krishna Kumar
Data Science (20AI401)
➢ The Data Model is applied to the Test Data to check if it’s accurate and houses all
desirable features.
➢ You can further test your Data Model to identify any adjustments that might be required
to enhance the performance and achieve the desired results.
➢ If the required precision is not achieved, you can go back to Step 5 (Machine Learning
Algorithms), choose an alternate Data Model, and then test the model again.
➢ The Model which provides the best result based on test findings is completed and
deployed in the production environment whenever the desired result is achieved
through proper testing as per the business needs.
➢ This concludes the process of Data Science Modeling.
Summary
In this chapter you learned the data science process consists of six steps:
■ Setting the research goal—Defining the what, the why, and the how of your project
in a projectcharter.
■ Retrieving data—Finding and getting access to data needed in your project. This
data is eitherfound within the company or retrieved from a third party.
■ Data preparation—Checking and remediating data errors, enriching the data with data
from otherdata sources, and transforming it into a suitable format for your models.
■ Data exploration—Diving deeper into your data using descriptive statistics and visual
techniques.
■ Data modeling—Using machine learning and statistical techniques to achieve your
project goal.
■ Data Science Process Overview —Presenting your results to the stakeholders and
industrializing youranalysis process for repetitive reuse and integration with other tools.
DEPARTMENT OF COMPUTER
SCIENCE & ENGINEERING
LECTURE NOTES
ON
SYLLABUS
UNIT-I INTRODUCTION TO DATA SCIENCE
Definition — Big Data and Data Science Hype — Why data science — Getting Past the Hype —
The Current Landscape — Data Scientist - Data Science Process Overview — Defining goals —
Retrieving data — Data preparation — Data exploration — Data modeling — Presentation.
Bit 1 Bit
Byte 8 Bits
Kilobyte 1024 Bytes
Megabyte 1, 024 Kilobytes
Gigabyte 1, 024 Megabytes
Terrabyte 1, 024 Gigabytes
Petabyte 1, 024 Terabytes
Exabyte 1, 024 Petabytes
Zettabyte 1, 024 Exabytes
Yottabyte 1, 024 Zettabytes
What is Big Data?
➢ Big Data is a collection of data that is huge in volume, yet growing exponentially with
time.
➢ It is a data with so large size and complexity that none of traditional data management
tools can store it or process it efficiently.
➢ Big data is also a data but with huge size.
➢ Digital data can be structured, semi-structured or unstructured data.
1. Unstructured data: This is the data which does not conform to a data model or is not in a
form which can be used easily by a computer program. About 80% data of an
organization is in this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters. researches, white papers, body of an email, etc.
2. Semi-structured data: Semi-structured data is also referred to as self describing structure.
This is the data which does not conform to a data model but has some structure.
The whole process is described in the following six steps and depicted in figure 5.4.
➢ Reading the input files.
➢ Passing each line to a mapper job.
➢ The mapper job parses the colors (keys) out of the file and outputs a file for each color
with the number of times it has been encountered (value). Or more technically said, it
maps a key (the color)to a value (the number of occurrences).
➢ The keys get shuffled and sorted to facilitate the aggregation.
➢ The reduce phase sums the number of occurrences per color and outputs one file per
key with thetotal number of occurrences for each color.
➢ The keys are collected in an output file.
Case Studies:
Scientific explorations:
➢ The data collected from various sensors are analyzed to extract the useful
information for societal benefits.
➢ For example, physics and astronomical experiments – a large number of scientists
collaborating for designing, operating, and analyzing the products of sensor
networks and detectors for scientific studies.
Dept. of CSE, SANK Page 12 Dr. N. Krishna Kumar
Introduction to Data Science (20DS101)
➢ Earth observation systems – information gathering and analytical approaches
about earth’s physical, chemical, and biological systems via remote-sensing
technologies.
➢ To improve social and economic well-being and its applications for weather
forecasting, monitoring, and responding to natural disasters, climate change
predictions, and so on.
Health care:
➢ Healthcare organizations would like to predict the locations from where the
diseases are spreading so as to prevent further spreading.
➢ However, to predict exactly the origin of the disease would not be possible, until
there is statistical data from several locations. In 2009, when a new flu virus
similar to H1N1 was spreading,
➢ Google has predicted this and published a paper in the scientific journal Nature,
by looking at what people were searching for, on the Internet.
Governance:
➢ Surveillance system analyzing and classifying streaming acoustic signals,
transportation departments using real-time traffic data to predict traffic patterns,
and update public transportation schedules.
➢ Security departments analyzing images from aerial cameras, news feeds, and
social networks or items of interest.
➢ Social program agencies gain a clearer understanding of beneficiaries and proper
payments. Tax agencies identifying fraudsters and support investigation by
analyzing complex identity information and tax returns.
➢ Sensor applications such stream air, water, and temperature data to support
cleanup, fire prevention, and other programs.
Web analytics:
➢ Several websites are experiencing millions of unique visitors per day, in turn
creating a large range of content.
➢ Increasingly, companies want to be able to mine this data to understand
limitations of their sites, improve response time, offer more targeted ads, and so
on.
➢ This requires tools to perform complicated analytics on data that far exceed the
memory of a single machine or even in cluster of machines.
DEPARTMENT OF COMPUTER
SCIENCE & ENGINEERING
LECTURE NOTES
ON
SYLLABUS
UNIT-I INTRODUCTION TO DATA SCIENCE
Definition — Big Data and Data Science Hype — Why data science — Getting Past the Hype —
The Current Landscape — Data Scientist - Data Science Process Overview — Defining goals —
Retrieving data — Data preparation — Data exploration — Data modeling — Presentation.
Unit – III
Machine Learning
Introduction to Machine Learning
➢ Machine learning is a growing technology which enables computers to learn
automatically from past data.
➢ Machine learning uses various algorithms for building mathematical models and
making predictions using historical data or information.
➢ Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender system, and many
more
What is Machine Learning
➢ In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions.
➢ But can a machine also learn from experiences or past data like a human does? So here
comes the role of Machine Learning.
Case Tools:
➢ Currently, machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc.
➢ Various top companies such as Netflix and Amazon have build machine learning models
that are using a vast amount of data to analyze the user interest and recommend product
accordingly.
Following are some key points which show the importance of Machine Learning:
➢ Rapid increment in the production of data
➢ Solving complex problems, which are difficult for a human
➢ Decision making in various sector including finance
➢ Finding hidden patterns and extracting useful information from data.
The Modeling Process
➢ The modeling phase consists of four steps:
➢ 1. Feature engineering and model selection
➢ 2. Training the model
➢ 3. Model validation and selection
➢ 4. Applying the trained model to unseen data
Feature engineering and model selection :
➢ There are several models that you can choose according to the objective that you might
have:
➢ you will use algorithms of classification, prediction, linear regression, clustering, i.e. k-
means or K-Nearest Neighbor, Deep Learning, i.e Neural Networks, Bayesian, etc.
➢ There are various models to be used depending on the data you are going to process such
as images, sound, text, and numerical values.
In the following table, we will see some models and their applications that you can
apply in your projects:
Model Applications
K-means Segmentation
1. Regression
Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which
come under supervised learning:
➢ Linear Regression
➢ Regression Trees
➢ Non-Linear Regression
➢ Bayesian Linear Regression
➢ Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.
Example : Spam Filtering
➢ Random Forest
Dept. of CSE, SANK Page 8 Dr. N. Krishna Kumar
Introduction to Data Science (20DS101)
➢ Decision Trees
➢ Logistic Regression
➢ Support vector Machines
Linear Regression:
➢ Linear regression is a statistical regression method which is used for predictive analysis.
➢ It is one of the very simple and easy algorithms which works on regression and shows
the relationship between the continuous variables.
➢ Linear regression shows the linear relationship between the independent variable (X-
axis) and the dependent variable (Y-axis), hence called linear regression.
➢ If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
➢ The relationship between variables in the linear regression model can be explained
using the below image. Here we are predicting the salary of an employee on the basis
of the year of experience.
➢ The equation for polynomial regression also derived from linear regression equation
that means Linear regression equation Y= b0+ b1x, is transformed into Polynomial
regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
➢ Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
➢ The model is still linear as the coefficients are still linear with quadratic
Decision Trees:
➢ Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
➢ It can solve problems for both categorical and numerical data
➢ Decision Tree regression builds a tree-like structure in which each internal node
represents the "test" for an attribute,
each branch represent the result of the test, and each leaf node represents the final
Random Forest:
➢ Random forest is one of the most powerful supervised learning algorithms which is
capable of performing regression as well as classification tasks.
➢ The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of each tree
output.
➢ The combined decision trees are called as base models, and it can be represented more
formally as:
g(x)= f0(x)+ f1(x)+ f2(x)+....
With the help of Random Forest regression, we can prevent Overfitting in the model by
creating random subsets of the dataset.
Logistic Regression:
➢ Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
Dept. of CSE, SANK Page 12 Dr. N. Krishna Kumar
Introduction to Data Science (20DS101)
binary or discrete format such as 0 or 1.
➢ Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or
No, True or False, Spam or not spam, etc.
➢ It is a predictive analysis algorithm which works on the concept of probability.
➢ Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
function can be represented as:
➢ It uses the concept of threshold levels, values above the threshold level are rounded up
to 1, and values below the threshold level are rounded up to 0.
➢ There are three types of logistic regression:
➢ Binary(0/1, pass/fail)
➢ Multi(cats, dogs, lions)
➢ Ordinal(low, medium, high)
Support Vector Machine:
➢ Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
➢ The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.
➢ SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine.
➢ Consider the below diagram in which there are two different categories that are
Dept. of CSE, SANK Page 13 Dr. N. Krishna Kumar
Introduction to Data Science (20DS101)
classified using a decision boundary or hyperplane:
➢ Here, the Solid line is called hyperplane, and the other two lines are known as boundary
lines.
Example:
➢ SVM can be understood with the example that we have used in the KNN classifier.
➢ Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm.
➢ We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature.
➢ So as support vector creates a decision boundary between these two data (cat and dog)
and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
➢ On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
➢ SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
➢ Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
➢ Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
➢ Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given.
➢ Now, this unlabeled input data is fed to the machine learning model in order to train it.
➢ Firstly, it will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, Decision tree, etc.
Types of Unsupervised Learning Algorithm:
➢ The unsupervised learning algorithm can be further categorized into two types of
problems:
Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no similarities with the
Dept. of CSE, SANK Page 15 Dr. N. Krishna Kumar
Introduction to Data Science (20DS101)
objects of another group. Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence of those commonalities.
Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
➢ K-means clustering
➢ KNN (k-nearest neighbors)
➢ Hierarchal clustering
➢ Anomaly detection
➢ Neural Networks
➢ Principle Component Analysis
➢ Independent Component Analysis
➢ Apriori algorithm
➢ Singular value decomposition
Difference between Supervised and Unsupervised Learning
Supervised Learning Unsupervised Learning
Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
➢ Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
➢ We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our dataset. Consider the below
image:
➢ The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms.
➢ But choosing the optimal number of clusters is a big task.
➢ There are some different ways to find the optimal number of clusters, but here we are
discussing the most appropriate method to find the number of clusters or value of K.
➢ The method is given below:
Elbow Method
➢ The Elbow method is one of the most popular ways to find the optimal number of
clusters.
➢ This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster.
➢
➢ Step-1: Calculating C1 and L1:
➢ In the first step, we will create a table that contains support count (The frequency of each
itemset individually in the dataset) of each itemset in the given dataset. This table is
called the Candidate set or C1.
➢
➢ Now, we will take out all the itemsets that have the greater support count that the
Minimum Support (2). It will give us the table for the frequent itemset L1.
➢ Since all the itemsets have greater or equal support count than the minimum support,
except the E, so E itemset will be removed.
➢
➢ Step-2: Candidate Generation C2, and L2:
➢ In this step, we will generate C2 with the help of L1. In C2, we will create the pair of the
itemsets of L1 in the form of subsets.
➢ After creating the subsets, we will again find the support count from the main
transaction table of datasets, i.e., how many times these pairs have occurred together in
the given dataset. So, we will get the below table for C2:
➢
Dept. of CSE, SANK Page 22 Dr. N. Krishna Kumar
Introduction to Data Science (20DS101)
➢ Again, we need to compare the C2 Support count with the minimum support count, and
after comparing, the itemset with less support count will be eliminated from the table
C2. It will give us the below table for L2
Now we will create the L3 table. As we can see from the above C3 table, there is only one
combination of itemset that has support count equal to the minimum support count. So,
the L3 will have only one combination, i.e., {A, B, C}.
Step-4: Finding the association rules for the subsets:
➢ To generate the association rules, first, we will create a new table with the possible rules
from the occurred combination {A, B.C}. For all the rules, we will calculate the
Confidence using formula sup( A ^B)/A. After calculating the confidence value for all
rules, we will exclude the rules that have less confidence than the minimum
threshold(50%).
➢ Consider the below table:
DEPARTMENT OF COMPUTER
SCIENCE & ENGINEERING
LECTURE NOTES
ON
SYLLABUS
➢ In the example given above, we provide the raw data of images to the first layer of the
input layer.
➢ After then, these input layer will determine the patterns of local contrast that means it
will differentiate on the basis of colors, luminosity, etc.
➢ Then the 1st hidden layer will determine the face feature, i.e., it will fixate on eyes, nose,
and lips, etc. And then, it will fixate those face features on the correct face template.
➢ So, in the 2nd hidden layer, it will actually determine the correct face here as it can be
seen in the above image, after which it will be sent to the output layer.
➢ Likewise, more hidden layers can be added to solve more complex problems, for
example, if you want to find out a particular kind of face having large or light
complexions.
➢ So, as and when the hidden layers increase, we are able to solve complex problems.
Deep Feedforward Networks
➢ The simplest form of neural networks where input data travels in one direction only,
passing through artificial neural nodes and exiting through output nodes.
➢ Where hidden layers may or may not be present, input and output layers are present
there.
➢ Based on this, they can be further classified as a single-layered or multi-layered feed-
forward neural network.
➢ The equation for the cost function in ridge regression will be:
➢ In the above equation, the penalty term regularizes the coefficients of the model, and
Dept. of CSE, SANK Page 9 Dr. N. Krishna Kumar
Introduction to Data Science (20DS101)
hence ridge regression reduces the amplitudes of the coefficients that decreases the
complexity of the model.
➢ As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum value
of λ, the model will resemble the linear regression model.
Lasso Regression:
➢ Lasso regression is another regularization technique to reduce the complexity of the
model. It stands for Least Absolute and Selection Operator.
➢ It is similar to the Ridge Regression except that the penalty term contains only the
absolute weights instead of a square of weights.
➢ Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
➢ It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:
Convolutional Networks :
➢ Convolution neural network contains a three-dimensional arrangement of neurons,
instead of the standard two-dimensional array.
➢ The first layer is called a convolutional layer. Each neuron in the convolutional layer
only processes the information from a small part of the visual field.
➢ Input features are taken in batch-wise like a filter.
➢ The network understands the images in parts and can compute these operations
multiple times to complete the full image processing.
➢ Processing involves conversion of the image from RGB or HSI scale to grey-scale.
➢ Furthering the changes in the pixel value will help to detect the edges and images can be
classified into different categories.
➢ Propagation is uni-directional where CNN contains one or more convolutional layers
followed by pooling and bidirectional where the output of convolution layer goes to a
fully connected neural network for classifying the images as shown in the above
diagram.
➢ Filters are used to extract certain parts of the image. In MLP the inputs are multiplied
with weights and fed to the activation function.
➢ Convolution uses RELU (Rectified Linear Unit)and MLP uses nonlinear activation
function followed by softmax. Convolution neural networks show very effective results
in image and video recognition, semantic parsing and paraphrase detection.
➢
Applications on Convolution Neural Network
➢ Image processing
➢ Computer Vision
➢ Speech Recognition
➢ Machine translation
Advantages of Convolution Neural Network:
➢ Used for deep learning with few parameters
➢ Less parameters to learn as compared to fully connected layer
Disadvantages of Convolution Neural Network:
➢ Comparatively complex to design and maintain
➢ Comparatively slow [depends on the number of hidden layers]
Recurrent Neural Networks
➢ Designed to save the output of a layer, Recurrent Neural Network is fed back to the
input to help in predicting the outcome of the layer.
➢ The first layer is typically a feed forward neural network followed by recurrent neural
network layer where some information it had in the previous time-step is remembered
by a memory function.
➢ Forward propagation is implemented in this case. It stores information required for it’s
Dept. of CSE, SANK Page 11 Dr. N. Krishna Kumar
Introduction to Data Science (20DS101)
future use. If the prediction is wrong, the learning rate is employed to make small
changes.
➢ Hence, making it gradually increase towards making the right prediction during the
backpropagation.
DEPARTMENT OF COMPUTER
SCIENCE & ENGINEERING
LECTURE NOTES
ON
SYLLABUS
4. Set the filter values. How you set the values depends upon the data type that you’re
filtering.
➢ To set filters on columns such as Cost or Quantity Ordered,
➢ To set filters on columns such as Product Category or Product Name,
➢ To set filters on columns such as Ship Date or Order Date
Dashboard Development Tools:
➢ There are tools which help you to visualize all your data. They are already there; only you
need to do is to pick the right data visualization tool as per your requirements.
➢ Data visualization allows you to interact with data. Google, Apple, Facebook,
and Twitter all ask better a better question of their data and make a better business
decision by using data visualization.
➢ Here are the top 10 data visualization tools that help you to visualize the data:
1. Tableau
➢ Tableau is a data visualization tool. You can create graphs, charts, maps, and many other
graphics.
➢ A tableau desktop app is available for visual analytics. If you don't want to install tableau
software on your desktop, then a server solution allows you to visualize your reports
online and on mobile.
➢
2. Infogram
➢ Infogram is also a data visualization tool. It has some simple steps to process that:
➢ First, you choose among many templates, personalize them with additional visualizations
like maps, charts, videos, and images.
➢ Then you are ready to share your visualization.
➢ Infogram supports team accounts for journalists and media publishers, branded designs
of classroom accounts for educational projects, companies, and enterprises.
3. Chartblocks
Chartblocks is an easy way to use online tool which required no coding and builds
visualization from databases, spreadsheets, and live feeds.
4. Datawrapper
➢ Datawrapper is easy visualization tool, and it requires zero codings. You can upload your
data and easily create and publish a map or a chart. The custom layouts to integrate your
visualizations perfectly on your site and access to local area maps are also available.
5. Plotly
Plotly will help you to create a slick and sharp chart in just a few minutes or in a very short
time. It also starts from a simple spreadsheet.
6. RAW
RAW creates the missing link between spreadsheets and vector graphics on its home page.
Your Data can come from Google Docs, Microsoft Excel, Apple Numbers, or a simple
comma-separated list.
7. Visual.ly
Visual.ly is a visual content service. It has a dedicated data visualization service and their
impressive portfolio that includes work for Nike, VISA, Twitter, Ford, The Huffington
post, and the national geographic.
8. D3.js
D3.js is a best data visualization library for manipulating documents. D3.js runs on
JavaScript, and it uses CSS, html, and SVG. D3.js is an open-source and applies a data-
driven transformation to a webpage. It's only applied when data is in JSON and XML file.
9. Ember Charts
Ember charts are based on the ember.js and D3.js framework, and it uses the D3.js under
the hood. It also applied when the data is in JSON and XML file.
10. NVD3
NVD3 is a project that attempts to build reusable charts and components. This project is
to keeps all your charts neat and customizable.