Data Science

Data Science (20AI401)
DEPARTMENT OF COMPUTER
SCIENCE & ENGINEERING
LECTURE NOTES
ON
DATA SCIENCE
COURSE CODE : 20AI401

BRANCH/YEAR : AI/II
SEMESTER : IV
Dept. of CSE, SANK Page 1 Dr. N. Krishna Kumar

SYLLABUS
UNIT-I INTRODUCTION TO DATA SCIENCE
Definition — Big Data and Data Science Hype — Why data science — Getting Past the Hype —
The Current Landscape — Data Scientist - Data Science Process Overview — Defining goals —
Retrieving data — Data preparation — Data exploration — Data modeling — Presentation.
UNIT-II BIG DATA

Problems when handling large data — General techniques for handling large data — Case study —
Steps in big data — Distributing data storage and processing with Frameworks — Case study.
UNIT-III MACHINE LEARNING
Machine learning — Modeling Process — Training model — Validating model — Predicting new
observations —Supervised learning algorithms — Unsupervised learning algorithms.
UNIT-IV DEEP LEARNING

Introduction — Deep Feedforward Networks — Regularization — Optimization of Deep Learning
Convolutional Networks — Recurrent and Recursive Nets — Applications of Deep Learning.
UNIT-V DATA VISUALIZATION
Introduction to data visualization — Data visualization options — Filters — MapReduce —
Dashboard development tools — Creating an interactive dashboard with dc.js-summary.
Text Books:
1. Introducing Data Science- Davy Cielen- Arno D. B. Meysman- Mohamed Ali- Manning
Publications Co.- I st edition- 2016
2. An Introduction to Statistical Learning: with Applications in R- Gareth James- Daniela Witten-
Trevor Hastie- Robert Tibshirani- Springer- 1st edition- 2013
3. Deep Learning- Ian Goodfellow-YoshuaBengio- Aaron Courville- MIT Press- 1st edition-
2016.
Reference Books:
1. Data Science from Scratch: First Principles with Python- Joel Grus- O’Reilly- 1" edition- 2015
2. Doing Data Science- Straight Talk from the Frontline- Cathy O'Neil- Rachel Schutt- O’ Reilly- l't
edition- 2013
3. Mining of Massive Datasets- Jure Leskovec- Anand Rajaraman- Jeffrey David Ullman-
4. Cambridge University Press- 2nd edition- 2014


What is Data Science?
➢ Data science enables businesses to process huge amounts of structured and unstructured
big data to detect patterns.
➢ This in turn allows companies to increase
➢ efficiencies,
➢ manage costs,
➢ identify new market opportunities,
➢ and boost their market advantage.
➢ Asking a personal assistant like Alexa or Siri for a recommendation demands data
science.
➢ So does operating a self-driving car, using a search engine that provides useful results, or
talking to a chatbot for customer service.
➢ These are all real-life applications for data science.
Data Science Definition

➢ Data science is the practice of mining large data sets of raw data, both structured and
unstructured, to identify patterns and extract actionable insight from them.
➢ This is an interdisciplinary field, and the foundations of data science include statistics,
inference, computer science, predictive analytics, machine learning algorithm
development, and new technologies to gain insights from big data.
➢ The first stage in the data science pipeline workflow involves capture: acquiring data,
sometimes extracting it, and entering it into the system.
➢ The next stage is maintenance, which includes data warehousing, data cleansing, data
processing, data staging, and data architecture.
➢ Data processing follows, and constitutes one of the data science fundamentals. It is
during data exploration and processing that data scientists stand apart from data
engineers. This stage involves data mining, data classification and clustering, data
modeling, and summarizing insights gleaned from the data—the processes that create
effective data.
➢ Next comes data analysis, an equally critical stage. Here data scientists conduct
exploratory and confirmatory work, regression, predictive analysis, qualitative analysis,
and text mining. This stage is why there is no such thing as cookie cutter data science—
when it’s done properly.
➢ During the final stage, the data scientist communicates insights. This involves data
visualization, data reporting, the use of various business intelligence tools, and assisting
businesses, policymakers, and others in smarter decision making.
Big Data and Data Science Hype

Introduction
Over the past few years, there’s been a lot of hype in the media about “data science” and “Big
Data.” A reasonable first reaction to all of this might be some combination of skepticism and
confusion
Big Data and Data Science Hype
➢ Many of you are likely skeptical of data science already for many of the reasons we were.
➢ We want to address this up front to let you know: we’re right there with you.
➢ If you’re a skeptic too, it probably means you have something useful to contribute to
making data science into a more legitimate field that has the power to have a positive
impact on society.
So, what is eyebrow-raising about Big Data and data science? Let’s count the ways:
➢ There’s a lack of definitions around the most basic terminology.
➢ What is “Big Data” anyway?
➢ What does “data science” mean?
➢ What is the relationship between Big Data and data science?
➢ Is data science the science of Big Data?
➢ Is data science only the stuff going on in companies like Google and Facebook and tech
companies?
➢ Why do many people refer to Big Data as crossing disciplines (astronomy, finance, tech,
etc.) and to data science as only taking place in tech?
➢ Just how big is big? Or is it just a relative term?
➢ These terms are so ambiguous, they’re well-nigh meaningless.
➢ There’s a distinct lack of respect for the researchers in academia and industry labs who
have been working on this kind of stuff for years, and whose work is based on decades
(in some cases, centuries) of work by statisticians, computer scientists, mathematicians,
engineers, and scientists of all types.
➢ From the way the media describes it, machine learning algorithms and data was never
“big” until Google came along.
➢ This is simply not the case. Many of the methods and techniques we’re using—and the
challenges we’re facing now—are part of the evolution of everything that’s come before.
➢ The hype is crazy—people throw around tired phrases straight out of the height of the

pre-financial crisis era like “Masters of the Universe” to describe data scientists, and that
doesn’t bode well.
➢ In general, hype masks reality and increases the noise-to-signal ratio.
➢ The longer the hype goes on, the more many of us will get turned off by it, and the
harder it will be to see what’s good underneath it all, if anything.
➢ Statisticians already feel that they are studying and working on the “Science of Data.”
➢ Although we will make the case that data science is not just a rebranding of statistics or
machine learning but rather a field unto itself,
➢ the media often describes data science in a way that makes it sound like as if it’s simply
statistics or machine learning in the context of the tech industry.
Getting Past the Hype
➢ It’s a general true that, whenever you go from school to a real job, you realize there’s a
gap between what you learned in school and what you do on the job.
➢ In other words, you were simply facing the difference between academic statistics and
industry statistics.
➢ Even so, the gap doesn’t represent simply a difference between industry statistics and
academic statistics.
➢ The general experience of data scientists is that, at their job, they have access to a larger
body of knowledge and methodology, as well as a process, which we now define as the
data science process that has foundations in both statistics and computer science.
Why Now?
➢ We have massive amounts of data about many aspects of our lives, and, simultaneously,
an abundance of inexpensive computing power. Shopping, communicating, reading
news, listening to music, searching for information, expressing our opinions—all this is
being tracked online, as most people know.
➢ It’s not just Internet data, though—it’s finance, the medical industry, pharmaceuticals,
bioinformatics, social welfare, government, education, retail, and the list goes on. There
is a growing influence of data in most sectors and most industries. In some cases, the
amount of data collected might be enough to be considered “big” in other cases, it’s not.
➢ But it’s not only the massiveness that makes all this new data interesting . It’s that the
data itself, often in real time, becomes the building blocks of data products.
➢ On the Internet, this means Amazon recommendation systems, friend recommendations
on Facebook, film and music recommendations, and so on. In finance, this means credit
ratings, trading algorithms, and models.
➢ In education, this is starting to mean dynamic personalized learning and assessments.
➢ In government, this means policies based on data.
➢ We’re witnessing the beginning of a massive, culturally saturated feedback loop where
our behavior changes the product and the product changes our behavior.
➢ Technology makes this possible: infrastructure for large-scale data processing, increased
memory, and bandwidth, as well as a cultural acceptance of technology in the fabric of
our lives. This wasn’t true a decade ago.

➢ Considering the impact of this feedback loop, we should start thinking seriously about
how it’s being conducted, along with the ethical and technical responsibilities for the
people responsible for the process
Datafication
➢ In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor Mayer-
Schoenberger wrote an article called “The Rise of Big Data”. In it they discuss the
concept of datafication, and their example is how we quantify friendships with “likes”:
it’s the way everything we do, online or otherwise, ends up recorded for later
examination in someone’s data storage units. Or maybe multiple storage units, and
maybe also for sale.
➢ They define datafication as a process of “taking all aspects of life and turning them into
data.”
➢ As examples, they mention that “Google’s augmented-reality glasses datafy the gaze.
➢ Twitter datafies stray thoughts.
➢ LinkedIn datafies professional networks.”
The Current Landscape:

on Quora there’s a discussion from 2010 about “What is Data Science?” and here’s Metamarket
CEO Mike Driscoll’s answer:
Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired
statistics.
➢ But data science is not merely hacking—because when hackers finish debugging their Bash
one-liners and Pig scripts, few of them care about non-Euclidean distance metrics.
➢ And data science is not merely statistics, because when statisticians finish theorizing the
perfect model, few could read a tab-delimited file into R if their job depended on it.
➢ Data science is the civil engineering of data. Its People possess a practical knowledge of
tools and materials, coupled with a theoretical understanding of what’s possible.
➢ Driscoll then refers to Drew Conway’s Venn diagram of data science
Data Scientist
We noticed about most of the job descriptions:
➢ they ask data scientists to be experts in computer science, Mathematics, statistics,
Machine Learning, communication and Presentation Skills, data visualization, and to
have extensive domain expertise.
➢ Nobody is an expert in everything, which is why it makes more sense to create teams of
people who have different profiles and different expertise—together, as a team, they can
specialize in all those things.
Step 1: Frame the problem

➢ The first thing you have to do before you solve a problem is to define exactly what it is.
You need to be able to translate data questions into something actionable.
➢ You’ll often get ambiguous inputs from the people who have problems. You need to turn
scarce inputs into actionable outputs–and to ask the questions that nobody else is asking.
➢ A great way to do this is to ask the right questions.
You should ask questions like the following:-
✓ Who are the customers?
✓ Why are they buying our product?
✓ How do we predict if a customer is going to buy our product?
✓ What is different from segments who are performing well and those that are performing
below expectations?
✓ How much money will we lose if we don’t actively sell the product to these groups
Step 2: Collect the raw data needed for your problem

➢ Once you’ve defined the problem, you’ll need data to give you the insights needed to
turn the problem around with a solution.
➢ This part of the process involves thinking through what data you’ll need and finding
ways to get that data, whether it’s querying internal databases, or purchasing external
datasets.
➢ You might find out that your company stores all of their sales data in a customer
relationship management software platform.
➢ You can export the CRM data in a CSV file for further analysis.
Step 3: Process the data for analysis
➢ Now that you have all of the raw data, you’ll need to process it before you can do any
analysis. Oftentimes, data can be quite messy, especially if it hasn’t been well-
maintained.

You’ll want to check for the following common errors:
➢ Missing values, perhaps customers without an initial contact date
➢ Corrupted values, such as invalid entries
➢ Timezone differences, perhaps your database doesn’t take into account the different
timezones of your users
➢ Date range errors, perhaps you’ll have dates that makes no sense, such as data registered
from before sales started
➢ You’ll need to look through aggregates of your file rows and columns and sample some
test values to see if your values make sense. If you detect something that doesn’t make
sense, you’ll need to remove that data or replace it with a default value.
Step 4: Explore the data

➢ When your data is clean, you’ll should start playing with it!
➢ The difficulty here isn’t coming up with ideas to test, it’s coming up with ideas that are
likely to turn into insights.
➢ You’ll have a fixed deadline for your data science project.
➢ so you’ll have to set your questions. ‘
➢ You’ll have to look at some of the most interesting patterns that can help explain why
sales are reduced for this group.
➢ You might notice that they don’t tend to be very active on social media, with few of
them having Twitter or Facebook accounts.
➢ You might also notice that most of them are older than your general audience.
➢ From that you can begin to trace patterns you can analyze more deeply.
Step 5: Perform in-depth analysis

➢ This step of the process is where you’re going to have to apply your statistical,
mathematical and technological knowledge and use all of the data science tools at your
disposal to crunch the data and find every insight you can.
➢ In this case, you might have to create a predictive model that compares your
underperforming group with your average customer. You might find out that the age and
social media activity are significant factors in predicting who will buy the product.
➢ If you’d asked a lot of the right questions while framing your problem, you might realize
that the company has been concentrating heavily on social media marketing efforts, with
messaging that is aimed at younger audiences.
➢ You would know that certain demographics prefer being reached by telephone rather
than by social media. You begin to see how the way the product has been has been
marketed is significantly affecting sales.
➢ You can now combine all of those qualitative insights with data from your quantitative
analysis to move people to action.

Step 6: Communicate results of the analysis
➢ Proper communication will mean the difference between action and inaction on your
proposals.
➢ You start by explaining the reasons behind the underperformance of the older
demographic.
➢ Then you move to concrete solutions that address the problem: we could shift some
resources from social media to personal calls.
➢ You tie it all together into a narrative that solves the pain of your Sales: she now has
clarity on how she can reclaim sales and hit her objectives.
➢ Throughout the data science process, your day-to-day will vary significantly depending
on where you are–and you will definitely receive tasks that fall outside of this standard
process!
Data Science Process Overview
❖ Step 1: Defining research goals and creating a project charter

➢ A project starts by understanding the what, the why, and the how of your project
➢ What does the company expect you to do?
➢ And why does management place such a value on your research?

➢ Is it part of a bigger strategic picture originating from an opportunity someone detected?

➢ Answering these three questions (what, why, how) is the goal of the first phase, so that
everybody knows what to do and can agree on the best course of action.
➢ The outcome should be a clear research goal, a good understanding of the context, well-
defined deliverables, and a plan of action with a timetable.
➢ This information is then best placed in a project charter.
➢ The length and formality can, of course, differ between projects and companies.
A project charter requires teamwork, and your input covers at least the following:
➢ A clear research goal
➢ The project mission and context
➢ How you’re going to perform your analysis
➢ What resources you expect to use
➢ Proof that it’s an achievable project, or proof of concepts
➢ Deliverables and a measure of success
➢ A timeline
Your client can use this information to make an estimation of the project costs and the data
and people required for your project to become a success.
❖ Step 2: Retrieving data

➢ The next step in data science is to retrieve the required data (figure 2.3).
➢ Sometimes you need to go into the field and design a data collection process yourself,
but most of the time you won’t be involved in this step.
➢ Many companies will have already collected and stored the data for you, and what they
don’t have can often be bought from third parties.

➢ Don’t be afraid to look outside your organization for data, because more and more
organizations are making even high-quality data freely available for public and
commercial use.
➢
➢ Data can be stored in many forms, ranging from simple text files to tables in a database.
➢ The objective now is acquiring all the data you need.
➢ This may be difficult, and even if you succeed, data is often like a diamond in the rough:
it needs polishing to be of any use to you.
➢ Step 3: Cleansing, integrating, and transforming data
Cleansing data
➢ Data cleansing is a subprocess of the data science process that focuses on removing
errors in your data so your data becomes a true and consistent representation of the
processes it originates from.
➢ By “true and consistent representation” we imply that at least two types of errors exist.
➢ The first type is the interpretation error, such as when you take the value in your data for
granted, like saying that a person’s age is greater than 300 years.
➢ The second type of error points to inconsistencies between data sources or against your
company’s standardized values.
➢ An example of this class of errors is putting “Female” in one table and “F” in another
when they represent the same thing: that the person is female.
a) DATA ENTRY ERRORS
➢ Data collection and data entry are error-prone processes.
➢ They often require human intervention, and because humans are only human, they make
typos or lose their concentration for a second and introduce an error into the chain.
Most errors of this type are easy to fix with simple assignment statements and if-then else
rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
b) REDUNDANT WHITESPACE
➢ Whitespaces tend to be hard to detect but cause errors like other redundant characters
would.
➢ Who hasn’t lost a few days in a project because of a bug that was caused by whitespaces
at the end of a string?
➢ You ask the program to join two keys and notice that observations are missing from the
output file.
➢ After looking for days through the code, you finally find the bug.
➢ They all provide string functions that will remove the leading and trailing whitespaces.
For instance, in Python you can use the strip() function to remove leading and trailing
spaces.
c) FIXING CAPITAL LETTER MISMATCHES
➢ Capital letter mismatches are common.
➢ Most programming languages make a distinction between “Brazil” and “brazil”.
➢ In this case you can solve the problem by applying a function that returns both strings in
lowercase, such as .
➢ lower() in Python.
➢ “Brazil”.lower() ==“brazil”.lower()

➢ should result in true.
d) IMPOSSIBLE VALUES AND SANITY CHECKS
➢ Sanity checks are another valuable type of data check.
➢ Here you check the value against physically or theoretically impossible values such as
people taller than 3 meters or someone with an age of 299 years.
➢ Sanity checks can be directly expressed with rules:
➢ check = 0 <= age <= 120
e) OUTLIERS
➢ An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the
other observations.
➢ The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values. An example is shown in figure 2.6.
➢ The normal distribution, or Gaussian distribution, is the most common distribution in
natural sciences.
f) DEALING WITH MISSING VALUES

g) DEVIATIONS FROM A CODE BOOK
➢ Detecting errors in larger data sets against a code book or against standardized values
can be done with the help of set operations.
➢ A code book is a description of your data, a form of metadata.
➢ It contains things such as the number of variables per observation, the number of
observations, and what each encoding within a variable means.
➢ (For instance “0” equals “negative”, “5” stands for “very positive”.)
TRANSFORMING DATA
Data transformation is the process of converting data from one format to another, typically from
the format of a source system into the required format of a destination system.
a) THE DIFFERENT WAYS OF COMBINING DATA

➢ JOINING TABLES
➢ APPENDING TABLES

b) USING VIEWS TO SIMULATE DATA JOINS AND APPENDS
c) TURNING VARIABLES INTO DUMMIES
Step 4: Exploratory data analysis

➢ During exploratory data analysis you take a deep dive into the data, mainly use graphical
techniques to gain an understanding of your data and the interactions between variables

➢ Exploratory Data Analysis (EDA) is a robust technique for familiarizing yourself with
Data and extracting useful insights.
➢ Data Scientists analysis through Unstructured Data to find patterns and inter
relationships between Data elements.
➢ Data Scientists use Statistics and Visualization tools to summaries Central
Measurements and variability to perform EDA.
➢ If Data skewness persists, appropriate transformations are used to scale the distribution
around its mean.
➢ When Datasets have a lot of features, exploring them can be difficult.
➢ As a result, to reduce the complexity of Model inputs, Feature Selection is used to rank
them in order of significance in Model Building for enhanced efficiency.
➢ Using Business Intelligence tools like Tableau, MicroStrategy, etc. can be quite
beneficial in this step.
➢ This step is crucial in Data Science Modeling as the Metrics are studied carefully for
validation of Data Outcomes.

5. Data Modeling
This is one of the most crucial processes in Data Science Modelling as the Machine Learning
Algorithm aids in creating a usable Data Model. There are a lot of algorithms to pick from, the
Model is selected based on the problem. There are three types of Machine Learning methods
that are incorporated:
I. Supervised Learning
II. Unsupervised Learning
III. Reinforcement Learning
I. Supervised Learning
It is based on the results of a previous operation that is related to the existing business
operation. Based on previous patterns, Supervised Learning aids in the prediction of an
outcome. Some of the Supervised Learning Algorithms are:
➢ Linear Regression
➢ Random Forest
➢ Support Vector Machines
II. Unsupervised Learning
This form of learning has no pre-existing consequence or pattern. Instead, it concentrates on
examining the interactions and connections between the presently available Data points.
Some of the Unsupervised Learning Algorithms are:
➢ KNN (k-Nearest Neighbors)
➢ K-means Clustering
➢ Hierarchical Clustering
➢ Anomaly Detection
III. Reinforcement Learning
➢ It is a fascinating Machine Learning technique that uses a dynamic Dataset that interacts
with the real world. In simple terms, it is a mechanism by which a system learns from its
mistakes and improves over time. Some of the Reinforcement Learning Algorithms are:
➢ Q-Learning
➢ State-Action-Reward-State-Action (SARSA)
➢ Deep Q Network
Step 6: Presenting findings and building applications on top of them

➢ After you’ve successfully analyzed the data and built a well-performing model, you’re
ready to present your findings to the world.
➢ This is an exciting part; all your hours of hard work have paid off and you can explain
what you found to the stakeholders.
➢
➢ The Data Model is applied to the Test Data to check if it’s accurate and houses all
desirable features.
➢ You can further test your Data Model to identify any adjustments that might be required
to enhance the performance and achieve the desired results.
➢ If the required precision is not achieved, you can go back to Step 5 (Machine Learning
Algorithms), choose an alternate Data Model, and then test the model again.
➢ The Model which provides the best result based on test findings is completed and
deployed in the production environment whenever the desired result is achieved
through proper testing as per the business needs.
➢ This concludes the process of Data Science Modeling.
Summary
In this chapter you learned the data science process consists of six steps:
■ Setting the research goal—Defining the what, the why, and the how of your project
in a projectcharter.
■ Retrieving data—Finding and getting access to data needed in your project. This
data is eitherfound within the company or retrieved from a third party.
■ Data preparation—Checking and remediating data errors, enriching the data with data
from otherdata sources, and transforming it into a suitable format for your models.
■ Data exploration—Diving deeper into your data using descriptive statistics and visual
techniques.
■ Data modeling—Using machine learning and statistical techniques to achieve your
project goal.
■ Data Science Process Overview —Presenting your results to the stakeholders and
industrializing youranalysis process for repetitive reuse and integration with other tools.

Introduction to Data Science (20DS101)
LECTURE NOTES
ON
INTRODUCTION TO DATA SCIENCE
COURSE CODE : 20DS101

BRANCH/YEAR : DS/I
SEMESTER :I

SYLLABUS
UNIT-II BIG DATA


Text Books:
2016.
Reference Books:
edition- 2013

UNIT-II BIG DATA
Problems when handling large data — General techniques for handling large data — Case
study —Steps in big data — Distributing data storage and processing with Frameworks — Case
study.
Introduction to Big Data
What is Data?
➢ Data can be defined as a representation of facts, concepts, or instructions in a formalized
manner, which should be suitable for communication, interpretation, or processing by
human or electronic machine.
➢ Data is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or
special characters (+,-,/,*,<,>,= etc.)
What is Data Size?
Name Equal To
Bit 1 Bit
Byte 8 Bits
Kilobyte 1024 Bytes
Megabyte 1, 024 Kilobytes
Gigabyte 1, 024 Megabytes
Terrabyte 1, 024 Gigabytes
Petabyte 1, 024 Terabytes
Exabyte 1, 024 Petabytes
Zettabyte 1, 024 Exabytes
Yottabyte 1, 024 Zettabytes
What is Big Data?
➢ Big Data is a collection of data that is huge in volume, yet growing exponentially with
time.
➢ It is a data with so large size and complexity that none of traditional data management
tools can store it or process it efficiently.
➢ Big data is also a data but with huge size.
➢ Digital data can be structured, semi-structured or unstructured data.
1. Unstructured data: This is the data which does not conform to a data model or is not in a
form which can be used easily by a computer program. About 80% data of an
organization is in this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters. researches, white papers, body of an email, etc.
2. Semi-structured data: Semi-structured data is also referred to as self describing structure.
This is the data which does not conform to a data model but has some structure.

However, it is not in a form which can be used easily by a computer program. About 10%
data of an organization is in this format; for example, HTML, XML, JSON, email data etc.
3. Structured data: When data follows a pre-defined schema/structure we say it is
structured data. This is the data which is in an organized form (e.g., in rows and
columns) and be easily used by a computer program. Relationships exist between
entities of data, such as classes and their objects. About 10% data of an organization is in
this format. Data stored in databases is an example of structured data.
3V's of Big Data(Characteristics)
1. Velocity: Velocity essentially refers to the speed at which data is being created in
real- time. We have moved from simple desktop applications like payroll application
to real- time processing applications.
2. Variety: Data can be structured data, semi-structured data and unstructured data. Data
stored in a database is an example of structured data.HTML data, XML data, email
data, CSV files are the examples of semi-structured data. Power point presentation,
images, videos, researches, white papers, body of email etc are the examples of
unstructured data.
3. Volume: Volume can be in Terabytes or Petabytes or Zettabytes. Gartner Glossary Big
data is high-volume, high-velocity and/or highvariety information assets that demand
cost-effective, innovative forms of information processing that enable enhanced
insight and decision making.
Sources of Big Data
These data come from many sources like
➢ Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
➢ E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
➢ Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
➢ Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
➢ Share Market: Stock exchange across the world generates huge amount of data through
its daily transaction.
Problems when handling large data

Problems when handling large data
➢ Data volume: Data today is growing at an exponential rate. This high tide of data will
continue to rise continuously. The key questions are –
➢ “will all this data be useful for analysis?”,
➢ “Do we work with all this data or subset of it?”,
➢ “How will we separate the knowledge from the noise?” etc.
➢ Storage: Cloud computing is the answer to managing infrastructure for big data as far as
cost-efficiency, elasticity and easy upgrading / downgrading is concerned. This further
complicates the decision to host big data solutions outside the enterprise.
➢ Data retention: How long should one retain this data? Some data may require for log-
term decision, but some data may quickly become irrelevant and obsolete.
➢ Skilled professionals: In order to develop, manage and run those applications that
generate insights, organizations need professionals who possess a high-level proficiency
in data sciences.
➢ Other challenges: Other challenges of big data are with respect to capture, storage,
search, analysis, transfer and security of big data.
➢ Visualization: Big data refers to datasets whose size is typically beyond the storage
capacity of traditional database software tools. There is no explicit definition of how big
the data set should be for it to be considered bigdata. Data visualization(computer
graphics) is becoming popular as a separate discipline. There are very few data
visualization experts.
General techniques for handling large data
There are 7 widely used Big Data analysis techniques are generally used for handling large
data. They are
1. Association rule learning
2. Classification tree analysis
3. Genetic algorithms
4. Machine learning
5. Regression analysis
6. Sentiment analysis
7. Social network analysis
1. Association rule learning
Are people who purchase tea more or less likely to purchase carbonated drinks?
➢ Association rule learning is a method for discovering interesting correlations between
variables in large databases. It was first used by major supermarket chains to discover
interesting relations between products, using data from supermarket point-of-sale (POS)
systems.
Association rule learning is being used to help:
➢ place products in better proximity to each other in order to increase sales
➢ extract information about visitors to websites from web server logs
➢ analyze biological data to uncover new relationships
➢ monitor system logs to detect intruders and malicious activity
➢ identify if people who buy milk and butter are more likely to buy diapers
2. Classification tree analysis
Which categories does this document belong to?
Statistical classification is a method of identifying categories that a new observation
belongs to. It requires a training set of correctly identified observations – historical data
in other words.
Statistical classification is being used to:
➢ automatically assign documents to categories
➢ categorize organisms into groupings
➢ develop profiles of students who take online courses
3. Genetic algorithms
Which TV programs should we broadcast, and in what time slot, to maximize our ratings?
Genetic algorithms are inspired by the way evolution works – that is, through
mechanisms such as inheritance, mutation and natural selection. These mechanisms are
used to “evolve” useful solutions to problems that require optimization.
Genetic algorithms are being used to:
➢ schedule doctors for hospital emergency rooms
➢ return combinations of the optimal materials and engineering practices required to
develop fuel-efficient cars
➢ generate “artificially creative” content such as puns and jokes
4. Machine Learning
Which movies from our catalogue would this customer most likely want to watch next,
based on their viewing history?
Machine learning includes software that can learn from data. It gives computers the
ability to learn without being explicitly programmed, and is focused on making
predictions based on known properties learned from sets of “training data.”
Machine learning is being used to help:
➢ distinguish between spam and non-spam email messages
➢ learn user preferences and make recommendations based on this information
➢ determine the best content for engaging prospective customers
➢ determine the probability of winning a case, and setting legal billing rates
5. Regression Analysis
How does your age affect the kind of car you buy?
At a basic level, regression analysis involves manipulating some independent variable
(i.e. background music) to see how it influences a dependent variable (i.e. time spent in
store). It describes how the value of a dependent variable changes when the independent
variable is varied. It works best with continuous quantitative data like weight, speed or
age.
Regression analysis is being used to determine how:
➢ levels of customer satisfaction affect customer loyalty
➢ the number of supports calls received may be influenced by the weather forecast given
the previous day
➢ Neighbourhood and size affect the listing price of houses
6. Sentiment Analysis
How well is our new return policy being received?
Sentiment analysis helps researchers determine the sentiments of speakers or writers
with respect to a topic.
Sentiment analysis is being used to help:
➢ improve service at a hotel chain by analyzing guest comments
➢ customize incentives and services to address what customers are really asking for
➢ determine what consumers really think based on opinions from social media
7. Social Network Analysis
How many degrees of separation are you from Kevin Bacon?
Social network analysis is a technique that was first used in the telecommunications
industry, and then quickly adopted by sociologists to study interpersonal relationships.
It is now being applied to analyze the relationships between people in many fields and
commercial activities. Nodes represent individuals within a network, while ties
represent the relationships between the individuals.
Social network analysis is being used to:
➢ see how people from different populations form ties with outsiders
➢ find the importance or influence of a particular individual within a group
➢ find the minimum number of direct ties required to connect two individuals
understand the social structure of a customer base
STEPS IN BIG DATA
Data Mining
➢ There are two focus terms: data extraction & data mining.
➢ If simply put, data extraction is a process of collecting all data from web pages into your
database.
➢ Whereas, data mining is a process of identifying valuable insights within that database.
Such data is collected by data scientists.
➢ For example, you are an e-commerce grocery site owner. After using various research
techniques, you concluded that approximately 70% people wear jeans. This is called data
extraction.
➢ Now you have to go deeper to understand which age, gender, and type of people use
Brand 1 and Brand 2 jeans. This process is known as data mining. Some of the useful
data mining tools include RapidMiner, TeraData & Kaggle

Data Collection
➢ Big data doesn’t have an “END” button.
➢ As the world grows, data will keep on streaming in.
➢ Data needs to be extracted constantly.
➢ From the above example: there will be people who wear Brand 1 have switched to Brand
2 and so on.
➢ The possibilities are endless! Data extraction becomes easier with tools like import.io.
Data Storing
➢ Ever imagined how Google must be storing so much of world data? Of course not on
traditional systems – files, CDs, DVDs, etc.. Google, Facebook, Apple, etc.run on
hyperscale computing environments.
➢ Which type of storage you should use depends on the scale of your business.
➢ A good data storage system provides an infrastructure which has all the latest data
analytics tools and storage space.
➢ You can store your data on data storage providers like Cloudera, Hadoop (not for
beginners) and Talend.
➢ Data storage is one step which here on can be inserted in between any other step.
Data Cleaning
➢ Data sets can come in all forms and degrees – some good and some not so good
especially if extracted from the web.
➢ Therefore, all the data extracted needs to cleaned.
➢ In the cleaning process, all the unwanted and inaccurate data is filtered out.
➢ After this process, you will only be left with what you actually want to focus on.
➢ Cleaning promotes structuring your data well.
➢ For example, you know number and type of people wearing jeans all over.
➢ While cleaning, you can remove all the duplicate entries, wrong data, unwanted regions
or information and more.
➢ You can make use of DataCleaner or OpenRefine for this purpose
Data Analysis
➢ The biggest part of big data is the analytics! What is big data analytics?
➢ While analyzing the data you come across your audience pattern, behavior and so on.
➢ Exploratory research method proves to be very helpful in analyzing big data.
➢ Analytics is about asking a specific question and finding answers to it.
➢ Qubole and Statwing are powerful data analytics tools.
➢ For example, you might ask – does my audience like to wear two pocket jeans? Which
color is most preferred by them,etc..
Data Consumption
➢ Data is consumed in various verticals which include:

➢ Identifying retail trends in the market using which businesses can highlight their top
selling products.
➢ It is used by Government bodies in order to reach out to the correct demographics,
geographies, and ethnicities.
➢ Marketers find big data extremely useful to figure out which advertisement works for
their products.
Distributing data storage and processing with Frameworks

➢ You’ll create a dashboard that allows you to explore data from lenders of a bank through
the following steps:
➢ Load data into Hadoop, the most common big data platform.
➢ Transform and clean data with Spark.
➢ Store it into a big data database called Hive.
➢ Interactively visualize this data with Qlik Sense, a visualization tool.
➢ All this (apart from the visualization) will be coordinated from within a Python
script.
➢ The end result is a dashboard that allows you to explore the data, as shown in figure
Distributing data storage and processing with frameworks

➢ New big data technologies such as Hadoop and Spark make it much easier to work with
and control a cluster of computers.
➢ Hadoop can scale up to thousands of computers, creating a cluster with petabytes of
storage.
➢ This enables businesses to grasp the value of the massive amount of data available.
Hadoop: a framework for storing and processing large data sets
Apache Hadoop is a framework that simplifies working with a cluster of computers. It
aims to be all of the following things and more:
➢ Reliable—By automatically creating multiple copies of the data and redeploying
processing logic in case of failure.
➢ Fault tolerant —It detects faults and applies automatic recovery.
➢ Scalable—Data and its processing are distributed over clusters of computers
(horizontal scaling).
➢ Portable—Installable on all kinds of hardware and operating systems.
Modules of Hadoop
➢ HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.
➢ Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
➢ Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts it
into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
➢ Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
Hadoop Architecture
➢ The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System).
➢ The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
➢ A Hadoop cluster consists of a single master and multiple slave nodes.
➢ The master node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas
the slave node includes DataNode and TaskTracker.
Hadoop Distributed File System

➢ The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop.
➢ It contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.
➢ Both NameNode and DataNode are capable enough to run on commodity machines.
➢ The Java language is used to develop HDFS.
➢ So any machine that supports Java language can easily run the NameNode and
DataNode software.
NameNode
➢ It is a single master server exist in the HDFS cluster.
➢ As it is a single node, it may become the reason of single point failure.
➢ It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
➢ It simplifies the architecture of the system.
DataNode
➢ The HDFS cluster contains multiple DataNodes.
➢ Each DataNode contains multiple data blocks.
➢ These data blocks are used to store data.
➢ It is the responsibility of DataNode to read and write requests from the file system's
clients.
➢ It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
➢ The role of Job Tracker is to accept the MapReduce jobs from client and process the data
by using NameNode.
➢ In response, NameNode provides metadata to Job Tracker.
Task Tracker
➢ It works as a slave node for Job Tracker.
➢ It receives task and code from Job Tracker and applies that code on the file. This process
can also be called as a Mapper.
MapReduce Layer
➢ The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker.
➢ In response, the Job Tracker sends the request to the appropriate Task Trackers.
➢ Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is
rescheduled.

The whole process is described in the following six steps and depicted in figure 5.4.
➢ Reading the input files.
➢ Passing each line to a mapper job.
➢ The mapper job parses the colors (keys) out of the file and outputs a file for each color
with the number of times it has been encountered (value). Or more technically said, it
maps a key (the color)to a value (the number of occurrences).
➢ The keys get shuffled and sorted to facilitate the aggregation.
➢ The reduce phase sums the number of occurrences per color and outputs one file per
key with thetotal number of occurrences for each color.
➢ The keys are collected in an output file.
Case Studies:
Scientific explorations:
➢ The data collected from various sensors are analyzed to extract the useful
information for societal benefits.
➢ For example, physics and astronomical experiments – a large number of scientists
collaborating for designing, operating, and analyzing the products of sensor
networks and detectors for scientific studies.
➢ Earth observation systems – information gathering and analytical approaches
about earth’s physical, chemical, and biological systems via remote-sensing
technologies.
➢ To improve social and economic well-being and its applications for weather
forecasting, monitoring, and responding to natural disasters, climate change
predictions, and so on.
Health care:
➢ Healthcare organizations would like to predict the locations from where the
diseases are spreading so as to prevent further spreading.
➢ However, to predict exactly the origin of the disease would not be possible, until
there is statistical data from several locations. In 2009, when a new flu virus
similar to H1N1 was spreading,
➢ Google has predicted this and published a paper in the scientific journal Nature,
by looking at what people were searching for, on the Internet.
Governance:
➢ Surveillance system analyzing and classifying streaming acoustic signals,
transportation departments using real-time traffic data to predict traffic patterns,
and update public transportation schedules.
➢ Security departments analyzing images from aerial cameras, news feeds, and
social networks or items of interest.
➢ Social program agencies gain a clearer understanding of beneficiaries and proper
payments. Tax agencies identifying fraudsters and support investigation by
analyzing complex identity information and tax returns.
➢ Sensor applications such stream air, water, and temperature data to support
cleanup, fire prevention, and other programs.
Financial and business analytics:

➢ Retaining customers and satisfying consumer expectations are among the most
serious challenges facing financial institutions.
➢ Sentiment analysis and predictive analysis would play a key role in several fields
like travel industry – for optimal cost estimations and retail industry – products
targeted for potential customers.
➢ Forecast analysis – estimating the best price estimations and so on.
Web analytics:
➢ Several websites are experiencing millions of unique visitors per day, in turn
creating a large range of content.
➢ Increasingly, companies want to be able to mine this data to understand
limitations of their sites, improve response time, offer more targeted ads, and so
on.
➢ This requires tools to perform complicated analytics on data that far exceed the
memory of a single machine or even in cluster of machines.

Unit – III
Machine Learning
Introduction to Machine Learning
➢ Machine learning is a growing technology which enables computers to learn
automatically from past data.
➢ Machine learning uses various algorithms for building mathematical models and
making predictions using historical data or information.
➢ Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender system, and many
more
What is Machine Learning
➢ In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions.
➢ But can a machine also learn from experiences or past data like a human does? So here
comes the role of Machine Learning.

➢ Machine Learning is said as a subset of artificial intelligence that is mainly concerned
with the development of algorithms which allow a computer to learn from the data and
past experiences on their own.
➢ The term machine learning was first introduced by Arthur Samuel in 1959. We can
define it in a summarized way as:
“Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly programmed.”
➢ With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed.
➢ Machine learning brings computer science and statistics together for creating predictive
models.
➢ Machine learning constructs or uses the algorithms that learn from historical data. The
more we will provide the information, the higher will be the performance.
How does Machine Learning work
➢ A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it.
➢ The accuracy of predicted output depends upon the amount of data, as the huge amount
of data helps to build a better model which predicts the output more accurately.
➢ Suppose we have a complex problem, where we need to perform some predictions, so
instead of writing a code for it, we just need to feed the data to generic algorithms, and
with the help of these algorithms, machine builds the logic as per the data and predict
the output.
➢ Machine learning has changed our way of thinking about the problem.
The below block diagram explains the working of Machine Learning algorithm:
Features of Machine Learning:

➢ Machine learning uses data to detect various patterns in a given dataset.
➢ It can learn from past data and improve automatically.
➢ It is a data-driven technology.
➢ Machine learning is much similar to data mining as it also deals with the huge amount of
the data.

LECTURE NOTES
ON

BRANCH/YEAR : DS/I
SEMESTER :I

SYLLABUS
UNIT-II BIG DATA


Text Books:
2016.
Reference Books:
edition- 2013

Unit – III
Machine Learning
Introduction to Machine Learning
➢ Machine learning is a growing technology which enables computers to learn
automatically from past data.
➢ Machine learning uses various algorithms for building mathematical models and
making predictions using historical data or information.
➢ Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender system, and many
more
What is Machine Learning
➢ In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions.
➢ But can a machine also learn from experiences or past data like a human does? So here
comes the role of Machine Learning.

➢ Machine Learning is said as a subset of artificial intelligence that is mainly concerned
with the development of algorithms which allow a computer to learn from the data and
past experiences on their own.
➢ The term machine learning was first introduced by Arthur Samuel in 1959. We can
define it in a summarized way as:
“Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly programmed.”
➢ With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed.
➢ Machine learning brings computer science and statistics together for creating predictive
models.
➢ Machine learning constructs or uses the algorithms that learn from historical data. The
more we will provide the information, the higher will be the performance.
How does Machine Learning work
➢ A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it.
➢ The accuracy of predicted output depends upon the amount of data, as the huge amount
of data helps to build a better model which predicts the output more accurately.
➢ Suppose we have a complex problem, where we need to perform some predictions, so
instead of writing a code for it, we just need to feed the data to generic algorithms, and
with the help of these algorithms, machine builds the logic as per the data and predict
the output.
➢ Machine learning has changed our way of thinking about the problem.
The below block diagram explains the working of Machine Learning algorithm:
Features of Machine Learning:

➢ Machine learning uses data to detect various patterns in a given dataset.
➢ It can learn from past data and improve automatically.
➢ It is a data-driven technology.
➢ Machine learning is much similar to data mining as it also deals with the huge amount of
the data.

Case Tools:
➢ Currently, machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc.
➢ Various top companies such as Netflix and Amazon have build machine learning models
that are using a vast amount of data to analyze the user interest and recommend product
accordingly.
Following are some key points which show the importance of Machine Learning:
➢ Rapid increment in the production of data
➢ Solving complex problems, which are difficult for a human
➢ Decision making in various sector including finance
➢ Finding hidden patterns and extracting useful information from data.
The Modeling Process
➢ The modeling phase consists of four steps:
➢ 1. Feature engineering and model selection
➢ 2. Training the model
➢ 3. Model validation and selection
➢ 4. Applying the trained model to unseen data
Feature engineering and model selection :
➢ There are several models that you can choose according to the objective that you might
have:
➢ you will use algorithms of classification, prediction, linear regression, clustering, i.e. k-
means or K-Nearest Neighbor, Deep Learning, i.e Neural Networks, Bayesian, etc.
➢ There are various models to be used depending on the data you are going to process such
as images, sound, text, and numerical values.
In the following table, we will see some models and their applications that you can
apply in your projects:

Model Applications
Logistic Regression Price prediction
Fully connected networks Classification
Convolutional Neural Networks Image processing
Recurrent Neural Networks Voice recognition
Random Forest Fraud Detection
Reinforcement Learning Learning by trial and error
Generative Models Image creation
K-means Segmentation
k-Nearest Neighbors Recommendation systems
Bayesian Classifiers Spam and noise filtering
Train your machine model

➢ You will need to train the datasets to run smoothly and see an incremental improvement
in the prediction rate.
➢ Remember to initialize the weights of your model randomly -the weights are the values
that multiply or affect the relationships between the inputs and outputs-
➢ which will be automatically adjusted by the selected algorithm the more you train them.
Validation Model:
➢ You will have to check the machine created against your evaluation data set that contains
inputs that the model does not know and verify the precision of your already trained
model.
➢ If the accuracy is less than or equal to 50%, that model will not be useful since it would

be like tossing a coin to make decisions.
➢ If you reach 90% or more, you can have good confidence in the results that the model
gives you.
Parameter Tuning( Predicting new observations)
➢ If during the evaluation you did not obtain good predictions and your precision is not
the minimum desired, it is possible that you have overfitting -or underfitting
➢ problems and you must return to the training step before making a new configuration of
parameters in your model.
➢ You can increase the number of times you iterate your training data- termed epochs.
➢ Another important parameter is the one known as the “learning rate”, which is usually a
value that multiplies the gradient to gradually bring it closer to the global -or local-
minimum to minimize the cost of the function.
➢ Increasing your values by 0.1 units from 0.001 is not the same as this can significantly
affect the model execution time.
➢ You can also indicate the maximum error allowed for your model.
➢ You can go from taking a few minutes to hours, and even days, to train your machine.
➢ These parameters are often called Hyperparameters.
➢ This “tuning” is still more of an art than a science and will improve as you experiment.
➢ There are usually many parameters to adjust and when combined they can trigger all
your options.
➢ Each algorithm has its own parameters to adjust. To name a few more, in Artificial
Neural Networks (ANNs) you must define in its architecture the number of hidden
layers it will have and gradually test with more or less and with how many neurons each
layer.
➢ This will be a work of great effort and patience to give good results.
Supervised Learning
➢ Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output.
➢ The labelled data means some input data is already tagged with the correct output.
➢ In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly.
➢ It applies the same concept as a student learns in the supervision of the teacher.
➢ Supervised learning is a process of providing input data as well as correct output data to
the machine learning model.
➢ The aim of a supervised learning algorithm is to find a mapping function to map the
input variable(x) with the output variable(y).
➢ In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:
1. Regression
Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which
come under supervised learning:
➢ Linear Regression
➢ Regression Trees
➢ Non-Linear Regression
➢ Bayesian Linear Regression
➢ Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.
Example : Spam Filtering
➢ Random Forest
➢ Decision Trees
➢ Logistic Regression
➢ Support vector Machines
Linear Regression:
➢ Linear regression is a statistical regression method which is used for predictive analysis.
➢ It is one of the very simple and easy algorithms which works on regression and shows
the relationship between the continuous variables.
➢ Linear regression shows the linear relationship between the independent variable (X-
axis) and the dependent variable (Y-axis), hence called linear regression.
➢ If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
➢ The relationship between variables in the linear regression model can be explained
using the below image. Here we are predicting the salary of an employee on the basis
of the year of experience.
➢ Below is the mathematical equation for Linear regression:

➢ Y= aX+b
➢ Here,
➢ Y = dependent variables (target variables),
➢ X = Independent variables (predictor variables),
a and b are the linear coefficients
➢ Some popular applications of linear regression are:
➢ Analyzing trends and sales estimates
➢ Salary forecasting
➢ Real estate prediction
➢ Analyzing the Impact of Price Changing
➢ Assessment the risk in financial services and Insurance Domain


Polynomial Regression:
➢ Polynomial Regression is a type of regression which models the non-linear dataset using
a linear model.
➢ It is similar to multiple linear regression, but it fits a non-linear curve between the value
of x and corresponding conditional values of y.
➢ Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints.
➢ To cover such datapoints, we need Polynomial regression.
➢ In Polynomial regression, the original features are transformed into polynomial features
of given degree and then modeled using a linear model. Which means the datapoints are
best fitted using a polynomial line.
➢ The equation for polynomial regression also derived from linear regression equation
that means Linear regression equation Y= b0+ b1x, is transformed into Polynomial
regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
➢ Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
➢ The model is still linear as the coefficients are still linear with quadratic
Decision Trees:
➢ Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
➢ It can solve problems for both categorical and numerical data
➢ Decision Tree regression builds a tree-like structure in which each internal node
represents the "test" for an attribute,
each branch represent the result of the test, and each leaf node represents the final

decision or result.
➢ A decision tree is constructed starting from the root node/parent node (dataset),
➢ which splits into left and right child nodes (subsets of dataset).
➢ These child nodes are further divided into their children node, and themselves become
the parent node of those nodes.
➢ Consider the below image:
Random Forest:
➢ Random forest is one of the most powerful supervised learning algorithms which is
capable of performing regression as well as classification tasks.
➢ The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of each tree
output.
➢ The combined decision trees are called as base models, and it can be represented more
formally as:
g(x)= f0(x)+ f1(x)+ f2(x)+....
With the help of Random Forest regression, we can prevent Overfitting in the model by
creating random subsets of the dataset.
Logistic Regression:
➢ Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
➢ Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or
No, True or False, Spam or not spam, etc.
➢ It is a predictive analysis algorithm which works on the concept of probability.
➢ Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
function can be represented as:
f(x)= Output between the 0 and 1 value.

▪ x= input to the function
▪ e= base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-curve as follows:
➢ It uses the concept of threshold levels, values above the threshold level are rounded up
to 1, and values below the threshold level are rounded up to 0.
➢ There are three types of logistic regression:
➢ Binary(0/1, pass/fail)
➢ Multi(cats, dogs, lions)
➢ Ordinal(low, medium, high)
Support Vector Machine:
➢ Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
➢ The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.
➢ SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine.
➢ Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
➢ Here, the Solid line is called hyperplane, and the other two lines are known as boundary
lines.
Example:
➢ SVM can be understood with the example that we have used in the KNN classifier.
➢ Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm.
➢ We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature.
➢ So as support vector creates a decision boundary between these two data (cat and dog)
and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
➢ On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
➢ SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
➢ Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
➢ Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.

What is Unsupervised Learning?
➢ As the name suggests, unsupervised learning is a machine learning technique in which
models are not supervised using training dataset.
➢ Instead, models itself find the hidden patterns and insights from the given data.
➢ It can be compared to learning which takes place in the human brain while learning new
things. It can be defined as:
“Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.”
Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised Learning:
➢ Unsupervised learning is helpful for finding useful insights from the data.
➢ Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
➢ Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
➢ In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
Working of Unsupervised Learning
➢ Working of unsupervised learning can be understood by the below diagram:
➢ Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given.
➢ Now, this unlabeled input data is fed to the machine learning model in order to train it.
➢ Firstly, it will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, Decision tree, etc.
Types of Unsupervised Learning Algorithm:
➢ The unsupervised learning algorithm can be further categorized into two types of
problems:
Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no similarities with the
objects of another group. Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence of those commonalities.
Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
➢ K-means clustering
➢ KNN (k-nearest neighbors)
➢ Hierarchal clustering
➢ Anomaly detection
➢ Neural Networks
➢ Principle Component Analysis
➢ Independent Component Analysis
➢ Apriori algorithm
➢ Singular value decomposition
Difference between Supervised and Unsupervised Learning
Supervised Learning Unsupervised Learning
Supervised learning algorithms are Unsupervised learning algorithms are

trained using labeled data. trained using unlabeled data.
Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
In supervised learning, input data is In unsupervised learning, only input

provided to the model along with the data is provided to the model.
output.
The goal of supervised learning is to train The goal of unsupervised learning is to
the model so that it can predict the output find the hidden patterns and useful
when it is given new data. insights from the unknown dataset.
Supervised learning needs supervision to Unsupervised learning does not need
train the model. any supervision to train the model.
Supervised learning can be categorized Unsupervised Learning can be classified

in Classification and Regression problems. in Clustering and Associations problems.

Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input
corresponding outputs. data and no corresponding output data.
Supervised learning model produces an Unsupervised learning model may give
accurate result. less accurate result as compared to
supervised learning.
Supervised learning is not close to true Unsupervised learning is more close to
Artificial intelligence as in this, we first the true Artificial Intelligence as it learns
train the model for each data, and then similarly as a child learns daily routine
only it can predict the correct output. things by his experiences.
It includes various algorithms such as It includes various algorithms such as
Linear Regression, Logistic Regression, Clustering, KNN, and Apriori al
Support Vector Machine, Multi-class
Classification, Decision tree, Bayesian
Logic, etc.
K-Means Clustering Algorithm
➢ K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science.
➢ What is K-Means Algorithm?
➢ K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters, and
for K=3, there will be three clusters, and so on.
➢ “It is an iterative algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has similar
properties.”
➢ It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any
training.
➢ It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
➢ The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
➢ The k-means clustering algorithm mainly performs two tasks:
➢ Determines the best value for K center points or centroids by an iterative process.
➢ Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
➢ Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
➢ The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready
Let's understand the above steps by considering the visual plots:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
➢ Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
➢ We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our dataset. Consider the below
image:


➢ The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms.
➢ But choosing the optimal number of clusters is a big task.
➢ There are some different ways to find the optimal number of clusters, but here we are
discussing the most appropriate method to find the number of clusters or value of K.
➢ The method is given below:
Elbow Method
➢ The Elbow method is one of the most popular ways to find the optimal number of
clusters.
➢ This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster.

➢ The formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +
∑Pi in Cluster2distance(Pi C2)2 +
∑Pi in CLuster3 distance(Pi C3)2
Apriori Algorithm in Machine Learning
➢ The Apriori algorithm uses frequent itemsets to generate association rules, and it is
designed to work on the databases that contain transactions.
➢ With the help of these association rule, it determines how strongly or how weakly two
objects are connected.
➢ This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
associations efficiently.
➢ It is the iterative process for finding the frequent itemsets from the large dataset.
➢ This algorithm was given by the R. Agrawal and Srikant in the year 1994.
➢ What is Frequent Itemset?
➢ Frequent itemsets are those items whose support is greater than the threshold value or
user-specified minimum support.
➢ It means if A & B are the frequent itemsets together, then individually A and B should
also be the frequent itemset.
➢ Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent itemsets.
➢ Steps for Apriori Algorithm
➢ Below are the steps for the apriori algorithm:
➢ Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.
➢ Step-2: Take all supports in the transaction with higher support value than the minimum
or selected support value.
➢ Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
➢ Step-4: Sort the rules as the decreasing order of lift.
➢ Apriori Algorithm Working
➢ We will understand the apriori algorithm using an example and mathematical
calculation:
➢ Example: Suppose we have the following dataset that has various transactions, and from
this dataset, we need to find the frequent itemsets and generate the association rules
using the Apriori algorithm:

➢
➢ Step-1: Calculating C1 and L1:
➢ In the first step, we will create a table that contains support count (The frequency of each
itemset individually in the dataset) of each itemset in the given dataset. This table is
called the Candidate set or C1.
➢
➢ Now, we will take out all the itemsets that have the greater support count that the
Minimum Support (2). It will give us the table for the frequent itemset L1.
➢ Since all the itemsets have greater or equal support count than the minimum support,
except the E, so E itemset will be removed.
➢
➢ Step-2: Candidate Generation C2, and L2:
➢ In this step, we will generate C2 with the help of L1. In C2, we will create the pair of the
itemsets of L1 in the form of subsets.
➢ After creating the subsets, we will again find the support count from the main
transaction table of datasets, i.e., how many times these pairs have occurred together in
the given dataset. So, we will get the below table for C2:
➢
➢ Again, we need to compare the C2 Support count with the minimum support count, and
after comparing, the itemset with less support count will be eliminated from the table
C2. It will give us the below table for L2
Step-3: Candidate generation C3, and L3:

For C3, we will repeat the same two processes, but now we will form the C3 table with
subsets of three itemsets together, and will calculate the support count from the dataset.
It will give the below table:
Now we will create the L3 table. As we can see from the above C3 table, there is only one
combination of itemset that has support count equal to the minimum support count. So,
the L3 will have only one combination, i.e., {A, B, C}.
Step-4: Finding the association rules for the subsets:
➢ To generate the association rules, first, we will create a new table with the possible rules
from the occurred combination {A, B.C}. For all the rules, we will calculate the
Confidence using formula sup( A ^B)/A. After calculating the confidence value for all
rules, we will exclude the rules that have less confidence than the minimum
threshold(50%).
➢ Consider the below table:


LECTURE NOTES
ON

BRANCH/YEAR : DS/I
SEMESTER :I
SYLLABUS

Definition — Big Data and Data Science Hype — Why data science — Getting Past the
Hype — The Current Landscape — Data Scientist - Data Science Process Overview —
Defining goals — Retrieving data — Data preparation — Data exploration — Data
modeling — Presentation.
UNIT-II BIG DATA
study —Steps in big data — Distributing data storage and processing with Frameworks —
Case study.
Machine learning — Modeling Process — Training model — Validating model —
Predicting new observations —Supervised learning algorithms — Unsupervised learning
algorithms.
Introduction — Deep Feedforward Networks — Regularization — Optimization of
Deep Learning Convolutional Networks — Recurrent and Recursive Nets — Applications
of Deep Learning.
Introduction to data visualization — Data visualization options — Filters — MapReduce
— Dashboard development tools — Creating an interactive dashboard with dc.js-
summary.
Text Books:
1. Introducing Data Science- Davy Cielen- Arno D. B. Meysman- Mohamed Ali-
ManningPublications Co.- I st edition- 2016
2. An Introduction to Statistical Learning: with Applications in R- Gareth James- Daniela
Witten-Trevor Hastie- Robert Tibshirani- Springer- 1st edition- 2013
3. Deep Learning- Ian Goodfellow-YoshuaBengio- Aaron Courville- MIT Press- 1st
edition-2016.
Reference Books:
1. Data Science from Scratch: First Principles with Python- Joel Grus- O’Reilly- 1" edition-
2015
2. Doing Data Science- Straight Talk from the Frontline- Cathy O'Neil- Rachel Schutt- O’
Reilly- l'tedition- 2013

Unit – IV
Deep Learning

Introduction to Deep Learning

➢ Deep learning is based on the branch of machine learning, which is a subset of artificial
intelligence.
➢ Since neural networks imitate the human brain and so deep learning will do.
➢ In deep learning, nothing is programmed explicitly. Basically, it is a machine learning
class that makes use of numerous nonlinear processing units so as to perform feature
extraction as well as transformation.
➢ The output from each preceding layer is taken as input by each one of the successive
layers.
➢ Since deep learning has been evolved by the machine learning, which itself is a subset of
artificial intelligence and as the idea behind the artificial intelligence is to mimic the
human behavior, so same is "the idea of deep learning to build such algorithm that can
mimic the brain".
➢ Deep learning is implemented with the help of Neural Networks, and the idea behind
the motivation of Neural Network is the biological neurons, which is nothing but a brain
cell.
➢ Deep learning is a collection of statistical techniques of machine learning for learning
feature hierarchies that are actually based on artificial neural networks.
➢ So basically, deep learning is implemented by the help of deep networks, which are
nothing but neural networks with multiple hidden layers.

➢ In the example given above, we provide the raw data of images to the first layer of the
input layer.
➢ After then, these input layer will determine the patterns of local contrast that means it
will differentiate on the basis of colors, luminosity, etc.
➢ Then the 1st hidden layer will determine the face feature, i.e., it will fixate on eyes, nose,
and lips, etc. And then, it will fixate those face features on the correct face template.
➢ So, in the 2nd hidden layer, it will actually determine the correct face here as it can be
seen in the above image, after which it will be sent to the output layer.
➢ Likewise, more hidden layers can be added to solve more complex problems, for
example, if you want to find out a particular kind of face having large or light
complexions.
➢ So, as and when the hidden layers increase, we are able to solve complex problems.
Deep Feedforward Networks
➢ The simplest form of neural networks where input data travels in one direction only,
passing through artificial neural nodes and exiting through output nodes.
➢ Where hidden layers may or may not be present, input and output layers are present
there.
➢ Based on this, they can be further classified as a single-layered or multi-layered feed-
forward neural network.

➢ Number of layers depends on the complexity of the function. It has uni-directional

forward propagation but no backward propagation.
➢ Weights are static here. An activation function is fed by inputs which are multiplied by
weights.
➢ To do so, classifying activation function or step activation function is used.
➢ For example: The neuron is activated if it is above threshold (usually 0) and the neuron
produces 1 as an output.
➢ The neuron is not activated if it is below threshold (usually 0) which is considered as -1.
➢ They are fairly simple to maintain and are equipped with to deal with data which
contains a lot of noise.
Advantages of Feed Forward Neural Networks
➢ Less complex, easy to design & maintain
➢ Fast and speedy [One-way propagation]
➢ Highly responsive to noisy data
Disadvantages of Feed Forward Neural Networks:
➢ Cannot be used for deep learning [due to absence of dense layers and back propagation]
Applications on Feed Forward Neural Networks:
➢ Simple classification (where traditional Machine-learning based classification
algorithms have limitations)
➢ Face recognition [Simple straight forward image processing]
➢ Computer vision [Where target classes are difficult to classify]
➢ Speech Recognition
What is Regularization?
➢ Regularization is one of the most important concepts of Deep learning. It is a technique
to prevent the model from overfitting by adding extra information to it.
➢ Sometimes the machine learning model performs well with the training data but does
not perform well with the test data.
➢ It means the model is not able to predict the output when deals with unseen data by
introducing noise in the output, and hence the model is called overfitted.
➢ This problem can be deal with the help of a regularization technique.
➢ This technique can be used in such a way that it will allow to maintain all variables or
features in the model by reducing the magnitude of the variables. Hence, it maintains
accuracy as well as a generalization of the model.
➢ It mainly regularizes or reduces the coefficient of features toward zero. In simple words,
"In regularization technique, we reduce the magnitude of the features by keeping the same
number of features."

How does Regularization Work?

➢ Regularization works by adding a penalty or complexity term to the complex model.
Let's consider the simple linear regression equation:
➢ y= β0+β1x1+β2x2+β3x3+⋯+βnxn +λ |Slope |
➢ In the above equation, Y represents the value to be predicted
➢ X1, X2, …Xn are the features for Y.
➢ β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here
represents the bias of the model, and b represents the intercept.
➢ Linear regression models try to optimize the β0 and b to minimize the cost function.
Ridge Regression
➢ Ridge regression is one of the types of linear regression in which a small amount of bias
is introduced so that we can get better long-term predictions.
➢ Ridge regression is a regularization technique, which is used to reduce the complexity of
the model. It is also called as L2 regularization.
➢ In this technique, the cost function is altered by adding the penalty term to it. The
amount of bias added to the model is called Ridge Regression penalty. We can calculate
it by multiplying with the lambda to the squared weight of each individual feature.
➢ The equation for the cost function in ridge regression will be:
➢ In the above equation, the penalty term regularizes the coefficients of the model, and
hence ridge regression reduces the amplitudes of the coefficients that decreases the
complexity of the model.
➢ As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum value
of λ, the model will resemble the linear regression model.
Lasso Regression:
➢ Lasso regression is another regularization technique to reduce the complexity of the
model. It stands for Least Absolute and Selection Operator.
➢ It is similar to the Ridge Regression except that the penalty term contains only the
absolute weights instead of a square of weights.
➢ Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
➢ It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:
Convolutional Networks :
➢ Convolution neural network contains a three-dimensional arrangement of neurons,
instead of the standard two-dimensional array.
➢ The first layer is called a convolutional layer. Each neuron in the convolutional layer
only processes the information from a small part of the visual field.
➢ Input features are taken in batch-wise like a filter.
➢ The network understands the images in parts and can compute these operations
multiple times to complete the full image processing.
➢ Processing involves conversion of the image from RGB or HSI scale to grey-scale.
➢ Furthering the changes in the pixel value will help to detect the edges and images can be
classified into different categories.
➢ Propagation is uni-directional where CNN contains one or more convolutional layers
followed by pooling and bidirectional where the output of convolution layer goes to a
fully connected neural network for classifying the images as shown in the above
diagram.
➢ Filters are used to extract certain parts of the image. In MLP the inputs are multiplied
with weights and fed to the activation function.
➢ Convolution uses RELU (Rectified Linear Unit)and MLP uses nonlinear activation
function followed by softmax. Convolution neural networks show very effective results
in image and video recognition, semantic parsing and paraphrase detection.

➢
Applications on Convolution Neural Network
➢ Image processing
➢ Computer Vision
➢ Speech Recognition
➢ Machine translation
Advantages of Convolution Neural Network:
➢ Used for deep learning with few parameters
➢ Less parameters to learn as compared to fully connected layer
Disadvantages of Convolution Neural Network:
➢ Comparatively complex to design and maintain
➢ Comparatively slow [depends on the number of hidden layers]
Recurrent Neural Networks
➢ Designed to save the output of a layer, Recurrent Neural Network is fed back to the
input to help in predicting the outcome of the layer.
➢ The first layer is typically a feed forward neural network followed by recurrent neural
network layer where some information it had in the previous time-step is remembered
by a memory function.
➢ Forward propagation is implemented in this case. It stores information required for it’s
future use. If the prediction is wrong, the learning rate is employed to make small
changes.
➢ Hence, making it gradually increase towards making the right prediction during the
backpropagation.
Advantages of Recurrent Neural Networks

➢ Model sequential data where each sample can be assumed to be dependent on historical
ones is one of the advantage.
➢ Used with convolution layers to extend the pixel effectiveness.
Disadvantages of Recurrent Neural Networks
➢ Gradient vanishing and exploding problems
➢ Training recurrent neural nets could be a difficult task
➢ Difficult to process long sequential data using ReLU as an activation function.
Applications of Recurrent Neural Networks
➢ Text processing like auto suggest, grammar checks, etc.
➢ Text to speech processing
➢ Image tagger
➢ Sentiment Analysis
➢ Translation
Deep learning applications
➢ Self-Driving Cars
In self-driven cars, it is able to capture the images around it by processing a huge amount
of data, and then it will decide which actions should be incorporated to take a left or
right or should it stop. So, accordingly, it will decide what actions it should take, which
will further reduce the accidents that happen every year.
➢ Voice Controlled Assistance
When we talk about voice control assistance, then Siri is the one thing that comes into
our mind. So, you can tell Siri whatever you want it to do it for you, and it will search it
for you and display it for you.
➢ Automatic Image Caption Generation
Whatever image that you upload, the algorithm will work in such a way that it will
generate caption accordingly. If you say blue colored eye, it will display a blue-colored
eye with a caption at the bottom of the image.
➢ Automatic Machine Translation
With the help of automatic machine translation, we are able to convert one language into
another with the help of deep learning.

LECTURE NOTES
ON

BRANCH/YEAR : DS/I
SEMESTER :I

SYLLABUS

Definition — Big Data and Data Science Hype — Why data science — Getting Past the Hype
— The Current Landscape — Data Scientist - Data Science Process Overview — Defining
goals — Retrieving data — Data preparation — Data exploration — Data modeling —
Presentation.
UNIT-II BIG DATA
study —Steps in big data — Distributing data storage and processing with Frameworks —
Case study.
Machine learning — Modeling Process — Training model — Validating model —
Predicting newobservations —Supervised learning algorithms — Unsupervised learning
algorithms.
Introduction — Deep Feedforward Networks — Regularization — Optimization of Deep
Learning Convolutional Networks — Recurrent and Recursive Nets — Applications of
Deep Learning.
Introduction to data visualization — Data visualization options — Filters — MapReduce
—Dashboard development tools — Creating an interactive dashboard with dc.js-summary.
Text Books:
2. An Introduction to Statistical Learning: with Applications in R- Gareth James- Daniela
Witten-Trevor Hastie- Robert Tibshirani- Springer- 1st edition- 2013
3. Deep Learning- Ian Goodfellow-YoshuaBengio- Aaron Courville- MIT Press- 1st
edition-2016.
Reference Books:
2. Doing Data Science- Straight Talk from the Frontline- Cathy O'Neil- Rachel Schutt- O’
Reilly- l'tedition- 2013

Unit – V
Data Visualization
Data Visualization:
➢ Data visualization is a graphical representation of quantitative information and data by
using visual elements like graphs, charts, and maps.
➢ Data visualization convert large and small data sets into visuals, which is easy to
understand and process for humans.
➢ Data visualization tools provide accessible ways to understand outliers, patterns, and
trends in the data.
➢ In the world of Big Data, the data visualization tools and technologies are required to
analyze vast amounts of information.
➢ Data visualizations are common in your everyday life, but they always appear in the form
of graphs and charts.
➢ The combination of multiple visualizations and bits of information are still referred to as
Infographics.
➢ Data visualizations are used to discover unknown facts and trends.
➢ You can see visualizations in the form of line charts to display change over time.
➢ Bar and column charts are useful for observing relationships and making comparisons.
➢ A pie chart is a great way to show parts-of-a-whole. And maps are the best way to share
geographical data visually.
Use of Data Visualization:
➢ To make easier in understand and remember.
➢ To discover unknown facts, outliers, and trends.
➢ To visualize relationships and patterns quickly.
➢ To ask a better question and make better decisions.
➢ To competitive analyze.
➢ To improve insights.
Data Visualization Options
➢ Bar Charts
➢ Line Charts
➢ Pie Charts
➢ Bubble Charts
➢ Stacked Charts
Scatterplots, etc

Creating Filters on a Visualization

➢ You can add filters to limit the data that’s displayed in a specific visualization on the
canvas.
➢ Visualization filters can be automatically created by selecting Drill on the visualization’s
Menu
➢ Instead of or in addition to adding filters to an individual visualization, you can add filters
to the project or to an individual canvas.
➢ Any filters included on the canvas are applied before the filters that you add to an
individual visualization.
1. Confirm that the Visualize canvas is displayed.
2. In the Visualize canvas, select the visualization that you want to add a filter to.
3. From the Data Elements pane, drag a data element to the Filter section in the Visualization
Grammar Pane.
4. If the two data sets aren’t joined, then you can't use data elements of a data set as a filter
in the visualization of another data set.

4. Set the filter values. How you set the values depends upon the data type that you’re
filtering.
➢ To set filters on columns such as Cost or Quantity Ordered,
➢ To set filters on columns such as Product Category or Product Name,
➢ To set filters on columns such as Ship Date or Order Date
Dashboard Development Tools:
➢ There are tools which help you to visualize all your data. They are already there; only you
need to do is to pick the right data visualization tool as per your requirements.
➢ Data visualization allows you to interact with data. Google, Apple, Facebook,
and Twitter all ask better a better question of their data and make a better business
decision by using data visualization.
➢ Here are the top 10 data visualization tools that help you to visualize the data:
1. Tableau
➢ Tableau is a data visualization tool. You can create graphs, charts, maps, and many other
graphics.
➢ A tableau desktop app is available for visual analytics. If you don't want to install tableau
software on your desktop, then a server solution allows you to visualize your reports
online and on mobile.

➢
2. Infogram
➢ Infogram is also a data visualization tool. It has some simple steps to process that:
➢ First, you choose among many templates, personalize them with additional visualizations
like maps, charts, videos, and images.
➢ Then you are ready to share your visualization.
➢ Infogram supports team accounts for journalists and media publishers, branded designs
of classroom accounts for educational projects, companies, and enterprises.
3. Chartblocks
Chartblocks is an easy way to use online tool which required no coding and builds
visualization from databases, spreadsheets, and live feeds.

4. Datawrapper
➢ Datawrapper is easy visualization tool, and it requires zero codings. You can upload your
data and easily create and publish a map or a chart. The custom layouts to integrate your
visualizations perfectly on your site and access to local area maps are also available.
5. Plotly
Plotly will help you to create a slick and sharp chart in just a few minutes or in a very short
time. It also starts from a simple spreadsheet.

6. RAW
RAW creates the missing link between spreadsheets and vector graphics on its home page.
Your Data can come from Google Docs, Microsoft Excel, Apple Numbers, or a simple
comma-separated list.
7. Visual.ly
Visual.ly is a visual content service. It has a dedicated data visualization service and their
impressive portfolio that includes work for Nike, VISA, Twitter, Ford, The Huffington
post, and the national geographic.

8. D3.js
D3.js is a best data visualization library for manipulating documents. D3.js runs on
JavaScript, and it uses CSS, html, and SVG. D3.js is an open-source and applies a data-
driven transformation to a webpage. It's only applied when data is in JSON and XML file.
9. Ember Charts
Ember charts are based on the ember.js and D3.js framework, and it uses the D3.js under
the hood. It also applied when the data is in JSON and XML file.

10. NVD3
NVD3 is a project that attempts to build reusable charts and components. This project is
to keeps all your charts neat and customizable.
Creating an interactive dashboard with dc.js

➢ DC.js is a charting library for exploring large multi-dimensional datasets.
➢ It relies on the D3.js engine to render charts in a CSS-friendly SVG format.
➢ It allows complex data visualization to be rendered and has a designed
dashboard having Bar Charts, Scatter Plots, Heat Maps, etc.
➢ DC.js is built to work with Crossfilter for data manipulation.
➢ DC.js enables a single (large) dataset to be visualized with many
interconnected charts with an advanced auto-filtering option.
Why Do We Need DC.js?
➢ In general, data visualization is quite a complex process and carrying it out
on the client side requires extra skill.
➢ DC.js enables us to create almost any kind of complex data visualization
using a simpler programming model.
➢ It is an open source, extremely easy to pick up JavaScript library, which
allows us to implement neat custom visualizations in a very short time.
➢ DC.js charts are data driven and very reactive. In addition, it delivers instant
feedback to user interaction using the Crossfilter Library.
DC.js Features
➢ DC.js is one of the best data visualization framework and it can be used to
generate simple as well as complex visualizations. Some of the salient
features are listed below
➢ Easy to use.
➢ Fast rendering of the charts.
➢ Supports large multi-dimensional datasets.
➢ Open source JavaScript library.
Dc.js Benefits
➢ DC.js is an open source project and it requires lesser code when compared to
others. It comes with the following benefits
➢ Great data visualization.
➢ Performs graphical filtering.
➢ Fast creation of charts and dashboards.
➢ Creation of highly interactive dashboards.
➢ The following steps to draw a pie chart in DC.
➢ Step 1: Include a Script
➢ Let us add D3, DC and Crossfilter using the following code −
➢ <script src = "js/d3.js"></script>
➢ <script src = "js/crossfilter.js"></script>
➢ <script src = "js/dc.js"></script>
➢ Step 2: Define a Variable
➢ Create an object of type, dc.pieChart as shown below −
➢ var pieChart = dc.pieChart('#pie');
➢ Here, the Pie id is mapped with a pie.
➢ Step 3: Read the Data
➢ Read your data (say, from people.csv) using the d3.csv() function.
➢ It is defined as follows −
➢ d3.csv("data/people.csv", function(errors, people) {
➢ console.log(people); }
➢ Step 4: Define the Crossfilter
➢ Define a variable for Crossfilter and assign the data to Crossfilter. It is
defined below −
➢ var mycrossfilter = crossfilter(people);
➢ Step 5: Create a Dimension
➢ Create a dimension for gender using the function below −
➢ var genderDimension = mycrossfilter.dimension(function(data) {
➢ return data.gender;
➢ });
➢ Here, the Gender of the people is used for dimension.
➢ Step 6: reduceCount()
➢ Create a Crossfilter group by applying the group() and the reduceCount()
function on the above created gender dimension - groupDimension.
➢ var genderGroup = genderDimension.group().reduceCount();
➢ Step 7: Generate Pie
➢ Generate the pie using the function below −
➢ pieChart .width(800) .height(300) .dimension(genderDimension)
.group(genderGroup) .on('renderlet', function(chart) {
chart.selectAll('rect').on('click', function(d) { console.log('click!', d); }); });
dc.renderAll();
Here,
➢ Width of the pie chart is set to 800.
➢ Height of the pie chart is set to 300.
➢ Dimension of the pie chart is set to genderDimension using the dimension()
method.
➢ Group of the pie chart is set to genderGroup using the group() method.

➢ Added a click event to log the data using the DC.js built-in event, renderlet().
The renderlet is invoked, whenever the chart is rendered or drawn.

Data Science

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Data Science

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science

Uploaded by

Copyright:

Available Formats

Data Science (20AI401)

COURSE CODE : 20AI401

Dept. of CSE, SANK Page 1 Dr. N. Krishna Kumar

UNIT-II BIG DATA

UNIT-IV DEEP LEARNING

Dept. of CSE, SANK Page 2 Dr. N. Krishna Kumar

UNIT-I INTRODUCTION TO DATA SCIENCE

Data Science Definition

Big Data and Data Science Hype

Dept. of CSE, SANK Page 4 Dr. N. Krishna Kumar

Dept. of CSE, SANK Page 5 Dr. N. Krishna Kumar

The Current Landscape:

Step 1: Frame the problem

Step 2: Collect the raw data needed for your problem

Dept. of CSE, SANK Page 7 Dr. N. Krishna Kumar

Step 4: Explore the data

Step 5: Perform in-depth analysis

Dept. of CSE, SANK Page 8 Dr. N. Krishna Kumar

Data Science Process Overview

❖ Step 1: Defining research goals and creating a project charter

Dept. of CSE, SANK Page 9 Dr. N. Krishna Kumar

➢ Is it part of a bigger strategic picture originating from an opportunity someone detected?

❖ Step 2: Retrieving data

Dept. of CSE, SANK Page 10 Dr. N. Krishna Kumar

Dept. of CSE, SANK Page 12 Dr. N. Krishna Kumar

f) DEALING WITH MISSING VALUES

Dept. of CSE, SANK Page 13 Dr. N. Krishna Kumar

a) THE DIFFERENT WAYS OF COMBINING DATA

Dept. of CSE, SANK Page 14 Dr. N. Krishna Kumar

b) USING VIEWS TO SIMULATE DATA JOINS AND APPENDS

c) TURNING VARIABLES INTO DUMMIES

Step 4: Exploratory data analysis

Dept. of CSE, SANK Page 15 Dr. N. Krishna Kumar

Dept. of CSE, SANK Page 16 Dr. N. Krishna Kumar

Step 6: Presenting findings and building applications on top of them

Dept. of CSE, SANK Page 18 Dr. N. Krishna Kumar

INTRODUCTION TO DATA SCIENCE

COURSE CODE : 20DS101

Dept. of CSE, SANK Page 1 Dr. N. Krishna Kumar

UNIT-II BIG DATA

UNIT-IV DEEP LEARNING

Dept. of CSE, SANK Page 2 Dr. N. Krishna Kumar

Dept. of CSE, SANK Page 3 Dr. N. Krishna Kumar

Dept. of CSE, SANK Page 4 Dr. N. Krishna Kumar

Dept. of CSE, SANK Page 7 Dr. N. Krishna Kumar

Dept. of CSE, SANK Page 8 Dr. N. Krishna Kumar

Distributing data storage and processing with Frameworks

Distributing data storage and processing with frameworks

Hadoop Distributed File System

Dept. of CSE, SANK Page 11 Dr. N. Krishna Kumar

Financial and business analytics:

Dept. of CSE, SANK Page 13 Dr. N. Krishna Kumar

Dept. of CSE, SANK Page 14 Dr. N. Krishna Kumar

Features of Machine Learning:

Dept. of CSE, SANK Page 15 Dr. N. Krishna Kumar

INTRODUCTION TO DATA SCIENCE

COURSE CODE : 20DS101

Dept. of CSE, SANK Page 1 Dr. N. Krishna Kumar

UNIT-II BIG DATA