Basics of Data Science
Basics of Data Science
Unit 1
Introduction to core concepts
& technologies
Contents :-
Introduction
Terminology
data science process
data science toolkit
Types of data
Example
Applications
Mathematical Foundations for Data Science : linear algebra
Analytical and numerical solutions of linear equations
Mathematical structures
concepts and notations used in discrete mathematics
Introduction :
Data Science is the area of study which involves extracting insights from vast
amounts of data using various scientific methods, algorithms, and processes.
The term Data Science has emerged because of the evolution of mathematical
statistics, data analysis, and big data.
Data Science is an interdisciplinary field that allows you to extract knowledge from
structured or unstructured data.
Data science enables you to translate a business problem into a research project and
then translate it back into a practical solution.
Terminology :-
Data Science is a field that combines programming skills and
knowledge of mathematics and statistics to derive insights from data.
In short: Data Scientists work with large amounts of data, which are
systematically analyzed to provide meaningful information that can be
used for decision making and problem solving.
Data Science Process :-
The data science process is a systematic approach to solving a data problem. It
provides a structured framework for articulating your problem as a question,
deciding how to solve it, and then presenting the solution to stakeholders.
• Discovery
To begin with, it is exceptionally imperative to get the different determinations,
prerequisites, needs and required budget-related with the venture. You must have
the capacity to inquire the correct questions like do you have got the desired assets.
These assets can be in terms of individuals, innovation, time and information. In
this stage, you too got to outline the trade issue and define starting hypotheses (IH)
to test.
• Information Preparation
In this stage, you would like to investigate, preprocess and condition data for
modeling. You’ll be able to perform information cleaning, changing, and
visualization. This will assist you to spot the exceptions and build up a relationship
between the factors. Once you have got cleaned and arranged the information, it’s
time to do exploratory analytics on it.
• Model Planning
Here, you may decide the strategies and methods to draw the connections between
factors. These connections will set the base for the calculations which you may
execute within the following stage. You may apply Exploratory Data Analytics
(EDA) utilizing different factual equations and visualization apparatuses.
• Model Building
In this stage, you’ll create datasets for training and testing purposes. You may
analyze different learning procedures like classification, association, and
clustering and at last, actualize the most excellent fit technique to construct the
show.
• Operationalize
In this stage, you convey the last briefings, code, and specialized reports. In
expansion, now a pilot venture is additionally actualized in a real-time generation
environment. This will give you a clear picture of the execution and other related
limitations.
• Communicate Results
Presently, it is critical to assess the outcome of the objective. So, within the final
stage, you recognize all the key discoveries, communicate to the partners and
decide in the event that the outcomes about the venture are a victory or a
disappointment based on the criteria created in Stage 1.
Data Science Toolkit :-
A Data Scientists primary role is to apply machine learning, statistical methods and
exploratory analysis to data to extract insights and aid decision making. Programming
and the use of computational tools are essential to this role.
The. Data Science Toolkit is a collection of the best open data sets and open-
source tools for data science, wrapped in an easy-to-use REST/JSON API with
command line, Python and Javascript interfaces.
1. Programming Languages:
- Python: Widely used for data analysis, machine learning, and data
visualization. Libraries like NumPy,
Pandas, Matplotlib, and Scikit-Learn are popular for data science
tasks.
- R: Another popular language for data analysis and statistics,
known for its extensive set of statistical
packages.
2. Integrated Development Environments (IDEs):
- Jupyter Notebook: An interactive and web-based environment for
writing and running code, making it
easy to document and share data analysis.
- RStudio: An IDE specifically designed for R programming.
Types of data :-
There are two types of data: Qualitative and Quantitative data, which are
further classified into four types data: nominal, ordinal, discrete, and
Continuous.
Numerical data:Numerical, or quantitative, data is a type of
data that represents numbers rather than natural language
descriptions, so it can only be collected in a numeric form.
Examples of quantitative data include arithmetic operations
(addition, subtraction, division, and multiplication), and ways
to measure a person's weight and height.
It is also divided into two subsets: discrete data and continuous
data.
Discrete data:
The main feature of this data type is that it is countable,
meaning that it can take certain values likenumbers 1,2,3 and
so on.
Examples of these types of data are age, the number of
children you want to have (the number is a non-negative
integer because you can't have 1.5 or −2 kids)
Continuous data:
Continuous data is a type of data with uncountable elements. It
Examples of continuous data are the measure of weight, height,
area, distance, time, etc.
This type of data can be further divided into interval data and ratio
data.
Interval data:
Interval data is measured along a scale, in which each point is
placed at an equal distance, or interval, from one another.
Ratio data:
Ratio data is almost the same as the previous type but the main
difference is that it has a zero point. For instance, the zero point
temperature can be measured in Kelvin. It is equal to −273.15
degrees Celsius
Categorical data
Categorical, or qualitative data, is information divided into groups or
categories using labels or names. In such dataset, each item is
placed in a single category depending on its qualities. All categories
are mutually exclusive.
Numbers in this type of data do not have mathematical meaning, i.e.
no arithmetical operations can be performed with numerical
variables.
Categorical data is further divided into nominal data and ordinal data.
Nominal data:
Nominal data, also known as naming data, is descriptive and has a
function of labeling or naming
variables. Elements of this type of data do not have any order, or
numerical value, and cannot be measured. Nominal data is usually
collected via questionnaires or surveys. E.g.: Person's name, eye color,
clothes brand.
Ordinal data:
This type of data represents elements that are ordered, ranked, or used
on a rating scale. Generally
speaking, these are categories with an implied order. Though ordinal
data can be counted, it cannot be measured as well as nominal one.
Examples Applications :-
Data science has found its applications in almost every industry:
• Healthcare
• Gaming
• Image Recognition
• Recommendation Systems
• Logistics
• Fraud Detection
• Internet Search
• Speech Recognition
Mathematical Foundations for Data Science:
Linear Algebra :-
Mathematical topics fundamental to computing and statistics including trees and other
graphs, counting in combinatorics, principles of elementary probability theory, linear
algebra, and fundamental concepts of calculus in one and several variables.
Linear Algebra is used in machine learning to understand how algorithms work under
the hood. It’s all about vector/matrix/tensor operations;
Analytical & Numerical Solutions of Linear Equations :-
An analytical solution involves framing the problem in a well-understood form and
calculating the exact solution.
A numerical solution means making guesses at the solution and testing whether the
problem is solved well enough to stop.
Comparison: Analytical vs. Numerical Solutions
Computationally
Optimized for large and
Complexity expensive for large
sparse systems.
systems.
Mathematical Structures :-
Algebraic Structure in Discrete Mathematics
The algebraic structure is a type of non-empty set G which is equipped with one or
more than one binary operation. Let us assume that * describes the binary operation
on non-empty set G. In this case, (G, *) will be known as the algebraic structure. (1,
-), (1, +), (N, *) all are algebraic structures.
(R, +, .) is a type of algebraic structure, which is equipped with two operations (+
and .)
Examples of Algebraic Structures(1, -):Here, the set is {1}, and the operation is
subtraction (−).
Subtracting any element in the set results in another element in the set (though {1}
is trivial).
(1, +):The set {1}with addition (+).Since adding 1+1=2 leaves the set, this specific
pair is valid only when constraints on the operation or set size are trivially satisfied.
It's used in computer science to design the apps and programs we use every day.
While there are no hard and fast definitions of discrete mathematics, it's well
known for the things it excludes - continuously varying quantities and all things
related to that.
Discrete mathematics is vital to digital devices. With tech continually on the rise,
studying this overlooked area of mathematics could prove valuable for your career
and your future.
The purpose of this course is to understand and use (abstract) discrete structures
that are backbones of computer science. In particular, this class is meant to
introduce logic, proofs, sets, relations, functions, counting, and probability, with an
emphasis on applications in computer science.
Introduction
Sources of data
• The process of gathering and analyzing accurate data from various sources to find answers
to research problems, trends and probabilities, etc., to evaluate possible outcomes is Known
as Data Collection .
• The main objective of data collection is to gather information-rich and reliable data, and
analyze them to make critical business decisions.
• During data collection, the researchers must identify the data types, the
sources of data, and what methods are being used
The data collection process has had to change and grow with the times, keeping pace with
technology.
Here are several key points to consider regarding data collection in data science:
•Data collection is essential for both business and research.
•Data collecting helps in the gathering of information, the testing of hypothesis,
and the production of relevant findings.
•It enables scientists to find correlations, trends, and patterns that lead to important
discoveries.
•In business, data collection offers insights into consumer behavior, market trends, as
well as operational effectiveness.
•It enables businesses to streamline operations and make data-driven decisions.
•A competitive advantage in the market is provided by data collection .
Data management refers to the process by which data is effectively acquired, stored,
processed, and applied, aiming to bring the role of data into full play.
In terms of business, data management includes metadata management, data quality
management, and data security management.
A data source can be a database, a flat file, real-time measurements from physical
equipment, scraped online data, or any of the numerous static and streaming data
providers available on the internet.
Primary data:
• The data which is Raw, original, and extracted directly from the
official sources is known as primary data.
• Interview method
• Survey method
• Observation method:
• Experimental method
Secondary data:
Sources of data can also be classified based on its collection methods, which are –
Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later stages
of data analysis.
In the process of big data analysis, “Data collection” is the initial step before
starting to analyze the patterns or useful information in data. The data which is to be
analyzed must be collected from different valid sources.
Data collection starts with asking some questions such as what type of data is to be
collected and what is the source of collection.
Most of the data collected are of two types known as “qualitative data“ which is a
group of non-numerical data such as words, sentences mostly focus on behavior and
actions of the group and another one is “quantitative data” which is in numerical
forms and can be calculated using different scientific tools and sampling data.
APIs
APIs, or Application Program Interfaces, and Web Services are provided by many
data providers and websites, allowing various users or programmes to communicate
with and access data for processing or analysis.
APIs and Web Services often listen for incoming requests from users or applications,
which might be in the form of web requests or network requests, and return data in
plain text, XML, HTML, JSON, or media files.
APIs (Application Programming Interfaces) are widely used to retrieve data from a
number of data sources. APIs are used by apps that demand data and access an end-
point that contains the data. Databases, online services, and data markets are examples
of end-points.
APIs are also used to validate data. An API might be used by a data analyst to validate
postal addresses and zip codes,
Types of APIs for Data Science
1.REST APIs (Representational State Transfer)
1. Most common, uses HTTP requests (GET, POST, PUT, DELETE)
2. Example: Twitter API, Google Maps API
2.SOAP APIs (Simple Object Access Protocol)
1. More secure but heavier compared to REST
2. Example: Some banking and enterprise APIs
3.GraphQL APIs
1. More flexible, allows fetching only required data
2. Example: GitHub GraphQL API
4.Streaming APIs
1. Used for real-time data
2. Example: Twitter Streaming API for live tweets
Making API Requests in Python
n order to work with API some tools are required such as requests so
we need In order to work with API some tools are required such as
requests so we need to first install them in our system.
import requests
response = requests.get('https://google.com/')
print(response)
>> <Response [200]>
Exploring and fixing data :-
Data exploration refers to the initial step in data analysis. Data analysts use data
visualization and statistical techniques to describe dataset characterizations, such as
size, quantity, and accuracy, to understand the nature of the data better.
Data exploration techniques include both manual analysis and automated data
exploration software solutions that visually explore and identify relationships between
different data variables, the structure of the dataset, the presence of outliers, and the
distribution of data values to reveal patterns and points of interest, enabling data
analysts to gain greater insight into the raw data.
Data Cleaning
•Identifying Missing Values: Locate and address missing data
points strategically (e.g., removal, imputation).
Data Visualization
•Creating Visualizations: Use charts and graphs (bar charts, line
charts, heatmaps) to effectively communicate patterns and trends
within the data.
•Choosing the Right Charts: Select visualizations that best suit the
type of data and the insights you're looking for.
Data storage management helps organizations understand where they have data,
which is a major piece of compliance. Compliance best practices include
documentation, automation,and use of governance tools. Immutable data storage also
helps achieve compliance.
Data management refers to the process by which data is effectively acquired, stored,
processed, and applied, aiming to bring the role of data into full play. In terms of
business, data management includes metadata management, data quality management,
and data security management.
Data storage and management in data science include:
•Informed Decision-Making Process
•Data Quality and Efficiency.
•Compliance and Customer Trust
•Strategy Development and Innovation
•Long-term Sustainability
Data Management Tools and Technologies
Relational Database Management Systems (RDBMS):
Data Warehouse
•Amazon Redshift
•Google BigQuery
•Snowflake
•Apache NiFi
•Talend
•Apache Spark
By using multiple data sources for your model, you can reduce the total volume of data
processed . If used in combination with calculated columns, multiple data sources can
minimize or eliminate the need to create database table joins in an external data access
tool. Using multiple data sources also enables measure allocation.
For example, suppose your product, customer, and order data is stored in a set of tables.
If you were to use this data from a single source, you would need separate tables for
Product, Customer, CustomerSite, Order, and OrderDetail. This source would contain
many duplicate values, and the joins between the tables would be relatively complex.
Instead, you create three separate sources for Products, Customer/Site, and Order/Order
Detail data. The volume of data contained in each is less than that in the single source,
and there are only simple joins between Customer and CustomerSite tables, and Order
and OrderDetail.
Here are some key considerations and benefits when
using multiple data sources in data science
Introduction to statistics
Variance
Linear regression
SVM
Naive Bayes
•Take your time to analyze and think about why you need to learn
data analytics, whether your interest lies in this field or not, and
why you wish to pursue a career in the field of data analytics.
Terminology and concepts :-
There are four key types of data analytics: descriptive, diagnostic, predictive,
and prescriptive. Together, these four types of data analytics can help an
organization make data-driven decisions.
Introduction to statistics :-
Statistics is like the heart of Data Science that helps to analyze, transform and
predict data.
Types of Statistics
1. Descriptive Statistics
2. Variability
3. Correlation
4. Probability Distribution
5. Regression
6. Normal Distribution
7. Bias
These were some of the statistics concepts for data science that you need to
work on. Apart from these, there are some other statistics topics for data
science as well which includes:
Variance(Population) = ∑(x−x‾)2nVariance
Variance is the measure of spread of data (Sample) = ∑(x−x‾)2n−1Variance(Populati
Sample/Population Variane
along its central values. on) = n∑(x−x)2
Variance(Sample) = n−1∑(x−x)2
Range is the difference between the Range = (Largest Data Value - Smallest
Range, (R)
largest and smallest values of the data set Data Value)
Central tendencies and Distributions :-
The central tendency measure is defined as the number used to represent the
center or middle of a set of data values. The three commonly used measures of
central tendency are the mean, median, and mode.
numbers = [5, 2, 9, 2, 7, 3, 2]
print("Mean:", statistics.mean(numbers))
print("Median:", statistics.median(numbers))
print("Mode:", statistics.mode(numbers))
A significant chunk of Data science is about understanding the behaviours and
properties of variables, and this is not possible without knowing what
distributions they belong to. Simply put, the probability distribution is a way to
represent possible values a variable may take and their respective probability.
Bionominal distribution -
• This distribution was discovered by a Swiss Mathematician James
Bernoulli. It is used in such situation where an experiment results in two
possibilities - success and failure.
• Binomial distribution is a discrete probability distribution which
expresses the probability of one set of two alternatives-successes failure.
• A binomial distribution thus represents the probability for x
successes in n trials, given a success probability p for each
trial.
• A binomial distribution's expected value, or mean, is
calculated by multiplying the number of trials (n) by the
probability of successes (p), or n × p.
Where:
•n is the number of trials (occurrences)
•x is the number of successful trials
•p is the probability of success in a single trial
•n C x is the combination of n and x..
•Note that nCx = n! / r! ( n − r ) ! ), where ! is factorial
import math
# Example usage
n = 10 # number of trials
p = 0.5 # probability of success (e.g., for a fair coin, p = 0.5)
x = 6 # number of successes
A graphical representation of a
normal distribution is sometimes
called a bell curve because of its
flared shape. The precise shape can
vary according to the distribution
of the population but the peak is
always in the middle and the curve
is always symmetrical. In a normal
distribution the mean mode and
median are all the same.
μ is the mean of the distribution.
Variance :-
print("Variance:", variance)
Standard Deviation
Standard deviation is the measure of the distribution of statistical data.
Standard
Deviation is the square root of the variance. Standard deviation is denoted
by the
symbol, „σ‟.
Properties of Standard Deviation
It describes the square root of the mean of the squares of all values in a
data set
and is also called the root-mean-square deviation.
The smallest value of the standard deviation is 0 since it cannot be
negative.
When the data values of a group are similar, then the standard deviation
will be
very low or close to zero. But when the data values vary with each other,
then the
standard variation is high or far from zero.
Example: Let there be two cricket players: Pant and Kartik,
and you have to select one for the cricket world cup. The
score of both the players in the last five T-20
matches are as follows:
import math
The CLT is a statistical theory states that - if you take a sufficiently large sample
size from a population with a finite level of variance, the mean of all samples from
that population will be roughly equal to the population mean.
The CLT has several applications. Look at the places where you can use it.
Political/election polling is a great example of how you can use CLT. These polls
are used to estimate the number of people who support a specific candidate. You
may have seen these results with confidence intervals on news channels. The CLT
aids in this calculation.
The CLT use in various census fields to calculate various population details, such
as family income, electricity consumption, individual salaries, and so on . The CLT
is useful in a variety of fields.
Basic machine learning algorithms :-
• Linear regression.
• Logistic regression.
• Decision tree.
• SVM algorithm.
• Naive Bayes algorithm.
• KNN algorithm.
• K-means.
• Random forest algorithm.
• Dimensionality reduction algorithms
• Gradient boosting algorithm and AdaBoosting algorithm
Linear regression can be further divided into two types of the algorithm:
1. Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
2. Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
SVM :-
Support Vector Machine(SVM) is a supervised machine learning algorithm
used for both classification and regression. Though we say regression
problems as well its best suited for classification. The objective of SVM
algorithm is to find a hyperplane in an N-dimensional space that distinctly
classifies the data points.
The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put the
new data point in the correct category in the future. This best decision boundary
is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine.
Types of SVM :
SVM can be of two types:
1. Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
A Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature.
For example, a fruit may be considered to be an apple if it is red, round, and about
3 inches in diameter.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
Introduction
Data encodings
Retinal variables
Visual encodings
Introduction :-
4. Improves Communication
There are many advantages of data visualization. Data visualization is used to:
Communicate your results or findings with your audience
Identify trends, patterns and correlations between variables
Monitor the model’s performance
Clean data
Validate the model’s assumptions
Disadvantages
1. Univariate Analysis
In univariate analysis, as the name suggest, we analyze only one variable at a time. In
other words, we analyze each variable separately. Bar charts, pie charts, box plots and
histograms are common examples of univariate data visualization. Bar charts and pie
charts are created for categorical variables, while box plots and histograms are
created for numerical variables.
2. Bivariate Analysis
In bivariate analysis, we analyze two variables at a time. Often, we see whether there
is a relationship between the two variables. The scatter plot is a classic example of
bivariate data visualization.
3. Multivariate Analysis
In multivariate analysis, we analyze more than two variables simultaneously. The
heatmap is a classic example of multivariate data visualization. Other examples are
cluster analysis and principal component analysis (PCA).
Tools and Software for Data Visualization
There are multiple tools and software available for data visualization.
• Weather reports: Maps and other plot types are commonly used in weather
reports.
• Internet websites: Social media analytics websites such as Social Blade and
Google Analytics use data visualization techniques to analyze and compare
the performance of websites.
• Geography
• Gaming industry
There are many data visualization types. The following are the commonly
used data visualization charts.
1. Distribution plot
2. Box and whisker plot
3. Violin plot
4. Line plot
5. Bar plot
6. Scatter plot
7. Histogram
8. Pie chart
9. Area plot
10. Hexbin plot
11. Heatmap
1. Distribution plot
A distribution plot, also known as a distplot.A distribution plot is
used to visualize data distribution. Example: Probability distribution plot or
density curve.
4. Line plot
A line plot is created by connecting a series of data points with straight lines. The
number of periods is on the x-axis.
5. Bar plot
A bar plot is used to plot the frequency of occurring categorical data. Each category
is represented by a bar. The bars can be created vertically or horizontally. Their
heights or lengths are proportional to the values they represent.
6. Scatter plot
Scatter plots are created to see whether there is a relationship (linear or non-linear and
positive or negative) between two numerical variables. They are commonly used in
regression analysis.
7. Histogram
A histogram represents the distribution of numerical data. Looking at a histogram,
we can decide whether the values are normally distributed (a bell-shaped curve),
skewed to the right or skewed left. A histogram of residuals is useful to validate
important assumptions in regression analysis.
8. Pie chart
A categorical variable pie chart includes each category's values as slices whose
sizes are proportional to the quantity they represent. It is a circular graph made with
slices equal to the number of categories.
9. Area plot
The area plot is based on the line chart. We get the area plot when we cover the
area between the line and the x-axis.
11. Heatmap
A heatmap visualizes the correlation coefficients of numerical features with a
beautiful color map. Light colors show a high correlation, while dark colors show a
low correlation. The heatmap is extremely useful for identifying multicollinearity
that occurs when the input features are highly correlated with one or more of the
other features in the dataset.
Encoding is the process of converting data into a format required for a number
of information processing needs, including: Program compiling and execution.
The data encoding technique is divided into the following types, depending upon the
type of data conversion.
1 Quantitative: These are the data types that represent the quantity of certain data.
Some attributes of this type includes position, length, volume, area etc.
2 Ordinal: These are the data types that holds data of some order. For example,
days of the week, which holds the order in which they should be represented.
3 Nominal: In this kind of data types, the data is represented in the form of the
names and categories.
Data mapping tools are commonly included in BI and analytics platforms. Be sure
to choose a platform that includes a tool that adequately fulfills your organizational
needs for customization and development. This will ensure that you get the most
comprehensive, precise, and valuable results from your BI and analytics.
The three popular techniques of converting Categorical values to Numeric values are
done in two different methods. Label Encoding. One Hot Encoding. Binary
Encoding.
Mathematical model includes the raw data and the operations over the data,
whereas the conceptual model includes the semantics and their domain
knowledge.
With the input from either of the models, certain relevant tasks are performed to
deliver the output images using the visual encoding patterns that exists
1 Retinal: Human beings are very sensitive to these kinds of retinal variables. Some
of the retinal variables are colours, shapes, size and other kind of properties. Human
beings can easily differentiate between these kinds of retinal variables.
2 Planar: Planar variables are another kind which can be applied to all types of data
that are available
Size
Size specifies how big or small objects are in relation to the space they occupy. The
primary role size plays in design are given below: • Function – For example, the
age of the audience – older people would need type set larger to aid help in
reading. • Attractiveness is to add interest by cropping or scaling the elements. •
Organization makes the important element the largest and the least important the
smallest.
Texture
Texture is the look or feel of any object or surface. The appearance is either visual
(illusionary) or tactile (physical to touch). Patterns are good examples of visual
texture.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Unit 5
Data mining
Network protocols
Computer security
Software engineering
Computer architecture
Operating systems
Distributed systems
Bioinformatics
Machine learning
DR Mrs J. N. Jadhav Associate professor Deptt of CSE, DYPCET
Computer science and engineering applications :-
Artificial Intelligence
Internet of Behaviours
Machine Learning
5G
Internet of Behaviours
This is another trending technology when we talk of computer Science engineering, this
aims at dealing with the basic entity DATA. IOB i.e. Internet of Behaviours refers to the
use of the data to drive some specific behaviour patterns from the collection of the data. It
aims at the gathering , combining and processing of the data from various sources like
public domain, social media ,Government agencies etc.
For example, both Facebook and Google are using the behavioral data of their users to
display advertisements to the people accordingly. This is helping businesses in getting
connected with their potential audience .
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Robotic Process Automation
RBA refers automation in the processes. The amount of human intervention is being
reduced and the tasks are being replaced by bots. A lot of Coding needs to be done so as
to enable the automation of computerised or non-computerised processes without
human intervention e.g. automatic Email replies to automated data analysis and
automatic processing and financial transactions approval. Robotic Automation of
processes is very fast as it is programmed to do the automation work.
Machine Learning
This trending Technology cannot be given a miss in the list as it is empowering our
machines to be smart thereby working on the learning models and improvising the
process of decision making which is happening in the artificial devices whom we are
making smart or intelligent .It is often called a subset of AI because it is aiming at
making the device smart but through a learning model .
Learning can be of two types:- Supervised Leaning and Unsupervised Learning.
Supervised Machine Learning is a type of machine learning technique that makes use of
a supervisor.
Unsupervised Learning Method for machine learning focusses on sorted information
that is grouped according to similarities and differences even though no categories are
provided.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
5G
Some of the Unique features of 5G that makes it worthwhile are High Speed, Massive
Interconnections , Low latency and low power Consumption.Also when 5G is integrated
with technologies like cloud Computing , IOT and Edge Computing , it can help businesses
grow and can deliver significant econmonic advantages .
VR is technology that is generated with the help of the computers that give a real sense of
surroundings , scenes and objects . Simulation is one technique that is used to achieve this
task.
VR is used extensively in applications like gaming ,education , healthcare etc. This
immersive technology has a lot of applications that are helping trainers as well students to
embark on their learning journey.
Big Data Analytics deals with the collection of data from different sources, merge it in a way
that it becomes available to be consumed by personnels who will be analysing it later and
finally deliver products useful to the businessess. The unstructured data that is raw data
collected from various sources of the data is coverted into a useful product for organisations.
It is one of the trending and significant technology for businesses.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Data mining :-
Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis. Data mining
techniques and tools enable enterprises to predict future trends and make more-informed
business decisions.
Data mining is a key part of data analytics overall and one of the core disciplines in
data science, which uses advanced analytics techniques to find useful information in data
sets. At a more granular level, data mining is a step in the knowledge discovery in
databases (KDD) process, a data science methodology for gathering, processing and
analyzing data. Data mining and KDD are sometimes referred to interchangeably, but
they're more commonly seen as distinct things.
Effective data mining aids in various aspects of planning business strategies and
managing operations. That includes customer-facing functions such as marketing,
advertising, sales and customer support, plus manufacturing, supply chain management,
finance and HR. Data mining supports fraud detection, risk management, cybersecurity
planning and many other critical business use cases. It also plays an important role in
healthcare, government, scientific research, mathematics, sports and more.
The data mining process can be broken down into these four primary stages:
Data gathering.
Relevant data for an analytics application is identified and assembled. The data may be
located in different source systems, a data warehouse or a data lake, an increasingly
common repository in big data environments that contain a mix of structured and
unstructured data. External data sources may also be used. Wherever the data comes from,
a data scientist often moves it to a data lake for the remaining steps in the process.
Data preparation.
This stage includes a set of steps to get the data ready to be mined. It starts with data
exploration, profiling and pre-processing, followed by data cleansing work to fix errors
and other data quality issues. Data transformation is also done to make data sets
consistent, unless a data scientist is looking to analyze unfiltered raw data for a particular
application.
• Classification. This approach assigns the elements in data sets to different categories
defined as part of the data mining process. Decision trees, Naive Bayes classifiers, k-nearest
neighbor and logistic regression are some examples of classification methods.
• Clustering. In this case, data elements that share particular characteristics are grouped
together into clusters as part of data mining applications. Examples include k-means
clustering, hierarchical clustering and Gaussian mixture models.
• Regression. This is another way to find relationships in data sets, by calculating predicted
data values based on a set of variables. Linear regression and multivariate regression are
examples. Decision trees and some other classification methods can be used to do
regressions, too.
• Sequence and path analysis. Data can also be mined to look for patterns in which a
particular set of events or values leads to later ones.
• Neural networks. A neural network is a set of algorithms that simulates the activity of the
human brain. Neural networks are particularly useful in complex pattern recognition
applications involving deep learning.
DR Mrs J. N. Jadhav Associate professor Deptt of CSE, DYPCET
Benefits of data mining :-
• More effective marketing and sales. Data mining helps marketers better understand
customer behavior and preferences, which enables them to create targeted marketing and
advertising campaigns.
• Better customer service. Companies can identify potential customer service issues more
promptly and give contact center agents up-to-date information to use in calls and online
chats with customers.
• Improved supply chain management. Organizations can spot market trends and forecast
product demand more accurately, enabling them to better manage inventories of goods and
supplies.
• Stronger risk management. Risk managers and business executives can better assess
financial, legal, cyber security and other risks to a company and develop plans for
managing them.
• Lower costs. Data mining helps drive cost savings through operational efficiencies in
business processes and reduced redundancy and waste in corporate spending.
DR Mrs J. N. Jadhav Associate professor Deptt of CSE, DYPCET
Network protocols :-
Network protocols take large-scale processes and break them down into small, specific
tasks or functions. This occurs at every level of the network, and each function must
cooperate at each level to complete the larger task at hand. The term protocol suite
refers to a set of smaller network protocols working in conjunction with each other.
There are three main types of network protocols. These include network management
protocols, network communication protocols and network security protocols:
While network protocol models generally work in similar ways, each protocol is unique
and operates in the specific way detailed by the organization that created it.
There are thousands of different network protocols, but they all perform one of three
primary actions:
i. Communication
ii. Network management
iii. Security
Each type is necessary to use network devices swiftly and safely, and they work together
to facilitate that usage.
Hypertext Transfer Protocol (HTTP): This Internet Protocol defines how data is
transmitted over the internet and determines how web servers and browsers should
respond to commands. This protocol (or its secure counterpart, HTTPS) appears at the
beginning of various URLs or web addresses online.
Secure Socket Shell (SSH): This protocol provides secure access to a computer, even if
it’s on an unsecured network. SSH is particularly useful for network administrators who
need to manage different systems remotely.
Short Message Service (SMS): This communications protocol was created to send and
receive text messages over cellular networks. SMS refers exclusively to text-based
messages. Pictures, videos or other media require Multimedia Messaging Service (MMS),
an extension of the SMS protocol.
Network protocols do not simply define how devices and processes work; they define how
devices and processes work together. Without these predetermined conventions and rules, the
internet would lack the necessary infrastructure it needs to be functional and useable.
Network protocols are the foundation of modern communications, without which the digital
world could not stand.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Types of network protocols :-
There are various types of protocols that support a major and compassionate role in
communicating with different devices across the network. These are:
Web Analytics is a technique that you can employ to collect, measure, report, and analyze
your website data. It is normally carried out to analyze the performance of a website and
optimize its web usage.
Web Analytics is an indispensable technique for all those people who run their business
online. This is a comprehensive tutorial that covers all the basics of web analytics.
Website traffic analysis is the process of collecting and interpreting key data points
that describe web traffic to and from your site. (Web traffic is information about every
user that visits your site.)
Web traffic analytics refers to collecting data about who comes to your website and
what they do when they get there. That data is crucial to building effective sales and
marketing strategies.
Web traffic analytics tells you who visits your website and what they do. Ideally, it'll tell
you what content your users love and give you insights to help improve conversions.
When you're equipped with accurate and immediate website traffic data, it’s possible to
develop pattern models that identify potential weak points in your web design and inform
ongoing development decisions.
Computer hardware is typically protected by the same means used to protect other valuable
or sensitive equipment—namely, serial numbers, doors and locks, and alarms.
Computer Security (Cyber security) can be categorized into five distinct types:
Both Data Science and Software Engineering requires you to have programming skills.
While Data Science includes statistics and Machine Learning, Software Engineering
focuses more on coding languages. Both career choices are in demand and highly
rewarding. Ultimately, it depends on your choice of interest.
The features that good software engineers should possess are as follows:
• Exposure to systematic methods, i.e., familiarity with software
engineering principles.
• Good technical knowledge of the project range (Domain knowledge).
• Good programming abilities.
• Good communication skills. These skills comprise of oral, written, and
interpersonal skills.
• High motivation.
Adaptability: If the software procedure were not based on scientific and engineering
ideas, it would be simpler to re-create new software than to scale an existing one.
Cost: As the hardware industry has demonstrated its skills and huge manufacturing has
let down the cost of computer and electronic hardware. But the cost of programming
remains high if the proper process is not adapted.
Dynamic Nature: The continually growing and adapting nature of programming hugely
depends upon the environment in which the client works. If the quality of the software is
continually changing, new upgrades need to be done in the existing one.
3) To decrease time:
Anything that is not made according to the project always wastes time. And if you are
making great software, then you may need to run many codes to get the definitive
running code. This is a very time-consuming procedure, and if it is not well handled,
then this can take a lot of time. So if you are making your software according to the
software engineering method, then it will decrease a lot of time.
5) Reliable software:
Software should be secure, means if you have delivered the software, then it should
work for at least its given time or subscription. And if any bugs come in the software,
the company is responsible for solving all these bugs. Because in software engineering,
testing and maintenance are given, so there is no worry of its reliability.
6) Effectiveness:
Effectiveness comes if anything has made according to the standards. Software
standards are the big target of companies to make it more effective. So Software
becomes more effective in the act with the help of software engineering.
The term software specifies to the set of computer programs, procedures and
associated documents (Flowcharts, manuals, etc.) that describe the program and how
they are to be used.
A software process is the set of activities and associated outcome that produce a
software product. Software engineers mostly carry out these activities. These are four
key process activities, which are common to all software processes.
These activities are:
• Software specifications:
The functionality of the software and constraints on its operation must be defined.
• Software development:
The software to meet the requirement must be produced.
• Software validation:
The software must be validated to ensure that it does what the customer wants.
• Software evolution:
The software must evolve to meet changing client needs.
Some examples of the types of software process models that may be produced are:
A workflow model: This shows the series of activities in the process along with their
inputs, outputs and dependencies. The activities in this model perform human actions.
2. A dataflow or activity model: This represents the process as a set of activities, each
of which carries out some data transformations. It shows how the input to the process,
such as a specification is converted to an output such as a design. The activities here
may be at a lower level than activities in a workflow model. They may perform
transformations carried out by people or by computers.
3. A role/action model: This means the roles of the people involved in the software
process and the activities for which they are responsible.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Software Development Life Cycle (SDLC) :
A software life cycle model (also termed process model) is a pictorial and
diagrammatic representation of the software life cycle. A life cycle model represents all
the methods required to make a software product transit through its life cycle stages. It
also captures the structure in which these methods are to be undertaken
Stage5: Testing
Stage6: Deployment
Stage7: Maintenance
Once the requirement analysis is done, the next stage is to certainly represent and
document the software requirements and get them accepted from the project
stakeholders.
This is accomplished through "SRS"- Software Requirement Specification document
which contains all the product requirements to be constructed and developed during the
project life cycle.
The next phase is about to bring down all the knowledge of requirements, analysis, and
design of the software project. This phase is the product of the last two, like inputs
from the customer and requirement gathering.
In this phase of SDLC, the actual development begins, and the programming is built.
The implementation of design begins concerning writing code. Developers have to
follow the coding guidelines described by their management and programming tools
like compilers, interpreters, debuggers, etc. are used to develop and implement the
code.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Stage5: Testing
After the code is generated, it is tested against the requirements to make sure that the
products are solving the needs addressed and gathered during the requirements stage.
During this stage, unit testing, integration testing, system testing, acceptance testing are
done.
Stage6: Deployment
Once the software is certified, and no bugs or errors are stated, then it is deployed.
Then based on the assessment, the software may be released as it is or with suggested
enhancement in the object segment.
After the software is deployed, then its maintenance begins.
Stage7: Maintenance
Once when the client starts using the developed systems, then the real issues come up and
requirements to be solved from time to time.
This procedure where the care is taken for the developed product is known as
maintenance.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
SDLC Models :
There are different software development life cycle models specify and design, which
are followed during the software development phase. These models are also called
"Software Development Process Models." Each process model follows a series of
phase unique to its type to ensure success in the step of software development.
The spiral model is a risk-driven process model. This SDLC model helps the group to
adopt elements of one or more process models like a waterfall, incremental, waterfall, etc.
The spiral technique is a combination of rapid prototyping and concurrency in design and
development activities.
Advantages -
High amount of risk analysis
Useful for large and mission-critical projects.
Disadvantages -
Can be a costly model to use.
Risk analysis needed highly particular expertise
Doesn't work well for smaller projects.
Early computer programs followed computer architecture, with data in one block of
memory and program statements in another.
The first step in understanding any computer architecture is to learn its language. The
words in a computer’s language are called instructions.
• Device Management: The operating system keeps track of all the devices. So, it is
also called the Input / Output controller that decides which process gets the device,
when, and for how much time.
• File Management: It allocates and de-allocates the resources and also decides who
gets the resource.
• Job Accounting: It keeps the track of time and resources used by various jobs or
users.
• Error-detecting Aids: It contains methods that include the production of dumps,
traces, error messages, and other debugging and error-detecting methods.
• Memory Management: It keeps track of the primary memory, like what part of it is in
use by whom, or what part is not in use, etc. and It also allocates the memory when a
process or program requests it.
• Processor Management: It allocates the processor to a process and then de-allocates
the processor when it is no longer required or the job is done.
• Control on System Performance: It records the delays between the request for a
service and from the system.
• Security: It prevents unauthorized access to programs and data by means of
passwords or some kind of protection technique.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Types of Operating Systems (OS) :
In the 1970s, Batch processing was very popular. In this technique, similar types of jobs
were batched together and executed in time. People were used to having a single computer
which was called a mainframe.
In Batch operating system, access is given to more than one person; they submit their
respective jobs to the system for the execution.
The system put all of the jobs in a queue on the basis of first come first serve and then
executes the jobs one by one. The users collect their respective output when all the jobs
get executed.
Advantages of Batch OS :-
• The use of a resident monitor
improves computer efficiency
as it eliminates CPU time
between two jobs.
Disadvantages of Batch OS :-
1. Starvation
2. 2. Not Interactive
Multiprogramming is an extension to batch processing where the CPU is always kept busy.
Each process needs two types of system time: CPU time and IO time.
In a multiprogramming environment, when a process does its I/O, The CPU can start the
execution of other processes. Therefore, multiprogramming improves the efficiency of the
system.
Advantages of Multiprogramming OS :-
• Throughout the system, it increased as the
CPU always had one program to execute.
• Response time can also be reduced.
Disadvantages of Multiprogramming OS :-
• Multiprogramming systems provide an
environment in which various systems
resources are used efficiently, but they do
not provide any user interaction with the
computer system.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Multiprocessing Operating System :
In Multiprocessing, Parallel computing is achieved. There are more than one processors
present in the system which can execute more than one process at the same time. This
will increase the throughput of the system.
In Multiprocessing, Parallel computing is achieved. More than one processor present in
the system can execute more than one process simultaneously, which will increase the
throughput of the system.
Advantages of Multitasking OS :-
• This operating system is more suited to
supporting multiple users
simultaneously.
• The multitasking operating systems
have well-defined memory
management.
Disadvantages of Multitasking OS :-
• The multiple processors are busier at the
same time to complete any task in a
multitasking environment, so the CPU
generates more heat.
An Operating system, which includes software and associated protocols to communicate with
other computers via a network conveniently and cost-effectively, is called Network
Operating System.
Types of Network OS : Peer-to-peer & Client-Server
Advantages of Network OS :
• In this type of operating system, network
traffic reduces due to the division between
clients and the server.
• This type of system is less expensive to set
up and maintain.
Disadvantages of Network OS :
• In this type of operating system, the failure
of any node in a system affects the whole
system.
• Security and performance are important
issues. So trained network administrators
are required for network administration.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Real Time Operating System :
In Real-Time Systems, each job carries a certain deadline within which the job is
supposed to be completed, otherwise, the huge loss will be there, or even if the result is
produced, it will be completely useless.
In the Time Sharing operating system, computer resources are allocated in a time-
dependent fashion to several programs simultaneously. Thus it helps to provide a large
number of user's direct access to the main computer. It is a logical extension of
multiprogramming. In time-sharing, the CPU is switched among multiple programs
given by different users on a scheduled basis.
The Distributed Operating system is not installed on a single machine, it is divided into
parts, and these parts are loaded on different machines. A part of the distributed Operating
system is installed on each machine to make their communication possible. Distributed
Operating systems are much more complex, large, and sophisticated than Network
operating systems because they also have to take care of varying networking protocols.
Advantages of Distributed OS :
• The distributed operating system
provides sharing of resources.
• This type of system is fault-
tolerant.
Disadvantages of Distributed OS :
• Protocol overhead can dominate
computation cost.
Some typical operating system functions may include managing memory, files,
processes, I/O system & devices, security, etc.
• If any issue occurs in OS, you may lose all the contents which have been stored in your
system.
• Operating system’s software is quite expensive for small size organization which adds
burden on them. Example Windows.
• It is never entirely secure as a threat can occur at any time.
Distributed computing is the method of making multiple computers work together to solve
a common problem. It makes a computer network appear as a powerful single computer
that provides large-scale resources to deal with complex challenges.
• Resource Sharing: It is the ability to use any Hardware, Software, or Data anywhere
in the System.
• Openness: It is concerned with Extensions and improvements in the system (i.e.,
How openly the software is developed and shared with others)
• Concurrency: It is naturally present in the Distributed Systems, that deal with the
same activity or functionality that can be performed by separate users who are in
remote locations. Every local system has its independent Operating Systems and
Resources.
• Scalability: It increases the scale of the system as a number of processors
communicate with more users by accommodating to improve the responsiveness of
the system.
• Fault tolerance: It cares about the reliability of the system if there is a failure in
Hardware or Software, the system continues to operate properly without degrading
the performance the system.
• Transparency: It hides the complexity of the Distributed Systems to the Users and
Application programs as there should be privacy in every system.
• Heterogeneity: Networks, computer hardware, operating systems, programming
languages, and developer implementations can all vary and differ among dispersed
system components.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Advantages of Distributed System:
• Education: E-learning.
Bioinformatics focuses on parsing and analyzing biological data, while data science is a
much broader field that can analyze data from any number of sources, like sales or
financial markets.
Bioinformatics entails the storage and management of biological data via the creation and
maintenance of powerful databases, as well as the retrieval, analysis, and interpretation of
data via algorithms and other computational tools. As such, it has applications for a wide
range of fields.
Here are just a few examples of how bioinformatics helps tackle real-world problems:
• It can help cancer researchers identify which gene mutations cause cancer. Scientists can
then develop targeted therapies exploiting that knowledge.
• It can help biologists map evolutionary connections and ancestry.
• It can help pharmaceutical companies develop new drugs customized to a person’s
individual genome.
• It can aid in the development of new vaccines.
• It can enable the development of crops that are more resistant to insects and disease.
• It can identify microbes that have the ability to clean-up environmental waste.
• It can improve the health of livestock.
• It can help forensic scientists identify incriminating DNA evidence.
The demand in data science and information technology is being pushed forward by the
growing popularity of cloud computing, Augmented Reality, Virtual Reality, Artificial
Intelligence, Machine Learning, Decision Intelligence, quantum computing, big data
analytics, and other related technologies.
Machine Learning is the core subarea of artificial intelligence. It makes computers get
into a self-learning mode without explicit programming. When fed new data, these
computers learn, grow, change, and develop by themselves
.
The concept of machine learning has been around for a while now. However, the ability
to automatically and quickly apply mathematical calculations to big data is now gaining
a bit of momentum.
Machine learning has been used in several places like the self-driving Google car, the
online recommendation engines – friend recommendations on Facebook, offer
suggestions from Amazon, and in cyber fraud detection. In this article, we will learn
about the importance of Machine Learning and why every Data Scientist must need it.
Machine learning life cycle involves seven major steps, which are given below:
1) Gathering Data
2) Data preparation
3) Data Wrangling
4) Analyse Data
7) Deployment
2. Data preparation :
After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
• Data exploration
• Data pre-processing
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
3. Data Wrangling :
Data wrangling is the process of cleaning and converting raw data into a useable format.
It is the process of cleaning the data, selecting the variable to use, and transforming the
data in a proper format to make it more suitable for analysis in the next step. It is one of
the most important steps of the complete process. Cleaning of data is required to address
the quality issues.
In real-world applications, collected data may have various issues, including:
• Missing Values
• Duplicate data
• Invalid data
• Noise
4. Data Analysis :
The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the determination of
the type of the problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model
using prepared data, and evaluate the model.
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
• Selection of analytical techniques
• Building models
• Review the result
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
5. Train Model :
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.
6. Test Model :
Once our machine learning model has been trained on a given dataset, then we test the
model. In this step, we check for the accuracy of our model by providing a test dataset to
it.
Testing the model determines the percentage accuracy of the model as per the requirement
of project or problem.
7. Deployment :
The last step of machine learning life cycle is deployment, where we deploy the model in
the real-world system.
If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not.
The deployment phase is similar to making the final report for a project.
Purpose –
Machine learning allows the user to feed a computer algorithm an immense amount of
data and have the computer analyze and make data-driven recommendations and
decisions based on only the input data.
Types –
Machine learning models rely on four primary data types. These include numerical data,
categorical data, time series data, and text data.
Working -
A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it.
The machine learning field is continuously evolving. And along with evolution comes
a rise in demand and importance. There is one crucial reason why data scientists need
machine learning, and that is: ‘High-value predictions that can guide better decisions
and smart actions in real-time without human intervention.’
Machine learning as technology helps analyze large chunks of data, easing the tasks of
data scientists in an automated process and is gaining a lot of prominence and
recognition. Machine learning has changed the way data extraction and interpretation
works by involving automatic sets of generic methods that have replaced
traditional statistical techniques.
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
2) Unsupervised Learning –
3) Reinforcement Learning –
Image recognition is one of the most common applications of machine learning. It is used
to identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a
photo with our Facebook friends, then we automatically get a tagging suggestion with
name, and the technology behind this is machine learning's face
detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
2. Speech Recognition :
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also
known as "Speech to text", or "Computer speech recognition." At present, machine
learning algorithms are widely used by various applications of speech
recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the
correct path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or
heavily congested with the help of two ways:
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the performance.
4. Product recommendations:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri.
As the name suggests, they help us in finding the information using our voice instruction.
These assistants can help us in various ways just by our voice instructions such as Play
music, call someone, Open an email, Scheduling an appointment, etc.
These virtual assistants use machine learning algorithms as an important part. These
assistant record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that
a fraudulent transaction can take place such as fake accounts, fake ids, and steal money in
the middle of a transaction. So to detect this, Feed Forward Neural network helps us by
checking whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these
values become the input for the next round. For each genuine transaction, there is a specific
pattern which gets change for the fraud transaction hence, it detects it and makes our online
transactions more secure.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
9. Stock Market trading:
Machine learning is widely used in stock market trading. In the stock market, there is
always a risk of up and downs in shares, so for this machine learning's long short term
memory neural network is used for the prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.
It helps in finding brain tumors and other brain-related diseases easily.
Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning
algorithm, which is used with image recognition and translates the text from one language to
another language.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Common machine learning algorithms –
Neural networks: Neural networks simulate the way the human brain works, with a
huge number of linked processing nodes. Neural networks are good at recognizing
patterns and play an important role in applications including natural language translation,
image recognition, speech recognition, and image creation.
Linear regression: This algorithm is used to predict numerical values, based on a linear
relationship between different values. For example, the technique could be used to
predict house prices based on historical data for the area.
Decision trees: Decision trees can be used for both predicting numerical values
(regression) and classifying data into categories. Decision trees use a branching
sequence of linked decisions that can be represented with a tree diagram. One of the
advantages of decision trees is that they are easy to validate and audit, unlike the black
box of the neural network.
Random forests: In a random forest, the machine learning algorithm predicts a value
or category by combining the results from a number of decision trees.
Bokeh (Python)
Most of the commonly used tools provide basic charting options, but more advanced
visualization techniques have hardly been integrated as features yet. This especially
applies for interactive exploratory data analysis, which has already been addressed as
the ’Interactive Visualization Gap’ in the literature. In this paper we present a study on
the usage of visualization techniques in common data science tools.
On the contrary, interviews with professional data analysts confirm strong interest in
learning and applying new tools and techniques. Users are especially interested in
techniques that can support their exploratory analysis workflow.
Based on these findings and our own experience with data science projects, we present
suggestions and considerations towards a better integration of visualization techniques
in current data science workflows.
Bokeh is a data visualization library that allows a developer to code in Python and
output JavaScript charts and visuals in web browsers.
Bokeh is a Python data visualization library that provides high performance. Bokeh output
is available in a variety of formats, including notebook, html, and server. Bokeh plots can
be included in Flask applications.
Users can choose between two different visualization interfaces offered by Bokeh. A low-
level interface that allows application developers a great deal of freedom. A high-level
interface for producing visual glyphs.As we know it is designed to be displayed in web
browsers. This is where Bokeh differs from other visualization libraries.
It enables us to quickly create complex statistical plots using simple commands. Bokeh
provides output in a variety of formats, including HTML, notebook, and server.
We can also integrate the bokeh visualization into flask and Django apps. Python bokeh
can transform visualizations created with other libraries such as matplotlib, seaborn, and
ggplot.
It has the ability to apply interaction and various styling options to visualization.
Bokeh is primarily used to convert source data into JSON, which is then used as input
for BokehJS. Some of the most appealing aspects of Bokeh is, that it provides charts as
well as customs charts for complex use cases.
It has an easy-to-use interface and can be used with notebooks of jupyter. We have
complete control over our chart and can easily modify it with custom Javascript.
It includes a plethora of examples and ideas to get us started, and it is distributed under
the BSD license.
It is very useful and important in python to make interactive browser visualizations.
• Box plots
• Histograms
• Heat maps
• Charts
• Tree maps
• Word Cloud/Network diagram
Bar Graph: It has rectangular bars in which the lengths are proportional to the values
which are represented.
Stacked Bar Graph: It is a bar style graph that has various components stacked
together so that apart from the bar, the components can also be compared to each
other.
Stacked Column Chart: It is similar to a stacked bar; however, the data is stacked
horizontally.
Area Chart: It combines the line chart and bar chart to show how the numeric values
of one or more groups change over the progress of a viable area.
Dual Axis Chart: It combines a column chart and a line chart and then compares the
two variables.
Line Graph: The data points are connected through a straight line; therefore, creating
a representation ofAssociate
DR Mrs J. N. Jadhav the changing trend.
professor
Deptt of CSE, DYPCET
Mekko Chart: It can be called a two-dimensional stacked chart with varying column
widths.
Pie Chart: It is a chart where various components of a data set are presented in the form
of a pie which represents their proportion in the entire data set.
Waterfall Chart: With the help of this chart, the increasing effect of sequentially
introduced positive or negative values can be understood.
Scatter Plot Chart: It is also called a scatter chart or scatter graph. Dots are used to
denote values for two different numeric variables.
Bullet Graph: It is a variation of a bar graph. A bullet graph is used to swap dashboard
gauges and meters.
Funnel Chart: The chart determines the flow of users with the help of a business or sales
process.
Heat Map: It is a technique of data visualization that shows the level of instances as color
in two dimensions.
DR Mrs J. N. Jadhav Associate professor
Deptt of CSE, DYPCET
Box Plots –
A box plot is a graph that gives you a good indication of how the values in the data are
spread out. Although box plots may seem primitive in comparison to
a histogram or density plot, they have the advantage of taking up less space, which is
useful when comparing distributions between many groups or datasets. For some
distributions/datasets, you will find that you need more information than the measures
of central tendency (median, mean, and mode). You need to have information on the
variability or dispersion of the data.
Histograms
A histogram is a graphical display of data using bars of different heights. In a
histogram, each bar groups numbers into ranges. Taller bars show that more data falls
in that range. A histogram displays the shape and spread of continuous sample data.
Heat Maps
A heat map is data analysis software that uses colour the way a bar graph uses height
and width: as a data visualization tool.
Line Chart
The simplest technique, a line plot is used to plot the relationship or dependence of
one variable on another.
Bar Charts
Bar charts are used for comparing the quantities of different categories or groups.
Pie Chart
It is a circular statistical graph which decides slices to illustrate numerical proportion.
Scatter Charts
Another common visualization technique is a scatter plot that is a two-dimensional
plot representing the joint variation of two data items.
Bubble Charts
It is a variation of scatter chart in which the data points are replaced with bubbles, and
an additional dimension of data is represented in the size of the bubbles.
Timeline Charts
Timeline charts illustrate events, in chronological order . for example the progress of a
project, advertising campaign, acquisition process in whatever unit of time the data
was recorded . for example week, month, year, quarter.
The variety of big data brings challenges because semi-structured, and unstructured data
require new visualization techniques. A word cloud visual represents the frequency of a
word within a body of text with its relative size in the cloud. This technique is used on
unstructured data as a way to display high- or low-frequency words.
Another visualization technique that can be used for semi-structured or unstructured
data is the network diagram. Network diagrams represent relationships as nodes
(individual actors within the network) and ties (relationships between the individuals).
They are used in many applications, for example for analysis of social networks or
mapping product sales across geographic areas.
Healthcare
Internet Search
Targeted Advertising
Website Recommendations
Speech Recognition
Gaming
Augmented Reality
DR Mrs J. N. Jadhav Associate professor Deptt of CSE, DYPCET