Padre Conceição
College Of Engineering
CEAM-03 – Data Science and Analytics
(T.E Computer , Sem-VI)
Presented by: Asst Prof. Vidya G
30/07/2025 INTRODUCTION TO DATA SCIENCE 1
Syllabus
30/07/2025 INTRODUCTION TO DATA SCIENCE 2
UNIT-01
Introduction to Data
Science
30/07/2025 INTRODUCTION TO DATA SCIENCE 3
Data Science
Data Science is also known as data-driven science, it is interdisciplinary field of scientific
methods, processes, and systems to extract knowledge or insights from data in various forms,
either structured or unstructured similar to Data Mining.
Convergence of various knowledge domains for effective utilisations of various analysis
method for better output of experts in their activities ( Refer Fig.1.1).
Data Science is one of the recent fields combining big data, unstructured data and
combination of statistics and analytics and business intelligence.
30/07/2025 INTRODUCTION TO DATA SCIENCE 4
30/07/2025 INTRODUCTION TO DATA SCIENCE 5
Data Science
Data Science is the discipline of using quantitative methods from statistics and mathematics
along with the technology to develop algorithms designed to discover patterns, predict
outcomes, and final optimal solutions to complex problems.
Data science employs techniques and theories drawn from many fields within broad areas of
mathematics, statistics, information science, and computer science, in particular from the
sub-domains of machine learning, classification, cluster analysis, data lakes data mining,
and warehousing, databases, and visualization (Refer Fig.1.2)
30/07/2025 INTRODUCTION TO DATA SCIENCE 6
30/07/2025 INTRODUCTION TO DATA SCIENCE 7
Terminology Related with Data Science
1. Big Data.
Big Data is a term applied to datasets whose size or type is beyond the ability of traditional
relational databases to capture, manage and process the data with low-latency.
Big data usually includes data sets with sizes beyond the ability of commonly used software
tools to capture, curate, manage and process data within a tolerable elapsed time.
30/07/2025 INTRODUCTION TO DATA SCIENCE 8
30/07/2025 INTRODUCTION TO DATA SCIENCE 9
Terminology Related with Data Science
2. Business Intelligence (BI)
BI is the technology which uses the transformed and loaded historical data to get or create the
reports.
It is a set of methodologies, process, theories that transform raw data into useful information
to help companies make better decisions.
BI is a process for analyzing data and presenting actionable information to help executives,
managers and other corporate end users make informed business decisions and help in decision
making.
30/07/2025 INTRODUCTION TO DATA SCIENCE 10
Terminology Related with Data Science
Functions in BI technologies include reporting, online analytical processing, analytics, data
mining, process mining, complex event processing, business performance management,
benchmarking, text mining, predictive analytics and prescriptive analytics.
BI can be used by enterprises to support a wide range of business decisions ranging from
operational to strategic.
30/07/2025 INTRODUCTION TO DATA SCIENCE 11
Terminology Related with Data Science
3. Data Analytics
Data Analytics and analytics, are used to describe the field and comprehensive collection of
associated methods.
Data analyst collect, process and perform statistical analyses of data.
30/07/2025 INTRODUCTION TO DATA SCIENCE 12
Terminology Related with Data Science
Difference between Big Data and Business Intelligence
BIG DATA
BUSINESS INTELLIGENCE (BI)
Big Data refers to act of generating, BI encompasses only commercial activities, its
capturing, and processing enormous amounts domain is larger. The data is collected in data
of data on continuous basis. lakes and refined in data warehousing through
data mining techniques.
BI refers to software and systems that import
Big data is the technology which collects
data streams of any size and use them to
transforms the huge data which is in generate informational displays that point
unstructured manner. specific decisions.
30/07/2025 INTRODUCTION TO DATA SCIENCE 13
Terminology Related with Data Science
4. Data Wrangling
The process of conversion of data, through the use of scripting languages to make it easier to
work is known as Data Wrangling or data munging.
Example: 900,000 birth year values of the format yyyy-dd-mm and 100,000 of the format
mm/dd/yyyy, write a perl script to convert latter to look the same former as you can use all
together, it is known as data wrangling.
30/07/2025 INTRODUCTION TO DATA SCIENCE 14
Terminology Related with Data Science
5. Algorithm
A series of repeatable steps for carrying out a certain type of task with data
6. Machine Learning
Analytics in which computers “learn” from data to produce models or rules that apply to those
data and other similar data
Predictive modelling techniques such as neural nets, classification and regression trees, naïve
bayes, k-nearest neighbour, and support vector machines.
30/07/2025 INTRODUCTION TO DATA SCIENCE 15
Terminology Related with Data Science
7. Web Analytics
Statistical or machine learning methods applied to web data such as page views, hits, clicks,
and conversions generally with a view to learning what web presentations are most effective in
achieving the organizational goal.
This goal might to sell products and services on a site, to server and sell advertising space, to
purchase on other sites.
Advantage is volume & constant flow of data.
30/07/2025 INTRODUCTION TO DATA SCIENCE 16
Methods of Data Repository
Data repository is the term used for data storage.
Data repository refers to data storage entity into which data has been specifically partitioned for an
analytical or reporting purpose.
It has several different shapes:
Data lakes
Data marts
Data Ware Housing
Big Data and Hadoop and similar frameworks.
30/07/2025 INTRODUCTION TO DATA SCIENCE 17
Methods of Data Repository
1. Data lake
Data lakes is storage repository that holds a vast amount of raw data in native format until it is
needed and refined.
Data lake shares data environment that has multiple repositories and capitalizes on big data
technologies.
Provides data to an organization for variety of analytic processes.
30/07/2025 INTRODUCTION TO DATA SCIENCE 18
Methods of Data Repository
Data lake is associated with Hadoop-oriented object storage, in which organizations data is
loaded into Hadoop-platform.
Business analytics and data mining tools are applied to the data where it resdes on the Hadoop
cluster.
The data lake concept takes Hadoop deployments to their extreme, creating a potentially,
limitless reservoir for disparate collections of structured, unstructured and semi-structured data
generated by transaction systems, social networks, server logs, sensors and other sources.
30/07/2025 INTRODUCTION TO DATA SCIENCE 19
Methods of Data Repository
Characteristics of Data Lake:
1. All data is loaded from source systems. No data is turned away.
2. Data is stored at the leaf level in an untransformed or nearly untransformed state.
3. Data is transformed and schema is applied to fulfil the needs of analysis.
30/07/2025 INTRODUCTION TO DATA SCIENCE 20
Methods of Data Repository
2. Data Warehouse
Data warehouse is constructed by integrating by data from multiple heterogeneous sources that
support analytical reporting, structured and / or ad hoc queries, decision making.
Data warehousing involves data cleaning, data integration, and data consolidations.
A core component of BI, data warehouse is central repository of integrated data from one or
more disparate sources, and its used for reporting & data analytics.
30/07/2025 INTRODUCTION TO DATA SCIENCE 21
Methods of Data Repository
Hierarchical database that stores data in files or folders a data lake uses a flat architecture to
store data.
Example: on updating daily basis transactions.
Data warehouse provides generalized and consolidated data in multidimensional view.
Data warehouse provides the online analytical processing (OLAP) tools.
This tools helps in interactive and effective analysis of data in multidimensional space. Analysis
results in data mining.
30/07/2025 INTRODUCTION TO DATA SCIENCE 22
Methods of Data Repository
Understanding a Data Warehouse
1. A data warehouse is a database, which is kept separate from the organization’s operational database.
2. There is no frequent updating done in data warehouse.
3. It possesses consolidated historical data, which helps the organization to analyze the business.
4. A data warehouse helps executives to organize, understand and use their data to take strategic decisions.
5. A data warehouse systems help in the integration of diversity of application systems.
6. A data warehouse systems helps in consolidated historical data analysis.
30/07/2025 INTRODUCTION TO DATA SCIENCE 23
Methods of Data Repository
Data ware house models
From the perspective of data warehouse architecture, we have the following data warehouse
models
1. Virtual warehouse
2. Data marts
3. Enterprise warehouse
30/07/2025 INTRODUCTION TO DATA SCIENCE 24
Methods of Data Repository
1. Virtual warehouse
The view over an operational data warehouse is know as virtual warehouse.
It is easy to build virtual warehouse.
Building virtual warehouse requires excess capacity on operational database servers.
30/07/2025 INTRODUCTION TO DATA SCIENCE 25
Methods of Data Repository
2. Data marts:
Data mart contains a subset of organization-wide data.
This subset is valuable to specific group of an organization.
Example: the marketing data mart may contain data related to items, customers, and sales.
Data marts are confined to subjects.
30/07/2025 INTRODUCTION TO DATA SCIENCE 26
30/07/2025 INTRODUCTION TO DATA SCIENCE 27
Methods of Data Repository
3. Enterprise warehouse
An enterprise warehouse collects all the information and the subjects spanning an entire
organization.
1. It provides us enterprise -wide data integration.
2. The data is integrated from operational systems and external information providers.
3. This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.
30/07/2025 INTRODUCTION TO DATA SCIENCE 28
Methods of Data Repository
Process flow in data warehouse
30/07/2025 INTRODUCTION TO DATA SCIENCE 29
Methods of Data Repository
Fig 1.3. Processes in Data Warehouse
30/07/2025 INTRODUCTION TO DATA SCIENCE 30
Methods of Data Repository
30/07/2025 INTRODUCTION TO DATA SCIENCE 31
Methods of Data Repository
30/07/2025 INTRODUCTION TO DATA SCIENCE 32
Methods of Data Repository
30/07/2025 INTRODUCTION TO DATA SCIENCE 33
Methods of Data Repository
30/07/2025 INTRODUCTION TO DATA SCIENCE 34
Methods of Data Repository
30/07/2025 INTRODUCTION TO DATA SCIENCE 35
Methods of Data Repository
30/07/2025 INTRODUCTION TO DATA SCIENCE 36
Methods of Data Repository
Functions of data warehouse-tools and utilities
30/07/2025 INTRODUCTION TO DATA SCIENCE 37
Methods of Data Repository
Fig.1.4 Functions of data ware housing
30/07/2025 INTRODUCTION TO DATA SCIENCE 38
Personnel involved with data science
1. Data Scientist
A data scientist is someone who is better at statistics than any software engineer and better at
software engineering than any statistician.
Data scientist implies the ability to work with large volumes of data generated not by studies, but
by ongoing organizational processes.
Due to complexity of dealing with large datasets and data flows, most of day-to-day work lies in
data pipeline challenges.
30/07/2025 INTRODUCTION TO DATA SCIENCE 39
30/07/2025 INTRODUCTION TO DATA SCIENCE 40
30/07/2025 INTRODUCTION TO DATA SCIENCE 41
30/07/2025 INTRODUCTION TO DATA SCIENCE 42
Personnel involved with data science
2. Data Analyst
Data analyst collect, process and perform statistical analyses of data.
Skills may not be as advanced as data scientist
E.g. they may not be able to create new algorithms, but the goals are same- to discover how data
can be used to answer questions and solve problems.
30/07/2025 INTRODUCTION TO DATA SCIENCE 43
30/07/2025 INTRODUCTION TO DATA SCIENCE 44
30/07/2025 INTRODUCTION TO DATA SCIENCE 45
30/07/2025 INTRODUCTION TO DATA SCIENCE 46
30/07/2025 INTRODUCTION TO DATA SCIENCE 47
Personnel involved with data science
3. Data Engineer
A specialist is data wrangling.
Data engineers are the ones that take the messy data and build the infrastructure for real,
tangible analysis.
They run ETL software, enrich and clean all the data that companies have been storing for years.
30/07/2025 INTRODUCTION TO DATA SCIENCE 48
30/07/2025 INTRODUCTION TO DATA SCIENCE 49
30/07/2025 INTRODUCTION TO DATA SCIENCE 50
30/07/2025 INTRODUCTION TO DATA SCIENCE 51
Personnel involved with data science
4. Data Architect
Data architect create blueprints for data management systems.
After assessing a company’s potential data sources architects design a plan to integrate,
centralize, protect and maintain them.
This allows employees to access crtitial information in the right place at right time.
30/07/2025 INTRODUCTION TO DATA SCIENCE 52
30/07/2025 INTRODUCTION TO DATA SCIENCE 53
30/07/2025 INTRODUCTION TO DATA SCIENCE 54
30/07/2025 INTRODUCTION TO DATA SCIENCE 55
Types of Data
30/07/2025 INTRODUCTION TO DATA SCIENCE 56
Unstructured data
30/07/2025 INTRODUCTION TO DATA SCIENCE 57
Semi-Structured data
30/07/2025 INTRODUCTION TO DATA SCIENCE 58
Meta data
30/07/2025 INTRODUCTION TO DATA SCIENCE 59
Meta data
30/07/2025 INTRODUCTION TO DATA SCIENCE 60
30/07/2025 INTRODUCTION TO DATA SCIENCE 61
Structured data
30/07/2025 INTRODUCTION TO DATA SCIENCE 62
The Data Science Process (DSP)
DSP is an agile, iterative data science methodology to deliver predictive analytics solutions and
intelligent applications efficiently.
DSP helps improve team collaboration and learning.
It contains a distillation of the best practices and structures from Microsoft and others in the
industry that facilitate the successful implementation of data science initiatives.
The goal is to help companies fully realize the benefits of their analytics program.
This provide a generic description of the process here that can be implemented with variety of
tools.
30/07/2025 INTRODUCTION TO DATA SCIENCE 63
The Data Science Process (DSP)
The process may involve 7 clear cut steps for data analytics.
30/07/2025 INTRODUCTION TO DATA SCIENCE 64
30/07/2025 INTRODUCTION TO DATA SCIENCE 65
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 66
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 67
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 68
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 69
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 70
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 71
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 72
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 73
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 74
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 75
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 76
The Data Science Process (DSP)
30/07/2025 INTRODUCTION TO DATA SCIENCE 77
Data Science Project’s Lifecycle
30/07/2025 INTRODUCTION TO DATA SCIENCE 78
Data Science Project’s Lifecycle
The lifecycle has been designed for data science projects that ship as part of intelligent
applications.
The applications deploy machine learning and artificial intelligence models for predictive
analysis.
Data science projects or ad hoc analytics projects can also benefit from using this process. Some
of the steps may not be needed.
30/07/2025 INTRODUCTION TO DATA SCIENCE 79
Data Science Project’s Lifecycle
CRISP-DM remains the top methodology for data mining projects.
CRISP-DM was invented around 1996.
The 6 high level phases of CRISP-DM are still good description for analytics process but the
details and specifics need to be updated.
CRISP-DM need not to be maintained and adapted to the challenges of Big data and modern data
science
30/07/2025 INTRODUCTION TO DATA SCIENCE 80
30/07/2025 INTRODUCTION TO DATA SCIENCE 81
Data Science Project’s Lifecycle
The lifecycle outlines the major stages that projects typically execute, often iteratively:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance
30/07/2025 INTRODUCTION TO DATA SCIENCE 82
30/07/2025 INTRODUCTION TO DATA SCIENCE 83
Popular Data Science toolkits
Tools are an important element of the data science field.
The open source community has been contributing to the data science toolkits for years which
has led to major advancements to the field.
There are wide variety of open source tools available from data-mining platforms to
programming languages.
Mixing all technology that data scientist could add to their data science toolkits.
30/07/2025 INTRODUCTION TO DATA SCIENCE 84
Popular Data Science toolkits
The tools listed here are free software.
1. R programming language
2. Python
3. KNIME
4. SQL
5. Apache Hadoop and Big Data tools
6. Data Science tools
7. Tensor Flow
8. RStudio
30/07/2025 INTRODUCTION TO DATA SCIENCE 85
Popular Data Science toolkits
R programming language
30/07/2025 INTRODUCTION TO DATA SCIENCE 86
Popular Data Science toolkits
30/07/2025 INTRODUCTION TO DATA SCIENCE 87
Popular Data Science toolkits
Python
30/07/2025 INTRODUCTION TO DATA SCIENCE 88
Popular Data Science toolkits
30/07/2025 INTRODUCTION TO DATA SCIENCE 89
Popular Data Science toolkits
KNIME
30/07/2025 INTRODUCTION TO DATA SCIENCE 90
Popular Data Science toolkits
SQL
30/07/2025 INTRODUCTION TO DATA SCIENCE 91
Popular Data Science toolkits
Apache Hadoop and other big data tools:
30/07/2025 INTRODUCTION TO DATA SCIENCE 92
Popular Data Science toolkits
30/07/2025 INTRODUCTION TO DATA SCIENCE 93
Popular Data Science toolkits
D3 Data Science tools
30/07/2025 INTRODUCTION TO DATA SCIENCE 94
Popular Data Science toolkits
Tensor Flow
30/07/2025 INTRODUCTION TO DATA SCIENCE 95
Popular Data Science toolkits
Rstudio
30/07/2025 INTRODUCTION TO DATA SCIENCE 96
Recent Trends in Data Science
Recent Trends in Various Data Collection and analysis techniques
Artificial Intelligence, immersive experiences, digital twins, event-thinking and continuous adaptive security create a
foundation for the next generation of digital business models and ecosystems.
As businesses and services aim more for the connected world, overlapping of physical and digital layers around us will
probably gain more relevance.
30/07/2025 INTRODUCTION TO DATA SCIENCE 97
Recent Trends in Data Science
Recent Trends in Various Data Collection and analysis techniques
1. Intelligent Digital Mesh
Gartner calls the entwinning of people, devices, content and services the intelligent digital
mesh.
It’s enabled by digital models, business platforms and a rich, intelligent set of services to support
digital business.
30/07/2025 INTRODUCTION TO DATA SCIENCE 98
Recent Trends in Data Science
1. Intelligent: How AI is seeping into virtually every technology and with a defined, well-
scoped focus can allow more dynamic, flexible and potentially autonomous systems.
2. Digital: Blending the virtual and real worlds to create an immersive digitally enhanced and
connected environment.
3. Mesh: The connections between an expanding set of people, business, devices, content and
services to deliver digital outcomes.
30/07/2025 INTRODUCTION TO DATA SCIENCE 99
Fig. 6.1: Intelligent Digital Mesh
30/07/2025 INTRODUCTION TO DATA SCIENCE 100
Recent Trends in Data Science
I. Artificial Intelligence (AI)
The widespread adoption of AI into all scientific and business systems and decision-making
applications.
Artificial Intelligence is classified into two parts,
1. General AI and
2. Narrow AI.
30/07/2025 INTRODUCTION TO DATA SCIENCE 101
Recent Trends in Data Science
General AI refers to making machines intelligent in a wide array of activities that involve
thinking and reasoning.
Narrow AI involves the use of artificial Intelligence for a very specific task.
General AI would mean an algorithm that is capable of playing all kinds of board game.
Narrow AI is within the reach of developers and researchers.
General AI is just a dream of researchers and perception among the masses that will take a lot of
time for the human race to achieve.
30/07/2025 INTRODUCTION TO DATA SCIENCE 102
Recent Trends in Data Science
Machine learning is the ability of a computer system to learn from the environment and improve
itself from experience without the need for any explicit programming.
Machine learning focuses on enabling algorithms to learn from the data provided, gather insights
and make predictions on previously unanalyzed data using information gathered.
Machine learning can be performed using multiple approaches.
The three basic models of machine learning are supervised, unsupervised and reinforcement
learning.
30/07/2025 INTRODUCTION TO DATA SCIENCE 103
Recent Trends in Data Science
i. Supervised Learning:
The labelled data is used to help machines recognize characteristics and use them for future data.
Example if you want to classify pictures of cats and dogs then you can feed the data of a few
labelled pictures and then the machine will classify all remaining pictures.
30/07/2025 INTRODUCTION TO DATA SCIENCE 104
Recent Trends in Data Science
ii. Unsupervised learning:
We can put unlabelled data and let machine understand the characteristics and classify it.
iii. Reinforcement machine learning:
The algorithms interact with environment by producing actions and then analyze errors or
rewards
Example: to understand a game of chess an ML algorithm will not analyze individual moves but
will study the game as a whole.
30/07/2025 INTRODUCTION TO DATA SCIENCE 105
Recent Trends in Data Science
II. Intelligent Apps and Analytics (Smart Apps)
Smart apps are application that allow users to tap into the capabilities of their devices to
automate their lives.
Most smart apps are installed by the user via the smart things mobile client application.
Smart applications incorporate data-driven, actionable insights into the user experience.
Insights are delivered in context as features in applications that enable users to more efficiently
complete a desired task or action.
30/07/2025 INTRODUCTION TO DATA SCIENCE 106
Recent Trends in Data Science
They often take the form of recommendations, estimates, and suggested next actions.
Smart applications can be consumer-facing or employee-facing.
User operational processes based on data-driven insights.
Example: Retail smart applications make product recommendations based on analysis of customer
buying behaviour while logistics applications provide data-driven estimates of delivery times of goods
and products.
Healthcare smart applications offer possible patient diagnosis & treatment recommendations to
clinicians based on analyses of patient & research data.
30/07/2025 INTRODUCTION TO DATA SCIENCE 107
Recent Trends in Data Science
III. Intelligent of things (IoT)
Intelligent things use AI and Machine learning to interact in a more intelligent way with people
and surroundings.
Some intelligent things wouldn’t exist without AI, but others are existing things that AI makes
intelligent.
This things operate semiautonomously or autonomously in an unsupervised environment for set
amount of time to complete a particular task.
30/07/2025 INTRODUCTION TO DATA SCIENCE 108
Recent Trends in Data Science
Example: Include a self-directing vacuum or autonomous farming vehicle.
As technology develops, AI & machine learning will increasingly appear in a variety of objects
ranging from smart healthcare equipment to autonomous harvesting robots for farms.
30/07/2025 INTRODUCTION TO DATA SCIENCE 109
Recent Trends in Data Science
IV. Digital Twins
It will bring together the connected world of sensors and humans.
Digital twins are linked to real-world objects and offer information on the state of the
counterparts, respond to changes, improve operations and add value.
The idea of behind a digital twin is to harness the data and use algorithms for making reasonable
projections about the future.
30/07/2025 INTRODUCTION TO DATA SCIENCE 110
Recent Trends in Data Science
The major applications of digital twins in following sectors.
1. Manufacturing:
Digital twin is poised to change the current face of manufacturing sector.
It has a significant impact on the way products are designed manufactured and maintained.
It makes manufacturing more efficient & optimized while reducing the throughput times.
30/07/2025 INTRODUCTION TO DATA SCIENCE 111
Recent Trends in Data Science
2. Automobile:
Digital twins can be used in automobile sector for creating the virtual model of connected
vehicle.
It captures the behavioral and operational data of the vehicle and helps in analyzing the overall
vehicle performance as well as the connected features.
It also helps in delivering a truly personalized/ customized service for customers.
30/07/2025 INTRODUCTION TO DATA SCIENCE 112
Recent Trends in Data Science
3. Retail:
Appealing customer experience is key in the retail sector.
Digital twin implementation can play a key role in augmenting the retail customer experience by
creating virtual twins for customers & modelling fashions for them on it.
Digital twins also helps in better instore planning, security implementation and energy
management in optimized manner.
30/07/2025 INTRODUCTION TO DATA SCIENCE 113
Recent Trends in Data Science
5. Healthcare:
Digital twins along with data from IoT can play a key role in health care sector from cost savings
to patient monitoring, preventative maintenance and providing personalized health care.
6. Smart Cities:
The smart city planning & implementation with digital twins and IoT data helps enhancing
economic development, efficient management of resources, reduction of ecological foot print and
increase the overall quality of citizen’s life.
30/07/2025 INTRODUCTION TO DATA SCIENCE 114
Recent Trends in Data Science
The digital twin model can help city planners and policymakers in the smart city planning by gaining the
insights from various sensor networks & intelligent systems.
The data from the digital twins help them in arriving at informed decisions regarding the future as well.
6. Industrial IoT:
Industrial firms with digital twin implementation can now monitor, track and control industrial systems
digitally.
It capture environmental data such as location, configuration, financial models.
30/07/2025 INTRODUCTION TO DATA SCIENCE 115
Recent Trends in Data Science
V. Cloud to the edge: edge computing:
In context of IoT ‘edge’ refers to the computing infrastructure that exists close to sources of data.
Example: industrial machines, industrial controllers such as SCADA and integerated building management
systems (IBMS) system, and time series databases aggregating data from a variety of equipment and sensors.
These devices typically reside away from the centralize computing available in the cloud.
SCADA is remote controlled IBMS and IBMS is automation systems for control of machinery and services
in large building complexes.
30/07/2025 INTRODUCTION TO DATA SCIENCE 116
Recent Trends in Data Science
While there are many outcomes that it can enable for industrial organizations, the edge
computing consortium identifies the following
1. Predictive maintenance
2. Reducing costs
3. Security assurance
4. Product to service extension
5. Energy efficiency management
6. Lower energy consumption
7. Lower maintenance costs
30/07/2025 INTRODUCTION TO DATA SCIENCE 117
Recent Trends in Data Science
VI. Intelligent platforms: Conversational platforms
Conversational platforms will drive a paradigm shift in which the burden of translating intent
shifts from user to computer.
These systems are capable of simple answers “how is the weather?” or more complicated
interactions “book a reservation at Italian restaurant on parker ave.”
Intelligent platforms business provides industrial software control systems and embedded
computing platforms to optimize their customers assets and equipment.
30/07/2025 INTRODUCTION TO DATA SCIENCE 118
Recent Trends in Data Science
VII. Immersive Experience
Augmented reality (AR), virtual reality (VR) and mixed reality are changing the way that people
perceive and interact with the digital world.
Combined with conversational platforms, fundamental shift in user experience to an invisible
and immersive experience will emerge.
Application vendors, system software vendors and development platform vendors will all
compete to deliver this model.
30/07/2025 INTRODUCTION TO DATA SCIENCE 119
Recent Trends in Data Science
VIII.Blockchain
Blockchain will become a much more important technology for businesses across the globe.
Blockchain enables un-trusted parties to engage in transactions.
Blockchain holds promise for many industry sectors like the finance, healthcare, and content
delivery.
Blockchain is continuously growing list of records, called blocks, which are linked and secured
using cryptography.
30/07/2025 INTRODUCTION TO DATA SCIENCE 120
Recent Trends in Data Science
IX. Event Driven Techs
Digital businesses rely on the ability to sense and be ready to exploit new digital business
moments.
Business events reflects the discovery of notable states changes, such as completion of purchase
order.
30/07/2025 INTRODUCTION TO DATA SCIENCE 121
Recent Trends in Data Science
X. Security: continuous adaptive risk and trust assessment ( CARTA)
Digital business creates a complex, evolving security environment.
The use of increasingly sophisticated tools increases the threat potential.
CARTA allow for real-time, risk and trust based decision making with adaptive responses to
security enable digital business.
30/07/2025 INTRODUCTION TO DATA SCIENCE 122
Recent Trends in Data Science
1) Various Big Data Visualization Tools
To make highly informed decisions quickly, organizational leaders need to be able to access and
interpret data in real-time.
Google chart
Tableau
Qlikview
Datawrapper
Oracle visual analyzer
Fusioncharts
30/07/2025 INTRODUCTION TO DATA SCIENCE 123
Recent Trends in Data Science
Highcharts
Microsoft power BI
Plotly
Sisense
30/07/2025 INTRODUCTION TO DATA SCIENCE 124
Recent Trends in Data Science
i. Importance of Big Data Visualization
1. Review large amounts of data: data presented in graphical form enables decision makers to
take in large amounts of data & gain an understanding.
2. Spot trends: time sequence data often captures trends but spotting trends hidden in data is
notoriously hard to do especially when the sources are diverse and the quantity of data is
large.
30/07/2025 INTRODUCTION TO DATA SCIENCE 125
Recent Trends in Data Science
3. Identify correlations and unexpected relationships: Data Visualization is that enables users to
explore data sets not to find answers specific questions but to discover what unexpected
insights the data can reveal.
4. Present the data to others: an oft-overlooked feature of bog data visualization is that it
provides a highly effective way to communicate any insights that it surfaces to others.
30/07/2025 INTRODUCTION TO DATA SCIENCE 126
Recent Trends in Data Science
ii. Key Issues Big Data Visualization
1. Availability of visualization specialists: many big data visualization tools are designed to be easy enough
for anyone in an organization to use.
2. Visualization hardware resources: big data visualization is essentially computing task and the ability to
carry out this task quickly to enable organizations to make decisions in timely manner using real-time data.
3. Data quality: insights can be drawn from big data visualization are only as accurate as the data that is being
visualized, if it is inaccurate or out of data then the value of insights is questionable.
30/07/2025 INTRODUCTION TO DATA SCIENCE 127
Recent Trends in Data Science
III. Benefits of Data Visualization Tools
1. Absorb information in new and more construction ways.
Data visualization enables users to receive vast amounts of information regarding operational and
business conditions.
Data visualization allows decision makers to see connections between multi-dimensional data sets and
provides new ways to interpret data through the use of heat maps, fever charts, other rich graphical
representations.
30/07/2025 INTRODUCTION TO DATA SCIENCE 128
Recent Trends in Data Science
2. Visualize relationships and patterns between operational and business activities.
Data visualization enables users to more effectively see connections as they are occurring between
operating conditions and business performance.
Example: an executive team for an electronics retailer is viewing monthly customer data. The team is
presented with a bar chart that shows the company’s net promoter score.
30/07/2025 INTRODUCTION TO DATA SCIENCE 129
Recent Trends in Data Science
3. Identify and act on emerging trends faster.
The volume of data that companies are able to gather about customers and market conditions can
provide business leaders with insights into new revenue and business opportunities, presuming
they can spot the opportunities in data.
30/07/2025 INTRODUCTION TO DATA SCIENCE 130
Recent Trends in Data Science
4. Manipulate and interact directly with data:
One of the greatest strengths of data visualization is how it brings actionable insights to the
surface.
One-dimensional tables and charts that can only be viewed, data visualization tools enable users to
interact with data.
30/07/2025 INTRODUCTION TO DATA SCIENCE 131
Recent Trends in Data Science
5. Foster a new business language:
Data visualization is its ability to tell story through data.
Example: business leaders for customer packaged goods company who track key performance
indicators such as EBITDA and net profit margin.
They can gather only part of the story about current business conditions using a static bar chart.
30/07/2025 INTRODUCTION TO DATA SCIENCE 132
Recent Trends in Data Science
6.3 Visualizing Big Data
Big data is data that is of such volume, variety and velocity.
Volume refers to the size of the data.
Variety describes whether the data is structured, semistructured, or unstructured.
Velocity is speed at which data pours in and how frequently it changes.
30/07/2025 INTRODUCTION TO DATA SCIENCE 133
30/07/2025 INTRODUCTION TO DATA SCIENCE 134
30/07/2025 INTRODUCTION TO DATA SCIENCE 135
In SAS Visual Analytics, "autocharting" refers to a feature that automatically
selects the most suitable data visualization based on the type and amount of data
you provide, allowing users to easily visualize their data without needing to
manually choose a chart type, making it particularly useful for non-technical
users and business analysts
30/07/2025 INTRODUCTION TO DATA SCIENCE 136
6.3.2 Visualizing semistructured and unstructured
data using word clouds and network diagrams.
Refer pg no. 234
30/07/2025 INTRODUCTION TO DATA SCIENCE 137
Visualizing semistructured and unstructured
data using word clouds and network diagrams
To visualize semistructured and unstructured data, like large amounts of text or
complex relationships, word clouds are used to highlight the frequency of words
within the data by displaying them in different sizes, while network diagrams
represent connections and relationships between entities within the data, showing
nodes (entities) and edges (links) between them; both methods are particularly
useful for identifying key themes and patterns in unstructured data sets.
30/07/2025 INTRODUCTION TO DATA SCIENCE 138
•Word Clouds:
•Data preparation: Text is processed by cleaning, tokenizing, and removing stop
words (common words like "the" and "a") to focus on meaningful keywords.
•Visualization: Each unique word is displayed in a cloud, with the font size
proportional to its frequency within the text.
•Benefits: Quickly identify dominant topics, understand the overall sentiment of a
text corpus, and compare different text sets by visually seeing which keywords
are most prevalent.
30/07/2025 INTRODUCTION TO DATA SCIENCE 139
Network Diagrams:
•Data preparation: Entities and their relationships are identified from the data,
creating nodes and edges.
•Visualization: Nodes are represented as points on a graph, and edges connect
relevant nodes, often with line thickness or color indicating the strength of the
relationship.
•Benefits: Analyze complex relationships between entities, identify clusters
within the data, and understand the flow of information or influence within a
network
30/07/2025 INTRODUCTION TO DATA SCIENCE 140
Examples of use cases:
•Analyzing customer feedback:
•Use a word cloud to identify frequently mentioned positive and negative aspects of a product
or service from customer reviews.
•Social media analysis:
•Create a network diagram to visualize the connections between users on a social platform
and identify influential individuals.
•Research paper analysis:
•Use a word cloud to identify key themes and recurring concepts within a collection of
research papers.
•Market research:
•Visualize the relationships between different brands or products in a market using a network
diagram
30/07/2025 INTRODUCTION TO DATA SCIENCE 141
30/07/2025 INTRODUCTION TO DATA SCIENCE 142
A visualization using a correlation matrix, often called a "correlogram," is a
graphical representation of the relationships between multiple variables in a
dataset, where each cell in the matrix displays the correlation coefficient between
two variables, allowing you to quickly identify patterns of strong positive or
negative correlations between different features in your data.
30/07/2025 INTRODUCTION TO DATA SCIENCE 143
Recent Trends in Data Science
3) Preattentive Attributes
These attributes are what immediately catch our eyes when we look at a visualization.
They can be perceived in less than 10 milliseconds even before we make a conscious effort to
notice them.
30/07/2025 INTRODUCTION TO DATA SCIENCE 144
30/07/2025 INTRODUCTION TO DATA SCIENCE 145
Recent Trends in Data Science
4) Challenges of Big Data Visualization
Scalability and dynamics are two major challenges in visual analytics.
Fig.6.1 shows the research status data and dynamic data according to data size.
For big dynamic data, solutions for type A problem or type B problems often do not work for A
and B problem.
30/07/2025 INTRODUCTION TO DATA SCIENCE 146
30/07/2025 INTRODUCTION TO DATA SCIENCE 147
30/07/2025 INTRODUCTION TO DATA SCIENCE 148
Recent Trends in Data Science
5) Potential Solutions
30/07/2025 INTRODUCTION TO DATA SCIENCE 149
Future Progress of Big Data Visualization
Refer Pg No: 239
30/07/2025 INTRODUCTION TO DATA SCIENCE 150
Data Visualisation
1) Data visualisation is a term that describes to help people understand the significance of data
by placing it in visual content.
Patterns, trends, correlations might go undetected in text-based data can be exposed and
recognized easier with data visualisation software.
Data visualisation is the presentation of data in a pictorial or graphical format.
It enables decision makers to see analytics presented visually so they can grasp difficult concepts
or identify new patterns.
30/07/2025 INTRODUCTION TO DATA SCIENCE 151
Data Visualisation
2) Data Attributes.
There are two categories: quantitative data and qualitative data.
Quantitative data is exactly like: a numerical value placed on an ascending scale.
Qualitative data refers to values that cannot be measured numerically but can be described
through language.
30/07/2025 INTRODUCTION TO DATA SCIENCE 152
Data Visualisation
2.1 Quantitative Data
Quantitative data can be:
1. Ratio (cost $10, $20, $30 or age 10 Yrs old, 20 Yrs old)
2. Data you can perform arithmetic operations on (add, divide, etc)
3. Intervals (temperature -5 degree, 10 degree, 25 degree or time 1am, 5am)
4. Data with a set value that you cannot perform all arithmetic operations on.
Example: You cannot calculate the sum of temperature during a week but you can calculate the average temperature per
day and the high/low for each day.
30/07/2025 INTRODUCTION TO DATA SCIENCE 153
Data Visualisation
Other data types for Visualisation
Eight types of quantitative messages that users may attempt to understand or communicate from set of
data & associated graphs used to help communicate the message.
1. Time-series:
a single variable is captured over period of time, such as the unemployment rate over 10-year period.
A line chart may be used to demonstrate the trend.
30/07/2025 INTRODUCTION TO DATA SCIENCE 154
Data Visualisation
2. Ranking:
Categorical subdivsions are ranked in ascending or descending order, such as a ranking of sales
performance by sales persons during a single period.
A bar chart may be used to show the comparison across the sales persons.
3. Part-to-whole:
Categorical subdivisions are measured as a ratio to the whole
A pie chart or bar chart can show the comparison of ratios, such as market share represented by
competitors.
30/07/2025 INTRODUCTION TO DATA SCIENCE 155
Data Visualisation
4. Deviation:
Categorical subdivisions are compared against a reference, such as a comparison of actual vs budget
expenses for several departments of business for given time period.
A bar chart can show comparison of actual versus the reference amount.
5. Frequency distribution
Shows the number of observations of a particular variable for given interval, such as the number of years in
which the stock market return is between intervals such as 0-10%, 11-20%.
30/07/2025 INTRODUCTION TO DATA SCIENCE 156
Data Visualisation
A histogram a type of bar chart may be used for this analysis
A boxplot helps visualized key statistics about distribution such as median, quartiles, outliers.
6. Correlation
Comparison between observations represented by two variables (X,Y) to determine if they tend to
move in the same or opposite directions.
Example plotting unemployment (X) and inflation (Y) for sample of months.
30/07/2025 INTRODUCTION TO DATA SCIENCE 157
Data Visualisation
7. Nominal comparison
Comparing categorical subdivisions is no particular order,such as the sales volume by product code.
A bar chart may be used for this comparison.
8. Geographic or geospatial
Comparison of variable across a map or layout, such as the unemployment rate by state or the number
of persons on various floors of building.
Cartogram is typical graphic used.
30/07/2025 INTRODUCTION TO DATA SCIENCE 158
Data Visualisation
2.2 Qualitative Data
Ordinal (size small, medium, large or position 1st place, 2nd place, 3rd place)
Data with fixed ranking with indeterminate distance between the values
Example: a large elephant in india is very different from large elephant in the Africa but don’t know
exactly how much larger.
Nominal (sports NFL football vs English football or computers laptop vs desktop)
Data where you can distinguish between values, but not order them.
30/07/2025 INTRODUCTION TO DATA SCIENCE 159
30/07/2025 INTRODUCTION TO DATA SCIENCE 160
Data Visualisation
3) Importance of Data Visualisation
Primary goal is to communicate information clearly & efficiently via statistical graphics, plots, and
information graphics.
Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative
message.
It helps users analyze & reason about data & evidence.
It makes complex data more accessible, understandable & usable.
30/07/2025 INTRODUCTION TO DATA SCIENCE 161
Data Visualisation
30/07/2025 INTRODUCTION TO DATA SCIENCE 162
Data Visualisation
30/07/2025 INTRODUCTION TO DATA SCIENCE 163
Data Visualisation
4) Conventional data visualisation methods
Conventional data visualisation methods are often used: they are: table, histogram, scatter, plot line, chart, bar
chart, pie chart, area chart, flow chart, bubble chart, multiple data series or combination of charts, time line,
venn diagram, data flow diagram, & entity relationship diagram.
Parallel coordinates is used to plot individual data elements across many dimensions.
Parallel coordinate is very useful when to display multidimensional data, Fig.4.2.
30/07/2025 INTRODUCTION TO DATA SCIENCE 164
30/07/2025 INTRODUCTION TO DATA SCIENCE 165
Data Visualisation
Treemap is an effective method of visualizing hierarchies.
The size of each sub-rectangle represents one measure, while color is often used to represent another
measure of data, Fig 4.3.
A treemap of collection of choices for streaming music & video tracks in social network community.
30/07/2025 INTRODUCTION TO DATA SCIENCE 166
30/07/2025 INTRODUCTION TO DATA SCIENCE 167
Data Visualisation
4.1 Visual Perception and Data Visualisation
Human can distinguish differences in line length, shape, orientation and colour readily without
significant processing effort: these are referred as ”pre-attentive attributes”.
Example: it may require significant time & effort to identify the number of times the digit “5” appears
in series of numbers
If digit is different in size, orientation, or color, instances of digit can be noticed quickly through pre-
attentive processing.
30/07/2025 INTRODUCTION TO DATA SCIENCE 168
30/07/2025 INTRODUCTION TO DATA SCIENCE 169
Data Visualisation
Visualisation are not only static: they can be interactive.
Interactive visualisation can be performed through approaches such as zooming, overview and detail,
zoom and focus and context.
The steps for interactive visualisation are as follows:
1. Selecting
2. Linking
3. Filtering
4. Rearranging or remapping.
30/07/2025 INTRODUCTION TO DATA SCIENCE 170
30/07/2025 INTRODUCTION TO DATA SCIENCE 171
Data Visualisation
4.2 Mapping of data visualisation
30/07/2025 INTRODUCTION TO DATA SCIENCE 172
Data Visualisation
5) Retinal variables
5.1 Seven variables for visualisation
1. Two planar variables
2. Five so called “retinal”
These are two planar variables (X and Y position on map plane).
30/07/2025 INTRODUCTION TO DATA SCIENCE 173
Data Visualisation
Five “Retinal” variables
1. Size,
2. Color value
3. Color hue,
4. Shape, and
5. Orientation,
30/07/2025 INTRODUCTION TO DATA SCIENCE 174
Data Visualisation
5.2 Types of visual variables
A visual variable can be categorised in three different categories:
1. Selective (e.g., colour hue)
2. Associative (e.g., shape)
3. Ordered
30/07/2025 INTRODUCTION TO DATA SCIENCE 175
Data Visualisation
Selective (e.g., colour hue)
30/07/2025 INTRODUCTION TO DATA SCIENCE 176
Data Visualisation
Associative (e.g., shape)
30/07/2025 INTRODUCTION TO DATA SCIENCE 177
Data Visualisation
Ordered
30/07/2025 INTRODUCTION TO DATA SCIENCE 178
Data Visualisation
5.3 Effectiveness of mappings
30/07/2025 INTRODUCTION TO DATA SCIENCE 179
Data Visualisation
6) Mapping Variables to Encodings
30/07/2025 INTRODUCTION TO DATA SCIENCE 180
Data Visualisation
6.1 Choosing Appropriate Visual Encodings
Natural ordering and number of distinct values will indicate whether a visual property is best suited to
one of main data types: quantitative, ordinal, categorical, or relational data.
6.2 Natural Ordering
Visual property has natural ordering is determined by visual systems and software used.
Example: position has natural ordering, shape doesn’t, length has natural ordering , texture doesn’t.
30/07/2025 INTRODUCTION TO DATA SCIENCE 181
Data Visualisation
6.3 Distinct Values
When choosing visual property, select one that has number of useful differentiable values and an
ordering similar to that of your data fig 4.9.
Fig 4.10 shows another way to think about visual properties, depending on what kind of data you
need to encode
30/07/2025 INTRODUCTION TO DATA SCIENCE 182
30/07/2025 INTRODUCTION TO DATA SCIENCE 183
30/07/2025 INTRODUCTION TO DATA SCIENCE 184
Data Visualisation
6.4 Mapping of data types according to Mackinlay
Mackinlay prepared a chart showing the quality of mapping of data types is shown in fig 4.11
30/07/2025 INTRODUCTION TO DATA SCIENCE 185
30/07/2025 INTRODUCTION TO DATA SCIENCE 186
30/07/2025 INTRODUCTION TO DATA SCIENCE 187
THANK YOU
30/07/2025 INTRODUCTION TO DATA SCIENCE 188