[go: up one dir, main page]

0% found this document useful (0 votes)
18 views188 pages

DSA Notes Unit-01

The document provides an introduction to Data Science, detailing its interdisciplinary nature and the methodologies used to extract insights from data. It covers key terminologies such as Big Data, Business Intelligence, Data Analytics, and the various types of data repositories like Data Lakes and Data Warehouses. Additionally, it outlines the roles of personnel involved in data science, including Data Scientists, Analysts, Engineers, and Architects.

Uploaded by

Sohail Agha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views188 pages

DSA Notes Unit-01

The document provides an introduction to Data Science, detailing its interdisciplinary nature and the methodologies used to extract insights from data. It covers key terminologies such as Big Data, Business Intelligence, Data Analytics, and the various types of data repositories like Data Lakes and Data Warehouses. Additionally, it outlines the roles of personnel involved in data science, including Data Scientists, Analysts, Engineers, and Architects.

Uploaded by

Sohail Agha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 188

Padre Conceição

College Of Engineering

CEAM-03 – Data Science and Analytics


(T.E Computer , Sem-VI)

Presented by: Asst Prof. Vidya G

30/07/2025 INTRODUCTION TO DATA SCIENCE 1


Syllabus

30/07/2025 INTRODUCTION TO DATA SCIENCE 2


UNIT-01
Introduction to Data
Science

30/07/2025 INTRODUCTION TO DATA SCIENCE 3


Data Science
Data Science is also known as data-driven science, it is interdisciplinary field of scientific
methods, processes, and systems to extract knowledge or insights from data in various forms,
either structured or unstructured similar to Data Mining.
Convergence of various knowledge domains for effective utilisations of various analysis
method for better output of experts in their activities ( Refer Fig.1.1).
Data Science is one of the recent fields combining big data, unstructured data and
combination of statistics and analytics and business intelligence.

30/07/2025 INTRODUCTION TO DATA SCIENCE 4


30/07/2025 INTRODUCTION TO DATA SCIENCE 5
Data Science
Data Science is the discipline of using quantitative methods from statistics and mathematics
along with the technology to develop algorithms designed to discover patterns, predict
outcomes, and final optimal solutions to complex problems.

Data science employs techniques and theories drawn from many fields within broad areas of
mathematics, statistics, information science, and computer science, in particular from the
sub-domains of machine learning, classification, cluster analysis, data lakes data mining,
and warehousing, databases, and visualization (Refer Fig.1.2)

30/07/2025 INTRODUCTION TO DATA SCIENCE 6


30/07/2025 INTRODUCTION TO DATA SCIENCE 7
Terminology Related with Data Science
1. Big Data.

Big Data is a term applied to datasets whose size or type is beyond the ability of traditional
relational databases to capture, manage and process the data with low-latency.

Big data usually includes data sets with sizes beyond the ability of commonly used software
tools to capture, curate, manage and process data within a tolerable elapsed time.

30/07/2025 INTRODUCTION TO DATA SCIENCE 8


30/07/2025 INTRODUCTION TO DATA SCIENCE 9
Terminology Related with Data Science
2. Business Intelligence (BI)

BI is the technology which uses the transformed and loaded historical data to get or create the
reports.

It is a set of methodologies, process, theories that transform raw data into useful information
to help companies make better decisions.

BI is a process for analyzing data and presenting actionable information to help executives,
managers and other corporate end users make informed business decisions and help in decision
making.

30/07/2025 INTRODUCTION TO DATA SCIENCE 10


Terminology Related with Data Science
Functions in BI technologies include reporting, online analytical processing, analytics, data
mining, process mining, complex event processing, business performance management,
benchmarking, text mining, predictive analytics and prescriptive analytics.

BI can be used by enterprises to support a wide range of business decisions ranging from
operational to strategic.

30/07/2025 INTRODUCTION TO DATA SCIENCE 11


Terminology Related with Data Science
3. Data Analytics

Data Analytics and analytics, are used to describe the field and comprehensive collection of
associated methods.

Data analyst collect, process and perform statistical analyses of data.

30/07/2025 INTRODUCTION TO DATA SCIENCE 12


Terminology Related with Data Science
Difference between Big Data and Business Intelligence

BIG DATA
BUSINESS INTELLIGENCE (BI)
Big Data refers to act of generating, BI encompasses only commercial activities, its
capturing, and processing enormous amounts domain is larger. The data is collected in data
of data on continuous basis. lakes and refined in data warehousing through
data mining techniques.
BI refers to software and systems that import
Big data is the technology which collects
data streams of any size and use them to
transforms the huge data which is in generate informational displays that point
unstructured manner. specific decisions.

30/07/2025 INTRODUCTION TO DATA SCIENCE 13


Terminology Related with Data Science
4. Data Wrangling

The process of conversion of data, through the use of scripting languages to make it easier to
work is known as Data Wrangling or data munging.

Example: 900,000 birth year values of the format yyyy-dd-mm and 100,000 of the format
mm/dd/yyyy, write a perl script to convert latter to look the same former as you can use all
together, it is known as data wrangling.

30/07/2025 INTRODUCTION TO DATA SCIENCE 14


Terminology Related with Data Science
5. Algorithm

A series of repeatable steps for carrying out a certain type of task with data

6. Machine Learning

Analytics in which computers “learn” from data to produce models or rules that apply to those
data and other similar data

Predictive modelling techniques such as neural nets, classification and regression trees, naïve
bayes, k-nearest neighbour, and support vector machines.

30/07/2025 INTRODUCTION TO DATA SCIENCE 15


Terminology Related with Data Science
7. Web Analytics

Statistical or machine learning methods applied to web data such as page views, hits, clicks,
and conversions generally with a view to learning what web presentations are most effective in
achieving the organizational goal.

This goal might to sell products and services on a site, to server and sell advertising space, to
purchase on other sites.

Advantage is volume & constant flow of data.

30/07/2025 INTRODUCTION TO DATA SCIENCE 16


Methods of Data Repository
Data repository is the term used for data storage.

Data repository refers to data storage entity into which data has been specifically partitioned for an
analytical or reporting purpose.

It has several different shapes:


 Data lakes
 Data marts
 Data Ware Housing
 Big Data and Hadoop and similar frameworks.

30/07/2025 INTRODUCTION TO DATA SCIENCE 17


Methods of Data Repository
1. Data lake
 Data lakes is storage repository that holds a vast amount of raw data in native format until it is
needed and refined.
 Data lake shares data environment that has multiple repositories and capitalizes on big data
technologies.
 Provides data to an organization for variety of analytic processes.

30/07/2025 INTRODUCTION TO DATA SCIENCE 18


Methods of Data Repository
 Data lake is associated with Hadoop-oriented object storage, in which organizations data is
loaded into Hadoop-platform.

 Business analytics and data mining tools are applied to the data where it resdes on the Hadoop
cluster.

 The data lake concept takes Hadoop deployments to their extreme, creating a potentially,
limitless reservoir for disparate collections of structured, unstructured and semi-structured data
generated by transaction systems, social networks, server logs, sensors and other sources.

30/07/2025 INTRODUCTION TO DATA SCIENCE 19


Methods of Data Repository
Characteristics of Data Lake:

1. All data is loaded from source systems. No data is turned away.


2. Data is stored at the leaf level in an untransformed or nearly untransformed state.
3. Data is transformed and schema is applied to fulfil the needs of analysis.

30/07/2025 INTRODUCTION TO DATA SCIENCE 20


Methods of Data Repository
2. Data Warehouse

 Data warehouse is constructed by integrating by data from multiple heterogeneous sources that
support analytical reporting, structured and / or ad hoc queries, decision making.

 Data warehousing involves data cleaning, data integration, and data consolidations.

 A core component of BI, data warehouse is central repository of integrated data from one or
more disparate sources, and its used for reporting & data analytics.

30/07/2025 INTRODUCTION TO DATA SCIENCE 21


Methods of Data Repository
 Hierarchical database that stores data in files or folders a data lake uses a flat architecture to

store data.

 Example: on updating daily basis transactions.

 Data warehouse provides generalized and consolidated data in multidimensional view.

 Data warehouse provides the online analytical processing (OLAP) tools.

 This tools helps in interactive and effective analysis of data in multidimensional space. Analysis

results in data mining.

30/07/2025 INTRODUCTION TO DATA SCIENCE 22


Methods of Data Repository
Understanding a Data Warehouse

1. A data warehouse is a database, which is kept separate from the organization’s operational database.

2. There is no frequent updating done in data warehouse.

3. It possesses consolidated historical data, which helps the organization to analyze the business.

4. A data warehouse helps executives to organize, understand and use their data to take strategic decisions.

5. A data warehouse systems help in the integration of diversity of application systems.

6. A data warehouse systems helps in consolidated historical data analysis.

30/07/2025 INTRODUCTION TO DATA SCIENCE 23


Methods of Data Repository
Data ware house models
From the perspective of data warehouse architecture, we have the following data warehouse
models
1. Virtual warehouse
2. Data marts
3. Enterprise warehouse

30/07/2025 INTRODUCTION TO DATA SCIENCE 24


Methods of Data Repository
1. Virtual warehouse
 The view over an operational data warehouse is know as virtual warehouse.
 It is easy to build virtual warehouse.
 Building virtual warehouse requires excess capacity on operational database servers.

30/07/2025 INTRODUCTION TO DATA SCIENCE 25


Methods of Data Repository
2. Data marts:
 Data mart contains a subset of organization-wide data.
 This subset is valuable to specific group of an organization.
 Example: the marketing data mart may contain data related to items, customers, and sales.
Data marts are confined to subjects.

30/07/2025 INTRODUCTION TO DATA SCIENCE 26


30/07/2025 INTRODUCTION TO DATA SCIENCE 27
Methods of Data Repository
3. Enterprise warehouse
 An enterprise warehouse collects all the information and the subjects spanning an entire
organization.
1. It provides us enterprise -wide data integration.
2. The data is integrated from operational systems and external information providers.
3. This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.

30/07/2025 INTRODUCTION TO DATA SCIENCE 28


Methods of Data Repository
Process flow in data warehouse

30/07/2025 INTRODUCTION TO DATA SCIENCE 29


Methods of Data Repository

Fig 1.3. Processes in Data Warehouse


30/07/2025 INTRODUCTION TO DATA SCIENCE 30
Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 31


Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 32


Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 33


Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 34


Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 35


Methods of Data Repository

30/07/2025 INTRODUCTION TO DATA SCIENCE 36


Methods of Data Repository
Functions of data warehouse-tools and utilities

30/07/2025 INTRODUCTION TO DATA SCIENCE 37


Methods of Data Repository

Fig.1.4 Functions of data ware housing


30/07/2025 INTRODUCTION TO DATA SCIENCE 38
Personnel involved with data science
1. Data Scientist

 A data scientist is someone who is better at statistics than any software engineer and better at
software engineering than any statistician.

Data scientist implies the ability to work with large volumes of data generated not by studies, but
by ongoing organizational processes.

Due to complexity of dealing with large datasets and data flows, most of day-to-day work lies in
data pipeline challenges.

30/07/2025 INTRODUCTION TO DATA SCIENCE 39


30/07/2025 INTRODUCTION TO DATA SCIENCE 40
30/07/2025 INTRODUCTION TO DATA SCIENCE 41
30/07/2025 INTRODUCTION TO DATA SCIENCE 42
Personnel involved with data science
2. Data Analyst

 Data analyst collect, process and perform statistical analyses of data.

 Skills may not be as advanced as data scientist

 E.g. they may not be able to create new algorithms, but the goals are same- to discover how data
can be used to answer questions and solve problems.

30/07/2025 INTRODUCTION TO DATA SCIENCE 43


30/07/2025 INTRODUCTION TO DATA SCIENCE 44
30/07/2025 INTRODUCTION TO DATA SCIENCE 45
30/07/2025 INTRODUCTION TO DATA SCIENCE 46
30/07/2025 INTRODUCTION TO DATA SCIENCE 47
Personnel involved with data science
3. Data Engineer

 A specialist is data wrangling.

 Data engineers are the ones that take the messy data and build the infrastructure for real,
tangible analysis.

 They run ETL software, enrich and clean all the data that companies have been storing for years.

30/07/2025 INTRODUCTION TO DATA SCIENCE 48


30/07/2025 INTRODUCTION TO DATA SCIENCE 49
30/07/2025 INTRODUCTION TO DATA SCIENCE 50
30/07/2025 INTRODUCTION TO DATA SCIENCE 51
Personnel involved with data science
4. Data Architect
 Data architect create blueprints for data management systems.
 After assessing a company’s potential data sources architects design a plan to integrate,
centralize, protect and maintain them.
 This allows employees to access crtitial information in the right place at right time.

30/07/2025 INTRODUCTION TO DATA SCIENCE 52


30/07/2025 INTRODUCTION TO DATA SCIENCE 53
30/07/2025 INTRODUCTION TO DATA SCIENCE 54
30/07/2025 INTRODUCTION TO DATA SCIENCE 55
Types of Data

30/07/2025 INTRODUCTION TO DATA SCIENCE 56


Unstructured data

30/07/2025 INTRODUCTION TO DATA SCIENCE 57


Semi-Structured data

30/07/2025 INTRODUCTION TO DATA SCIENCE 58


Meta data

30/07/2025 INTRODUCTION TO DATA SCIENCE 59


Meta data

30/07/2025 INTRODUCTION TO DATA SCIENCE 60


30/07/2025 INTRODUCTION TO DATA SCIENCE 61
Structured data

30/07/2025 INTRODUCTION TO DATA SCIENCE 62


The Data Science Process (DSP)
 DSP is an agile, iterative data science methodology to deliver predictive analytics solutions and
intelligent applications efficiently.

 DSP helps improve team collaboration and learning.

 It contains a distillation of the best practices and structures from Microsoft and others in the
industry that facilitate the successful implementation of data science initiatives.

 The goal is to help companies fully realize the benefits of their analytics program.

This provide a generic description of the process here that can be implemented with variety of

tools.

30/07/2025 INTRODUCTION TO DATA SCIENCE 63


The Data Science Process (DSP)
The process may involve 7 clear cut steps for data analytics.

30/07/2025 INTRODUCTION TO DATA SCIENCE 64


30/07/2025 INTRODUCTION TO DATA SCIENCE 65
The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 66


The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 67


The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 68


The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 69


The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 70


The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 71


The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 72


The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 73


The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 74


The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 75


The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 76


The Data Science Process (DSP)

30/07/2025 INTRODUCTION TO DATA SCIENCE 77


Data Science Project’s Lifecycle

30/07/2025 INTRODUCTION TO DATA SCIENCE 78


Data Science Project’s Lifecycle
The lifecycle has been designed for data science projects that ship as part of intelligent
applications.

The applications deploy machine learning and artificial intelligence models for predictive
analysis.

Data science projects or ad hoc analytics projects can also benefit from using this process. Some
of the steps may not be needed.

30/07/2025 INTRODUCTION TO DATA SCIENCE 79


Data Science Project’s Lifecycle
CRISP-DM remains the top methodology for data mining projects.

CRISP-DM was invented around 1996.

The 6 high level phases of CRISP-DM are still good description for analytics process but the
details and specifics need to be updated.

CRISP-DM need not to be maintained and adapted to the challenges of Big data and modern data
science

30/07/2025 INTRODUCTION TO DATA SCIENCE 80


30/07/2025 INTRODUCTION TO DATA SCIENCE 81
Data Science Project’s Lifecycle
 The lifecycle outlines the major stages that projects typically execute, often iteratively:
1. Business understanding
2. Data acquisition and understanding
3. Modeling
4. Deployment
5. Customer acceptance

30/07/2025 INTRODUCTION TO DATA SCIENCE 82


30/07/2025 INTRODUCTION TO DATA SCIENCE 83
Popular Data Science toolkits
Tools are an important element of the data science field.

The open source community has been contributing to the data science toolkits for years which
has led to major advancements to the field.

There are wide variety of open source tools available from data-mining platforms to
programming languages.

Mixing all technology that data scientist could add to their data science toolkits.

30/07/2025 INTRODUCTION TO DATA SCIENCE 84


Popular Data Science toolkits
The tools listed here are free software.
1. R programming language
2. Python
3. KNIME
4. SQL
5. Apache Hadoop and Big Data tools
6. Data Science tools
7. Tensor Flow
8. RStudio

30/07/2025 INTRODUCTION TO DATA SCIENCE 85


Popular Data Science toolkits
R programming language

30/07/2025 INTRODUCTION TO DATA SCIENCE 86


Popular Data Science toolkits

30/07/2025 INTRODUCTION TO DATA SCIENCE 87


Popular Data Science toolkits
Python

30/07/2025 INTRODUCTION TO DATA SCIENCE 88


Popular Data Science toolkits

30/07/2025 INTRODUCTION TO DATA SCIENCE 89


Popular Data Science toolkits
KNIME

30/07/2025 INTRODUCTION TO DATA SCIENCE 90


Popular Data Science toolkits
SQL

30/07/2025 INTRODUCTION TO DATA SCIENCE 91


Popular Data Science toolkits
Apache Hadoop and other big data tools:

30/07/2025 INTRODUCTION TO DATA SCIENCE 92


Popular Data Science toolkits

30/07/2025 INTRODUCTION TO DATA SCIENCE 93


Popular Data Science toolkits
D3 Data Science tools

30/07/2025 INTRODUCTION TO DATA SCIENCE 94


Popular Data Science toolkits
Tensor Flow

30/07/2025 INTRODUCTION TO DATA SCIENCE 95


Popular Data Science toolkits
Rstudio

30/07/2025 INTRODUCTION TO DATA SCIENCE 96


Recent Trends in Data Science
Recent Trends in Various Data Collection and analysis techniques
Artificial Intelligence, immersive experiences, digital twins, event-thinking and continuous adaptive security create a
foundation for the next generation of digital business models and ecosystems.

As businesses and services aim more for the connected world, overlapping of physical and digital layers around us will
probably gain more relevance.

30/07/2025 INTRODUCTION TO DATA SCIENCE 97


Recent Trends in Data Science
Recent Trends in Various Data Collection and analysis techniques
1. Intelligent Digital Mesh

 Gartner calls the entwinning of people, devices, content and services the intelligent digital
mesh.

 It’s enabled by digital models, business platforms and a rich, intelligent set of services to support
digital business.

30/07/2025 INTRODUCTION TO DATA SCIENCE 98


Recent Trends in Data Science
1. Intelligent: How AI is seeping into virtually every technology and with a defined, well-
scoped focus can allow more dynamic, flexible and potentially autonomous systems.

2. Digital: Blending the virtual and real worlds to create an immersive digitally enhanced and
connected environment.

3. Mesh: The connections between an expanding set of people, business, devices, content and
services to deliver digital outcomes.

30/07/2025 INTRODUCTION TO DATA SCIENCE 99


Fig. 6.1: Intelligent Digital Mesh

30/07/2025 INTRODUCTION TO DATA SCIENCE 100


Recent Trends in Data Science
I. Artificial Intelligence (AI)

The widespread adoption of AI into all scientific and business systems and decision-making
applications.

Artificial Intelligence is classified into two parts,

1. General AI and

2. Narrow AI.

30/07/2025 INTRODUCTION TO DATA SCIENCE 101


Recent Trends in Data Science
General AI refers to making machines intelligent in a wide array of activities that involve
thinking and reasoning.

Narrow AI involves the use of artificial Intelligence for a very specific task.

General AI would mean an algorithm that is capable of playing all kinds of board game.

Narrow AI is within the reach of developers and researchers.

General AI is just a dream of researchers and perception among the masses that will take a lot of
time for the human race to achieve.

30/07/2025 INTRODUCTION TO DATA SCIENCE 102


Recent Trends in Data Science
Machine learning is the ability of a computer system to learn from the environment and improve
itself from experience without the need for any explicit programming.

Machine learning focuses on enabling algorithms to learn from the data provided, gather insights
and make predictions on previously unanalyzed data using information gathered.

Machine learning can be performed using multiple approaches.

The three basic models of machine learning are supervised, unsupervised and reinforcement
learning.

30/07/2025 INTRODUCTION TO DATA SCIENCE 103


Recent Trends in Data Science
i. Supervised Learning:

The labelled data is used to help machines recognize characteristics and use them for future data.

Example if you want to classify pictures of cats and dogs then you can feed the data of a few
labelled pictures and then the machine will classify all remaining pictures.

30/07/2025 INTRODUCTION TO DATA SCIENCE 104


Recent Trends in Data Science
ii. Unsupervised learning:

We can put unlabelled data and let machine understand the characteristics and classify it.

iii. Reinforcement machine learning:

The algorithms interact with environment by producing actions and then analyze errors or
rewards

Example: to understand a game of chess an ML algorithm will not analyze individual moves but
will study the game as a whole.

30/07/2025 INTRODUCTION TO DATA SCIENCE 105


Recent Trends in Data Science
II. Intelligent Apps and Analytics (Smart Apps)

Smart apps are application that allow users to tap into the capabilities of their devices to
automate their lives.

Most smart apps are installed by the user via the smart things mobile client application.

Smart applications incorporate data-driven, actionable insights into the user experience.

Insights are delivered in context as features in applications that enable users to more efficiently
complete a desired task or action.

30/07/2025 INTRODUCTION TO DATA SCIENCE 106


Recent Trends in Data Science
They often take the form of recommendations, estimates, and suggested next actions.

Smart applications can be consumer-facing or employee-facing.

User operational processes based on data-driven insights.

Example: Retail smart applications make product recommendations based on analysis of customer
buying behaviour while logistics applications provide data-driven estimates of delivery times of goods
and products.

Healthcare smart applications offer possible patient diagnosis & treatment recommendations to
clinicians based on analyses of patient & research data.
30/07/2025 INTRODUCTION TO DATA SCIENCE 107
Recent Trends in Data Science
III. Intelligent of things (IoT)

Intelligent things use AI and Machine learning to interact in a more intelligent way with people
and surroundings.

Some intelligent things wouldn’t exist without AI, but others are existing things that AI makes
intelligent.

This things operate semiautonomously or autonomously in an unsupervised environment for set


amount of time to complete a particular task.

30/07/2025 INTRODUCTION TO DATA SCIENCE 108


Recent Trends in Data Science
Example: Include a self-directing vacuum or autonomous farming vehicle.

As technology develops, AI & machine learning will increasingly appear in a variety of objects
ranging from smart healthcare equipment to autonomous harvesting robots for farms.

30/07/2025 INTRODUCTION TO DATA SCIENCE 109


Recent Trends in Data Science
IV. Digital Twins

It will bring together the connected world of sensors and humans.

Digital twins are linked to real-world objects and offer information on the state of the
counterparts, respond to changes, improve operations and add value.

The idea of behind a digital twin is to harness the data and use algorithms for making reasonable
projections about the future.

30/07/2025 INTRODUCTION TO DATA SCIENCE 110


Recent Trends in Data Science
The major applications of digital twins in following sectors.

1. Manufacturing:

Digital twin is poised to change the current face of manufacturing sector.

It has a significant impact on the way products are designed manufactured and maintained.

It makes manufacturing more efficient & optimized while reducing the throughput times.

30/07/2025 INTRODUCTION TO DATA SCIENCE 111


Recent Trends in Data Science
2. Automobile:

Digital twins can be used in automobile sector for creating the virtual model of connected
vehicle.

It captures the behavioral and operational data of the vehicle and helps in analyzing the overall
vehicle performance as well as the connected features.

It also helps in delivering a truly personalized/ customized service for customers.

30/07/2025 INTRODUCTION TO DATA SCIENCE 112


Recent Trends in Data Science
3. Retail:

Appealing customer experience is key in the retail sector.

Digital twin implementation can play a key role in augmenting the retail customer experience by
creating virtual twins for customers & modelling fashions for them on it.

Digital twins also helps in better instore planning, security implementation and energy
management in optimized manner.

30/07/2025 INTRODUCTION TO DATA SCIENCE 113


Recent Trends in Data Science
5. Healthcare:

Digital twins along with data from IoT can play a key role in health care sector from cost savings
to patient monitoring, preventative maintenance and providing personalized health care.

6. Smart Cities:

The smart city planning & implementation with digital twins and IoT data helps enhancing
economic development, efficient management of resources, reduction of ecological foot print and
increase the overall quality of citizen’s life.

30/07/2025 INTRODUCTION TO DATA SCIENCE 114


Recent Trends in Data Science
The digital twin model can help city planners and policymakers in the smart city planning by gaining the

insights from various sensor networks & intelligent systems.

The data from the digital twins help them in arriving at informed decisions regarding the future as well.

6. Industrial IoT:

Industrial firms with digital twin implementation can now monitor, track and control industrial systems

digitally.

It capture environmental data such as location, configuration, financial models.

30/07/2025 INTRODUCTION TO DATA SCIENCE 115


Recent Trends in Data Science
V. Cloud to the edge: edge computing:

In context of IoT ‘edge’ refers to the computing infrastructure that exists close to sources of data.

Example: industrial machines, industrial controllers such as SCADA and integerated building management

systems (IBMS) system, and time series databases aggregating data from a variety of equipment and sensors.

These devices typically reside away from the centralize computing available in the cloud.

SCADA is remote controlled IBMS and IBMS is automation systems for control of machinery and services

in large building complexes.

30/07/2025 INTRODUCTION TO DATA SCIENCE 116


Recent Trends in Data Science
While there are many outcomes that it can enable for industrial organizations, the edge
computing consortium identifies the following
1. Predictive maintenance
2. Reducing costs
3. Security assurance
4. Product to service extension
5. Energy efficiency management
6. Lower energy consumption
7. Lower maintenance costs

30/07/2025 INTRODUCTION TO DATA SCIENCE 117


Recent Trends in Data Science
VI. Intelligent platforms: Conversational platforms

Conversational platforms will drive a paradigm shift in which the burden of translating intent
shifts from user to computer.

These systems are capable of simple answers “how is the weather?” or more complicated
interactions “book a reservation at Italian restaurant on parker ave.”

Intelligent platforms business provides industrial software control systems and embedded
computing platforms to optimize their customers assets and equipment.

30/07/2025 INTRODUCTION TO DATA SCIENCE 118


Recent Trends in Data Science
VII. Immersive Experience

Augmented reality (AR), virtual reality (VR) and mixed reality are changing the way that people
perceive and interact with the digital world.

Combined with conversational platforms, fundamental shift in user experience to an invisible


and immersive experience will emerge.

Application vendors, system software vendors and development platform vendors will all
compete to deliver this model.

30/07/2025 INTRODUCTION TO DATA SCIENCE 119


Recent Trends in Data Science
VIII.Blockchain

Blockchain will become a much more important technology for businesses across the globe.

Blockchain enables un-trusted parties to engage in transactions.

Blockchain holds promise for many industry sectors like the finance, healthcare, and content
delivery.

Blockchain is continuously growing list of records, called blocks, which are linked and secured
using cryptography.

30/07/2025 INTRODUCTION TO DATA SCIENCE 120


Recent Trends in Data Science
IX. Event Driven Techs

Digital businesses rely on the ability to sense and be ready to exploit new digital business
moments.

Business events reflects the discovery of notable states changes, such as completion of purchase
order.

30/07/2025 INTRODUCTION TO DATA SCIENCE 121


Recent Trends in Data Science
X. Security: continuous adaptive risk and trust assessment ( CARTA)

Digital business creates a complex, evolving security environment.

The use of increasingly sophisticated tools increases the threat potential.

CARTA allow for real-time, risk and trust based decision making with adaptive responses to
security enable digital business.

30/07/2025 INTRODUCTION TO DATA SCIENCE 122


Recent Trends in Data Science
1) Various Big Data Visualization Tools

To make highly informed decisions quickly, organizational leaders need to be able to access and
interpret data in real-time.
Google chart
Tableau
Qlikview
Datawrapper
Oracle visual analyzer
Fusioncharts
30/07/2025 INTRODUCTION TO DATA SCIENCE 123
Recent Trends in Data Science
Highcharts
Microsoft power BI
Plotly
Sisense

30/07/2025 INTRODUCTION TO DATA SCIENCE 124


Recent Trends in Data Science
i. Importance of Big Data Visualization

1. Review large amounts of data: data presented in graphical form enables decision makers to
take in large amounts of data & gain an understanding.

2. Spot trends: time sequence data often captures trends but spotting trends hidden in data is
notoriously hard to do especially when the sources are diverse and the quantity of data is
large.

30/07/2025 INTRODUCTION TO DATA SCIENCE 125


Recent Trends in Data Science
3. Identify correlations and unexpected relationships: Data Visualization is that enables users to
explore data sets not to find answers specific questions but to discover what unexpected
insights the data can reveal.

4. Present the data to others: an oft-overlooked feature of bog data visualization is that it
provides a highly effective way to communicate any insights that it surfaces to others.

30/07/2025 INTRODUCTION TO DATA SCIENCE 126


Recent Trends in Data Science
ii. Key Issues Big Data Visualization

1. Availability of visualization specialists: many big data visualization tools are designed to be easy enough
for anyone in an organization to use.

2. Visualization hardware resources: big data visualization is essentially computing task and the ability to
carry out this task quickly to enable organizations to make decisions in timely manner using real-time data.

3. Data quality: insights can be drawn from big data visualization are only as accurate as the data that is being
visualized, if it is inaccurate or out of data then the value of insights is questionable.

30/07/2025 INTRODUCTION TO DATA SCIENCE 127


Recent Trends in Data Science
III. Benefits of Data Visualization Tools

1. Absorb information in new and more construction ways.

Data visualization enables users to receive vast amounts of information regarding operational and
business conditions.

Data visualization allows decision makers to see connections between multi-dimensional data sets and
provides new ways to interpret data through the use of heat maps, fever charts, other rich graphical
representations.

30/07/2025 INTRODUCTION TO DATA SCIENCE 128


Recent Trends in Data Science
2. Visualize relationships and patterns between operational and business activities.

Data visualization enables users to more effectively see connections as they are occurring between
operating conditions and business performance.

Example: an executive team for an electronics retailer is viewing monthly customer data. The team is
presented with a bar chart that shows the company’s net promoter score.

30/07/2025 INTRODUCTION TO DATA SCIENCE 129


Recent Trends in Data Science
3. Identify and act on emerging trends faster.

The volume of data that companies are able to gather about customers and market conditions can
provide business leaders with insights into new revenue and business opportunities, presuming
they can spot the opportunities in data.

30/07/2025 INTRODUCTION TO DATA SCIENCE 130


Recent Trends in Data Science
4. Manipulate and interact directly with data:

One of the greatest strengths of data visualization is how it brings actionable insights to the
surface.

One-dimensional tables and charts that can only be viewed, data visualization tools enable users to
interact with data.

30/07/2025 INTRODUCTION TO DATA SCIENCE 131


Recent Trends in Data Science
5. Foster a new business language:

Data visualization is its ability to tell story through data.

Example: business leaders for customer packaged goods company who track key performance
indicators such as EBITDA and net profit margin.

They can gather only part of the story about current business conditions using a static bar chart.

30/07/2025 INTRODUCTION TO DATA SCIENCE 132


Recent Trends in Data Science
6.3 Visualizing Big Data
Big data is data that is of such volume, variety and velocity.
Volume refers to the size of the data.
Variety describes whether the data is structured, semistructured, or unstructured.
Velocity is speed at which data pours in and how frequently it changes.

30/07/2025 INTRODUCTION TO DATA SCIENCE 133


30/07/2025 INTRODUCTION TO DATA SCIENCE 134
30/07/2025 INTRODUCTION TO DATA SCIENCE 135
In SAS Visual Analytics, "autocharting" refers to a feature that automatically
selects the most suitable data visualization based on the type and amount of data
you provide, allowing users to easily visualize their data without needing to
manually choose a chart type, making it particularly useful for non-technical
users and business analysts

30/07/2025 INTRODUCTION TO DATA SCIENCE 136


6.3.2 Visualizing semistructured and unstructured
data using word clouds and network diagrams.

Refer pg no. 234

30/07/2025 INTRODUCTION TO DATA SCIENCE 137


Visualizing semistructured and unstructured
data using word clouds and network diagrams
To visualize semistructured and unstructured data, like large amounts of text or
complex relationships, word clouds are used to highlight the frequency of words
within the data by displaying them in different sizes, while network diagrams
represent connections and relationships between entities within the data, showing
nodes (entities) and edges (links) between them; both methods are particularly
useful for identifying key themes and patterns in unstructured data sets.

30/07/2025 INTRODUCTION TO DATA SCIENCE 138


•Word Clouds:
•Data preparation: Text is processed by cleaning, tokenizing, and removing stop
words (common words like "the" and "a") to focus on meaningful keywords.
•Visualization: Each unique word is displayed in a cloud, with the font size
proportional to its frequency within the text.
•Benefits: Quickly identify dominant topics, understand the overall sentiment of a
text corpus, and compare different text sets by visually seeing which keywords
are most prevalent.

30/07/2025 INTRODUCTION TO DATA SCIENCE 139


Network Diagrams:
•Data preparation: Entities and their relationships are identified from the data,
creating nodes and edges.
•Visualization: Nodes are represented as points on a graph, and edges connect
relevant nodes, often with line thickness or color indicating the strength of the
relationship.
•Benefits: Analyze complex relationships between entities, identify clusters
within the data, and understand the flow of information or influence within a
network

30/07/2025 INTRODUCTION TO DATA SCIENCE 140


Examples of use cases:
•Analyzing customer feedback:
•Use a word cloud to identify frequently mentioned positive and negative aspects of a product
or service from customer reviews.
•Social media analysis:
•Create a network diagram to visualize the connections between users on a social platform
and identify influential individuals.
•Research paper analysis:
•Use a word cloud to identify key themes and recurring concepts within a collection of
research papers.
•Market research:

•Visualize the relationships between different brands or products in a market using a network
diagram

30/07/2025 INTRODUCTION TO DATA SCIENCE 141


30/07/2025 INTRODUCTION TO DATA SCIENCE 142
A visualization using a correlation matrix, often called a "correlogram," is a
graphical representation of the relationships between multiple variables in a
dataset, where each cell in the matrix displays the correlation coefficient between
two variables, allowing you to quickly identify patterns of strong positive or
negative correlations between different features in your data.

30/07/2025 INTRODUCTION TO DATA SCIENCE 143


Recent Trends in Data Science
3) Preattentive Attributes

These attributes are what immediately catch our eyes when we look at a visualization.

They can be perceived in less than 10 milliseconds even before we make a conscious effort to
notice them.

30/07/2025 INTRODUCTION TO DATA SCIENCE 144


30/07/2025 INTRODUCTION TO DATA SCIENCE 145
Recent Trends in Data Science
4) Challenges of Big Data Visualization

Scalability and dynamics are two major challenges in visual analytics.

Fig.6.1 shows the research status data and dynamic data according to data size.

For big dynamic data, solutions for type A problem or type B problems often do not work for A
and B problem.

30/07/2025 INTRODUCTION TO DATA SCIENCE 146


30/07/2025 INTRODUCTION TO DATA SCIENCE 147
30/07/2025 INTRODUCTION TO DATA SCIENCE 148
Recent Trends in Data Science
5) Potential Solutions

30/07/2025 INTRODUCTION TO DATA SCIENCE 149


Future Progress of Big Data Visualization

Refer Pg No: 239

30/07/2025 INTRODUCTION TO DATA SCIENCE 150


Data Visualisation
1) Data visualisation is a term that describes to help people understand the significance of data
by placing it in visual content.

Patterns, trends, correlations might go undetected in text-based data can be exposed and
recognized easier with data visualisation software.

Data visualisation is the presentation of data in a pictorial or graphical format.

It enables decision makers to see analytics presented visually so they can grasp difficult concepts
or identify new patterns.

30/07/2025 INTRODUCTION TO DATA SCIENCE 151


Data Visualisation
2) Data Attributes.

There are two categories: quantitative data and qualitative data.

Quantitative data is exactly like: a numerical value placed on an ascending scale.

Qualitative data refers to values that cannot be measured numerically but can be described
through language.

30/07/2025 INTRODUCTION TO DATA SCIENCE 152


Data Visualisation
2.1 Quantitative Data

Quantitative data can be:

1. Ratio (cost $10, $20, $30 or age 10 Yrs old, 20 Yrs old)

2. Data you can perform arithmetic operations on (add, divide, etc)

3. Intervals (temperature -5 degree, 10 degree, 25 degree or time 1am, 5am)

4. Data with a set value that you cannot perform all arithmetic operations on.

Example: You cannot calculate the sum of temperature during a week but you can calculate the average temperature per
day and the high/low for each day.
30/07/2025 INTRODUCTION TO DATA SCIENCE 153
Data Visualisation
Other data types for Visualisation

Eight types of quantitative messages that users may attempt to understand or communicate from set of
data & associated graphs used to help communicate the message.

1. Time-series:

a single variable is captured over period of time, such as the unemployment rate over 10-year period.

A line chart may be used to demonstrate the trend.

30/07/2025 INTRODUCTION TO DATA SCIENCE 154


Data Visualisation
2. Ranking:

Categorical subdivsions are ranked in ascending or descending order, such as a ranking of sales
performance by sales persons during a single period.

A bar chart may be used to show the comparison across the sales persons.

3. Part-to-whole:

Categorical subdivisions are measured as a ratio to the whole

A pie chart or bar chart can show the comparison of ratios, such as market share represented by
competitors.
30/07/2025 INTRODUCTION TO DATA SCIENCE 155
Data Visualisation
4. Deviation:

 Categorical subdivisions are compared against a reference, such as a comparison of actual vs budget
expenses for several departments of business for given time period.

A bar chart can show comparison of actual versus the reference amount.

5. Frequency distribution

Shows the number of observations of a particular variable for given interval, such as the number of years in
which the stock market return is between intervals such as 0-10%, 11-20%.

30/07/2025 INTRODUCTION TO DATA SCIENCE 156


Data Visualisation
A histogram a type of bar chart may be used for this analysis

A boxplot helps visualized key statistics about distribution such as median, quartiles, outliers.

6. Correlation

Comparison between observations represented by two variables (X,Y) to determine if they tend to
move in the same or opposite directions.

Example plotting unemployment (X) and inflation (Y) for sample of months.

30/07/2025 INTRODUCTION TO DATA SCIENCE 157


Data Visualisation
7. Nominal comparison

Comparing categorical subdivisions is no particular order,such as the sales volume by product code.

A bar chart may be used for this comparison.

8. Geographic or geospatial

Comparison of variable across a map or layout, such as the unemployment rate by state or the number
of persons on various floors of building.

Cartogram is typical graphic used.

30/07/2025 INTRODUCTION TO DATA SCIENCE 158


Data Visualisation
2.2 Qualitative Data

Ordinal (size small, medium, large or position 1st place, 2nd place, 3rd place)

Data with fixed ranking with indeterminate distance between the values

Example: a large elephant in india is very different from large elephant in the Africa but don’t know
exactly how much larger.

Nominal (sports NFL football vs English football or computers laptop vs desktop)

Data where you can distinguish between values, but not order them.

30/07/2025 INTRODUCTION TO DATA SCIENCE 159


30/07/2025 INTRODUCTION TO DATA SCIENCE 160
Data Visualisation
3) Importance of Data Visualisation

Primary goal is to communicate information clearly & efficiently via statistical graphics, plots, and
information graphics.

Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative
message.

It helps users analyze & reason about data & evidence.

It makes complex data more accessible, understandable & usable.

30/07/2025 INTRODUCTION TO DATA SCIENCE 161


Data Visualisation

30/07/2025 INTRODUCTION TO DATA SCIENCE 162


Data Visualisation

30/07/2025 INTRODUCTION TO DATA SCIENCE 163


Data Visualisation
4) Conventional data visualisation methods

Conventional data visualisation methods are often used: they are: table, histogram, scatter, plot line, chart, bar
chart, pie chart, area chart, flow chart, bubble chart, multiple data series or combination of charts, time line,
venn diagram, data flow diagram, & entity relationship diagram.

Parallel coordinates is used to plot individual data elements across many dimensions.

Parallel coordinate is very useful when to display multidimensional data, Fig.4.2.

30/07/2025 INTRODUCTION TO DATA SCIENCE 164


30/07/2025 INTRODUCTION TO DATA SCIENCE 165
Data Visualisation
Treemap is an effective method of visualizing hierarchies.

The size of each sub-rectangle represents one measure, while color is often used to represent another
measure of data, Fig 4.3.

A treemap of collection of choices for streaming music & video tracks in social network community.

30/07/2025 INTRODUCTION TO DATA SCIENCE 166


30/07/2025 INTRODUCTION TO DATA SCIENCE 167
Data Visualisation
4.1 Visual Perception and Data Visualisation

Human can distinguish differences in line length, shape, orientation and colour readily without
significant processing effort: these are referred as ”pre-attentive attributes”.

Example: it may require significant time & effort to identify the number of times the digit “5” appears
in series of numbers

If digit is different in size, orientation, or color, instances of digit can be noticed quickly through pre-
attentive processing.

30/07/2025 INTRODUCTION TO DATA SCIENCE 168


30/07/2025 INTRODUCTION TO DATA SCIENCE 169
Data Visualisation
Visualisation are not only static: they can be interactive.

Interactive visualisation can be performed through approaches such as zooming, overview and detail,
zoom and focus and context.

The steps for interactive visualisation are as follows:

1. Selecting

2. Linking

3. Filtering

4. Rearranging or remapping.
30/07/2025 INTRODUCTION TO DATA SCIENCE 170
30/07/2025 INTRODUCTION TO DATA SCIENCE 171
Data Visualisation
4.2 Mapping of data visualisation

30/07/2025 INTRODUCTION TO DATA SCIENCE 172


Data Visualisation
5) Retinal variables

5.1 Seven variables for visualisation

1. Two planar variables

2. Five so called “retinal”

These are two planar variables (X and Y position on map plane).

30/07/2025 INTRODUCTION TO DATA SCIENCE 173


Data Visualisation
Five “Retinal” variables

1. Size,

2. Color value

3. Color hue,

4. Shape, and

5. Orientation,

30/07/2025 INTRODUCTION TO DATA SCIENCE 174


Data Visualisation
5.2 Types of visual variables

A visual variable can be categorised in three different categories:

1. Selective (e.g., colour hue)

2. Associative (e.g., shape)

3. Ordered

30/07/2025 INTRODUCTION TO DATA SCIENCE 175


Data Visualisation
Selective (e.g., colour hue)

30/07/2025 INTRODUCTION TO DATA SCIENCE 176


Data Visualisation
Associative (e.g., shape)

30/07/2025 INTRODUCTION TO DATA SCIENCE 177


Data Visualisation
Ordered

30/07/2025 INTRODUCTION TO DATA SCIENCE 178


Data Visualisation
5.3 Effectiveness of mappings

30/07/2025 INTRODUCTION TO DATA SCIENCE 179


Data Visualisation
6) Mapping Variables to Encodings

30/07/2025 INTRODUCTION TO DATA SCIENCE 180


Data Visualisation
6.1 Choosing Appropriate Visual Encodings

Natural ordering and number of distinct values will indicate whether a visual property is best suited to
one of main data types: quantitative, ordinal, categorical, or relational data.

6.2 Natural Ordering

Visual property has natural ordering is determined by visual systems and software used.

Example: position has natural ordering, shape doesn’t, length has natural ordering , texture doesn’t.

30/07/2025 INTRODUCTION TO DATA SCIENCE 181


Data Visualisation
6.3 Distinct Values

When choosing visual property, select one that has number of useful differentiable values and an
ordering similar to that of your data fig 4.9.

Fig 4.10 shows another way to think about visual properties, depending on what kind of data you
need to encode

30/07/2025 INTRODUCTION TO DATA SCIENCE 182


30/07/2025 INTRODUCTION TO DATA SCIENCE 183
30/07/2025 INTRODUCTION TO DATA SCIENCE 184
Data Visualisation
6.4 Mapping of data types according to Mackinlay

Mackinlay prepared a chart showing the quality of mapping of data types is shown in fig 4.11

30/07/2025 INTRODUCTION TO DATA SCIENCE 185


30/07/2025 INTRODUCTION TO DATA SCIENCE 186
30/07/2025 INTRODUCTION TO DATA SCIENCE 187
THANK YOU

30/07/2025 INTRODUCTION TO DATA SCIENCE 188

You might also like