0% found this document useful (0 votes)

26 views14 pages

Fundamentals of Data Science & Big Data"

Unit 1 provides an overview of data science, its benefits, and the data science process, which includes defining goals, data retrieval, cleansing, exploratory analysis, model building, and deployment. It discusses the various facets of data, such as structured, unstructured, natural language, and machine-generated data, as well as the big data ecosystem that supports data science. The document emphasizes the importance of defining research goals and creating a project charter to ensure project focus and success.

Uploaded by

hacker.792123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views14 pages

Fundamentals of Data Science & Big Data"

Uploaded by

hacker.792123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

UNIT-1

UNIT I: Introduction to Data science, benefits and uses, facets of data, data science process in
brief, big data ecosystem and data science.

Data Science process: Overview, defining goals and creating project charter, retrieving data,
cleansing, integrating and transforming data, exploratory analysis, model building, presenting
findings and building applications on top of them.

Benefits and uses of data science and big data:

❖ Data science and big data are rapidly growing fields that offer a wide
range of benefits and uses across various industries. Some of the benefits
and uses of data science and big data are:
1. Improved decision-making: Data science and big data help
organizations make better decisions by analyzing and interpreting
large amounts of data. Data scientists can identify patterns,
trends, and insights that can be used to make informed decisions.
2. Increased efficiency: Data science and big data can help
organizations automate tasks, streamline processes, and optimize
operations. This can result in significant time and cost savings.
3. Personalization: With data science and big data, organizations
can personalize their products and services to meet the specific
needs and preferences of individual customers. This can lead to
increased customer satisfaction and loyalty.
4. Predictive analytics: Data science and big data can be used to
build predictive models that can forecast future trends and
behavior. This can be useful for businesses that need to anticipate
customer needs, market trends, or supply chain disruptions.
5. Fraud detection: Data science and big data can be used to detect
fraud and other types of financial crimes. By analyzing patterns
in financial data, data scientists can identify suspicious behavior
and prevent fraud.
6. Healthcare: Data science and big data can be used to improve
patient outcomes by analyzing large amounts of medical data.
This can lead to better diagnosis, treatment, and prevention of
diseases.
7. Marketing: Data science and big data can be used to improve
marketing strategies by analyzing consumer behavior and
preferences. This can help businesses target their marketing
campaigns more effectively and generate more leads and sales.
Facets of Data
• Very large amount of data will generate in big data and data science. These data
is various types and main categories of data are as follows:

a) Structured

b) Natural language

c) Graph-based

d) Streaming

e) Unstructured

f) Machine-generated

g) Audio, video and images

Structured Data

• Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data.

• The term structured data refers to data that is identifiable because it is organized
in a structure. The most common form of structured data or records is a database
where specific information is stored based on a methodology of columns and
rows.

• Structured data is also searchable by data type within content. Structured data is
understood by computers and is also efficiently organized for human readers.

• An Excel table is an example of structured data.

Unstructured Data

• Unstructured data is data that does not follow a specified format. Row and
columns are not used for unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no identifiable structure.

• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured
data.

• Even today in most of the organizations more than 80 % of the data are in
unstructured form. This carries lots of information. But extracting information
from these various sources is a very big challenge.
• Characteristics of unstructured data:

1. There is no structural restriction or binding for the data.

2. Data can be of any type.

3. Unstructured data does not follow any structural rules.

4. There are no predefined formats, restriction or sequence for unstructured data.

5. Since there is no structural binding for unstructured data, it is unpredictable in

nature.

Natural Language

• Natural language is a special type of unstructured data.

• Natural language processing enables machines to recognize characters, words

and sentences, then apply meaning and understanding to that information. This
helps machines to understand language as humans do.

• Natural language processing is the driving force behind machine intelligence in

many modern real-world applications. The natural language processing
community has had success in entity recognition, topic recognition,
summarization, text completion and sentiment analysis.

•For natural language processing to help machines understand human language, it

must go through speech recognition, natural language understanding and machine
translation. It is an iterative process comprised of several layers of text analysis.

Machine - Generated Data

• Machine-generated data is an information that is created without human

interaction as a result of a computer process or application activity. This means
that data entered manually by an end-user is not recognized to be machine-
generated.

• Machine data contains a definitive record of all activity and behavior of our
customers, users, transactions, applications, servers, networks, factory machinery
and so on.

• It's configuration data, data from APIs and message queues, change events, the
output of diagnostic commands and call detail records, sensor data from remote
equipment and more.

• Examples of machine data are web server logs, call detail records, network event
logs and telemetry.

• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions

generate machine data. Machine data is generated continuously by every
processor-based system, as well as many consumer-oriented systems.

Graph-based or Network Data

•Graphs are data structures to describe relationships and interactions between

entities in complex systems. In general, a graph contains a collection of entities
called nodes and another collection of interactions between a pair of nodes called
edges.

• Nodes represent entities, which can be of any object type that is relevant to our
problem domain. By connecting nodes with edges, we will end up with a graph
(network) of nodes.

• A graph database stores nodes and relationships instead of tables or documents.

Data is stored just like we might sketch ideas on a whiteboard. Our data is stored
without restricting it to a predefined model, allowing a very flexible way of
thinking about and using it.

• Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.

Figure 1 Friends in a social network are an example of graph-based data.

• Graph databases are capable of sophisticated fraud prevention. With graph

databases, we can use relationships to process financial and purchase transactions
in near-real time. With fast graph queries, we are able to detect that, for example,
a potential purchaser is using the same email address and credit card as included
in a known fraud case.

• Graph databases can also help user easily detect relationship patterns such as
multiple people associated with a personal email address or multiple people
sharing the same IP address but residing in different physical addresses.
• Graph databases are a good choice for recommendation applications. With graph
databases, we can store in a graph relationships between information categories
such as customer interests, friends and purchase history. We can use a highly
available graph database to make product recommendations to a user based on
which products are purchased by others who follow the same sport and have
similar purchase history.

• Graph theory is probably the main method in social network analysis in the early
history of the social network concept. The approach is applied to social network
analysis in order to determine important features of the network such as the nodes
and links (for example influencers and the followers).

Audio, Image and Video

• Audio, image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.

•The terms audio and video commonly refers to the time-based media storage
format for sound/music and moving pictures information. Audio and video digital
recording, also referred as audio and video codecs, can be uncompressed, lossless
compressed or lossy compressed depending on the desired quality and use cases.

• It is important to remark that multimedia data is one of the most important

sources of information and knowledge; the integration, transformation and
indexing of multimedia data bring significant challenges in data management and
analysis. Many challenges have to be addressed including big data,
multidisciplinary nature of Data Science and heterogeneity.

• Data Science is playing an important role to address these challenges in

multimedia data. Multimedia data usually contains various forms of media, such
as text, image, video, geographic coordinates and even pulse waveforms, which
come from multiple sources. Data Science can be a key instrument covering big
data, machine learning and data mining solutions to store, handle and analyze
such heterogeneous data.

Streaming Data

Streaming data is data that is generated continuously by thousands of data

sources, which typically send in the data records simultaneously and in small sizes
(order of Kilobytes).

• Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game
player activity, information from social networks, financial trading floors or
geospatial services and telemetry from connected devices or instrumentation in
data centers.
The data science process:
The data science process typically involves the following steps:
1. Define the problem: The first step in the data science process is to define
the problem that you want to solve. This involves identifying the business
or research question that you want to answer and determining what data
you need to collect.
2. Collect and clean the data: Once you have identified the data that you
need, you will need to collect and clean the data to ensure that it is
accurate and complete. This involves checking for errors, missing values,
and inconsistencies.
3. Explore and visualize the data: After you have collected and cleaned the
data, the next step is to explore and visualize the data. This involves
creating summary statistics, visualizations, and other descriptive analyses
to better understand the data.
4. Prepare the data: Once you have explored the data, you will need to
prepare the data for analysis. This involves transforming and
manipulating the data, creating new variables, and selecting relevant
features.
5. Build the model: With the data prepared, the next step is to build a model
that can answer the business or research question that you identified in
step one. This involves selecting an appropriate algorithm, training the
model, and evaluating its performance.
6. Evaluate the model: Once you have built the model, you will need to
evaluate its performance to ensure that it is accurate and effective. This
involves using metrics such as accuracy, precision, recall, and F1 score to
assess the model's performance.
7. Deploy the model: After you have evaluated the model, the final step is to
deploy the model in a production environment. This involves integrating
the model into an application or workflow and ensuring that it can handle
real-world data and user inputs.
The big data ecosystem and data science:
❖ The big data ecosystem and data science are closely related, as the former
provides the infrastructure and tools that enable the latter.
❖ The big data ecosystem refers to the set of technologies, platforms, and
frameworks that are used to store, process, and analyze large volumes of
data.
❖ Some of the key components of the big data ecosystem include:
1. Storage: Big data storage systems such as Hadoop Distributed File
System (HDFS), Apache Cassandra, and Amazon S3 are designed to
store and manage large volumes of data across multiple nodes.
2. Processing: Big data processing frameworks such as Apache Spark,
Apache Flink, and Apache Storm are used to process and analyze large
volumes of data in parallel across distributed computing clusters.
3. Querying: Big data querying systems such as Apache Hive, Apache Pig,
and Apache Drill are used to extract and transform data stored in big data
storage systems.
4. Visualization: Big data visualization tools such as Tableau, D3.js, and
Apache Zeppelin are used to create interactive visualizations and
dashboards that enable data scientists and business analysts to explore
and understand data.
5. Machine learning: Big data machine learning platforms such as Apache
Mahout, TensorFlow, and Microsoft Azure Machine Learning are used to
build and deploy machine learning models at scale.

The data science process: Overview of the data science process:

Defining research goals and creating a project character:

Defining research goals and creating a project charter are important initial
steps in any data science project, as they set the stage for the entire project and
help ensure that it stays focused and on track.

Here are some key considerations for defining research goals and creating
a project charter in data science:

Identify the problem or question you want to answer: What is the business
problem or research question that you are trying to solve? It's important to
clearly define the problem or question at the outset of the project, so that
everyone involved is on the same page and working towards the same goal.

❖ Define the scope of the project: Once you have identified the problem or
question, you need to define the scope of the project. This includes
specifying the data sources you will be using, the variables you will be
analyzing, and the timeframe for the project.
❖ Determine the project objectives: What do you hope to achieve with the
project? What are your key performance indicators (KPIs)? This will help
you measure the success of the project and determine whether you have
achieved your goals.

❖ Identify the stakeholders: Who are the key stakeholders in the project?
This could include business leaders, data analysts, data scientists, and
other team members. It's important to identify all the stakeholders upfront
so that everyone is aware of their role in the project and can work
together effectively.

❖ Create a project charter: The project charter is a document that

summarizes the key information about the project, including the problem
or question, the scope of the project, the objectives, the stakeholders, and
any constraints or risks. It's a critical document that helps ensure
everyone involved in the project is on the same page and understands
what is expected of them.
Retrieving data:
Retrieving data is an essential step in the data science process as
it provides theraw material needed to analyze and derive insights. There are
various ways to retrieve data, and the methods used depend on the type of data
and where it is stored.
Here are some common methods for retrieving data in data science:

➢ File import: Data can be retrieved from files in various formats, such as
CSV, Excel, JSON, or XML. This is a common method used to retrieve
data that is stored locally.

➢ Web scraping: Web scraping involves using scripts to extract data from
websites. This is a useful method for retrieving data that is not readily
available in a structured format.

➢ APIs: Many applications and services provide APIs (Application

Programming Interfaces) that allow data to be retrieved
programmatically. APIs can be used to retrieve data from social media
platforms, weather services, financial data providers, and many other
sources.
➢ Databases: Data is often stored in databases, and SQL (Structured Query
Language) can be used to retrieve data from databases. Non-relational
databases such as MongoDB or Cassandra are also popular for storing
and retrieving data.

➢ Big Data platforms: When dealing with large amounts of data, big data
platforms such as Hadoop, Spark, or NoSQL databases can be used to
retrieve data efficiently.

Cleansing, integrating and transforming data

Cleansing, integrating, and transforming data are essential steps in the data
preparation process in data science. These steps are necessary to ensure that the
data is accurate, consistent, and usable for analysis. Here's an overview of each
step:

▪ Data Cleansing: This step involves identifying and correcting or

removing any errors, inconsistencies, or missing values in the data. Some
common techniques used for data cleansing include removing duplicates,
filling in missing values, correcting spelling errors, and dealing with
outliers.
▪ Data Integration: In many cases, data comes from multiple sources, and
data integration is needed to combine the data into a single dataset. This
can involve matching and merging datasets based on common fields or
keys, and handling any discrepancies or inconsistencies between the
datasets.

Figure1 : Joining two tables on the Item and Region keys

▪ Data Transformation: Data transformation involves converting the data

into a format that is more suitable for analysis. This can involve
converting categorical variables into numerical variables, scaling or
normalizing data, and creating new variables or features from existing
data.

Exploratory data analysis:

Exploratory data analysis (EDA) is the process of analyzing and

summarizing data sets in order to gain insights and identify patterns.

The main goal of EDA is to understand the data, rather than to test a
particular hypothesis. The process typically involves visualizing the data
using graphs, charts, and tables, as well as calculating summary statistics
such as mean, median, and standard deviation.
Figure 2: From top to bottom, a bar chart, a line plot, and a distribution are some of the graphs
used in exploratory analysis.
Some common techniques used in EDA include:

❖ Descriptive statistics: This involves calculating summary statistics such

as mean, median, mode, standard deviation, and range.

❖ Data visualization: This involves creating graphs, charts, and other

visual representations of the data, such as histograms, scatter plots, and
box plots.

❖ Data transformation: This involves transforming the data to make it

easier to analyze, such as normalizing or standardizing the data, or log
transforming skewed data.

❖ Outlier detection: This involves identifying and analyzing data points

that are significantly different from the other data points.

❖ Correlation analysis: This involves examining the relationship between

different variables in the data set, such as calculating correlation
coefficients or creating correlation matrices.

Overall, EDA is an important step in any data analysis project, as it helps to

identify any patterns, outliers, or other trends in the data that may be relevant to
the analysis. It also helps to ensure that the data is clean, complete, and ready
for further analysis.

Model building
• To build the model, data should be clean and understand the content properly.
The components of model building are as follows:

a) Selection of model and variable

b) Execution of model

c) Model diagnostic and model comparison

Model and Variable Selection

• For this phase, consider model performance and whether project meets all the
requirements to use model, as well as other factors:

1. Must the model be moved to a production environment and, if so, would it be

easy to implement?

2. How difficult is the maintenance on the model: how long will it remain relevant
if left untouched?

3. Does the model need to be easy to explain?

Model Execution

• Various programming language is used for implementing the model. For model
execution, Python provides libraries like StatsModels or Scikit-learn. These
packages use several of the most popular techniques.

• Coding a model is a nontrivial task in most cases, so having these libraries

available can speed up the process. Following are the remarks on output:

a) Model fit: R-squared or adjusted R-squared is used.

b) Predictor variables have a coefficient: For a linear model this is easy to

interpret.

c) Predictor significance: Coefficients are great, but sometimes not enough

evidence exists to show that the influence is there.

• Linear regression works if we want to predict a value, but for classify

something, classification models are used. The k-nearest neighbors method is one
of the best method.

Following commercial tools are used :

1. SAS enterprise miner: This tool allows users to run predictive and descriptive
models based on large volumes of data from across the enterprise.

2. SPSS modeler: It offers methods to explore and analyze data through a GUI.

3. Matlab: Provides a high-level language for performing a variety of data

analytics, algorithms and data exploration.

4. Alpine miner: This tool provides a GUI front end for users to develop analytic
workflows and interact with Big Data tools and platforms on the back end.

Open Source tools:

1. R and PL/R: PL/R is a procedural language for PostgreSQL with R.

2. Octave: A free software programming language for computational modeling,

has some of the functionality of Matlab.

3. WEKA: It is a free data mining software package with an analytic workbench.

The functions created in WEKA can be executed within Java code.

4. Python is a programming language that provides toolkits for machine learning

and analysis.

5. SQL in-database implementations, such as MADlib provide an alterative to in

memory desktop analytical tools.
Model Diagnostics and Model Comparison

Try to build multiple models and then select best one based on multiple criteria.

• In Holdout Method, the data is split into two different datasets labelled as a
training and a testing dataset. This can be a 60/40 or 70/30 or 80/20 split. This
technique is called the hold-out validation technique.

Suppose we have a database with house prices as the dependent variable and two
independent variables showing the square footage of the house and the number of
rooms. Now, imagine this dataset has 30 rows. The whole idea is that you build a
model that can predict house prices accurately.

• To 'train' our model or see how well it performs, we randomly subset 20 of those
rows and fit the model. The second step is to predict the values of those 10 rows
that we excluded and measure how well our predictions were.

Presenting Findings and Building Applications

• The team delivers final reports, briefings, code and technical documents.

• In addition, team may run a pilot project to implement the models in a

production environment.

The last stage of the data science process is where user soft skills will be most
useful.

• Presenting your results to the stakeholders and industrializing your analysis

process for repetitive reuse and integration with other tools.

Data Science Unit-1 B.sc. III Sem. MDC
No ratings yet
Data Science Unit-1 B.sc. III Sem. MDC
10 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
Facets of Data
No ratings yet
Facets of Data
7 pages
Fods Unit 1
No ratings yet
Fods Unit 1
11 pages
CS 3353 FDS Unit 1 Notes JPR
No ratings yet
CS 3353 FDS Unit 1 Notes JPR
39 pages
Unit 1 To 5
No ratings yet
Unit 1 To 5
202 pages
Unit I
No ratings yet
Unit I
262 pages
Module-1: Introduction To Data Science
No ratings yet
Module-1: Introduction To Data Science
98 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
65 pages
Data Science SPPU
No ratings yet
Data Science SPPU
115 pages
Ds Unit 1
No ratings yet
Ds Unit 1
18 pages
DSUP Chapter 1 PDF
No ratings yet
DSUP Chapter 1 PDF
31 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
CS3352 Fds
No ratings yet
CS3352 Fds
23 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
Mod 3
No ratings yet
Mod 3
96 pages
1 Unit 1 Introduction To Data Science
No ratings yet
1 Unit 1 Introduction To Data Science
48 pages
Unit 1
No ratings yet
Unit 1
26 pages
Foundations of Data Science Course
No ratings yet
Foundations of Data Science Course
25 pages
Data Science
No ratings yet
Data Science
244 pages
Fdsunit 1
No ratings yet
Fdsunit 1
27 pages
Facets of Data
50% (2)
Facets of Data
22 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
30 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
The Excitement of Data Science
No ratings yet
The Excitement of Data Science
137 pages
Data Science
No ratings yet
Data Science
54 pages
Module 1 - Data Science Introduction - Detailed
No ratings yet
Module 1 - Data Science Introduction - Detailed
131 pages
Unit 1
No ratings yet
Unit 1
9 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
UNIT I Democracy
No ratings yet
UNIT I Democracy
75 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
15 pages
Unit 1-3
No ratings yet
Unit 1-3
39 pages
Cs3352 Fds Notes Mk1
No ratings yet
Cs3352 Fds Notes Mk1
30 pages
AI Primer
No ratings yet
AI Primer
24 pages
Module 1 Intro To Big Data - Hadoop
No ratings yet
Module 1 Intro To Big Data - Hadoop
55 pages
IDS - Sem Ans Unit 1
No ratings yet
IDS - Sem Ans Unit 1
10 pages
DS R Unit-1
No ratings yet
DS R Unit-1
41 pages
CS3352 Foundations of Data Science Nov Dec 2023
No ratings yet
CS3352 Foundations of Data Science Nov Dec 2023
26 pages
Data Science & Big Data Guide
No ratings yet
Data Science & Big Data Guide
6 pages
Unit 1
No ratings yet
Unit 1
19 pages
MCS 226
No ratings yet
MCS 226
348 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
107 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Big Data & Data Science Essentials
No ratings yet
Big Data & Data Science Essentials
42 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
BD Unit 1
No ratings yet
BD Unit 1
72 pages
Data v2
No ratings yet
Data v2
25 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
What Is Big Data
No ratings yet
What Is Big Data
18 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Big Data Analytics Unit 1
No ratings yet
Big Data Analytics Unit 1
26 pages
Hibernate Interview Questions and Answers
No ratings yet
Hibernate Interview Questions and Answers
51 pages
DBAM Lec4
No ratings yet
DBAM Lec4
3 pages
Fema Hazus 5.1 Earthquake Model User Guidance
No ratings yet
Fema Hazus 5.1 Earthquake Model User Guidance
238 pages
Esight V300R010C00 Single-Node System Software Installation Guide (Windows) 05
No ratings yet
Esight V300R010C00 Single-Node System Software Installation Guide (Windows) 05
330 pages
SAP Readiness Check Fit Gap Whitepaper - Final - AJ
0% (1)
SAP Readiness Check Fit Gap Whitepaper - Final - AJ
8 pages
Simkep Anatomi Komputet
No ratings yet
Simkep Anatomi Komputet
5 pages
MOS - Access 2016 Exam Objectives
No ratings yet
MOS - Access 2016 Exam Objectives
3 pages
Testbank For Fundamentals of Database Systems 6th Edition Instant Download
No ratings yet
Testbank For Fundamentals of Database Systems 6th Edition Instant Download
18 pages
IBM Hiring Process Question Bank
No ratings yet
IBM Hiring Process Question Bank
23 pages
Microsoft Pr-Engagement Questionairre
No ratings yet
Microsoft Pr-Engagement Questionairre
3 pages
Configuring Autocad Electrical For Enhanced Productivity: Learning Objectives
No ratings yet
Configuring Autocad Electrical For Enhanced Productivity: Learning Objectives
24 pages
KMTLink Revit Addin Guide
No ratings yet
KMTLink Revit Addin Guide
9 pages
Analyze HTTP & DNS Logs for Threats
100% (2)
Analyze HTTP & DNS Logs for Threats
9 pages
Care Bundles in Emergency Medicine Official Test Bank
No ratings yet
Care Bundles in Emergency Medicine Official Test Bank
407 pages
Ports Used On OnGuard System
100% (1)
Ports Used On OnGuard System
1 page
Student Management System Project by Mohammad Shah Khan
No ratings yet
Student Management System Project by Mohammad Shah Khan
27 pages
Notesdbms
No ratings yet
Notesdbms
106 pages
Libary Management System
60% (5)
Libary Management System
37 pages
SAP HANA SQL Script Reference en
No ratings yet
SAP HANA SQL Script Reference en
328 pages
Advance Java Notes
No ratings yet
Advance Java Notes
101 pages
Question Bank With Answers
No ratings yet
Question Bank With Answers
49 pages
Final Report - Student Information System
No ratings yet
Final Report - Student Information System
95 pages
Chapter 1 Online Job Application System (Vape Max)
No ratings yet
Chapter 1 Online Job Application System (Vape Max)
9 pages
Human Resource Information System For Rift Valley University (Bole Campaus)
No ratings yet
Human Resource Information System For Rift Valley University (Bole Campaus)
56 pages
Jeff Ryan
No ratings yet
Jeff Ryan
45 pages
Pdms Monitor: Reference Manual
No ratings yet
Pdms Monitor: Reference Manual
61 pages
Data Dictionary - Northwind
No ratings yet
Data Dictionary - Northwind
10 pages
(Nov2413) Gpon Olt-Ont - Ems Test Cases
No ratings yet
(Nov2413) Gpon Olt-Ont - Ems Test Cases
12 pages
Learning Graph - Branching Databases - Y3 - v1
No ratings yet
Learning Graph - Branching Databases - Y3 - v1
1 page
Exadata Migration
100% (1)
Exadata Migration
13 pages

Fundamentals of Data Science & Big Data"

Uploaded by

Fundamentals of Data Science & Big Data"

Uploaded by

UNIT-1

Benefits and uses of data science and big data:

g) Audio, video and images

• An Excel table is an example of structured data.

1. There is no structural restriction or binding for the data.

2. Data can be of any type.

3. Unstructured data does not follow any structural rules.

4. There are no predefined formats, restriction or sequence for unstructured data.

5. Since there is no structural binding for unstructured data, it is unpredictable in

• Natural language is a special type of unstructured data.

• Natural language processing enables machines to recognize characters, words

• Natural language processing is the driving force behind machine intelligence in

•For natural language processing to help machines understand human language, it

Machine - Generated Data

• Machine-generated data is an information that is created without human

• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions

Graph-based or Network Data

•Graphs are data structures to describe relationships and interactions between

• A graph database stores nodes and relationships instead of tables or documents.

Figure 1 Friends in a social network are an example of graph-based data.

• Graph databases are capable of sophisticated fraud prevention. With graph

Audio, Image and Video

• It is important to remark that multimedia data is one of the most important

• Data Science is playing an important role to address these challenges in

Streaming data is data that is generated continuously by thousands of data

The data science process: Overview of the data science process:

❖ Create a project charter: The project charter is a document that

➢ APIs: Many applications and services provide APIs (Application

Cleansing, integrating and transforming data

▪ Data Cleansing: This step involves identifying and correcting or

Figure1 : Joining two tables on the Item and Region keys

▪ Data Transformation: Data transformation involves converting the data

Exploratory data analysis:

Exploratory data analysis (EDA) is the process of analyzing and

❖ Descriptive statistics: This involves calculating summary statistics such

❖ Data visualization: This involves creating graphs, charts, and other

❖ Data transformation: This involves transforming the data to make it

❖ Outlier detection: This involves identifying and analyzing data points

❖ Correlation analysis: This involves examining the relationship between

Overall, EDA is an important step in any data analysis project, as it helps to

a) Selection of model and variable

c) Model diagnostic and model comparison

Model and Variable Selection

1. Must the model be moved to a production environment and, if so, would it be

3. Does the model need to be easy to explain?

• Coding a model is a nontrivial task in most cases, so having these libraries

a) Model fit: R-squared or adjusted R-squared is used.

b) Predictor variables have a coefficient: For a linear model this is easy to

c) Predictor significance: Coefficients are great, but sometimes not enough

• Linear regression works if we want to predict a value, but for classify

Following commercial tools are used :

3. Matlab: Provides a high-level language for performing a variety of data

Open Source tools:

1. R and PL/R: PL/R is a procedural language for PostgreSQL with R.

2. Octave: A free software programming language for computational modeling,

3. WEKA: It is a free data mining software package with an analytic workbench.

4. Python is a programming language that provides toolkits for machine learning

5. SQL in-database implementations, such as MADlib provide an alterative to in

Presenting Findings and Building Applications

• In addition, team may run a pilot project to implement the models in a

• Presenting your results to the stakeholders and industrializing your analysis

You might also like