[go: up one dir, main page]

0% found this document useful (0 votes)
22 views41 pages

Introduction To Emerging Technology - Chapter Two

Data science is a high-demand, multi-disciplinary field that extracts insights from various types of data, including structured, semi-structured, and unstructured data. It encompasses roles such as data analysts, data engineers, and data scientists, each with distinct skills and responsibilities. The document emphasizes the importance of data in decision-making, improving quality of life, and driving organizational efficiency.

Uploaded by

Tj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views41 pages

Introduction To Emerging Technology - Chapter Two

Data science is a high-demand, multi-disciplinary field that extracts insights from various types of data, including structured, semi-structured, and unstructured data. It encompasses roles such as data analysts, data engineers, and data scientists, each with distinct skills and responsibilities. The document emphasizes the importance of data in decision-making, improving quality of life, and driving organizational efficiency.

Uploaded by

Tj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

DATA SCIENCE

August 2022
“ DATA SCIENCE IS SEXY? ” HARVARD BUSINESS REVIEW
Data Science is sexy because it has rare qualities and high demand

Rare Qualities High Demand


Data science takes unstructured data, then finds Data science provides insight and competitive

order, meaning, and value advantage.

 McKinsey Global Institute - they posted for the next few years 140, 000 – 190, 000 more deep analytical positions, and 1.5 million more data science managers needed to take full advantage of big data in the United States.
DEFINITION

 Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and

systems to extract knowledge and insights from structured, semi-structured and unstructured data.
 Scientific methods involves steps – observation, question (make question about the observation, gather information), hypothesize (statement

that attempts to explain the observation, make some predictions based on this hypothesis), testing (testing the hypothesis using reproducible

experiment), conclude (analyze the results and draw conclusions).


 The processes involves to solve data problem – processes include – collection, cleaning, exploratory analysis, model building, and model

deployment.
 As an academic discipline and profession, data science continues to evolve as one of the most

promising and in-demand career paths for skilled professionals.


DEFINITION

 A data analyst is someone who can query data sources to generate reports and graphical visualizations.

 A Data Analyst is generally comfortable with statistical tools in order to better explore data. Skills and tools include – Excel, Access, SPSS, and others.

 A Data Engineer is someone with a technical background in software development. He can be a Software Engineer who has converted to Big Data.

They will set up Big Data systems to process them. They will collect and transform data from different sources.
 Skills and tools include - SQL, NoSQL, Hadoop, Data Lake (is a centralized repository that allows you to store all you structured and unstructured data),

Big Data, Spark, Software Engineering, Map / Reduce.


 A Data Scientist is a multidisciplinary professional whose primary goal is to extract useful information (insights) from raw data. The Data Scientist's role

lies somewhere between that of a Data Analyst and that of a Data Engineer. While having business knowledge in his field of operation.
 Skills and tools include - SQL, NoSQL, Python, R, Machine Learning, Deep Learning
DATA AND INFORMATION

 Data can be defined as a representation of facts, concepts, or instructions in a formalized manner,

which should be suitable for communication, interpretation, or processing, by human or electronic

machines.
 It can be described as unprocessed facts and figures

 It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).

 Information is the processed data on which decisions and actions are based.

 It is data that has been processed into a form that is meaningful to the recipient and is of real or perceived value in the current or the prospective action or decision of recipient.

 Information is interpreted data; created from organized, structured, and processed data in a particular context.
WHY DATA/ROLE

1. Improve People’s Lives

Data will help you to improve quality of life for people you support: Improving quality is first and foremost among the reasons why organizations should be using data. By allowing you to

measure and take action, an effective data system can enable your organization to improve the quality of people’s lives.

2. Make Informed Decisions

Data = Knowledge. Good data provides indisputable evidence, while anecdotal evidence, assumptions, or abstract observation might lead to wasted resources due to taking action based on an

incorrect conclusion.

3. Stop Molehills From Turning Into Mountains

Data allows you to monitor the health of important systems in your organization: By utilizing data for quality monitoring, organizations are able to respond to challenges before they become

full-blown crisis. Effective quality monitoring will allow your organization to be proactive rather than reactive and will support the organization to maintain best practices over time.
CONT’D…
4. Get The Results You Want

Data allows organizations to measure the effectiveness of a given strategy: When strategies are put into place to overcome a challenge, collecting data will allow you to determine how

well your solution is performing, and whether or not your approach needs to be tweaked or changed over the long-term.

5. Find Solutions To Problems

Data allows organizations to more effectively determine the cause of problems. Data allows organizations to visualize relationships between what is happening in different locations,

departments, and systems. If the number of medication errors has gone up, is there an issue such as staff turnover or vacancy rates that may suggest a cause? Looking at these data

points side-by-side allows us to develop more accurate theories, and put into place more effective solutions.

6. Back Up Your Arguments

Data is a key component to systems advocacy. Utilizing data will help present a strong argument for systems change. Whether you are advocating for increased funding from public or

private sources, or making the case for changes in regulation, illustrating your argument through the use of data will allow you to demonstrate why changes are needed.
CONT’D…

7. Stop The Guessing Game


Data will help you explain (both good and bad) decisions to your stakeholders. Whether or not your strategies and
decisions have the outcome you anticipated, you can be confident that you developed your approach based not
upon guesses, but good solid data.
8. Be Strategic In Your Approaches
Data increases efficiency. Effective data collection and analysis will allow you to direct scarce resources where
they are most needed. If an increase in significant incidents is noted in a particular service area, this data can be
dissected further to determine whether the increase is widespread or isolated to a particular site. If the issue is
isolated, training, staffing, or other resources can be deployed precisely where they are needed, as opposed to
system-wide. Data will also support organizations to determine which areas should take priority over others.
9. Know What You Are Doing Well
Data allows you to replicate areas of strength across your organization. Data analysis will support you to identify
high-performing programs, service areas, and people. Once you identify your high-performers, you can study them
CONT’D…
10. Keep Track Of It All
Good data allows organizations to establish baselines, benchmarks, and goals to keep moving forward. Because data allows you
to measure, you will be able to establish baselines, find benchmarks and set performance goals. A baseline is what a certain area
looks like before a particular solution is implemented. Benchmarks establish where others are at in a similar demographic.
Collecting data will allow your organization to set goals for performance and celebrate your successes when they are achieved.
11. Make The Most Of Your Money
Funding is increasingly outcome and data-driven. With the shift from funding that is based on services provided to funding that
is based on outcomes achieved, it is increasingly important for organizations to implement evidence-based practice and develop
systems to collect and analyze data.
12. Access The Resources Around You
Your organization probably already has most of the data and expertise you need to begin analysis. Your HR office probably
already tracks data regarding your staff. You are probably already reporting data regarding incidents to your state oversight
agency. You probably have at least one person in your organization who has experience with Excel. But, if you don’t do any of
these things, there is still hope! There are lots of free resources online that can get you started. Do a web search for “how to
analyze data” or “how to make a chart in Excel.”
DATA AND INFORMATION
DATA PROCESSING CYCLE

 Data processing is the re-structuring or re-ordering of data by people or machines to increase their usefulness and add values for a particular purpose.

 Data processing consists of the following basicsteps - input, processing, and output – I-P-O

 Input – heart rate, GSR (galvanic skin response), temperature; Process – pre-processing, segmentation, feature extraction, and emotion classification;

Output – Emotional State, i.e., negative, positive, and neutral.


DATA TYPES AND THEIR REPRESENTATION
 Data types can be described from diverse perspectives.

 In computer science and computer programming, for instance, a data type is simply an attribute of data that tells the compiler or interpreter how the programmer intends to use the data

 Integers(int)- is used to store whole numbers, mathematically known as integers


 Booleans(bool)- is used to represent restricted to one of two values: true or false
 Characters(char)- is used to store a single character
 Floating-point numbers(float)- is used to store real numbers
 Alphanumeric strings(string)- used to store a combination of characters and numbers.
 A data type makes the values that expression, such as a variable or a function, might take.

 This data type defines the operations that can be done on the data, the meaning of the data, and the way values of that type can be stored.
DATA TYPES FROM DATA ANALYTICS PERSPECTIVE

 From a data analytics point of view, it is important to understand that there are three common types of data types or structures: Structured, Semi-structured, and Unstructured data types

 Structured Data - Structured data conforms to a tabular format with a relationship between the different rows and columns.

 Common examples of structured data are Excel files or SQL databases. Each of these has structured rows and columns that can be sorted.

 Semi –Structured Data - Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data

tables, but nonetheless, contains tags or other markers to separate semantic elements and enforce hierarchies of records

and fields within the data.


 Therefore, it is also known as a self-describing structure. Examples of semi-structured data include JSON and XML are forms of semi-structured data

.
STRUCTURED DATA -- EXAMPLES
● SQL Data ● Excel File
SEMI-STRUCTURED DATA --
EXAMPLES
Examples of semi-structured data

JSON and XML


DATA TYPES FROM DATA ANALYTICS PERSPECTIVE

 Unstructured Data - is information that either does not have a predefined data model or is not organized in a pre-defined manner.

 Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well.

 This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in structured databases. Common examples of unstructured data include audio, video files or NoSQL databases.

 Meta Data - Data about Data - Metadata is data about data. It provides additional information about a specific set of data.

 In a set of photographs, for example, metadata could describe when and where the photos were taken. The metadata then provides fields for dates and locations which, by themselves, can be

considered structured data. Because of this reason, metadata is frequently used by Big Data solutions for initial analysis.

.
UNSTRUCTURED DATA -- EXAMPLES
● Pdf files ● Images
DATA VALUE CHAIN
 Value Chains have been used as a decision support tool in the field of Business Management to model the chain of activities that an organization performs in order to deliver a valuable product or service to the market (Porter 1985).

 A value chain is made up of subsystems that each have inputs, transformation processes, and outputs.

 Within their work on Virtual Value Chains, (Rayport and Sviokla 1995) were among the first to apply the value chain metaphor to information systems.

 The value chain can be applied to information flows as an analytical tool to understand the value creation of data technology.

 A Data Value Chain describes information flow as a series of steps required to generate value and useful insights from data.

 According to the European Commission, the data value chain will be the "center of the future knowledge economy, bringing the opportunities of digital developments to more traditional sectors (e.g., transportation, financial services, health,

manufacturing, and retail)" (DG Connect 2013).

.
DATA VALUE CHAIN

.
DATA ACQUISITION

Data acquisition is the process of gathering, filtering, and cleaning data before storing it in a data warehouse or other storage solution for data analysis.

In terms of infrastructure requirements, data acquisition is one of the most significant big data challenges.

The infrastructure required to support big data acquisition must provide low, predictable latency in both data capture and query execution; handle very

high transaction volumes, often in a distributed environment; and support flexible and dynamic data structures.

.
DATA ANALYSIS

Data analysis is concerned with transforming raw data into usable information for decision-making and domain-specific applications.

Data analysis entails investigating, transforming, and modeling data with the goal of highlighting relevant data, synthesizing, and

extracting useful hidden information with high business potential. Data mining, business intelligence, and machine learning are all

related fields.

.
DATA CURATION

 Data curation is the active management of data throughout its life cycle to ensure that it meets the data quality requirements for effective use (Pennock 2007).

 Content creation, selection, classification, transformation, validation, and preservation are all examples of data curation processes.

 Data curation is carried out by expert curators who are in charge of improving data accessibility and quality.

 Data curators (also known as scientific curators or data annotators) are responsible for ensuring that data is reliable, discoverable, accessible, reusable, and

appropriate for their purpose.


 The process of labeling data in various formats such as video, images, or text so that machines can understand it is known as data annotation.

 A key trend in big data curation is the use of community and crowdsourcing approaches (Curry et al 2010).

.
DATA STORAGE

 Data storage is the persistence and management of data in a scalable manner that meets the needs of applications

that require quick access to data.


 For nearly 40 years, Relational Database Management Systems (RDBMS) have been the primary, and nearly

unique, storage paradigm solution.

• Transaction is any operation that is treated as a single unit of work, which either completes fully or does not
complete at all, and leaves the storage system in a consistent state.
 Atomicity

 Definition: Ensures that every transaction is treated as a single unit—either all operations succeed or none

do.
 Examples: In a bank transfer, money is deducted from the sender and added to the receiver’s account only

if both steps complete successfully. If any step fails, no money is moved.


COT’D ….

 Consistency:

 Definition: Guarantees that a transaction brings the database from one valid state to another, adhering to all

defined rules (constraints and triggers).


 Example: When updating an order status in an e-commerce system, the database rules ensure that the status

changes only if order details remain valid and all related data are correctly updated..
 Isolation:

 Definition: Means that transactions occur independently without interference from other concurrent

transactions.
 Example: Consider multiple users trying to modify the same shopping cart simultaneously. Isolation

ensures that each user's changes are processed in a way that one user's transaction does not corrupt another’s

work.
COT’D ….

Durability:

• Definition: Ensures that once a transaction is committed, the changes persist even in the case of a system
crash or failure.

• Example: After an online purchase is confirmed, the transaction details are permanently recorded so that
even if the server fails immediately afterward, the sale is not lost.
DATA USAGE

 Data Usage: encompasses data-driven business activities that require data access, data analysis, and the tools required to integrate data analysis into the

business activity.
 Data use in business decision-making can boost competitiveness by lowering costs, increasing added value, or measuring any other parameter against

existing performance criteria.

.
BIG DATA

.
WHAT IS BIG DATA?

 Big Data is a massive/huge collection of data that is growing exponentially over time.

 It is a data set that is so large and complex that traditional data management tools cannot store or process it efficiently.

 Big data is a type of data that is extremely large in size.

 A "large dataset" in this context refers to a dataset that is too large to process or store with traditional tooling or on a single computer.

 This means that the common scale of large datasets is constantly shifting and may differ significantly between organizations.

.
WHAT IS BIG DATA?

 Big Data is a massive/huge collection of data that is growing exponentially over time.

 It is a data set that is so large and complex that traditional data management tools cannot store or process it efficiently.

 Big data is a type of data that is extremely large in size.

 A "large dataset" in this context refers to a dataset that is too large to process or store with traditional tooling or on a single computer.

 This means that the common scale of large datasets is constantly shifting and may differ significantly between organizations.

.
WHAT IS AN EXAMPLE OF BIG DATA?

 The New York Stock Exchange, for example, generates approximately one terabyte of new trade data per day.

 Every day, 500+terabytes of new data are ingested into the databases of the social media site Facebook. This data is primarily

generated through photo and video uploads, message exchanges, commenting, and so on.

.
CHARACTERISTICS OF BIG DATA

 Big data can be described by the following characteristics:

 Volume

 Variety

 Velocity

 Veracity

 Volume - The term "Big Data" refers to a massive amount of data.

 The size of the data is very important in determining the value of the data.

 Furthermore, whether a particular data set can be considered Big Data or not is determined by the volume of

data.
 As a result, 'Volume' is one characteristic that must be considered when dealing with Big Data solutions.
CHARACTERISTICS OF BIG DATA

 Variety - The next feature of Big Data is its diversity.

 Variety refers to a wide range of data sources and data types, both structured and unstructured.

 Previously, spreadsheets and databases were the only data sources considered by most applications.

 Data in the form of emails, photos, videos, monitoring devices, PDFs, audio, and so on are now considered in

analysis applications.
 This variety of unstructured data raises concerns about data storage, mining, and analysis.

 Velocity - The term'velocity' refers to the rate at which data is generated.

 The true potential of the data is determined by how quickly it is generated and processed to meet the demands.

 Big Data Velocity is concerned with the rate at which data flows in from various sources such as business

processes, application logs, networks, social media sites, sensors, mobile devices, and so on. The data flow is

massive and continuous.


CHARACTERISTICS OF BIG DATA

 Veracity - A big data characteristic related to consistency, accuracy, quality, and trustworthiness is veracity.

 The biasness, noise, and abnormality in data are all examples of data veracity. It also refers to data that is

incomplete or contains errors, outliers, or missing values.


 can we trust the data? How accurate is it? etc.
CLUSTERED COMPUTING AND HADOOP ECOSYSTEM

 Because of the qualities of big data, individual computers are often inadequate for handling the

data at most stages.


 To better address the high storage and computational needs of big data, computer clusters are a better fit.

 Big data clustering software combines the resources of many smaller machines, seeking to provide

a number of benefits:
 Resource Pooling: Combining the available storage space to hold data is a clear benefit, but CPU and

memory pooling are also extremely important.


 Processing large datasets requires large amounts of all three of these resources.
CLUSTERED COMPUTING AND HADOOP ECOSYSTEM

 Big data clustering software combines the resources of many smaller machines, seeking to provide

a number of benefits:
 High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees to

prevent hardware or software failures from affecting access to data and processing.
 This becomes increasingly important as we continue to emphasize the importance of real-time analytics.

 Easy Scalability: Clusters make it easy to scale horizontally by adding additional

machines to the group.


 This means the system can react to changes in resource requirements without

expanding the physical resources on a machine.


CLUSTERED COMPUTING AND HADOOP ECOSYSTEM

 Using clusters requires a solution for managing cluster membership, coordinating

resource sharing,

and scheduling actual work on individual nodes.


 Cluster membership and resource allocation can be handled by software like Hadoop’s

YARN (which stands for Yet Another Resource Negotiator).


 The assembled computing cluster often acts as a foundation that other software

interfaces with to process the data.


 The machines involved in the computing cluster are also typically involved with the

management of a distributed storage system, which we will talk about when we discuss

data persistence
HADOOP AND ITS ECOSYSTEM

 Hadoop is an open-source framework intended to make interaction with big data easier.

 It is a framework that allows for the distributed processing of large datasets across clusters of computers

using simple programming models.


 It is inspired by a technical document published by Google. The four key characteristics of Hadoop are:

 Economical: Its systems are highly economical as ordinary computers can be used for data processing.

 Reliable: It is reliable as it stores copies of the data on different machines and is resistant to hardware failure.

 Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in scaling up the

framework.
 Flexible: It is flexible and you can store as much structured and unstructured data as you need to and decide to

use them later.


HADOOP AND ITS ECOSYSTEM

 Hadoop has an ecosystem that has evolved from its four core components: data management, access, processing,

and storage.
 It is continuously growing to meet the needs of Big Data. It comprises the following components and many

others:
 HDFS: Hadoop Distributed File System

 YARN: Yet Another Resource Negotiator

 MapReduce: Programming based Data Processing

 Spark: In-Memory data processing

 PIG, HIVE: Query-based processing of data services

 HBase: NoSQL Database

 Mahout, Spark MLLib: Machine Learning algorithm libraries


HADOOP AND ITS ECOSYSTEM

 Solar, Lucene: Searching and Indexing

 Zookeeper: Managing cluster

 Oozie: Job Scheduling


BIG DATA LIFE CYCLE WITH HADOOP

 Ingesting data into the system - The first stage of Big Data processing is Ingest.

 The data is ingested or transferred to Hadoop from various sources such as relational databases, systems, or local

files.
 Sqoop transfers data from DBMS to HDFS, whereas Flume transfers event data.

 Processing the data in storage - The second stage is Processing. In this stage, the data is stored and processed.

 The data is stored in the distributed file system, HDFS, and the NoSQL distributed data, HBase. Spark and

MapReduce perform data processing .


 Computing and analyzing data - The third stage is to Analyze. Here, the data is analyzed by processing

frameworks such as Pig, Hive, and Impala.


 Pig converts the data using a map and reduce and then analyzes it. Hive is also based on the map and reduce

programming and is most suitable for structured data. HBase for NoSQL
BIG DATA LIFE CYCLE WITH HADOOP

 Visualizing the results - The fourth stage is Access, which is performed by tools such as

Hue and Cloudera Search. In this stage, the analyzed data can be accessed by users .

You might also like