[go: up one dir, main page]

0% found this document useful (0 votes)
8 views34 pages

Unit 1 - Introduction

The document provides an overview of data science, its importance across various fields such as business, healthcare, education, and technology, and outlines the data science process, which includes setting research goals, retrieving, cleansing, and transforming data, and building models. It highlights the necessity of data science in making informed decisions and predictions based on big data, as well as the different types of data involved, such as structured, unstructured, and machine-generated data. Additionally, it discusses the benefits and applications of data science in enhancing user experiences, predicting market trends, and improving operational efficiencies.

Uploaded by

nitramesh2913
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views34 pages

Unit 1 - Introduction

The document provides an overview of data science, its importance across various fields such as business, healthcare, education, and technology, and outlines the data science process, which includes setting research goals, retrieving, cleansing, and transforming data, and building models. It highlights the necessity of data science in making informed decisions and predictions based on big data, as well as the different types of data involved, such as structured, unstructured, and machine-generated data. Additionally, it discusses the benefits and applications of data science in enhancing user experiences, predicting market trends, and improving operational efficiencies.

Uploaded by

nitramesh2913
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT I INTRODUCTION

Need for data science – benefits and uses – facets of data – data science process – setting
the research goal – retrieving data – cleansing, integrating, and transforming data –
exploratory data analysis – build the models – presenting and building applications.

DATA SCIENCE:

Big data is a term for any collection of data sets so large or complex that it becomes difficult to
process them using traditional data management techniques such as, for example, RDBMS
(relational database management systems).

Data science involves using methods to analyze massive amounts of data and extract the
knowledge it contains.

Data science is the study of data. It involves developing methods of recording, storing, and
analyzing data to effectively extract useful information. The goal of data science is to gain
insights and knowledge from any type of data — both structured and unstructured.

Data science is the field of study that combines domain expertise, programming skills, and
knowledge of mathematics and statistics to extract meaningful informations from data. Data
science practitioners apply machine learning algorithms to numbers, text, images, video, audio,
and more to produce artificial intelligence (AI) systems to perform tasks that ordinarily require
human intelligence. This is used to create insights from the data.

NEED FOR DATA SCIENCE:

Data Science for Business:

Data science provides the knowledge, which is derived from the big data after processes of
extraction and the information. This information is sometimes collected from the ongoing
sources within the system, and most of the time, it is mined from external sources.

Data is the key component for every business, as businesses need it to analyze their current
scenario based on past facts and performance and make decisions for future challenges. Data
science is the requirement of every business to make business forecasts and predictions based on
facts and figures, which are collected in the form of data and processed through data science.

The influence of data science on business will be based on the understanding of the critical
information given as input to the system because the decisions are based on those results. Data
science provides resolutions for challenges that are crucial for business decisions as the future of
the business is based on them. Data science is important for marketing promotions and
campaigns as it has offered the essence of needs and wants in the form of trends and consumer
behaviors in a competitive market at the right time for the right consumer.

Data Science for Medical Research:

Data science is necessary for research and analysis in health care, which makes it easier for
practitioners in both fields to understand the challenges and extract results through analysis and
insights proposed based on data. The medical science industry also thrives on data science as it
has also provided solutions for long-standing complexities. In recent years, there has been an
immense increase in deadly disease outbreaks and new fatal viruses due to pollution, unsafe and
unhealthy practices, and improper diet, etc.

The medical science, through advance technology has been able to predict the anticipated
troubles which would possibly arise in patients, and treatments have been designed in advance to
alter the course of the disease. The advanced technology of machine learning and artificial
intelligence has been infused in the field of medical science to conduct biomedical and genetic
researches more easily. The analysis of the researches is driven by big data, and it has helped
scientists to develop replica models of human DNA. This DNA modeling is assisting in
understanding the configuration and structure of human development and its functions.

The scientists can research new medicines and study their possible outcomes on the human
compositional basics. Data science has powered the data to be turned into visualizations and
graphical presentations to study the patterns of behavior and course of actions of many unseen
components of the human body. It helps scientists find a cure for diseases that had no possible
treatment in the past.
Data Science for Health Care:

The life-threatening health issues are required to be dealt with utter care and diligence to target
the actual cause and find a cure with few side effects. The data congregating in clinics and
hospitals is huge. The clinical history of a patient and the medicinal treatment given is
computerized and kept as a digitalized record, which makes it easier for the practitioner and
doctors to detect the complex diseases at an early stage and understand its complexity.

It helps to design a personalized treatment for the patient. Similarly, this data collected on a
larger scale is helpful for the health organizations and scientists to generate a clear picture of the
on-going medical diseases. With the help of data science, it will be easier to gather quantitative
data on the health status of a region or country and facilitate in designing health benefits
programs efficiently.

Data Science in Education:

A large amount of student data is being gathered through online forums, websites, and online
learning portals, which display their choices concerning selection in the field of study, preferred
universities, personal interests, occupational choices, and career. The fact and figures obtained
through this data help to design the guide and roadmap suggested for higher studies.

The data is used to have a preview of the forecast of the new batch of students and help
policymakers in designing new courses and programs depending on the most wanted and popular
trends in the market. It also helps devise the admission policy based on the grading system. This
data is also being used to find out what is the most demanding study program among students
with versatile backgrounds and also helps to understand their behavior in selecting a career path.

Online personality tests and career counseling suggest students and individuals select a career
path based on their choices. This is done with the help of data science, which has designed
algorithms and predictive rationalities that conclude with the provided set of choices.
Data Science in Technology:

The new technology emerging around us to ease life and alter our lifestyle is all due to data
science. Some applications collect sensory data to track our movements and activities and also
provide knowledge and suggestions concerning our health, such as blood pressure and heart rate.
This data collected is useful in designing health care products, medicines, and fitness equipment
that are tailored for a large group of people sharing the same problems and conditions.

Data science is also being implemented to advance security and safety technology, which is the
most important issue nowadays. It has also increased the personification and enhanced privacy in
smart devices.

For instance, voice recognition of a specified user, motion sensor cameras for surveillances,
fingerprint recognition in mobile phones, and eye detection for passenger verifications have all
been made possible through the application of data science and artificial intelligence.

Another example is the driverless cars designed to benefit people with disabilities and those who
cannot drive due to several other medical conditions.

Data Science for Social Media:

Data science implications and integration on social media websites and public socializing
platforms have taken the process of datafication to another level. Most of the data of consumer
behavior, choices, and preferences are being collected through online platforms, which help the
business grow. The systems are smart enough to predict and analyze a user’s mood, behavior,
and opinions towards a specific product, incident, or event with the help of texts, feedbacks,
thoughts, views, reactions, and suggestions. It also helps in examining and investigating
consumer and human behavior.

Data Science in Policy Making

Data science analysis and demonstrations for factors concerning various aspects of our
environment and ecosystem are also backed by the science of data. The environmentalists are
working on data and predicting the outcomes of ongoing energy usage, pollution, population, and
its effects on our ecosystem and global warming. The anticipated results are calculated with the
help of data science that could not have been possible to know otherwise. The technology is
assisting in finding solutions to the climate emergency with the help of experiments on test
models.

It has also provided the data which allows the scientists to know the prevailing earth minerals
and fuel sources around the world and their lasting period and quantity. We need data science
tools in providing alternatives to energy to preserve the scarce resources of the earth and make
our planet more sustainable.

Data science is vital in almost every field. It needs to develop and progress within its systems to
handle emerging issues in every industry, business, and organization. The system which solves
complex problems should be advanced enough to provide simple solutions. Improvisation in the
field of data science will be achievable with more developments and innovations in artificial
intelligence and machine learning.

Deep learning is giving the brain to systems and machines to think and act on inputs as slight as
an image. Data science portrays an expansive and conclusive demonstration of influencing and
acting factors needed for assessments and systematic conclusions. Keeping in view all the
technological innovations, discoveries, and progress, we will need data science more vigorously
than ever before.

BENEFITS AND USES:

Data science and big data are used almost everywhere in both commercial and noncommercial
settings.

Commercial companies in almost every industry use data science and big data to gain insights
into their customers, processes, staff, completion, and products. Many companies use data
science to offer customers a better user experience, as well as to cross-sell, up-sell, and
personalize their offerings. A good example of this is Google AdSense, which collects data from
internet users so relevant commercial messages can be matched to the person browsing the
internet.

Financial institutions use data science to predict stock markets, determine the risk of lending
money, and learn how to attract new clients for their services.

Universities use data science in their research but also to enhance the study experience of their
students. The rise of massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional classes.

FACETS OF DATA:

. The main categories of data are:

■ Structured

■ Unstructured

■ Natural language

■ Machine-generated

■ Graph-based

■ Audio, video, and images

■ Streaming

Structured data:

Structured data is data that depends on a data model and resides in a fixed field within a record.
It is easy to store structured data in tables within databases or Excel files. SQL, or Structured
Query Language, is the preferred way to manage and query data that resides in databases.
Unstructured data:

Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying. One example of unstructured data is regular email.
Natural language:

Natural language is a special type of unstructured data; it’s challenging to process because it
requires knowledge of specific data science techniques and linguistics.The natural language
processing is used in entity recognition, topic recognition, summarization, text completion, and
sentiment analysis.

Example: Predictive text, Search results

Machine-generated data:

Machine-generated data is information that’s automatically created by a computer, process,


application, or other machine without human intervention. Machine-generated data is becoming
a major data resource and will continue to do so.
The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and
telemetry.

Graph-based or network data:

Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store graphical data.
Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two
people.

Examples of graph-based data can be found on many social media websites like follower
list on Twitter is another example of graph-based data.Graph databases are used to store
graph-based data and are queried with specialized query languages such as SPARQL.
Audio, image, and video

Audio, image, and video are data types that pose specific challenges to a data scientist. Tasks
that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for
computers. MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll
increase video capture to approximately 7 TB per game for the purpose of live, in-game
analytics.

High-speed cameras at stadiums will capture ball and athlete movements to calculate in real
time, for example, the path taken by a defender relative to two baselines. Recently a company
called DeepMind succeeded at creating an algorithm that’s capable of learning how to play video
games. This algorithm takes the video screen as input and learns to interpret everything via a
complex process of deep learning. It’s a remarkable feat that prompted Google to buy the
company for their own Artificial Intelligence (AI) development plans.

Streaming data:

While streaming data can take almost any of the previous forms, it has an extra property. The
data flows into the system when an event happens instead of being loaded into a data store in a
batch. Although this isn’t really a different type of data, we treat it here as such because you need
to adapt your process to deal with this type of information. Examples are the “What’s trending”
on Twitter, live sporting or music events, and the stock market.
The data science process

The data science process typically consists of six steps


1.SETTING THE RESEARCH GOAL:

Data science is mostly applied in the context of an organization. When the business asks to
perform a data science project, the first step is to prepare a project charter. This charter contains
information such as what you’re going to research, how the company benefits from that, what
data and resources are needed, a timetable, and deliverables.

A project charter requires teamwork, and the input covers at least the following:

■ A clear research goal

■ The project mission and context

■ How you’re going to perform your analysis

■ What resources you expect to use

■ Proof that it’s an achievable project, or proof of concepts

■ Deliverables and a measure of success

■ A timeline

2.RETRIEVING DATA

The second step is to collect data. Data can be stored in many forms, ranging from simple text
files to tables in a database. The objective now is acquiring all the data that is needed.The project
charter states which data is needed and where it can found. In this step it is ensured that the data
can be used in the program, which means checking the existence of quality, and access to the
data. Data can also be delivered by third-party companies and takes many forms ranging from
Excel spreadsheets to different types of databases.
3.DATA PREPARATION

Data collection is an error-prone process; in this phase the quality of the data is enhanced and is
prepared to be used in subsequent steps. This phase consists of three subphases:

1. Data cleansing- removes false values from a data source and inconsistencies across data
sources,
2. Data integration- enriches data sources by combining information from multiple data
sources,
3. Data transformation- ensures that the data is in a suitable format for use in the models.

DATA CLEANSING:

Data cleansing is a sub-process of the data science process that focuses on removing errors in
your data so the data becomes a true and consistent representation of the processes it originates
from.

Two types of errors exist. The first type is the interpretation error, such as when the value in data
for granted, like saying that a person’s age is greater than 300 years.The second type of error
points to inconsistencies between data sources or against the company’s standardized values. An
example of this class of errors is placing “Female” in one table and “F” in another when they
represent the same thing: that the person is female.
DATA ENTRY ERRORS

Data collection and data entry are error-prone processes. They often require human intervention,
and because humans are only human, they make typos or lose their concentration for a second
and introduce an error into the chain. But data collected by machines or computers isn’t free
from errors either. Errors can arise from human sloppiness, whereas others are due to machine or
hardware failure.

Examples of errors originating from machines are transmission errors or bugs in the extract,
transform, and load phase (ETL).
When you have a variable that can take only two values: “Good” and “Bad”, a frequency table
can be created to see if those are truly the only two values present. In table 2.3, the values
“Godo” and “Bade” point out something went wrong in at least 16 cases.

Most errors of this type are easy to fix with simple assignment statements and if-then-else rules:

if x == “Godo”:

x = “Good”

if x == “Bade”:

x = “Bad”

REDUNDANT WHITESPACE:

Whitespaces tend to be hard to detect but cause errors like other redundant characters would.For
instance, in Python the strip() function can be used to remove leading and trailing spaces.

FIXING CAPITAL LETTER MISMATCHES:

Capital letter mismatches are common. Most programming languages make a distinction
between “Brazil” and “brazil”. In this case it can be solved by applying a function that returns
both strings in lowercase, such as .lower() in Python. “Brazil”.lower() == “brazil”.lower() should
result in true.

IMPOSSIBLE VALUES AND SANITY CHECKS


Sanity checks are another valuable type of data check. Here the value is checked against
physically or theoretically impossible values such as people taller than 3 meters or someone with
an age of 299 years.

Sanity checks can be directly expressed with rules: check = 0 <= age <= 120

OUTLIERS:

An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the other
observations. The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers on
the upper side when a normal distribution is expected. The normal distribution, or Gaussian
distribution, is the most common distribution in natural sciences.

DEALING WITH MISSING VALUES:

Missing values aren’t necessarily wrong, but they still needed to be handled separately; certain
modeling techniques can’t handle missing values. They might be an indicator that something
went wrong in data collection or that an error happened in the ETL process. Common techniques
data scientists use are
DEVIATIONS FROM A CODE BOOK:

Detecting errors in larger data sets against a code book or against standardized values can be
done with the help of set operations. A code book is a description of data, which is a form of
metadata. It contains things such as the number of variables per observation, the number of
observations, and what each encoding within a variable means. (For instance “0” equals
“negative”, “5” stands for “very positive”.)

CORRECT ERRORS AS EARLY AS POSSIBLE

A good practice is to mediate data errors as early as possible in the data collection chain and to
fix as little as possible inside your program while fixing the origin of the problem. Retrieving
data is a difficult task, and organizations spend millions of dollars on it in the hope of making
better decisions. The data collection process is error-prone, and in a big organization it involves
many steps and teams.

Data should be cleansed when acquired for many reasons:

■ Not everyone spots the data anomalies. Decision-makers may make costly mistakes on
information based on incorrect data from applications that fail to correct for the faulty data.

■ If errors are not corrected early on in the process, the cleansing will have to be done for every
project that uses that data.

■Data errors may point to a business process that isn’t working as designed. For instance, both
authors worked at a retailer in the past, and they designed a couponing system to attract more
people and make a higher profit. During a data science project, we discovered clients who
abused the couponing system and earned money while purchasing groceries. The goal of the
couponing system was to stimulate cross-selling, not to give products away for free. This flaw
cost the company money and nobody in the company was aware of it. In this case the data wasn’t
technically wrong but came with unexpected results.

■ Data errors may point to defective equipment, such as broken transmission lines and defective
sensors.

■ Data errors can point to bugs in software or in the integration of software that may be critical
to the company. While doing a small project at a bank we discovered that two software
applications used different local settings. This caused problems with numbers greater than 1,000.
For one app the number 1.000 meant one, and for the other it meant one thousand

DATA INTEGRATION:

Data integration enriches data sources by combining information from multiple data sources, and
data transformation ensures that the data is in a suitable format for use in models.
Data comes from several different places and in this step, the data from different sources are
integrated. Data varies in size, type, and structure, ranging from databases and Excel files to text
documents.

The different ways of combining data:

The two operations that are performed in various data types to combine them are as follows:

1. Joining- enriching an observation from one table with information from another table.
2. Appending or stacking- adding the observations of one table to those of another table.

JOINING TABLES:

Joining tables allows to combine the information of one observation found in one table with the
information that is found in another table. The focus is on enriching a single observation.

To join tables, variables that represent the same object in both tables are used , such as a date, a
country name, or a Social Security number. These common fields are known as keys. When these
keys also uniquely define the records in the table they are called as primary keys.

APPENDING TABLES:

Appending or stacking tables is effectively adding observations from one table to another table.
The result of appending the tables is a larger one with the observations from both the tables.
USING VIEWS TO SIMULATE DATA JOINS AND APPENDS:

To avoid duplication of data, data can be virtually combined with views.

ENRICHING AGGREGATED MEASURES

Data enrichment can also be done by adding calculated information to the table, such as the total
number of sales or what percentage of total stock has been sold in a certain region.
Transforming data:

Certain models require their data to be in a certain shape. Now that the data is cleansed and
integrated, the next task is: transforming the data so it takes a suitable form for data modeling.

Reducing the number of Variables:


In some data there are too many variables and they needed to be reduced because they don’t add
new information to the model.For instance, all the techniques based on a Euclidean distance
perform well only up to 10 variables.

Data modeling or model building

In this phase, models, domain knowledge, and insights about the data found in the previous steps
are used to answer the research question. A technique is selected from the fields of statistics,
machine learning, operations research, and so on. Building a model is an iterative process that
involves selecting the variables for the model, executing the model, and model diagnostics.

4.EXPLORATORY DATA ANALYSIS:

During exploratory data analysis you take a deep dive into the data Information becomes much
easier to grasp when shown in a picture, therefore graphical techniques are used to gain an
understanding of the data and the interactions between variables.

The visualization techniques used in this phase range from simple line graphs or histograms, to
more complex diagrams such as Sankey and network graphs. Sometimes it’s useful to compose a
composite graph from simple graphs to get even more insight into the data.
These plots can be combined to provide even more insight like,
Brushing and linking:

With brushing and linking different graphs and tables (or views) can be combined and linked so
the changes in one graph are automatically transferred to the other graphs. This interactive
exploration of data facilitates the discovery of new insights.
Not only does this indicate a high correlation between the answers, but it’s easy to see that whe
several points on a subplot are selected, the points will correspond to similar points on the other
graphs. In this case the selected points on the left graph correspond to points on the middle and
right graphs, although they correspond better in the middle and right graphs.

5.BUILD THE MODELS:


With clean data in place and a good understanding of the content, models can be built with the
goal of making better predictions, classifying objects, or gaining an understanding of the system
that you’re modeling. This phase is much more focused than the exploratory analysis step.
Building a model is an iterative process.

The following are the main steps:

1. Selection of a modeling technique and variables to enter in the model

2. Execution of the model

3. Diagnosis and model comparison

Model and variable selection:

The model that performs best for the given data must be selected. Several factors involved in
selecting the model include:

■ Must the model be moved to a production environment and, if so, would it beeasy to
implement?

■ How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?

■ Does the model need to be easy to explain?

Model Execution:
Once the model has been selected, it has to be executed with a proper code. Most programming
languages have library functions. This library function can speed of the process of coding the
model.

Linear regression- prediction problems.


Linear regression attempts to model the relationship between two variables by fitting a linear
equation to observed data. One variable is considered to be an explanatory variable, and the other
is considered to be a dependent variable.

The following listing shows the execution of a linear prediction model.

For a linear regression, a “linear relation” between each x (predictor) and the y (target) variable
is assumed. The target variable is created based on the predictor by adding a bit of randomness.

The results.summary() outputs the table


● Model fit—For this the R-squared or adjusted R-squared is used. This measure is an
indication of the amount of variation in the data that gets captured by the model. The
difference between the adjusted R-squared and the R-squared is minimal here because the
adjusted one is the normal one + a penalty for model complexity. A model gets complex
when many variables (or features) are introduced. Models in businesses, model fit value
above 0.85 are often considered good.
● Predictor variables have a coefficient.
● Predictor significance—Coefficients are great, but sometimes not enough evidence exists
to show that the influence is there. This is what the p-value isabout. A long explanation
about type 1 and type 2 mistakes is possible herebut the short explanations would be: if
the p-value is lower than 0.05, the variable is considered significant.

K- Nearest Neighbors- Classification Problem:


A k-nearest-neighbor algorithm, often abbreviated k-nn, is an approach to data classification that
estimates how likely a data point is to be a member of one group or the other depending on what
group the data points nearest to it are in.

The prediction can be compared to the real values using a confusion matrix.
metrics.confusion_matrix(target,prediction)

Model diagnosis and Model Comparison:


While training the model a part of the data can be held up for the testing phase. This phase is
used to evaluate the model. The model is trained using a part of the data and later the model is
tested using the unseen data. The error measures are then used in the tested phase to evaluate
how well the model performs for the given data.

Example of error measures: Mean square error method

Mean square error is a simple measure: check for every prediction how far it was from the truth,
square this error, and add up the error of every prediction.

6.PRESENTATION AND AUTOMATION

Finally, the result to the business is provided. These results can take many forms, ranging from
presentations to research reports. Sometimes the execution of the process is automated because
the business will want to use the insights gained in another project or enable an operational
process to use the outcome from the model. Presenting theresults to the stakeholders and
industrializing the analysis process for repetitive reuse and integration withother tools.

You might also like