[go: up one dir, main page]

0% found this document useful (0 votes)
11 views16 pages

Applied - Data - Science MODULE 1 SEM8

Uploaded by

Dhruv Suvarna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views16 pages

Applied - Data - Science MODULE 1 SEM8

Uploaded by

Dhruv Suvarna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

MODULE 1

Introduction to
Data Science

QData Science Tasks Description Algorithms and Examples


ANS

1
Q Data Science Techniques.

ANS

The techniques used in the steps of a data science process and in conjunction with the term “data science”
are:

➢ Descriptive statistics:
➢ Exploratory visualization
➢ Dimensional slicing
➢ Hypothesis testing
➢ Data engineering
➢ Business intelligence

Descriptive statistics: Descriptive statistics: Computing mean, standard deviation, correlation, and other
descriptive statistics, quantify the aggregate structure of a dataset. This is essential information for
understanding any dataset in order to understand the structure of the data and the relationships

•within the dataset. •They are used in the exploration stage of the data science process.

Exploratory visualization:-

Exploratory visualization: The process of expressing data in visual coordinates enables users to find patterns
and relationships in the data and to comprehend large datasets. Similar to descriptive statistics, they are
integral in the pre- and post-processing steps in data science.

Dimensional slicing:-

Dimensional slicing: Online analytical processing (OLAP) applications, which are prevalent in
organizations, mainly provide information on the data through dimensional slicing, filtering, and pivoting.
OLAP analysis is enabled by a unique database schema design where the data are organized as dimensions
(e.g., products, regions, dates) and quantitative facts or measures (e.g., revenue, quantity). With a well

2
defined database structure, it is easy to slice the yearly revenue by products or combination of region and
products.

•These techniques are extremely useful and may unveil patterns in data (e.g., candy sales

decline after Halloween in the United States).

Hypothesis testing
Hypothesis testing: In confirmatory data analysis, experimental data are collected to evaluate whether a
hypothesis has enough evidence to be supported or not.
• There are many types of statistical testing and they have a wide variety of business applications (e.g., A/B
testing in marketing).

In general, data science is a process where many hypotheses are generated and tested based on observational
data.

•Since the data science algorithms are iterative, solutions can be refined in each step.

Data engineering

Data engineering: Data engineering is the process of sourcing, organizing, assembling, storing, and
distributing data for effective analysis and usage.

Database engineering, distributed storage, and computing frameworks (e.g., Apache Hadoop, Spark, Kafka),
parallel computing, extraction transformation and loading processing, and data warehousing constitute data
engineering techniques.

•Data engineering helps source and prepare for data science learning algorithms.

Business intelligence:-

Business intelligence: Business intelligence helps organizations consume data effectively.

•It helps query the ad hoc data without the need to write the technical query command or use dashboards or
visualizations to communicate the facts and trends. Business intelligence specializes in the secure delivery
of information to right roles and the distribution of information at scale. • Historical trends are usually
reported, but in •combination with data science, both the past and the predicted future data can be
combined. • BI can hold and distribute the results of data •science.
3
Q components of Data Science
Ans

components of Data Science


•1. Statistics
•2. Domain Expertise
•3. Data engineering
•4. Visualization
•5. Advanced computing
•6. Mathematics
•7. Machine learning

1. Statistics: Statistics is one of the most important components of data science. Statistics is
a way to collect and analyze the numerical data in a large amount and finding meaningful
insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together.
•Domain expertise means specialized knowledge or skills of a particular area. In data
science, there are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata (data
about data) to the data.
4. Visualization: Data visualization is meant by representing data in a visual context so that
people can easily understand the significance of data.
• Data visualization makes it easy to access the huge amount of data in visuals.

4
5. Advanced computing: Heavy lifting of data science is advanced computing.
•Advanced computing involves designing, writing, debugging, and maintaining the source
code of computer programs.

6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist, knowledge of good
mathematics is essential.

7. Machine learning: Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In data science, we
use various machine learning algorithms to solve the problems.

Q difference between data science and data analytics?


Data Science is a field that deals with extracting meaningful information and insights by
applying various algorithms preprocessing and scientific methods on structured and
unstructured data. This field is related to Artificial Intelligence and is currently one of the most
demanded skills. Data science comprises mathematics, computations, statistics, programming,
etc to gain meaningful insights from the large amount of data provided in various formats.
What is Data Analytics
Data Analytics is used to get conclusions by processing the raw data. It is helpful in various
businesses as it helps the company to make decisions based on the conclusions from the data.
Basically, data analytics helps to convert a Large number of figures in the form of data into
Plain English i.e., conclusions which are further helpful in making in-depth decisions.

Below is a table of differences between Data Science and Data


Analytics:

5
s.no Feature Data Science Data Analytics
1 Coding Python is the most commonly used The Knowledge of Python and

Language language for data science along R Language is essential for


with the use of other languages Data Analytics.
such as C++, Java, Perl, etc.
2 In-depth knowledge of
Programming Basic Programming skills is
programming is required for data
Skills necessary for data analytics.
science.
3 Data Science makes use of Data Analytics does not use
Use of Machine
machine learning algorithms to get machine learning to get the
Learning
insights. insight of data.
4 Data Science makes use of Data Hadoop Based analysis is used
Other Skills mining activities for getting for getting conclusions from
meaningful insights. raw data.
5 The Scope of data analysis is
scope The scope of data science is large.
micro i.e., small.
6 Data science deals with Data Analysis makes use of
Goals
explorations and new innovations. existing resources.
7 Data Science mostly deals with Data Analytics deals with
Data Type
unstructured data. structured data.
8 The statistical skills are of
Statistical Statistical skills are necessary in
minimal or no use in data
Skills the field of Data Science..
analytics.
9 Purpose Data Scientists produces both broad Data analytics is more focused on
insights by exploring the data and producing insights to answer
actionable insights that answer specific specific questions and which can
questions. be put into action.
10 Scope and Data Scientists is a multidisciplinary Data analytics is a broad field
Skills

6
field including data engineering, which includes data integration,
computer science, statistics, machine data analysis and data presentation.
learning, and predictive analytics in
addition to presentation of findings.

7
11 Approach Data Scientists prepare, manage and Data analysts prepare, manage
explore large data sets and then and analyze well-defined datasets
develop custom analytical models and to identify trends and create visual
algorithms to produce the required presentations to help organizations
business insights. They also make better, data-driven decisions.
communicate and collaborate with
stakeholders to define project goals
and share findings.
12
Helps businesses forecast, optimize, Helps businesses understand
Business Impact
and innovate using data and utilize data

13 • Advanced statistical
• Statistical analysis
analysis
Key Skills
• Data visualization
• Data visualization

14
Primary Analyze and model data to predict Analyze data to find actionable
Objective and optimize outcomes insights

15 • SQL
• SQL
• Python/R (used for
• Excel
Tools Used advanced analytics)
• Basic analytics tools (e.g., R)
• Advanced analytics tools

• •

8
Q Data Science Process:
Ans

Data Science Process:


•Data Extraction
•Data Preparation
•Exploratory Data Analysis(EDA)
•Predictive analytics
•Model Building
Model deployment

Data Extraction – Data extraction is the process of collecting or


retrieving different types of data from a variety of sources, many of
which may be badly organized or completely unstructured. Data
extraction makes it possible to process, consolidate and refine data so
that it can be stored in a centralized location in order to be modified.
These locations may be cloud-based, on-site or a hybrid of the two.
Data extraction is the most initial step in both ELT (extract, load,
transform) and ETL (extract, transform, load) tasks. ETL/ELT are
themselves part of an absolute data integration strategy.

Data Preparation – Once the data is extracted, it then enters the data
preparation stage. •Data preparation, often referred to as “pre-
9
processing” is the stage at which raw data is cleaned and organized for
the following stage of data processing.During preparation, raw data is
rigorously checked for presence of any errors. •The purpose of this step
is to eliminate poor data (redundant, incomplete, or incorrect data) and
begin to create excellent quality data for the best business intelligence.

Exploratory Data Analysis(EDA) – It refers to the censorious process


of performing initial investigations on data so as to discover meaningful
patterns,to detect anomalies,to test hypotheses and to check assumptions
with the support of graphical representations and summary statistics. It
is a good practice to have an understanding of the data first and try to
gather as many meaningful insights from it. •EDA is all about making
sense of data in hand ,before getting them tarnished with it.

Predictive analytics – It looks at historical and current data patterns to


determine if those patterns are likely to appear again. This allows
investors and businesses to adjust where they use their resources to take
advantage of possible future events. Predictive analytics can also be
used to reduce risk and improve operational efficiencies. Predictive
analytics is a unique kind of technology that forms predictions about
certain unknowns in the future. •It draws on a series of techniques to
make these determinations, including artificial intelligence (AI), data
mining, machine learning, modeling, and statistics.

10
Model Building – In this step, the model building process actually
starts. Here, Data scientists distribute datasets for training and testing.
Techniques like regression, classification, and clustering are applied on
the training data set. When the model gets prepared it gets tested against
the “testing” dataset.

Model deployment: In model deployment the model is deployed in the


desired channel and format. •After careful evaluation and modifications,
the data model will become ready to provide the results in real time.

Q Explain the motivation for data science

Ans
Each key motivation for using data science techniques are explored here.

Volume
Dimensions
Complex
Questions

Volume

As data become more granular, the need to use large volume data to extract information
increases.

• A rapid increase in the volume of data

•exposes the limitations of current analysis methodologies.

11
• Dimensions

•The three characteristics of the Big Data phenomenon are high volume, high
velocity, and high variety.
•The variety of data relates to the multiple types of values (numerical, categorical),
formats of data (audio files, video files), and the application of the data (location
coordinates, graph data).
Every single record or data point contains multiple attributes or variables to provide
context for the record.
•For example, every user record of an ecommerce site can contain attributes such as
products viewed, products purchased, user demographics, frequency of purchase,
clickstream, etc.
Determining the most effective offer for an ecommerce user can involve computing
information across these attributes.
•Each attribute can be thought of as a dimension in the data space.

•The user record has multiple attributes and can be visualized in


multidimensional space. The addition of each dimension increases the
complexity of analysis techniques.
• A simple linear regression model that has one input dimension is relatively easy to
build compared to multiple linear regression models with multiple dimensions.

Complex questions

As more complex data are available for analysis, the complexity of information that
needs to get extracted from data is increasing as well.
• If the natural clusters in a dataset, with hundreds of dimensions, need to be found,
then traditional analysis like hypothesis testing techniques cannot be used in a
12
•scalable fashion.

The machine-learning algorithms need to be leveraged in order to automate searching in the


vast search space.
•Traditional statistical analysis approaches the data analysis problem by assuming a
stochastic model, in order to predict a response variable based on a set of input
variables.

A linear regression is a classic example of this

•technique where the parameters of the model are estimated from the data.

13
•These hypothesis-driven techniques were highly successful in modeling
simple relationships between response and input variables.

However, there is a significant need to extract nuggets of information from


large, complex datasets, where the use of traditional statistical data analysis
techniques is limited

Q Applications of Data Science


Ans

Image recognition and speech recognition: Data science is


currently using for Image and speech recognition.
•When you upload an image on Facebook and start getting the
suggestion to tag to your friends .This automatic tagging
suggestion uses image recognition algorithm, which is part of
data science. When you say something using, "Ok Google, Siri,
Cortana", etc., and these devices respond as per voice control, so
this is possible with speech recognition algorithm.

14
Gaming world: In the gaming world, the use of Machine
learning algorithms is increasing day by day. EA Sports, Sony,
Nintendo, are widely using data science for enhancing user
experience.

Internet search: When we want to search for something on the


internet, then we use different types of search engines such as
Google, Yahoo, Bing, Ask, etc. All these search engines use the
data science technology to make the search experience better,
and you can get a search result with a fraction of seconds.

Transport: Transport industries also using data science


technology to create self-driving cars. •With self-driving cars, it
will be easy to reduce the number of road accidents.

Healthcare: In the healthcare sector, data science is providing


lots of benefits. Data science is being used for tumor detection,
drug discovery, medical image analysis, virtual medical bots,
etc.

Recommendation systems: Most of the companies, such as


Amazon, Netflix, Google Play, etc., are using data science
technology for making a better user experience with
personalized recommendations. Such as, when you search for

15
something on Amazon, and you started getting suggestions for
similar products, so this is because of data science technology.

Risk detection: Finance industries always had an issue of fraud


and risk of losses, but with the help of data science, this can be
rescued. Most of the finance companies are looking for the data
scientist to avoid risk and any type of losses with an increase in
customer satisfaction.

16

You might also like