Applied - Data - Science MODULE 1 SEM8
Applied - Data - Science MODULE 1 SEM8
Introduction to
Data Science
1
Q Data Science Techniques.
ANS
The techniques used in the steps of a data science process and in conjunction with the term “data science”
are:
➢ Descriptive statistics:
➢ Exploratory visualization
➢ Dimensional slicing
➢ Hypothesis testing
➢ Data engineering
➢ Business intelligence
Descriptive statistics: Descriptive statistics: Computing mean, standard deviation, correlation, and other
descriptive statistics, quantify the aggregate structure of a dataset. This is essential information for
understanding any dataset in order to understand the structure of the data and the relationships
•within the dataset. •They are used in the exploration stage of the data science process.
Exploratory visualization:-
Exploratory visualization: The process of expressing data in visual coordinates enables users to find patterns
and relationships in the data and to comprehend large datasets. Similar to descriptive statistics, they are
integral in the pre- and post-processing steps in data science.
Dimensional slicing:-
Dimensional slicing: Online analytical processing (OLAP) applications, which are prevalent in
organizations, mainly provide information on the data through dimensional slicing, filtering, and pivoting.
OLAP analysis is enabled by a unique database schema design where the data are organized as dimensions
(e.g., products, regions, dates) and quantitative facts or measures (e.g., revenue, quantity). With a well
2
defined database structure, it is easy to slice the yearly revenue by products or combination of region and
products.
•These techniques are extremely useful and may unveil patterns in data (e.g., candy sales
Hypothesis testing
Hypothesis testing: In confirmatory data analysis, experimental data are collected to evaluate whether a
hypothesis has enough evidence to be supported or not.
• There are many types of statistical testing and they have a wide variety of business applications (e.g., A/B
testing in marketing).
In general, data science is a process where many hypotheses are generated and tested based on observational
data.
•Since the data science algorithms are iterative, solutions can be refined in each step.
Data engineering
Data engineering: Data engineering is the process of sourcing, organizing, assembling, storing, and
distributing data for effective analysis and usage.
Database engineering, distributed storage, and computing frameworks (e.g., Apache Hadoop, Spark, Kafka),
parallel computing, extraction transformation and loading processing, and data warehousing constitute data
engineering techniques.
•Data engineering helps source and prepare for data science learning algorithms.
Business intelligence:-
•It helps query the ad hoc data without the need to write the technical query command or use dashboards or
visualizations to communicate the facts and trends. Business intelligence specializes in the secure delivery
of information to right roles and the distribution of information at scale. • Historical trends are usually
reported, but in •combination with data science, both the past and the predicted future data can be
combined. • BI can hold and distribute the results of data •science.
3
Q components of Data Science
Ans
1. Statistics: Statistics is one of the most important components of data science. Statistics is
a way to collect and analyze the numerical data in a large amount and finding meaningful
insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together.
•Domain expertise means specialized knowledge or skills of a particular area. In data
science, there are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata (data
about data) to the data.
4. Visualization: Data visualization is meant by representing data in a visual context so that
people can easily understand the significance of data.
• Data visualization makes it easy to access the huge amount of data in visuals.
4
5. Advanced computing: Heavy lifting of data science is advanced computing.
•Advanced computing involves designing, writing, debugging, and maintaining the source
code of computer programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist, knowledge of good
mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In data science, we
use various machine learning algorithms to solve the problems.
5
s.no Feature Data Science Data Analytics
1 Coding Python is the most commonly used The Knowledge of Python and
6
field including data engineering, which includes data integration,
computer science, statistics, machine data analysis and data presentation.
learning, and predictive analytics in
addition to presentation of findings.
7
11 Approach Data Scientists prepare, manage and Data analysts prepare, manage
explore large data sets and then and analyze well-defined datasets
develop custom analytical models and to identify trends and create visual
algorithms to produce the required presentations to help organizations
business insights. They also make better, data-driven decisions.
communicate and collaborate with
stakeholders to define project goals
and share findings.
12
Helps businesses forecast, optimize, Helps businesses understand
Business Impact
and innovate using data and utilize data
13 • Advanced statistical
• Statistical analysis
analysis
Key Skills
• Data visualization
• Data visualization
14
Primary Analyze and model data to predict Analyze data to find actionable
Objective and optimize outcomes insights
15 • SQL
• SQL
• Python/R (used for
• Excel
Tools Used advanced analytics)
• Basic analytics tools (e.g., R)
• Advanced analytics tools
• •
8
Q Data Science Process:
Ans
Data Preparation – Once the data is extracted, it then enters the data
preparation stage. •Data preparation, often referred to as “pre-
9
processing” is the stage at which raw data is cleaned and organized for
the following stage of data processing.During preparation, raw data is
rigorously checked for presence of any errors. •The purpose of this step
is to eliminate poor data (redundant, incomplete, or incorrect data) and
begin to create excellent quality data for the best business intelligence.
10
Model Building – In this step, the model building process actually
starts. Here, Data scientists distribute datasets for training and testing.
Techniques like regression, classification, and clustering are applied on
the training data set. When the model gets prepared it gets tested against
the “testing” dataset.
Ans
Each key motivation for using data science techniques are explored here.
Volume
Dimensions
Complex
Questions
Volume
As data become more granular, the need to use large volume data to extract information
increases.
11
• Dimensions
•The three characteristics of the Big Data phenomenon are high volume, high
velocity, and high variety.
•The variety of data relates to the multiple types of values (numerical, categorical),
formats of data (audio files, video files), and the application of the data (location
coordinates, graph data).
Every single record or data point contains multiple attributes or variables to provide
context for the record.
•For example, every user record of an ecommerce site can contain attributes such as
products viewed, products purchased, user demographics, frequency of purchase,
clickstream, etc.
Determining the most effective offer for an ecommerce user can involve computing
information across these attributes.
•Each attribute can be thought of as a dimension in the data space.
Complex questions
As more complex data are available for analysis, the complexity of information that
needs to get extracted from data is increasing as well.
• If the natural clusters in a dataset, with hundreds of dimensions, need to be found,
then traditional analysis like hypothesis testing techniques cannot be used in a
12
•scalable fashion.
•technique where the parameters of the model are estimated from the data.
13
•These hypothesis-driven techniques were highly successful in modeling
simple relationships between response and input variables.
14
Gaming world: In the gaming world, the use of Machine
learning algorithms is increasing day by day. EA Sports, Sony,
Nintendo, are widely using data science for enhancing user
experience.
15
something on Amazon, and you started getting suggestions for
similar products, so this is because of data science technology.
16