DATA SCIENCE SELECTION QUESTIONS WITH ANSWER 2022
1. What is Data Science?
Data Science is a combination of algorithms, tools, and machine learning
technique which helps you to find common hidden patterns from the given
raw data.
2. What is logistic regression in Data Science?
Logistic Regression is also called as the logit model. It is a method to
forecast the binary outcome from a linear combination of predictor
variables.
3. What is a Linear Regression?
Linear regression is a statistical programming method where the score of
a variable ‘A’ is predicted from the score of a second variable ‘B’. B is
referred to as the predictor variable and A as the criterion variable.
4. Explain the steps for a Data analytics project
The following are important steps involved in an analytics project:
• Understand the Business problem
• Explore the data and study it carefully.
• Prepare the data for modeling by finding missing values and
transforming variables.
• Start running the model and analyze the Big data result.
• Validate the model with new data set.
• Implement the model and track the result to analyze the
performance of the model for a specific period.
5. What is a Random Forest?
Random forest is a machine learning method which helps you to perform
all types of regression and classification tasks. It is also used for treating
missing values and outlier values.
6. Explain the difference between Data Science and Data Analytics
Data Scientists need to slice data to extract valuable insights that a data
analyst can apply to real-world business scenarios. The main difference
between the two is that the data scientists have more technical
knowledge then business analyst. Moreover, they don’t need an
understanding of the business required for data visualization.
7. Explain p-value?
When you conduct a hypothesis test in statistics, a p-value allows you to
determine the strength of your results. It is a numerical number between
0 and 1. Based on the value it will help you to denote the strength of the
specific result.
8. When do you need to update the algorithm in Data science?
You need to update an algorithm in the following situation:
• You want your data model to evolve as data streams using
infrastructure
• The underlying data source is changingIf it is non-stationarityA
9. Explain why Data Cleansing is essential and which method you
use to maintain clean data
Dirty data often leads to the incorrect inside, which can damage the
prospect of any organization. For example, if you want to run a targeted
marketing campaign. However, our data incorrectly tell you that a specific
product will be in-demand with your target audience; the campaign will
fail.
10. Name commonly used algorithms.
Four most commonly used algorithm by Data scientist are:
• Linear regression
• Logistic regression
• Random Forest
• KNN
11. Explain cluster sampling technique in Data science
A cluster sampling method is used when it is challenging to study the
target population spread across, and simple random sampling can’t be
applied.
12. What is statistical analysis in data science?
Statistical analysis is a scientific tool that helps collect and analyze large
amounts of data to identify common patterns and trends to convert them
into meaningful information. In simple words, statistical analysis is a data
analysis tool that helps draw meaningful conclusions from raw and
unstructured data.
13. What is Rmarkdown? What is the use of it?
RMarkdown is a reporting tool provided by R. With the help of
Rmarkdown, you can create high quality reports of your R code.
The output format of Rmarkdown can be:
• HTML
• PDF
• WORD
14. Explain what is R?
R is data analysis software which is used by analysts, quants,
statisticians, data scientists and others.
15. List out some of the function that R provides?
The function that R provides are
• Mean
• Median
• Distribution
• Covariance
• Regression
• Non-linear...etc.
16. How can you save your data in R?
To save data in R, there are many ways, but the easiest way of doing this
is
Go to Data > Active Data Set > Export Active Data Set and a dialogue
box will appear, when you click ok the dialogue box let you save your
data in the usual way.
17. How can you save your data in R?
To save data in R, there are many ways, but the easiest way of doing this
is
Go to Data > Active Data Set > Export Active Data Set and a dialogue
box will appear, when you click ok the dialogue box let you save your
data in the usual way.
18. What are the data structures in R that is used to perform
statistical analyses and create graphs?
R has data structures like
• Vectors
• Matrices
• Arrays
• Data frames
19. What are the advantages of R?
• The advantages are:-
• It is used for managing and manipulating of data.
• No license restrictions
• Free and open source software.
• Graphical capabilities of R are good.
• Runs on many Operating system and different hardware and also
run on 32 & 64 bit processors etc.
20. What is git in data science?
Git is a version control system designed to track changes in a source code
over time.
When many people work on the same project without a version control
system it's total chaos.
22. What is the difference between Git & GitHub?
Git is the underlying technology and its command-line client (CLI) for
tracking and merging changes in a source code.
GitHub is a web platform built on top of git technology to make it easier.
It also offers additional features like user management, pull requests,
automation.
23. What is rstudio in data science?
RStudio is a powerful and easy way to interact with R programming,
considered as Integrated Development Environment (IDE) that provides a
one-stop solution for all the statistical computing and graphics.
24. What is Scoping and scoping Rule?
The scope of a variable is nothing more than the place in the code where
it is referenced and visible. There are two basic concepts of
scoping, lexical scoping and is dynamic scoping. In R, there is a
concept of free variables, which add some spice to the scoping.
Lexical Scoping (sometimes known as static scoping ) is a set of rules
that helps to determine how R represents the value of a symbol.
With dynamic scoping, the value of y is looked up in the environment
from which the function was called (sometimes referred to as the calling
environment).
The scoping rules of a language determine how a value is associated
with a free variable in a function.
25. What is simulation in R programming?
In a simulation, you set the ground rules of a random process and then
the computer uses random numbers to generate an outcome that adheres
to those rules.
26. What is code profiling?
Code Profiling gives you the chance to identify bottlenecks and pieces of
code that needs to be more efficiently implemented.
27. What is data cleaning in data science?
Data Cleaning means the process of identifying the incorrect, incomplete,
inaccurate, irrelevant or missing part of the data and then modifying,
replacing or deleting them according to the necessity.
Data cleaning is considered a foundational element of the basic data
science.
28. What is tidy data?
Tidy data is a specific way of organizing data into a consistent format
which plugs into the tidy verse set of packages for R.
There are many ways in which we can organize data. Some of these ways
can make for easy data analysis. Others lead to a lot of frustration. This is
where tidy data comes in.
29. What is big data in data science?
Big data is the data that contains greater variety, arriving in increasing
volumes and with more velocity. This is also known as the three Vs.
Volume: The amount of data matters. With big data, you’ll have to
process high volumes of low-density, unstructured data.
Velocity: Velocity is the fast rate at which data is received and
(perhaps) acted on.
Variety: Variety refers to the many types of data that are available.
30. What is EDA?
Exploratory Data Analysis (EDA) is an approach to analyze the data using
visual techniques.
It is used to discover trends, patterns, or to check assumptions with the
help of statistical summary and graphical representations.