ML Interview
ML Interview
ML Interview
DATA SCIENTIST
INTERVEIW QNA PDF COLLECTION
And More
Tools Are
Covered
SQL & Python/R Microsoft
NoSQL Excel
AKASH RAJ
Founder & CEO -CloudyML || Data Scientist
1. How does Stacking work?
The idea of stacking is to learn several different weak learners and combine
them by training a meta-model to output predictions based on the multiple
predictions returned by these weak models.
If a stacking ensemble is composed of L weak learners, then to fit the model the
following steps are followed:
Split the training data into two folds.
Choose L weak learners and fit them to the data of the first fold.
For each of the L weak learners, make predictions for observations in the second
fold.
Fit the meta-model on the second fold, using predictions made by the weak
learners as inputs.
A scatter graph would be more appropriate than a line chart when you are
looking to show the relationship between two variables that are not linearly
related. For example, if you were looking to show the relationship between a
person’s age and their weight, a scatter graph would be more appropriate than a
line chart. A line chart would be more appropriate than a scatter graph when you
are looking to show a trend over time. For example, if you were looking at the
monthly sales of a company over the course of a year, a line chart would be more
appropriate than a scatter graph.
When data is ingested into Power BI, it is basically stored in Fact and Dimension
tables.
Fact tables: The central table in a star schema of a data warehouse, a fact table
stores quantitative information for analysis and is not normalized in most cases.
Dimension tables: It is just another table in the star schema that is used to store
attributes and dimensions that describe objects stored in a fact table.
www.cloudyml.com
Learn Data Science By Doing It.
4. What is Cursor? How to use a Cursor?
Decorators are used to add some design patterns to a function without changing
its structure. Decorators generally are defined before the function they are
enhancing. To apply a decorator we first define the decorator function. Then we
write the function it is applied to and simply add the decorator function above
the function it has to be applied to. For this, we use the @ symbol before the
decorator.
www.cloudyml.com
Learn Data Science By Doing It.
7. What is the meaning of KPI in statistics?
Autoencoders are artificial neural networks that learn without any supervision.
Here, these networks have the ability to automatically learn by mapping the
inputs to the corresponding outputs.
Autoencoders, as the name suggests, consist of two entities:
Encoder: Used to fit the input into an internal computation state
Decoder: Used to convert the computational state back into the output
www.cloudyml.com
Learn Data Science By Doing It.
11. What is a Recursive Stored Procedure in SQL?
A stored procedure that calls itself until a boundary condition is reached, is called
a recursive stored procedure. This recursive function helps the programmers to
deploy the same set of code several times as and when required. Some SQL
programming languages limit the recursion depth to prevent an infinite loop of
procedure calls from causing a stack overflow, which slows down the system and
may lead to system crashes.
The sum function (Sum()) takes the data columns and aggregates them totally
but the SumX function (SumX()) lets you filter the data which you are adding.
SUMX(Table, Expression), where the table contains the rows for calculation.
Expression is a calculation that will be evaluated on each row of the table.
Autoencoders are artificial neural networks that learn without any supervision.
Here, these networks have the ability to automatically learn by mapping the
inputs to the corresponding outputs.
Autoencoders, as the name suggests, consist of two entities:
Encoder: Used to fit the input into an internal computation state
Decoder: Used to convert the computational state back into the output
the loss function is to capture the difference between the actual and predicted
values for a single record whereas cost functions aggregate the difference for
the entire training dataset.
The Most commonly used loss functions are Mean-squared error and Hinge loss.
www.cloudyml.com
Learn Data Science By Doing It.
15. What is the difference between Python Arrays and
lists?
Arrays in python can only contain elements of same data types i.e., data type of
array should be homogeneous. It is a thin wrapper around C language arrays and
consumes far less memory than lists.
Lists in python can contain elements of different data types i.e., data type of lists
can be heterogeneous. It has the disadvantage of consuming large memory.
Root cause analysis: a method of problem-solving used for identifying the root
cause(s) of a problem [5]
Correlation measures the relationship between two variables, range from -1 to 1.
Causation is when a first event appears to have caused a second event.
Causation essentially looks at direct relationships while correlation can look at
both direct and indirect relationships.
k-means has trouble clustering data where clusters are of various sizes and
densities.Outliers will cause the centroids to be dragged, or the outliers might
get their own cluster instead of being ignored. Outliers should be clipped or
removed before clustering.If the number of dimensions increase, a distance-
based similarity measure converges to a constant value between any given
examples. Dimensions should be reduced before clustering them.
www.cloudyml.com
Learn Data Science By Doing It.
18. If your Time-Series Dataset is very long, what
architecture would you use?
If the dataset for time-series is very long, LSTMs are ideal for it because it can
not only process single data points, but also entire sequences of data. A time-
series being a sequence of data makes LSTM ideal for it.For an even stronger
representational capacity, making the LSTM's multi-layered is better.Another
method for long time-series dataset is to use CNNs to extract information.
Parsing time series information from various sources and formats. Generating
sequences of fixed-frequency dates and time spans. Manipulating and
converting date times with time zone information. Resampling or converting a
time series to a particular frequency.
The main difference between window functions and aggregate functions is that
aggregate functions group multiple rows into a single result row; all the individual
rows in the group are collapsed and their individual data is not shown. On the
other hand, window functions produce a result for each individual row. This result
is usually shown as a new column value in every row within the window.
www.cloudyml.com
Learn Data Science By Doing It.
21. What is Ribbon in Excel and where does it appear?
The Ribbon is basically your key interface with Excel and it appears at the top of
the Excel window. It allows users to access many of the most important
commands directly. It consists of many tabs such as File, Home, View, Insert, etc.
You can also customize the ribbon to suit your preferences. To customize the
Ribbon, right-click on it and select the “Customize the Ribbon” option.
The memory cell in an LSTM is implemented as a forget gate, an input gate, and
an output gate. The forget gate controls how much information from the
previous cell state is forgotten. The input gate controls how much new
information from the current input is allowed into the cell state. The output gate
controls how much information from the cell state is allowed to pass out to the
next cell state.
A CTE (Common Table Expression) is a one-time result set that only exists for the
duration of the query. It allows us to refer to data within a single SELECT, INSERT,
UPDATE, DELETE, CREATE VIEW, or MERGE statement's execution scope. It is
temporary because its result cannot be stored anywhere and will be lost as soon
as a query's execution is completed.
www.cloudyml.com
Learn Data Science By Doing It.
24. List the advantages NumPy Arrays have over Python
lists?
Constraints are used to specify the rules concerning data in the table. It can be
applied for single or multiple fields in an SQL table during the creation of the
table or after creating using the ALTER TABLE command. The constraints are:
NOT NULL - Restricts NULL value from being inserted into a column.
CHECK - Verifies that all values in a field satisfy a condition.
DEFAULT - Automatically assigns a default value if no value has been specified
for the field.
UNIQUE - Ensures unique values to be inserted into the field.
INDEX - Indexes a field providing faster retrieval of records.
PRIMARY KEY - Uniquely identifies each record in a table.
FOREIGN KEY - Ensures referential integrity for a record in another table.
www.cloudyml.com
Learn Data Science By Doing It.
27. What Would You Do If Some Countries/Provinces (Any
Geographical Entity) are Missing and Displaying a Null
When You Use Map View in Tableau?
www.cloudyml.com
Learn Data Science By Doing It.
29. What is the AdaBoost Algorithm?
Tableau uses a workbook and sheet file structure, much like Microsoft Excel.
A workbook contains sheets, which can be a worksheet, dashboard, or a story.
A worksheet contains a single view along with shelves, legends, and the Data
pane.
A dashboard is a collection of views from multiple worksheets.
A story contains a sequence of worksheets or dashboards that work together to
convey information.
www.cloudyml.com
Learn Data Science By Doing It.
32. What are the steps involved in training a perceptron
in Deep Learning?
There are five main steps that determine the learning of a perceptron:
Initialize thresholds and weights
Provide inputs
Calculate outputs
Update weights in each step
Repeat steps 2 to 4
Hard-Margin SVMs have linearly separable training data. No data points are
allowed in the margin areas. This type of linear classification is known as Hard
margin classification.
Soft-Margin SVMs have training data that are not linearly separable. Margin
violation means choosing a hyperplane, which can allow some data points to stay
either in between the margin area or on the incorrect side of the hyperplane.
Hard-Margin SVMs are quite sensitive to outliers.
Soft-Margin SVMs try to find the best balance between keeping the margin as
large as possible and limiting the margin violations.
www.cloudyml.com
Learn Data Science By Doing It.
35. What is the Right JOIN in SQL?
The Right join is used to retrieve all rows from the right-hand table and only
those rows from the other table that fulfilled the join condition. It returns all the
rows from the right-hand side table even though there are no matches in the
left-hand side table. If it finds unmatched records from the left side table, it
returns a Null value. This join is also known as Right Outer Join.
The RNN is a stateful neural network, which means that it not only retains
information from the previous layer but also from the previous pass. Thus, this
neuron is said to have connections between passes, and through time.
For the RNN the order of the input matters due to being stateful. The same
words with different orders will yield different outputs.
RNN can be used for unsegmented, connected applications such as handwriting
recognition or speech recognition.
Array elements can be removed using pop() or remove() method. The difference
between these two functions is that the former returns the deleted value
whereas the latter does not.
Advantages of Views:
As there is no physical location where the data in the view is stored, it
generates output without wasting resources.
Data access is restricted as it does not allow commands like insertion,
updation, and deletion.
Disadvantages of Views:
The view becomes irrelevant if we drop a table related to that view.
Much memory space is occupied when the view is created for large tables.
www.cloudyml.com
Learn Data Science By Doing It.
39. How to create a calculated field in Tableau?
Click the drop down to the right of Dimensions on the Data pane and select
“Create > Calculated Field” to open the calculation editor.
Name the new field and create a formula.
One-to-One - This can be defined as the relationship between two tables where
each record in one table is associated with the maximum of one record in the
other table.
One-to-Many & Many-to-One - This is the most commonly used relationship
where a record in a table is associated with multiple records in the other table.
Many-to-Many - This is used in cases when multiple instances on both sides are
needed for defining a relationship.
Self-Referencing Relationships - This is used when a table needs to define a
relationship with itself.
www.cloudyml.com
Learn Data Science By Doing It.
42.What are the main difficulties when training RNNs?
How can you handle them?
The two main difficulties when training RNNs are unstable gradients (exploding
or vanishing) and a very limited short-term memory. These problems both get
worse when dealing with long sequences.
To alleviate the unstable gradients problem, we can:
Use a smaller learning rate.
Use a saturating activation function such as the hyperbolic tangent (which is the
default), and possibly use gradient clipping, Layer Normalization, or dropout at
each time step.
To tackle the limited short-term memory problem, we can use a Long Short-
Term Memory layer or a Gated recurrent unit layer.
www.cloudyml.com
Learn Data Science By Doing It.
45. What do you understand by the term silhouette
coefficient?
Trends and seasonality are two characteristics of time series metrics that break
many models. Trends are continuous increases or decreases in a metric’s value.
Seasonality, on the other hand, reflects periodic (cyclical) patterns that occur in a
system, usually rising above a baseline and then decreasing again.
Bagging is a homogeneous weak learners’ model that learns from each other
independently in parallel and combines them for determining the model average.
Boosting is also a homogeneous weak learners’ model but works differently from
Bagging. In this model, learners learn sequentially and adaptively to improve
model predictions of a learning algorithm
www.cloudyml.com
Learn Data Science By Doing It.
SUBSCRIBE TO
OUR TELEGRAM
CHANNEL TO GET
COMPLETE PDF
and more such valuable contents
https://t.me/cloudymlofficial