For any Homework related queries, call us at- +1 678 648 4277
You can mail us at:- support@statisticshomeworksolver.com or
reach us at- https://www.statisticshomeworksolver.com/
Data Mining Assignment Help
Data Mining and Regression
These questions cover a wide range of data mining and
regression sub-topics. It involves concepts like:
• Training set and test
• Data reduction
• Sampling
• Data splitting and re-sampling
• Regression
Training and Test Sets
What are training set and test set used for respectively? If
splitting a dataset by assigning 75% to one set while 25%
to another set, is it 75% or 25% that should go to training
set?
Ans: Training set is used to train the model at a known
sample so that model can learn its parameters. Test set is used
for the model performance testing using out of sample
examples which was not used to train the model in order to
assess the real-world performance of the model. 75% of the
data should go to training the model so that it can reliably
estimate the parameters.
Data Reduction
Removing predictor(s) is generally known as a data
reduction technique. Explain under what
conditions we should consider removing predictors.
Ans: Predictors can be removed under certain conditions such
as:
a) Predictor is not adding any value to the problem in logical
sense, like name, serial number etc.
b) Predictor is replicating same information which is covered
in any other predictor.
c) Lots of missing values in the predictor which may lead to
bad fit.
Sampling
What is the difference(s) between simple random sampling
and stratified random sampling?
Ans: Simple random sampling is just taking a k out of n
objects randomly. In these sampling scheme, every possible
sample must have equal probability of getting selected.
In Stratified sampling, there are well defined groups or strata,
and simple random sampling is done inside each stratum and
included into the sample. These are, in most cases, a better
alternative to represent actual scenario especially in case of
class imbalance.
Why is model tuning necessary for predictive modelling?
Ans: Hyperparameters are crucial as they control the overall
behaviour of a machine learning model. The ultimate goal is
to find an optimal combination of hyperparameters that
minimizes a predefined loss function to give better results.
This is why model tuning is important as to get the optimum
model based on problem statement. There can be n number of
models for every task but to get the best out of it,
hyperparameters must be tuned.
Predictive Model Building
Use your words to describe the process of building
predictive models considering data splitting and data
resampling (referring to the graph below).
Ans: The steps of model building is outlined below:
Step 1: Select/Get Data
Step 2: Data cleaning/Data pre-processing
Step 3: Data splitting: Into training and test sets
Step 4: Split training set into Training and Validation set
Step 5: Model Selection and Develop Models (Training)
Step 6: Parameter tuning (Validation set), Optimize
Step 7: Testing and model performance evaluation
Linear Regressi
List three linear regression models we learned in class.
What metrics can be used to compare the linear model
predictive performance?
Ans: The regression models are Ordinary least square
regression, Kernel regression, k-NN regression, MARS Model.
What are the two tuning parameters associated with
Multivariate Adaptive Regression Splines (MARS) model?
How to determine the optimal values for the tuning
parameters?
Ans: Two parameters are degree and nprune. Both of these are
determined by testing the model performance on validation set.
Define K-Nearest Neighbours (KNN) regression method
and indicate whether pre-processing predictors is needed
prior to performing KNN.
Ans: KNN regression is a non-parametric method that, in an
intuitive manner, approximates the association between
independent variables and the continuous outcome
by averaging the observations in the same neighbourhood.
The size of the neighbourhood needs to be set by the analyst
or can be chosen using cross-validation to select the size that
minimises the mean-squared error. Generally, pre-processing
here includes making the features similar and numeric so that
distance can be calculated. So we centre and scale the data.