[go: up one dir, main page]

0% found this document useful (0 votes)
3 views22 pages

DAV Solution

The document outlines key roles in data analytics, including data collection, analysis, database management, and continuous learning. It explains stepwise regression methods (forward, backward, bidirectional), TF-IDF for measuring word relevance, and compares Matplotlib and Seaborn libraries for data visualization. Additionally, it discusses components of time series, phases in the data analytics lifecycle, and details on ARIMA models, including their pros and cons.

Uploaded by

Vandana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views22 pages

DAV Solution

The document outlines key roles in data analytics, including data collection, analysis, database management, and continuous learning. It explains stepwise regression methods (forward, backward, bidirectional), TF-IDF for measuring word relevance, and compares Matplotlib and Seaborn libraries for data visualization. Additionally, it discusses components of time series, phases in the data analytics lifecycle, and details on ARIMA models, including their pros and cons.

Uploaded by

Vandana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

DAV Solution(Even Semester)

Academic Year (2023-24)

Q1
(A) List and explain different key roles for successful data analytics?

Answer:
Responsibilities include collecting and organizing data, analyzing datasets to identify
trends and patterns, generating reports and visualizations for stakeholders,
maintaining and optimizing databases, problem-solving through data-driven
approaches, monitoring business performance, and continuously learning about data.

1.Data Collection and Interpretation: Gather, organize, and clean


large datasets from various sources, ensuring data integrity and accuracy.
Interpret data to identify trends, patterns, and insights relevant to
business objectives.

2.Analysis and Reporting: Utilize statistical techniques and data


analysis tools to extract meaningful insights from complex datasets.
Generate reports and visualizations to communicate findings effectively to
stakeholders, aiding in informed decision-making processes.

3.Database Management: Maintain and optimize databases to ensure


efficient data storage, retrieval, and manipulation. Implement data
governance policies and procedures to safeguard data quality and
security.

4.Problem Solving and Decision Support: Collaborate with cross-


functional teams to address business challenges and provide data-driven
solutions. Support strategic planning and resource allocation through data
analysis and modeling.
5.Performance Monitoring and Optimization: Develop and implement
metrics and KPIs to monitor business performance and track key trends
over time. Identify areas for improvement and optimization based on data-
driven insights.

6.Predictive Modeling and Forecasting: Build predictive models and


conduct forecasting analysis to anticipate future trends and outcomes.
Evaluate model accuracy and refine algorithms to enhance predictive
capabilities.

7.Continuous Learning and Development: Stay abreast of industry


trends, best practices, and emerging technologies in data analytics.
Continuously expand skills in data manipulation, statistical analysis, and
data visualization tools to remain competitive in the field.

1(B).What is Stepwise regression? Explain its types.

Solution:

Stepwise regression is the step-by-step iterative construction of a


regression model that involves the selection of independent variables to
be used in a final model. It involves adding or removing potential
explanatory variables in succession and testing for statistical significance
after each iteration.

#1 – Forward Stepwise Regression

The forward model is empty with no variable. Instead, each predictor


variable is first tested and then introduced into the model. Only the ones
that meet statistical significance criteria are kept.

This process is repeated till the desired result is acquired. It is called


forward regression because the process moves in the forward direction—
testing occurs toward constructing an optimal model.

#2 – Backward Stepwise Regression

It is the opposite of ‘forward regression.’ When the backward approach is


employed, the model already contains many variables. Each variable then
undergoes testing—variables that fail to meet statistical significance
standards are discarded. This process is repeated for all the variables till
the desired result is obtained.

#3 – Bidirectional Stepwise Regression

The bi-directional approach is simply a combination of forward and


backward regression. It is naturally a tad bit complicated. Nevertheless,
analysts use this challenging subtype to save time when too many
variables are present.

1(c) Explain Term Frequency-inverse Document Frequency (TF-IDF) with a


suitable example.
Solution:

TF-IDF stands for Term Frequency Inverse Document Frequency of


records. It can be defined as the calculation of how relevant a word in a
series or corpus is to a text. The meaning increases proportionally to the
number of times in the text a word appears but is compensated by the
word frequency in the corpus (data-set).

Terminologies:

Term Frequency: In document d, the frequency represents the number of


instances of a given word t. Therefore, we can see that it becomes more
relevant when a word appears in the text, which is rational. Since the
ordering of terms is not significant, we can use a vector to describe the
text in the bag of term models. For each specific term in the paper, there
is an entry with the value being the term frequency.

The weight of a term that occurs in a document is simply proportional to


the term frequency.

tf(t,d) = count of t in d / number of words in d

Document Frequency: This tests the meaning of the text, which is very
similar to TF, in the whole corpus collection. The only difference is that in
document d, TF is the frequency counter for a term t, while df is the
number of occurrences in the document set N of the term t. In other
words, the number of papers in which the word is present is DF.

df(t) = occurrence of t in documents

Inverse Document Frequency: Mainly, it tests how relevant the word is.
The key aim of the search is to locate the appropriate records that fit the
demand. Since tf considers all terms equally significant, it is therefore not
only possible to use the term frequencies to measure the weight of the
term in the paper. First, find the document frequency of a term t by
counting the number of documents containing the term:

df(t) = N(t)

where

df(t) = Document frequency of a term t

N(t) = Number of documents containing the term t

Term frequency is the number of instances of a term in a single document


only; although the frequency of the document is the number of separate
documents in which the term appears, it depends on the entire corpus.
Now let’s look at the definition of the frequency of the inverse paper. The
IDF of the word is the number of documents in the corpus separated by
the frequency of the text.

idf(t) = N/ df(t) = N/N(t)


example of 3 documents -

Document 1 It is going to rain today.

Document 2 Today I am not going outside.

Document 3 I am going to watch the season


premiere.

To find TF-IDF we need to perform the steps we laid


out above, let’s get to it.

Step 1 Clean data and Tokenize

Vocab of document

Step 2 Find TF
Document 1—

It is going to rain today.

Find it’s TF = (Number of repetitions of word in a


document) / (# of words in a document)

TF for sentence 1

Continue for rest of sentences -

TF for the document


Step 3 Find IDF

Find IDF for documents (we do this for feature names


only/ vocab words which have no stop words )

IDF =Log[(Number of documents) / (Number of


documents containing the word)]

IDF for document

Step 4 Build model i.e. stack all words next to each other —

IDF Value and TF value of 3 documents.

Step 5 Compare results and use table to ask questions


Remember, the final equation = TF-IDF = TF * IDF

You can easily see using this table that words like ‘it’,’is’,’rain’ are important for

document 1 but not for document 2 and document 3 which means Document 1 and

2&3 are different w.r.t talking about rain.

You can also say that Document 1 and 2 talk about something ‘today’, and document

2 and 3 discuss something about the writer because of the word ‘I’.

This table helps you find similarities and non similarities btw documents, words and

more much much better than BOW.

1(D).Difference between Matplotlib and Seaborn library.

Solution:

Matplotlib: It is a Python library used for plotting graphs with the help of other

libraries like Numpy and Pandas. It is a powerful tool for visualizing data in Python. It

is used for creating statistical inferences and plotting 2D graphs of arrays. It was first

introduced by John D. Hunter in 2002. It uses Pyplot to provide a MATLAB-like

interface free and open-source. It is capable of dealing with various operating systems

and their graphical backends.


Seaborn: It is also a Python library used for plotting graphs with the help of Matplotlib,

Pandas, and Numpy. It is built on the roof of Matplotlib and is considered as a superset

of the Matplotlib library. It helps in visualizing univariate and bivariate data. It uses

beautiful themes for decorating Matplotlib graphics. It acts as an important tool in

picturing Linear Regression Models. It serves in making graphs of statical Time-Series

data. It eliminates the overlapping of graphs and also aids in their beautification.

Table of differences between Matplotlib and Seaborn

Features Matplotlib Seaborn

It is utilized for making basic Seaborn contains several


patterns and plots for data
graphs. Datasets are visualized visualization. It uses
fascinating themes. It helps
Functionality with the help of bar graphs, in compiling whole data into
a single plot. It also
histograms, pie charts, scatter provides the distribution of
plots, lines, and so on. data.

It uses comparatively simple

It uses comparatively complex and syntax which is easier to

lengthy syntax. Example: Syntax learn and understand.

Syntax for bar graph- Example: Syntax for

matplotlib.pyplot.bar(x_axis, bargraph-

y_axis). seaborn.barplot(x_axis,

y_axis).
Features Matplotlib Seaborn

We can open and use multiple

figures simultaneously. However, Seaborn sets the time for the

they are closed distinctly. Syntax creation of each figure.


Dealing Multiple
Figures to close one figure at a time: However, it may lead to

matplotlib.pyplot.close(). Syntax (OOM) out of memory

to close all the figures: issues

matplotlib.pyplot.close(“all”)

Matplotlib is well connected with

Numpy and Pandas and acts as a Seaborn is more

graphics package for data comfortable in handling

visualization in Python. Pyplot Pandas data frames. It uses


Visualization
provides similar features and basic sets of methods to

syntax as in MATLAB. Therefore, provide beautiful graphics in

MATLAB users can easily study Python.

it.

Seaborn avoids overlapping


Matplotlib is a highly customized
Pliability plots with the help of its
and robust
default themes

Data Frames and Matplotlib works efficiently with Seaborn is much more
Features Matplotlib Seaborn

functional and organized

data frames and arrays.It treats than Matplotlib and treats

figures and axes as objects. It the whole dataset as a single

contains various stateful APIs for unit. Seaborn is not so


Arrays
plotting. Therefore plot() like stateful and therefore,

methods can work without parameters are required

parameters. while calling methods like

plot()

Seaborn is the extended

version of Matplotlib which


Matplotlib plots various graphs
Use Cases uses Matplotlib along with
using Pandas and Numpy
Numpy and Pandas for

plotting graphs

1(E). Explain components of time series?

Solution:

Time series is a sequence of data points that are recorded over a period of time. The

components of time series are the different patterns that can be observed in the data.

These patterns can be used to understand the underlying behavior of the data and to

make predictions about the future.


The four main components of time series are:

Trend: The trend is the long-term direction of the data. It can be upward, downward,
or flat.

Seasonality: Seasonality is a repeating pattern in the data that occurs at regular


intervals, such as daily, weekly, monthly, or yearly.

Cycle: A cycle is a pattern in the data that repeats itself after a specific number of
observations, which is not necessarily related to seasonality.

Irregularity: Irregularity is the random variation in the data that cannot be explained
by any of the other components.

These components can be combined to create a model of the time series. This model

can then be used to make predictions about the future.

2. Attempt the following

[A] Explain different phases in data analytics lifecycle.

Solution:

Data Analytics Lifecycle :


The Data analytic lifecycle is designed for Big Data problems and data science
projects. The cycle is iterative to represent real project. To address the distinct
requirements for performing analysis on Big Data, step – by – step methodology is
needed to organize the activities and tasks involved with acquiring, processing,
analyzing, and repurposing data.

Phase 1: Discovery –

The data science team learn and investigate the problem.

Develop context and understanding.

Come to know about data sources needed and available for the project.

The team formulates initial hypothesis that can be later tested with data.

Phase 2: Data Preparation –

Steps to explore, preprocess, and condition data prior to modeling and analysis.

It requires the presence of an analytic sandbox, the team execute, load, and transform,
to get data into the sandbox.

Data preparation tasks are likely to be performed multiple times and not in predefined
order.

Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine,
etc.

Phase 3: Model Planning –

Team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
In this phase, data science team develop data sets for training, testing, and production
purposes.

Team builds and executes models based on the work done in the model planning
phase.

Several tools commonly used for this phase are – Matlab, STASTICA.

Phase 4: Model Building –

Team develops datasets for testing, training, and production purposes.

Team also considers whether its existing tools will suffice for running the models or if
they need more robust environment for executing models.

Free or open-source tools – Rand PL/R, Octave, WEKA.

Commercial tools – Matlab , STASTICA.

Phase 5: Communication Results –

After executing model team need to compare outcomes of modeling to criteria


established for success and failure.

Team considers how best to articulate findings and outcomes to various team members
and stakeholders, taking into account warning, assumptions.

Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.

Phase 6: Operationalize –

The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.

This approach enables team to learn about performance and related constraints of the
model in production environment on small scale &nbsp, and make adjustments before
full deployment.

The team delivers final reports, briefings, codes.

Free or open source tools – Octave, WEKA, SQL, MADlib.


Q2(B): Explain ARIMA model in detail. Also state its Pros and Cons.

Solution:

ARIMA models are a class of statistical models that describe the patterns and trends in
time series data. They consist of three main components: autoregression, integration,
and moving average. Autoregression means that the current value of the series depends
on its past values, with some lag.

ARIMA models are a class of statistical models that describe the patterns and trends in
time series data. They consist of three main components: autoregression, integration,
and moving average. Autoregression means that the current value of the series depends
on its past values, with some lag.

ARIMA Parameters

Each component in ARIMA functions as a parameter with a standard notation. For


ARIMA models, a standard notation would be ARIMA with p, d, and q, where integer
values substitute for the parameters to indicate the type of ARIMA model used. The
parameters can be defined as:

p: the number of lag observations in the model, also known as the lag order.

d: the number of times the raw observations are differenced; also known as the degree
of differencing.

q: the size of the moving average window, also known as the order of the moving
average.

Pros and Cons of ARIMA

ARIMA models have strong points and are good at forecasting based on past
circumstances, but there are more reasons to be cautious when using ARIMA. In stark
contrast to investing disclaimers that state "past performance is not an indicator of
future performance...," ARIMA models assume that past values have some residual
effect on current or future values and use data from the past to forecast future events.

The following table lists other ARIMA traits that demonstrate good and bad
characteristics.

Potential pros of using ARIMA models

 Only requires the prior data of a time series to generalize the forecast.
 Performs well on short term forecasts.
 Models non-stationary time series.

Potential cons of using ARIMA models

 Difficult to predict turning points.


 There is quite a bit of subjectivity involved in determining (p,d,q) order
of the model.
 Computationally expensive.
 Poorer performance for long term forecasts.
 Cannot be used for seasonal time series.
 Less explainable than exponential smoothing.

Q3(A): Explain in detail seven practice areas of Text Analytics.

Solution:

Text mining and text analytics are broad umbrella terms describing a
range of technologies for analyzing and processing semi-structured and
unstructured text data. The unifying theme behind each of these
technologies is the need to “turn text into numbers” so powerful algorithms
can be applied to large document databases. Converting text into a
structured, numerical format and applying analytical algorithms require
knowing how to both use and combine techniques for handling text,
ranging from individual words to documents to entire document databases.
1. Search and information retrieval (IR): Storage and retrieval of text
documents, including search engines and keyword search.

2. Document clustering: Grouping and categorizing terms, snippets,


paragraphs, or documents, using data mining clustering methods.

3. Document classification: Grouping and categorizing snippets, paragraphs,


or documents, using data mining classification methods, based on models
trained on labeled examples.

4. Web mining: Data and text mining on the Internet, with a specific focus on
the scale and interconnections of the web.

5. Information extraction (IE): Identification and extraction of relevant facts


and relationships from unstructured text; the process of making structured
data from unstructured and semistructured text.

6. Natural language processing (NLP): Low-level language processing and


understanding tasks (e.g., tagging part of speech); often used synonymously
with computational linguistics.

7. Concept extraction: Grouping of words and phrases into semantically


similar groups.
Q4(B): How EDA performed in R.

Solution:

Exploratory Data Analysis (EDA) is a crucial step in the data science process
that helps to understand the underlying structure of a data set. One of the
most efficient ways to perform EDA is through the use of graphical
representations of the data. Graphs can reveal patterns, outliers, and
relationships within the data that may not be immediately apparent from the
raw data.

R is a popular programming language for data analysis and visualization, and


one of the most widely used libraries for creating high-quality, publication-
ready graphics is ggplot2.

Some common examples of EDA plots that can be created using ggplot2
include:

Scatter plots: It is used to visualize the relationship between two variables.

Histograms: It is used to visualize the distribution of a single variable.

Box plots: It is used to visualize the distribution of a variable and identify


outliers.

Scatter plot: It is used to identify relationships between all pairs of variables in


a data set.

Heatmaps: used to visualize the relationship between two variables by plotting


the density of points in a 2D space
Scatter plots with.

Smoothed density estimates: used to understand the distribution of a variable.

Exploratory Data Analysis or EDA is a statistical approach or technique for


analyzing data sets to summarize their important and main characteristics
generally by using some visual aids. The EDA approach can be used to
gather knowledge about the following aspects of data.

Main characteristics or features of the data.

The variables and their relationships.

Finding out the important variables that can be used in our problem.

EDA is an iterative approach that includes:

Generating questions about our data


Searching for the answers by using visualization, transformation, and
modeling of our data

Exploratory Data Analysis in R

In R Programming Language, we are going to perform EDA under two broad


classifications:

Descriptive Statistics, which includes mean, median, mode, inter-quartile


range, and so on.

Graphical Methods, which includes histogram, density estimation, box plots,


and so on.

Before we start working with EDA, we must perform the data inspection
properly. Here in our analysis, we will be using the loafercreek from
the soilDB package in R. We are going to inspect our data in order to find all
the typos and blatant errors. Further EDA can be used to determine and
identify the outliers and perform the required statistical analysis. For
performing the EDA, we will have to install and load the following packages:

“aqp” package

“ggplot2” package

“soilDB” package

We can install these packages from the R console using the


install.packages() command and load them into our R Script by using
the library() command. We will now see how to inspect our data and remove
the typos and blatant errors.
Descriptive Statistics Exploratory Data Analysis in R

For Descriptive Statistics in order to perform EDA in R, we will divide all the
functions into the following categories:

Measures of central tendency

Measures of dispersion

Correlation

We will try to determine the mid-point values using the functions under
the Measures of Central tendency. Under this section, we will be calculating
the mean, median, mode, and frequencies.

Graphical Method in Exploratory Data Analysis in R

Since we have already checked our data for missing values, blatant errors,
and typos, we can now examine our data graphically in order to perform EDA.
We will see the graphical representation under the following categories:

Distributions,Scatter and Line plot

Under the Distribution, we shall examine our data using the bar plot,
Histogram, Density curve, box plots, and QQplot.

Q5(a): Enlist and Explain the steps of Text analaysis.

Solution:

Text analysis is a method by which a computer program extracts information


from unstructured data and converts it into a form that computers can
interpret. In computer science, unstructured data refers to information that
either doesn't have a pre-defined data model or doesn't have a defined
organizational structure. For example, a customer leaving a text-based
product review on a company's website is unstructured data because the
reviewer's words don't follow the pre-determined structures and patterns of a
computer-readable language, like code. A text analysis program can analyze
the comment, compare it to other comments and identify linguistic patterns.

steps for how to conduct a text analysis:

1. Identify your goals

The first step to conducting a text analysis is identifying your goals. Since
different text analysis methods use data in different ways, taking time to
understand your goals can help you choose the right analytical method for
you. When establishing your goals, consider factors like what type of text you
plan to analyze, what questions you need to answer with data and what
sources you need to use to get relevant information. For example, you may
set a goal to identify customer engagement rates in response to a new social
media marketing campaign and use text analysis to monitor participation.

2. Choose a text analysis method

After establishing what you want to accomplish with your text analysis, choose
the right method for achieving that goal. Some text analysis methods are
better for finding, organizing and storing data, while others can accomplish
goals like flagging information that doesn't translate into computer languages
or summarizing large sets of data. In some cases, you may need to apply
multiple text analysis methods to access the needed information to find, sort,
store and manipulate data.Related: Data Analysis: Definition, Types and
Benefits

3. Collect data

Choose the sources from which you plan to collect data. Apply your text
analysis methods to gather data from your selected sources. Some common
sources for text analysis include social media platforms and product review
pages. These sources of data can give you important feedback regarding your
target market, including their needs, preferences and experiences as they
relate to your business and its products.

4. Clean and prepare data

Once you've performed the initial collection, clean and prepare the data for
analysis. Some analytical programs automatically clean the data, which
means removing any data that doesn't meet the needs of the analysis. By
cleaning the data, you eliminate any pieces of information that may reduce the
accuracy of your results. Additionally, you may use a system that
automatically prepares the data by sorting it within defined categories for
future use.

5. Initiate the analysis

You can begin the analysis after preparing the data. The type of analysis you
use depends on the information you need to learn. You may also run multiple
analyses using the same data set to gain insight into it from various
perspectives. For example, you can use the same text sources to analyze the
number of positive product reviews, negative product reviews and neutral
product reviews. This helps you create a comprehensive list of your product's
successes and areas for improvement based on direct
feedback.Related: What Are the 4 Types of Data Analytics? (With Tips)

6. Organize and visualize data


Once you've completed the analysis, you can interpret and visualize the
results. Review the findings and use data visualization techniques to make the
results easier to understand and share. You may make charts or graphs to
display your findings so it's easy to see the distribution of your data. Apply
these findings to answer questions about your target market and current
business strategies. By identifying patterns in your customer feedback and
other data sources, you can improve your strategies to better meet the
interests and preferences of your target market.

You might also like