DAV Solution
DAV Solution
Q1
(A) List and explain different key roles for successful data analytics?
Answer:
Responsibilities include collecting and organizing data, analyzing datasets to identify
trends and patterns, generating reports and visualizations for stakeholders,
maintaining and optimizing databases, problem-solving through data-driven
approaches, monitoring business performance, and continuously learning about data.
Solution:
Terminologies:
Document Frequency: This tests the meaning of the text, which is very
similar to TF, in the whole corpus collection. The only difference is that in
document d, TF is the frequency counter for a term t, while df is the
number of occurrences in the document set N of the term t. In other
words, the number of papers in which the word is present is DF.
Inverse Document Frequency: Mainly, it tests how relevant the word is.
The key aim of the search is to locate the appropriate records that fit the
demand. Since tf considers all terms equally significant, it is therefore not
only possible to use the term frequencies to measure the weight of the
term in the paper. First, find the document frequency of a term t by
counting the number of documents containing the term:
df(t) = N(t)
where
Vocab of document
Step 2 Find TF
Document 1—
TF for sentence 1
Step 4 Build model i.e. stack all words next to each other —
You can easily see using this table that words like ‘it’,’is’,’rain’ are important for
document 1 but not for document 2 and document 3 which means Document 1 and
You can also say that Document 1 and 2 talk about something ‘today’, and document
2 and 3 discuss something about the writer because of the word ‘I’.
This table helps you find similarities and non similarities btw documents, words and
Solution:
Matplotlib: It is a Python library used for plotting graphs with the help of other
libraries like Numpy and Pandas. It is a powerful tool for visualizing data in Python. It
is used for creating statistical inferences and plotting 2D graphs of arrays. It was first
interface free and open-source. It is capable of dealing with various operating systems
Pandas, and Numpy. It is built on the roof of Matplotlib and is considered as a superset
of the Matplotlib library. It helps in visualizing univariate and bivariate data. It uses
data. It eliminates the overlapping of graphs and also aids in their beautification.
matplotlib.pyplot.bar(x_axis, bargraph-
y_axis). seaborn.barplot(x_axis,
y_axis).
Features Matplotlib Seaborn
matplotlib.pyplot.close(“all”)
it.
Data Frames and Matplotlib works efficiently with Seaborn is much more
Features Matplotlib Seaborn
plot()
plotting graphs
Solution:
Time series is a sequence of data points that are recorded over a period of time. The
components of time series are the different patterns that can be observed in the data.
These patterns can be used to understand the underlying behavior of the data and to
Trend: The trend is the long-term direction of the data. It can be upward, downward,
or flat.
Cycle: A cycle is a pattern in the data that repeats itself after a specific number of
observations, which is not necessarily related to seasonality.
Irregularity: Irregularity is the random variation in the data that cannot be explained
by any of the other components.
These components can be combined to create a model of the time series. This model
Solution:
Phase 1: Discovery –
Come to know about data sources needed and available for the project.
The team formulates initial hypothesis that can be later tested with data.
Steps to explore, preprocess, and condition data prior to modeling and analysis.
It requires the presence of an analytic sandbox, the team execute, load, and transform,
to get data into the sandbox.
Data preparation tasks are likely to be performed multiple times and not in predefined
order.
Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine,
etc.
Team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
In this phase, data science team develop data sets for training, testing, and production
purposes.
Team builds and executes models based on the work done in the model planning
phase.
Several tools commonly used for this phase are – Matlab, STASTICA.
Team also considers whether its existing tools will suffice for running the models or if
they need more robust environment for executing models.
Team considers how best to articulate findings and outcomes to various team members
and stakeholders, taking into account warning, assumptions.
Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
Phase 6: Operationalize –
The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.
This approach enables team to learn about performance and related constraints of the
model in production environment on small scale  , and make adjustments before
full deployment.
Solution:
ARIMA models are a class of statistical models that describe the patterns and trends in
time series data. They consist of three main components: autoregression, integration,
and moving average. Autoregression means that the current value of the series depends
on its past values, with some lag.
ARIMA models are a class of statistical models that describe the patterns and trends in
time series data. They consist of three main components: autoregression, integration,
and moving average. Autoregression means that the current value of the series depends
on its past values, with some lag.
ARIMA Parameters
p: the number of lag observations in the model, also known as the lag order.
d: the number of times the raw observations are differenced; also known as the degree
of differencing.
q: the size of the moving average window, also known as the order of the moving
average.
ARIMA models have strong points and are good at forecasting based on past
circumstances, but there are more reasons to be cautious when using ARIMA. In stark
contrast to investing disclaimers that state "past performance is not an indicator of
future performance...," ARIMA models assume that past values have some residual
effect on current or future values and use data from the past to forecast future events.
The following table lists other ARIMA traits that demonstrate good and bad
characteristics.
Only requires the prior data of a time series to generalize the forecast.
Performs well on short term forecasts.
Models non-stationary time series.
Solution:
Text mining and text analytics are broad umbrella terms describing a
range of technologies for analyzing and processing semi-structured and
unstructured text data. The unifying theme behind each of these
technologies is the need to “turn text into numbers” so powerful algorithms
can be applied to large document databases. Converting text into a
structured, numerical format and applying analytical algorithms require
knowing how to both use and combine techniques for handling text,
ranging from individual words to documents to entire document databases.
1. Search and information retrieval (IR): Storage and retrieval of text
documents, including search engines and keyword search.
4. Web mining: Data and text mining on the Internet, with a specific focus on
the scale and interconnections of the web.
Solution:
Exploratory Data Analysis (EDA) is a crucial step in the data science process
that helps to understand the underlying structure of a data set. One of the
most efficient ways to perform EDA is through the use of graphical
representations of the data. Graphs can reveal patterns, outliers, and
relationships within the data that may not be immediately apparent from the
raw data.
Some common examples of EDA plots that can be created using ggplot2
include:
Finding out the important variables that can be used in our problem.
Before we start working with EDA, we must perform the data inspection
properly. Here in our analysis, we will be using the loafercreek from
the soilDB package in R. We are going to inspect our data in order to find all
the typos and blatant errors. Further EDA can be used to determine and
identify the outliers and perform the required statistical analysis. For
performing the EDA, we will have to install and load the following packages:
“aqp” package
“ggplot2” package
“soilDB” package
For Descriptive Statistics in order to perform EDA in R, we will divide all the
functions into the following categories:
Measures of dispersion
Correlation
We will try to determine the mid-point values using the functions under
the Measures of Central tendency. Under this section, we will be calculating
the mean, median, mode, and frequencies.
Since we have already checked our data for missing values, blatant errors,
and typos, we can now examine our data graphically in order to perform EDA.
We will see the graphical representation under the following categories:
Under the Distribution, we shall examine our data using the bar plot,
Histogram, Density curve, box plots, and QQplot.
Solution:
The first step to conducting a text analysis is identifying your goals. Since
different text analysis methods use data in different ways, taking time to
understand your goals can help you choose the right analytical method for
you. When establishing your goals, consider factors like what type of text you
plan to analyze, what questions you need to answer with data and what
sources you need to use to get relevant information. For example, you may
set a goal to identify customer engagement rates in response to a new social
media marketing campaign and use text analysis to monitor participation.
After establishing what you want to accomplish with your text analysis, choose
the right method for achieving that goal. Some text analysis methods are
better for finding, organizing and storing data, while others can accomplish
goals like flagging information that doesn't translate into computer languages
or summarizing large sets of data. In some cases, you may need to apply
multiple text analysis methods to access the needed information to find, sort,
store and manipulate data.Related: Data Analysis: Definition, Types and
Benefits
3. Collect data
Choose the sources from which you plan to collect data. Apply your text
analysis methods to gather data from your selected sources. Some common
sources for text analysis include social media platforms and product review
pages. These sources of data can give you important feedback regarding your
target market, including their needs, preferences and experiences as they
relate to your business and its products.
Once you've performed the initial collection, clean and prepare the data for
analysis. Some analytical programs automatically clean the data, which
means removing any data that doesn't meet the needs of the analysis. By
cleaning the data, you eliminate any pieces of information that may reduce the
accuracy of your results. Additionally, you may use a system that
automatically prepares the data by sorting it within defined categories for
future use.
You can begin the analysis after preparing the data. The type of analysis you
use depends on the information you need to learn. You may also run multiple
analyses using the same data set to gain insight into it from various
perspectives. For example, you can use the same text sources to analyze the
number of positive product reviews, negative product reviews and neutral
product reviews. This helps you create a comprehensive list of your product's
successes and areas for improvement based on direct
feedback.Related: What Are the 4 Types of Data Analytics? (With Tips)