0% found this document useful (0 votes)

4 views36 pages

DV Mid Term Notes

The document explains the differences between qualitative and quantitative data, outlines the statistical analysis process, and defines key concepts like population, sample, and variable. It also discusses random variables, methods for calculating mean, median, and mode, and compares surveys and observations in data collection. Additionally, it covers sampling techniques, applications of statistics in various fields, measures of central tendency and dispersion, and the role of inferential statistics and hypothesis testing.

Uploaded by

faizalanother

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views36 pages

DV Mid Term Notes

Uploaded by

faizalanother

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

UNIT 1

4 , 6 , 10 MARK QUESTIONS
Explain the di erence between qualitative (categorical) and quantitative
(numerical) data, with examples.

Feature Qualitative (Categorical) Data Quantitative (Numerical) Data

Data that describes categories Data that represents numbers and

Deﬁnition
or qualities. can be measured.

Nature Non-numeric (unless coded) Numeric

To calculate, measure, and perform

Purpose To classify or label things
mathematical operations

Gender (Male, Female), Color Age (21, 35), Height (160 cm),
Examples
(Red, Blue), Type of Car Salary (₹50,000), Marks (85)

Types Nominal, Ordinal Discrete, Continuous

What is statistics? List the key steps involved in a statistical analysis process.
Statistics is the branch of mathematics that deals with collecting, organizing,
analyzing, interpreting, and presenting data to support decision-making and
problem-solving.
It helps in understanding patterns, trends, and making predictions based on data.

Key Steps in Statistical Analysis Process

1. Problem Deﬁnition
o Clearly deﬁne the question or problem to be solved using data.
2. Data Collection
o Gather relevant data from surveys, experiments, databases, or
observations.
3. Data Organization and Cleaning
o Arrange data in tables or spreadsheets and remove errors or missing
values.
4. Data Analysis
o Apply statistical methods (mean, median, regression, etc.) to
summarize and understand the data.
5. Interpretation of Results
o Explain what the results mean in the context of the problem.
6. Presentation of Findings
o Use charts, graphs, and reports to communicate the results
e ectively.

Deﬁne population, sample, and variable in the context of statistical analysis.

 Population
A population is the entire group of individuals or items that we want to study or
draw conclusions about.
 Example: All students in a university.
 Sample
A sample is a subset of the population, selected for the actual study. It is used
when studying the whole population is impractical.
 Example: 200 students selected from the university for a survey.
 Variable
A variable is any characteristic or attribute that can be measured or observed
and varies among individuals in the population.
 Example: Age, gender, marks, height.

What is a random variable? Distinguish between discrete and continuous random

variables.
A random variable is a numerical outcome of a random experiment. It assigns a
real number to each possible outcome of that experiment.
 It can be thought of as a function that maps outcomes of a chance process
to numbers.

Type Discrete Random Variable Continuous Random Variable

Takes countable values (ﬁnite or Takes uncountable values (within a

Deﬁnition
countably inﬁnite). range or interval).

Values are separated, often Values are measurable, can

Nature
integers. include decimals.

Number of students in a class, Height of students, temperature,

Examples
dice rolls (1–6), number of cars. weight, time taken to run a race.

Given the following dataset:

[5, 10, 15, 20, 25, 30, 35]
Calculate the mean, median, and mode.
1. Mean (Average)
Mean=Sum of all values/Total number of values=5+10+15+20+25+30+35/7=140/
7=20

2. Median (Middle value)

 Arrange the data in ascending order (already done).
 Number of values = 7 (odd), so median is the middle value.
Median=4th value=20

3. Mode (Most frequent value)

 All values appear only once → No repetition.
Mode=No mode (all values are unique
Analyze the pros and cons of using surveys versus observations in collecting data
for a study on customer behavior in retail stores.
Surveys
Pros:
1. Direct Feedback – Surveys allow researchers to collect speciﬁc
information directly from customers about their preferences, opinions, and
motivations.
2. Scalability – Large samples can be reached quickly, especially using online
or digital platforms.
3. Cost-E ective – Especially if conducted online, surveys can be less
expensive than observational studies.
Cons:
1. Response Bias – Customers may give socially desirable answers or may
not answer honestly.
2. Limited Depth – Surveys often fail to capture subtle behaviors or non-
verbal cues.
3. Low Response Rate – Especially in voluntary surveys, many customers
may choose not to participate.
Observations
Pros:
1. Actual Behavior – Observations capture real customer actions rather than
reported behavior, leading to more accurate data.
2. Contextual Insights – Allows researchers to understand customer behavior
in a real-world environment, including interactions with products and sta .
3. Non-intrusive – If done discreetly, it doesn’t interrupt or inﬂuence
customer actions.
Cons:
1. Time-Consuming – Observing customers and analyzing footage or notes
takes a lot of time.
2. Subjectivity – Interpretation of observed behavior can vary between
observers.
3. Ethical Concerns – If not done transparently, observation may raise privacy
issues.

You are conducting research on students' study habits in a university. Choose an

appropriate sampling technique and explain how you would implement it step by
step.
An appropriate sampling technique for studying students' study habits in a
university is Stratified Random Sampling. This method ensures representation
from di erent groups within the student population, such as year of study or
academic program.
Why Stratified Sampling?
 The university has a diverse population (e.g., first-year, second-year,
di erent departments).
 Stratified sampling ensures fair representation of each subgroup.

Implementation Steps:
1. Identify Strata (Subgroups):
Divide the student population into clear strata based on relevant
characteristics, such as:
o Year of study (1st year, 2nd year, 3rd year, etc.)
o Academic program (Engineering, Arts, Commerce, etc.)
2. Create a Sampling Frame for Each Stratum:
Make a list of all students in each group using the university database.
3. Decide Sample Size for Each Stratum:
Choose a proportional or equal number of students from each group
depending on research goals.
Example: If 30% of the students are from Engineering, then 30% of the
sample should also be from Engineering.
4. Use Random Sampling Within Each Stratum:
Select students randomly from each subgroup. This could be done using
software or a random number generator.
5. Collect Data:
Distribute the questionnaire or conduct interviews with the selected
students from all strata.
6. Analyze by Stratum if Needed:
Compare study habits across di erent years or departments to gain deeper
insights.

List any four real-life ﬁelds where statistics is commonly applied.

1. Healthcare and Medicine:
o Application: Statistics is used in clinical trials to test the
e ectiveness of new drugs or treatments, track disease spread, and
analyze patient outcomes.
o Example: Analyzing survival rates of patients with speciﬁc types of
cancer based on treatment.
2. Business and Marketing:
o Application: Companies use statistics for market research,
customer satisfaction surveys, sales forecasting, and product
development.
o Example: Analyzing customer purchasing patterns to improve
advertising strategies.
3. Education:
o Application: Statistics helps in analyzing student performance,
evaluating teaching methods, and assessing educational programs.
o Example: Analyzing standardized test scores to evaluate student
achievement levels across regions.
4. Sports and Performance Analytics:
o Application: In sports, statistics is used to analyze player
performance, track team progress, and predict outcomes.
o Example: Using player statistics to make decisions on team lineups
or player contracts.
Given the following data on test scores:
[10, 20, 20, 30, 30, 30, 40, 40, 50, 60]
Create a frequency table
Draw the corresponding histogram
Frequency Table:

Test Score Frequency

10 1

20 2

30 3

40 2

50 1

60 1
Explain the di erence between positively skewed, negatively skewed, and
symmetrically distributed data.
 Positively Skewed (Right Skewed) Data:
 Characteristics:
o The mean is greater than the median.
o The tail on the right side of the distribution is longer.
o It indicates that there are few extreme high values pulling the
distribution to the right.
 Example: Income distribution in many populations, where most people
earn average or lower wages, but a few individuals earn very high salaries,
skewing the data to the right.
 Negatively Skewed (Left Skewed) Data:
 Characteristics:
o The mean is less than the median.
o The tail on the left side of the distribution is longer.
o It indicates that there are few extreme low values pulling the
distribution to the left.
 Example: Age at retirement in a population, where most people retire
between 55-70 years old, but a few people retire much earlier, say in their
30s or 40s, causing a left skew.
 Symmetrically Distributed Data:
 Characteristics:
o The mean is equal to the median.
o There is no skewness; the data is evenly spread.
o The distribution looks similar on both sides of the central value.
 Example: Heights of adult humans (in a well-deﬁned population), where
most people have average heights and there are fewer people who are
extremely short or tall, creating a balanced distribution.
Di erentiate between measures of central tendency and measures of dispersion.
Why are both important in descriptive statistics?
1. Measures of Central Tendency: Measures of central tendency are statistical
tools that describe the center or typical value of a dataset. They help summarize
the data with a single value that represents the entire dataset. The three most
common measures of central tendency are:
 Mean: The average of all data points, calculated by summing all the values
and dividing by the number of values.
 Median: The middle value when the data is arranged in ascending or
descending order. If the number of data points is odd, the median is the
middle number; if even, it's the average of the two middle numbers.
 Mode: The value that appears most frequently in the dataset. A dataset can
have more than one mode (bimodal, multimodal) or none at all.
2. Measures of Dispersion: Measures of dispersion (or spread) describe the
extent to which data points in a dataset di er from the central value (mean,
median, or mode). These measures give us an idea of how spread out or clustered
the data points are. Common measures of dispersion include:
 Range: The di erence between the maximum and minimum values in the
dataset.
 Variance: The average of the squared di erences from the mean. It
measures how spread out the data points are around the mean.
 Standard Deviation: The square root of the variance. It provides a measure
of the average distance between each data point and the mean.

Both measures of central tendency and measures of dispersion are critical in

understanding the overall characteristics of a dataset:
 Central Tendency: These measures give a quick summary of the dataset,
allowing us to understand the typical or central value of the data. For
instance, knowing the average salary in a company helps understand the
general compensation level.
 Dispersion: Measures of dispersion show how much variability exists in the
data. Without this, we might misinterpret the data. For example, two
datasets may have the same mean, but if one has a high variance and the
other a low variance, they are very di erent in terms of data spread. This is
important for understanding the consistency or predictability of the data.

Explain how inferential statistics helps draw conclusions about a population

using data from a sample. Include the role of hypothesis testing in this process.
Inferential statistics is a branch of statistics that helps us make conclusions
about a larger population based on a sample of data. Since it is often impractical
or impossible to collect data from the entire population, inferential statistics
provides a method to estimate population parameters (such as the mean,
proportion, or variance) using the data from a smaller, representative sample.
The main objective of inferential statistics is to draw conclusions, make
predictions, and test hypotheses about the population based on the sample data.
This is achieved through various techniques like conﬁdence intervals,
hypothesis testing, and regression analysis.

The Process of Inferential Statistics:

 Sample vs. Population: A sample is a subset of the population, selected
for analysis. The sample is assumed to be representative of the population,
meaning that the characteristics of the sample (e.g., mean, variance) can
be used to make inferences about the entire population.
 Estimating Population Parameters: With sample data, we estimate
population parameters. For example, the sample mean is used as an
estimate of the population mean, and the sample standard deviation is
used as an estimate of the population standard deviation.
 Confidence Intervals: A confidence interval is a range of values that is
likely to contain the population parameter with a certain level of
confidence (usually 95% or 99%). It gives us a measure of how reliable our
estimate is. For example, if we calculate a 95% confidence interval for the
mean, it means we are 95% confident that the true population mean lies
within this range.
Role of Hypothesis Testing in Inferential Statistics:
Hypothesis testing is a key method in inferential statistics used to assess whether
there is enough evidence to support a specific claim or hypothesis about a
population. It helps us make data-driven decisions and conclusions.
Steps in Hypothesis Testing:
 Step 1: Formulate Hypotheses
o Null Hypothesis (H₀): The null hypothesis suggests that there is no
e ect, di erence, or relationship in the population. It is the
hypothesis that is typically tested.
o Alternative Hypothesis (H₁ or Ha): The alternative hypothesis
suggests that there is an e ect, di erence, or relationship that
contradicts the null hypothesis.
Example:
o H₀: There is no di erence in average test scores between two
teaching methods.
o H₁: There is a di erence in average test scores between the two
teaching methods.
 Step 2: Choose Significance Level (α)
o The significance level (α) is the probability threshold for rejecting the
null hypothesis. Commonly, α is set to 0.05, meaning that there is a
5% risk of rejecting the null hypothesis when it is actually true (Type I
error).
 Step 3: Collect Sample Data
o Data is collected from a sample that is representative of the
population of interest.
 Step 4: Perform the Statistical Test
o Choose the appropriate statistical test (e.g., t-test, chi-square test,
ANOVA) based on the type of data and the research question. The
test will produce a test statistic (such as t, z, or chi-square) that
measures how far the sample data deviates from the null hypothesis.
 Step 5: Calculate the p-value
o The p-value is the probability of observing the sample data (or
something more extreme) if the null hypothesis were true. It helps to
assess the strength of the evidence against the null hypothesis.
o If the p-value is smaller than the chosen significance level (α), we
reject the null hypothesis.
 Step 6: Make a Decision
o Reject H₀: If the p-value is less than α, we reject the null hypothesis
and conclude that there is enough evidence to support the
alternative hypothesis.
o Fail to Reject H₀: If the p-value is greater than α, we fail to reject the
null hypothesis and conclude that there is insu icient evidence to
support the alternative hypothesis.
Example: If the p-value is 0.03 and α is 0.05, we reject H₀ and conclude that there
is a significant di erence in test scores between the two teaching methods.

Explain the key characteristics of a normal distribution. How does it di er from

other probability distributions?
Key Characteristics of a Normal Distribution
A normal distribution, often called the Gaussian distribution, is one of the most
important probability distributions in statistics. It is widely used due to its natural
occurrence in many real-world phenomena, such as heights, test scores, and
measurement errors.
The key characteristics of a normal distribution are:
1. Bell-Shaped Curve:
o The normal distribution is symmetrically bell-shaped. The graph of
the distribution has a peak at the mean, and the tails extend
infinitely in both directions, approaching but never touching the x-
axis.
2. Symmetry:
o The normal distribution is perfectly symmetric around its mean. This
means that the left side of the distribution is a mirror image of the
right side.
3. Mean, Median, and Mode Are Equal:
o In a perfectly normal distribution, the mean, median, and mode are
all located at the same point, which is the center of the distribution.
4. Defined by Two Parameters:
o The shape of the normal distribution is determined by two
parameters:
 Mean (μ): This is the center of the distribution and represents
the average of all data points.
 Standard Deviation (σ): This measures the spread or
dispersion of the distribution. A larger standard deviation
results in a wider, flatter curve, while a smaller standard
deviation produces a narrower, taller curve.
5. 68-95-99.7 Rule (Empirical Rule):
o In a normal distribution:
 About 68% of the data lies within 1 standard deviation from
the mean.
 About 95% of the data lies within 2 standard deviations from
the mean.
 About 99.7% of the data lies within 3 standard deviations
from the mean.
o This rule helps to quickly understand how data is spread out in a
normal distribution.

Characteristic Normal Distribution Other Distributions

Bell-shaped,
Shape Varies (skewed, multi-modal, etc.)
symmetric

Symmetry Perfect symmetry Not always symmetric (e.g., exponential)

Characteristic Normal Distribution Other Distributions

2 parameters (mean,
Parameters Varies (depends on distribution)
std)

Data Type Continuous Can be continuous or discrete

No skew (skewness
Skewness Can be skewed (positive or negative)
= 0)

Natural phenomena, Varies by context (e.g., binomial for trials,

Use Cases
errors Poisson for rare events)

Kurtosis 3 (mesokurtic) Varies (e.g., Poisson has kurtosis > 3)

Explain the common methods of data collection used in statistical studies.

1. Surveys and Questionnaires
 Deﬁnition: A survey is a method of data collection in which respondents
are asked to answer questions related to the study's objectives. It is
typically done through structured questionnaires that include closed-
ended, open-ended, or a mix of both types of questions.
 Advantages:
o Can reach a large number of people.
o Cost-e ective, especially with online surveys.
o Flexible in terms of question types.
 Disadvantages:
o Potential bias if the sample is not representative.
o Non-response or incomplete answers can a ect the quality of data.

2. Interviews
 Deﬁnition: An interview involves direct interaction between the researcher
and the participant, where the researcher asks questions and records the
answers. Interviews can be structured, semi-structured, or unstructured.
 Advantages:
o Provides in-depth, qualitative data.
o Can clarify responses and ask follow-up questions.
 Disadvantages:
o Time-consuming and resource-intensive.
o Risk of interviewer bias in the interpretation of responses.

3. Observations
 Deﬁnition: In observational data collection, the researcher directly
observes and records behaviors, actions, or phenomena as they occur in a
natural or controlled setting. This can be done with or without the
participants' awareness.
 Advantages:
o Provides real-time data and natural behavior.
o Useful for studies where direct questioning is impractical.
 Disadvantages:
o Observer bias may inﬂuence the data.
o Can be time-consuming and di icult to manage in large groups.

4. Experiments
 Definition: Experiments involve manipulating one or more independent
variables to observe the e ect on a dependent variable, often under
controlled conditions. The researcher intervenes to create di erent
conditions and collects data based on the outcomes.
 Advantages:
o Can establish cause-and-e ect relationships.
o Can be replicated to confirm findings.
 Disadvantages:
o Can be expensive and time-consuming.
o Ethical concerns may arise, especially in certain experiments
involving humans.

5. Focus Groups
 Deﬁnition: A focus group is a qualitative method where a small group of
people (usually 6-10) are brought together to discuss a particular topic or
issue. A moderator guides the discussion, ensuring all members
contribute.
 Advantages:
o Provides in-depth insights and diverse perspectives.
o Allows exploration of attitudes, perceptions, and opinions.
 Disadvantages:
o Not statistically representative of the larger population.
o Group dynamics can inﬂuence the responses, and dominant
participants may overshadow others.

6. Case Studies
 Deﬁnition: A case study is an in-depth analysis of a single case (or a few
cases) within its real-life context. Case studies are often used in ﬁelds like
medicine, psychology, business, and law.
 Advantages:
o Provides detailed and comprehensive data.
o Useful for studying rare or unique phenomena.
 Disadvantages:
o Limited generalizability due to the focus on a single case or small
group.
o Potential researcher bias in interpreting the data.

7. Secondary Data Collection

 Deﬁnition: Secondary data refers to data that has already been collected
for other purposes but is used by the researcher for a new analysis. This
data is typically obtained from existing records, reports, or databases.
 Types:
o Public Databases: Government reports, census data, research
studies, etc.
o Private Data: Company records, sales data, customer data.

What is sampling? Name any four types of sampling techniques.

Sampling is the process of selecting a representative subset (sample) from a
larger group (population) in order to study characteristics of the whole population
without surveying every individual. It plays a crucial role in descriptive and
inferential statistics by enabling cost-e ective and time-e icient data
collection.

Why is Sampling Important?

 Reduces the time, e ort, and cost required to study large populations
 Enables researchers to make inferences about a population
 Allows focus on accuracy and depth in analysis
 Necessary when population access is limited or impractical

Four Common Types of Sampling Techniques

1. Simple Random Sampling (SRS)
 Type: Probability Sampling
 Deﬁnition: Every individual in the population has an equal and
independent chance of being chosen.
 Example: Randomly selecting 50 students from a list of 500 using a
random number generator.
 Advantage: Eliminates selection bias and supports statistical validity.

2. Stratified Sampling
 Type: Probability Sampling
 Definition: The population is divided into strata (subgroups) based on
specific characteristics, and random samples are taken from each stratum.
 Example: Dividing a population by income levels and sampling equally
from each income group.
 Advantage: Ensures representation from all relevant subgroups.

3. Systematic Sampling
 Type: Probability Sampling
 Deﬁnition: Every kᵗʰ element in the population list is selected after a
random starting point.
 Example: Selecting every 5th person on a list after randomly starting at the
3rd.
 Advantage: Easy to implement and evenly distributes the sample across
the population.

4. Cluster Sampling
 Type: Probability Sampling
 Deﬁnition: The population is divided into clusters (often geographically),
and entire clusters are randomly selected for study.
 Example: Randomly selecting 5 schools out of 50 and surveying all
students in those schools.
 Advantage: Cost-e ective and practical for large, spread-out populations.
 Limitation: May introduce bias if clusters are not similar.

UNIT 2
4 , 6 , 10 MARK QUESTIONS

Name at least ﬁve types of plots or charts that can be created in R.

 Bar Chart
 Used to display categorical data with rectangular bars.
 Function: barplot()
 Histogram
 Shows the distribution of numerical data by dividing it into bins.
 Function: hist()
 Line Chart
 Used to display trends over time or ordered data points.
 Function: plot(type = "l")
 Box Plot (Box-and-Whisker Plot)
 Summarizes data using median, quartiles, and outliers.
 Function: boxplot()
 Scatter Plot
 Displays the relationship between two numerical variables.
 Function: plot()

Name at least ﬁve key functions provided by the dplyr package in R.

 filter()
 Used to select rows based on specific conditions.
 Example: filter(data, age > 18)
 select()
 Used to choose specific columns from a dataset.
 Example: select(data, name, age)
 arrange()
 Used to sort rows by column values, in ascending or descending order.
 Example: arrange(data, salary)
 mutate()
 Used to add new columns or modify existing ones.
 Example: mutate(data, age_in_months = age * 12)
 summarise() (or summarize())
 Used to generate summary statistics, often combined with group_by().
 Example: summarise(data, avg_salary = mean(salary))

What are three key characteristics of R programming that make it popular for data
analysis?
 Rich Set of Packages and Libraries
 R provides powerful packages like ggplot2, dplyr, tidyr, and caret for data
manipulation, visualization, and machine learning.
 Thousands of packages are available via CRAN.
 Excellent Data Visualization Support
 R is known for its high-quality and customizable plots.
 Tools like ggplot2 help create professional graphs easily.
 Statistical and Analytical Strength
 R was built for statistical computing.
 It supports a wide range of statistical tests, models, and techniques,
making it ideal for deep data analysis.
What is descriptive data analysis? Name at least three common techniques used
in descriptive data analysis.
Descriptive data analysis is the process of summarizing and organizing data to
understand its basic features. It focuses on what the data shows, without
making predictions or drawing conclusions beyond the data.
It helps in identifying patterns, trends, and distributions within a dataset.

Three Common Techniques in Descriptive Data Analysis

1. Measures of Central Tendency
o Describe the center of the data.
o Includes Mean, Median, and Mode.
2. Measures of Dispersion
o Show how spread out the data is.
o Includes Range, Variance, and Standard Deviation.
3. Data Visualization
o Helps represent data visually to ﬁnd patterns.
o Includes Bar Charts, Histograms, Box Plots, and Pie Charts.

Name at least ﬁve key functions in the ggplot2 package used for creating plots in
R.
1. ggplot()
o Initializes the plotting system.
o Example: ggplot(data, aes(x, y))
2. geom_point()
o Creates scatter plots.
o Example: geom_point() adds dots to represent data points.
3. geom_bar()
o Creates bar charts for categorical data.
o Example: geom_bar(stat = "count")
4. geom_line()
o Draws line graphs, useful for time series or trends.
o Example: geom_line()
5. labs()
o Adds labels like title, x-axis, and y-axis names.
o Example: labs(title = "Sales", x = "Month", y = "Revenue")

Explain how mutate() and summarise() work di erently in terms of how they
transform or summarize a dataset. When would you use each function?
1. mutate() – Add or Modify Columns
 Purpose:
mutate() is used to create new columns or modify existing ones in a
dataset.
 How it Works:
It works row-wise, meaning it adds a new value to each row based on a
formula or condition.
 Example Use Case:
Suppose a dataset has two columns: math_score and science_score.
Using mutate(), we can add a new column average_score = (math_score +
science_score)/2.
 Result:
The dataset remains the same size (same number of rows), but with extra
columns.

2. summarise() – Reduce to Summary

 Purpose:
summarise() is used to compute summary statistics, like mean, sum,
count, etc.
 How it Works:
It works column-wise, often in combination with group_by(), and reduces
the dataset to fewer rows (typically 1 row per group or for the entire data).
 Example Use Case:
To ﬁnd the average score of all students, or average score by gender if used
with group_by(gender).
 Result:
The dataset becomes shorter—only the summary rows remain.

What is the reshape2 package in R? Name at least two functions available in the
reshape package.
The reshape2 package in R is used for reshaping data, particularly converting data
between "wide" and "long" formats. It is an essential tool for data manipulation
and analysis, especially when preparing data for modeling or visualization. The
package simpliﬁes the process of reshaping data frames and is especially useful
for datasets that have multiple measurements or variables over time or
categories.
Functions in the reshape2 Package:
1. melt():
o The melt() function is used to convert data from a wide format to a
long format. In a wide format, each variable is represented in a
separate column. By "melting" the data, the variables are stacked
into a single column, which can make it easier to perform statistical
analyses or create plots.
Example :
library(reshape2)
data <- data.frame(
ID = c(1, 2, 3),
Time1 = c(4, 5, 6),
Time2 = c(7, 8, 9)
)
melted_data <- melt(data, id.vars = "ID")
print(melted_data)
2. dcast():
o The dcast() function is the opposite of melt(). It converts data from
long format back to wide format. After performing operations or
manipulations on data in long format, dcast() is used to spread it
back out into a more usable wide format, where each category gets
its own column.
Example :
library(reshape2)
wide_data <- dcast(melted_data, ID ~ variable)
print(wide_data)

Name at least three methods or functions available in R to convert data between

long and wide formats.
1. reshape():
 The reshape() function in base R can be used to convert data between long
and wide formats. It allows for both reshaping the data to a "long" format
(with fewer columns) and to a "wide" format (with more columns).
2. melt() and dcast() from the reshape2 package:
 melt(): Converts data from wide format to long format by stacking multiple
columns into a single column.
 dcast(): Converts data from long format to wide format, spreading one
column of values into multiple columns based on a factor.
3. pivot_longer() and pivot_wider() from the tidyr package:
 pivot_longer(): Converts data from wide format to long format by gathering
multiple columns into key-value pairs.
 pivot_wider(): Converts data from long format to wide format by spreading
key-value pairs across multiple columns.

Analyze how the mutate(), summarise(), and group_by() functions in dplyr can be
combined for more complex data manipulations. Provide an example
 mutate():
 This function is used to create new columns or modify existing columns in a
data frame. You can apply transformations or calculations to existing
columns to create new ones.
 summarise():
 This function is used to summarize data, typically by applying aggregation
functions (like mean(), sum(), min(), max(), etc.) to one or more variables.
The result is a reduced dataset with summarized values.
 group_by():
 This function is used to group data by one or more categorical variables.
Once the data is grouped, operations like summarise() or mutate() are
typically applied to each group independently.

Example :
library(dplyr)
data <- data.frame(
Product = c("A", "A", "B", "B", "A", "B", "C", "C"),
Region = c("North", "South", "North", "South", "North", "South", "North", "South"),
Sales = c(100, 150, 200, 250, 300, 350, 400, 450)
)
print(data)
result <- data %>%
group_by(Product, Region)
mutate(Average_Sales = mean(Sales)) %>%
summarise(Total_Sales = sum(Sales),
Average_Sales = mean(Average_Sales)) %>%
arrange(desc(Total_Sales))
print(result)

Name at least three statistical methods that can be implemented in R.

1. Linear Regression:
 Linear regression is used to model the relationship between a dependent
variable (response) and one or more independent variables (predictors). In
R, the lm() function is used to ﬁt linear models.
Example:
model <- lm(y ~ x, data = dataset)
summary(model)
2. Logistic Regression:
 Logistic regression is used when the dependent variable is categorical,
particularly for binary outcomes (e.g., success/failure). In R, logistic
regression is implemented using the glm() function with a binomial family.
Example:
model <- glm(y ~ x1 + x2, data = dataset, family = binomial)
summary(model)
3. Principal Component Analysis (PCA):
 Principal Component Analysis (PCA) is a dimensionality reduction
technique that transforms a large set of variables into a smaller one
(principal components) while retaining as much variance as possible. In R,
PCA can be performed using the prcomp() function.
Example:
pca_result <- prcomp(dataset, center = TRUE, scale. = TRUE)
summary(pca_result)

Explain dplyr package in R? Explain at least ﬁve functions available in the dplyr
package with code
The dplyr package in R is part of the tidyverse ecosystem and is widely used for
data manipulation tasks. It provides a set of functions that are intuitive, fast, and
e icient for handling and transforming data. The primary focus of dplyr is on
performing data manipulation using a data frame or tibble.
Some key features of dplyr include:
 Piping (%>%): This allows chaining commands in a readable manner.
 E icient performance: Optimized for speed, making it suitable for large
datasets.
 Clear syntax: Makes data wrangling tasks more readable and accessible.

Key Functions in dplyr

1. ﬁlter() - Subset rows based on conditions

The filter() function is used to filter rows based on a condition or multiple
conditions.
Example:
library(dplyr)
data <- tibble(
name = c("John", "Alice", "Bob", "Eve"),
age = c(25, 30, 22, 35),
score = c(85, 92, 75, 88)
)
filtered_data <- filter(data, age > 25)
print(filtered_data)

2. select() - Choose speciﬁc columns

The select() function allows you to select certain columns of a data frame or
tibble.
Example:
selected_data <- select(data, name, score)
print(selected_data)

3. mutate() - Create or modify columns

The mutate() function is used to create new variables or modify existing ones
based on calculations or transformations.
Example:
data_with_percentage <- mutate(data, score_percentage = score / 100 * 100)
print(data_with_percentage)

4. arrange() - Sort the rows

The arrange() function is used to sort rows based on the values of one or more
columns.
Example:
arranged_data <- arrange(data, age)
print(arranged_data)

5. summarise() - Aggregate data

The summarise() function is used to aggregate data, such as calculating the sum,
average, or other statistics of one or more variables.
Example:
summary_data <- summarise(data, avg_age = mean(age), avg_score =
mean(score))
print(summary_data)

Explain the purpose of the tidyr package in R.

The tidyr package in R is designed for tidy data manipulation. It provides a set of
tools for reshaping and tidying data, making it easier to work with data in a
consistent and structured format. The main purpose of tidyr is to help you
transform your data so that it ﬁts well within the tidy data principles, which are
essential for e icient analysis in R.

Key Functions in tidyr

The tidyr package provides several important functions that help manipulate the
structure of data.

1. gather() - Convert data from wide format to long format

Purpose: The gather() function is used to convert data from wide format to long
format. In wide format, each variable is spread across multiple columns. In long
format, each variable becomes a row, which is preferred for many types of
analysis.
Example:
library(tidyr)
data <- tibble(
name = c("John", "Alice", "Bob"),
math = c(85, 90, 75),
science = c(88, 92, 78)
)
long_data <- gather(data, subject, score, math:science)
print(long_data)

2. spread() - Convert data from long format to wide format

Purpose: The spread() function is used to convert data from long format to wide
format, which means turning key-value pairs into separate columns.
Example:
wide_data <- spread(long_data, key = subject, value = score)
print(wide_data)

3. separate() - Split a single column into multiple columns

Purpose: The separate() function is used to split a single column into multiple
columns based on a separator or delimiter (e.g., comma, space).
Example:
data <- tibble(
name = c("John Doe", "Alice Smith", "Bob Johnson")
)
separated_data <- separate(data, name, into = c("ﬁrst_name", "last_name"), sep =
" ")
print(separated_data)

4. unite() - Combine multiple columns into a single column

Purpose: The unite() function is used to combine multiple columns into a single
column, with an optional separator.
Example:
data <- tibble(
ﬁrst_name = c("John", "Alice", "Bob"),
last_name = c("Doe", "Smith", "Johnson")
)
united_data <- unite(data, col = "full_name", ﬁrst_name, last_name, sep = " ")
print(united_data)

5. ﬁll() - Fill missing values with the previous value

Purpose: The fill() function is used to fill missing values (NA) in a column with the
previous non-missing value.
Example:
data <- tibble(
name = c("John", "Alice", "Bob"),
score = c(85, NA, 78)
)
filled_data <- fill(data, score)
print(filled_data)

Explain how the lubridate package simpliﬁes working with dates and times in R.
The lubridate package in R is part of the tidyverse and is designed to make date
and time manipulation easier, faster, and more intuitive. Handling dates and
times in base R can often be confusing and verbose — lubridate solves this with a
cleaner syntax and smart parsing.

1. Parsing Dates and Times Easily

Instead of manually specifying formats like %Y-%m-%d, lubridate provides
functions like:
Example:
library(lubridate)
date1 <- ymd("2023-12-31")
date2 <- mdy("12-31-2023")
date3 <- dmy("31-12-2023")
date4 <- ymd_hms("2023-12-31 23:59:59")

print(date1)
print(date2)
print(date3)
print(date4)

2. Extracting Date-Time Components

You can extract parts of a date/time using intuitive functions:
Example:
dt <- ymd_hms("2023-12-31 23:45:10")
year(dt) # 2023
month(dt) # 12
day(dt) # 31
hour(dt) # 23
minute(dt) # 45
second(dt) # 10

3. Updating Components
You can change parts of a date or time easily:
updated_dt <- update(dt, year = 2025, hour = 10)
print(updated_dt)
# Output: "2025-12-31 10:45:10 UTC"

4. Date Arithmetic
You can add or subtract durations like days, months, or years.
today <- ymd("2024-04-23")

today + days(5) # Add 5 days

today - months(1) # Subtract 1 month
today + years(2) # Add 2 years

5. Working with Intervals and Durations

 interval() creates an interval between two dates.
 duration() represents a speciﬁc length of time.
start <- ymd("2023-01-01")
end <- ymd("2024-01-01")

intv <- interval(start, end)

duration <- as.duration(intv)

print(intv) # 2023-01-01 to 2024-01-01

print(duration) # "31536000s (~1 years)"

6. Time Zones Support

lubridate can handle and convert between time zones:
dt <- ymd_hms("2023-12-31 23:00:00", tz = "UTC")
with_tz(dt, "Asia/Kolkata")

Explain ggplot2 package in R? Explain atleast ﬁve types of plots you can create
using ggplot2.
The ggplot2 package is part of the tidyverse and is the most popular data
visualization library in R. It is built on the grammar of graphics, which provides a
structured and layered approach to building visualizations.

1. Scatter Plot – geom_point()

Used to visualize the relationship between two numeric variables.
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "blue", size = 3) +
labs(title = "Weight vs MPG")

2. Bar Chart – geom_bar()

 Use geom_bar() for count of categories.
ggplot(mpg, aes(x = class)) +
geom_bar(ﬁll = "skyblue") +
labs(title = "Count of Car Classes")

3. Line Chart – geom_line()

Used for time series or trends over an ordered variable.
df <- data.frame(
year = 2015:2020,
sales = c(200, 220, 250, 280, 300, 330)
)

ggplot(df, aes(x = year, y = sales)) +

geom_line(color = "darkgreen", size = 1.2) +
labs(title = "Yearly Sales Trend")
4. Boxplot – geom_boxplot()
Displays distribution and outliers of a variable, grouped by categories.
ggplot(mpg, aes(x = class, y = hwy)) +
geom_boxplot(ﬁll = "orange") +
labs(title = "Highway MPG by Car Class")

5. Histogram – geom_histogram()
Used to view the distribution of a single numeric variable.
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, ﬁll = "steelblue", color = "black") +
labs(title = "Distribution of MPG")

Explain how data.table di ers from data.frame in R. What are some advantages of
using data.table for large datasets?

Feature data.frame data.table

Basic nature Base R structure Extension of data.frame

Syntax More verbose Concise and expressive

Speed Slower with large data Much faster for large datasets

Memory usage Moderate More memory-e icient

Not supported (creates a Supported (modiﬁes by

In-place updates
copy) reference)

Less readable for long

Chaining Supports elegant chaining ([])
pipelines
Feature data.frame data.table

Grouping
Via aggregate() or dplyr Built-in with by
operations

Keying (indexing) Not available Allows fast lookup via setkey()

Advantages of data.table for Large Datasets

1. Speed
 data.table is optimized in C and performs data ﬁltering, grouping, and
aggregation faster than data.frame or dplyr.
2. Memory E iciency
 It modiﬁes data in place instead of making copies. This is important when
working with big data where memory is limited.
3. Concise Syntax
 Complex operations can be performed with minimal code.
4. In-Place Updates
 You can update columns directly without creating a new object.

AA SL - Unit 1a - Representing Data (Statistics)
No ratings yet
AA SL - Unit 1a - Representing Data (Statistics)
74 pages
Data Collection & Organization Guide
No ratings yet
Data Collection & Organization Guide
13 pages
Psychological Statistics Reviewer
No ratings yet
Psychological Statistics Reviewer
8 pages
Chapter 1: Introduction To Statistics
No ratings yet
Chapter 1: Introduction To Statistics
40 pages
Chapter 1
No ratings yet
Chapter 1
28 pages
Statistics Course Overview
100% (3)
Statistics Course Overview
43 pages
Inferential Statistics Course
No ratings yet
Inferential Statistics Course
46 pages
Statistics Notes Diploma
No ratings yet
Statistics Notes Diploma
49 pages
Data Management (1) (1) - Compressed
No ratings yet
Data Management (1) (1) - Compressed
46 pages
Session 01
No ratings yet
Session 01
16 pages
Summary of Lectures
No ratings yet
Summary of Lectures
36 pages
College of Business and Economics
No ratings yet
College of Business and Economics
7 pages
Introduction Book 1
No ratings yet
Introduction Book 1
41 pages
Business Statistics May Module
No ratings yet
Business Statistics May Module
72 pages
Bio Statistics
No ratings yet
Bio Statistics
217 pages
Smm105 - Reviewer (Prelim)
No ratings yet
Smm105 - Reviewer (Prelim)
7 pages
Statistics Overview
No ratings yet
Statistics Overview
13 pages
Statistical Tools Complete Notes
No ratings yet
Statistical Tools Complete Notes
68 pages
SASA Reviewer
No ratings yet
SASA Reviewer
4 pages
Mat105 Study Guide
No ratings yet
Mat105 Study Guide
14 pages
Statistics and Probability Basics
No ratings yet
Statistics and Probability Basics
9 pages
Statistics and Probability - Midterm Reviewer
No ratings yet
Statistics and Probability - Midterm Reviewer
13 pages
Basic Statistical Concepts - Measures of Location
No ratings yet
Basic Statistical Concepts - Measures of Location
14 pages
Statistics and Probability - Midterm Reviewer
No ratings yet
Statistics and Probability - Midterm Reviewer
12 pages
Data Management (1)
No ratings yet
Data Management (1)
46 pages
Statistics For Business and Economics
No ratings yet
Statistics For Business and Economics
6 pages
Statistics - CH - 1 & CH - 2 - Introduction and Describing Data - Tabular and Graphical Presentation
No ratings yet
Statistics - CH - 1 & CH - 2 - Introduction and Describing Data - Tabular and Graphical Presentation
37 pages
STA2023 Summary Notes: Chapter 1 - 10
No ratings yet
STA2023 Summary Notes: Chapter 1 - 10
58 pages
Data Management
No ratings yet
Data Management
44 pages
Umehabiba - 2340 - 4448 - 3 - Lec 1,2
No ratings yet
Umehabiba - 2340 - 4448 - 3 - Lec 1,2
41 pages
QAB - II - Lecture - Notes Statistic
No ratings yet
QAB - II - Lecture - Notes Statistic
101 pages
مبادئ الاحصاء
No ratings yet
مبادئ الاحصاء
66 pages
Introduction To Basic Concepts Terminologies
No ratings yet
Introduction To Basic Concepts Terminologies
6 pages
Module For R-3
No ratings yet
Module For R-3
14 pages
Bio 206 Biostatistics
No ratings yet
Bio 206 Biostatistics
12 pages
RSU - Statistics - Lecture 1 - Final - myRSU
100% (1)
RSU - Statistics - Lecture 1 - Final - myRSU
44 pages
MMWChapter4 6
No ratings yet
MMWChapter4 6
66 pages
GROUP 1 SEC. 22 MPA Chapter 5 Nos. 1&2 Introduction and Application of Research Statistics 1
No ratings yet
GROUP 1 SEC. 22 MPA Chapter 5 Nos. 1&2 Introduction and Application of Research Statistics 1
19 pages
Business Statistics: A Decision-Making Approach: The Where, Why, and How of Data Collection
No ratings yet
Business Statistics: A Decision-Making Approach: The Where, Why, and How of Data Collection
129 pages
Topic 1 Introduction To Statistics
No ratings yet
Topic 1 Introduction To Statistics
35 pages
Data Science (Unit 02) Notes
No ratings yet
Data Science (Unit 02) Notes
7 pages
Sasa Reviewer P1, P4 at P5
No ratings yet
Sasa Reviewer P1, P4 at P5
10 pages
Data Management
No ratings yet
Data Management
43 pages
STATISTICS - Is A Set of Pertinent Activities Such As Collection, Organization, Presentation, Analysis and
No ratings yet
STATISTICS - Is A Set of Pertinent Activities Such As Collection, Organization, Presentation, Analysis and
5 pages
STAT. Lec.1
No ratings yet
STAT. Lec.1
30 pages
QM Exam Guide
No ratings yet
QM Exam Guide
70 pages
Statistics - Basic Concepts
No ratings yet
Statistics - Basic Concepts
29 pages
Data Analysis
No ratings yet
Data Analysis
12 pages
Chapter 1: Statistics: Scatterplot
No ratings yet
Chapter 1: Statistics: Scatterplot
30 pages
Intro To Statistics Lecture
No ratings yet
Intro To Statistics Lecture
41 pages
Statistics Assignment Full
No ratings yet
Statistics Assignment Full
11 pages
MATH30 6 Lecture 1 1
No ratings yet
MATH30 6 Lecture 1 1
32 pages
Statistics
No ratings yet
Statistics
7 pages
Lect 1
No ratings yet
Lect 1
47 pages
Excel & Python Statistical Functions
No ratings yet
Excel & Python Statistical Functions
44 pages
Intro to Statistics Basics
No ratings yet
Intro to Statistics Basics
8 pages
Resume: Jordan C. Viernes
No ratings yet
Resume: Jordan C. Viernes
2 pages
Activity 4 - Drifted Supercontinent
No ratings yet
Activity 4 - Drifted Supercontinent
1 page
Grade 11 Surface Area
No ratings yet
Grade 11 Surface Area
13 pages
Forbidden Crown A Professor - (Z-Library)
0% (1)
Forbidden Crown A Professor - (Z-Library)
318 pages
Computer Vision
No ratings yet
Computer Vision
21 pages
Anthology of Gnosis
100% (4)
Anthology of Gnosis
56 pages
I. Andante: From Six Sonatas For Violin and Guitar, No. 6
No ratings yet
I. Andante: From Six Sonatas For Violin and Guitar, No. 6
2 pages
Scroll Saw 40-100 16 1246
No ratings yet
Scroll Saw 40-100 16 1246
8 pages
Projet MJP 2500w FAQS
No ratings yet
Projet MJP 2500w FAQS
3 pages
Bibliografia Requisitada Adi Gold
No ratings yet
Bibliografia Requisitada Adi Gold
1 page
Particle Model of Matter
No ratings yet
Particle Model of Matter
20 pages
Republic V Sam Nthenya, Chief Executive Officer, Nairobi Women's Hospital & Another Ex Parte Christine Nzula Commission On Administrative Justice (Interested Party) (2021) EKLR
No ratings yet
Republic V Sam Nthenya, Chief Executive Officer, Nairobi Women's Hospital & Another Ex Parte Christine Nzula Commission On Administrative Justice (Interested Party) (2021) EKLR
13 pages
Creature Codex - Lairs
No ratings yet
Creature Codex - Lairs
40 pages
History of Cricket - Wikipedia
No ratings yet
History of Cricket - Wikipedia
23 pages
64 Sharp
No ratings yet
64 Sharp
18 pages
Ccbf812c886b47a PDF
No ratings yet
Ccbf812c886b47a PDF
30 pages
Checkpoint Protocol
No ratings yet
Checkpoint Protocol
2 pages
Surat Permintaan Abstrak Pembicara
No ratings yet
Surat Permintaan Abstrak Pembicara
4 pages
Daytona Beach Agenda Summary 0402
No ratings yet
Daytona Beach Agenda Summary 0402
11 pages
To Opsc Email To Req. Gate 2026
No ratings yet
To Opsc Email To Req. Gate 2026
3 pages
CP#11 - Problem Solving and Creativity
No ratings yet
CP#11 - Problem Solving and Creativity
13 pages
Intoxication
No ratings yet
Intoxication
66 pages
MCQ Motion in 1d MTG Neet
No ratings yet
MCQ Motion in 1d MTG Neet
4 pages
Plapoly JOS CAMP Schedule
No ratings yet
Plapoly JOS CAMP Schedule
1 page
DHEA Benefits for Aging Adults
No ratings yet
DHEA Benefits for Aging Adults
5 pages
Class 8 Social Science Exam 2017
No ratings yet
Class 8 Social Science Exam 2017
4 pages
Listening Section 1 - Practice (3.6.2023)
No ratings yet
Listening Section 1 - Practice (3.6.2023)
8 pages
Italian Education Document Verification Services: To Get An Appointment
No ratings yet
Italian Education Document Verification Services: To Get An Appointment
2 pages
Jurnal Dedikasi Pendidikan: Universitas Abulyatama
No ratings yet
Jurnal Dedikasi Pendidikan: Universitas Abulyatama
10 pages

DV Mid Term Notes

Uploaded by

DV Mid Term Notes

Uploaded by

UNIT 1

Feature Qualitative (Categorical) Data Quantitative (Numerical) Data

Data that describes categories Data that represents numbers and

Nature Non-numeric (unless coded) Numeric

To calculate, measure, and perform

Types Nominal, Ordinal Discrete, Continuous

Key Steps in Statistical Analysis Process

Deﬁne population, sample, and variable in the context of statistical analysis.

What is a random variable? Distinguish between discrete and continuous random

Type Discrete Random Variable Continuous Random Variable

Takes countable values (ﬁnite or Takes uncountable values (within a

Values are separated, often Values are measurable, can

Number of students in a class, Height of students, temperature,

Given the following dataset:

2. Median (Middle value)

3. Mode (Most frequent value)

You are conducting research on students' study habits in a university. Choose an

List any four real-life ﬁelds where statistics is commonly applied.

Test Score Frequency

Both measures of central tendency and measures of dispersion are critical in

Explain how inferential statistics helps draw conclusions about a population

The Process of Inferential Statistics:

Explain the key characteristics of a normal distribution. How does it di er from

Characteristic Normal Distribution Other Distributions

Symmetry Perfect symmetry Not always symmetric (e.g., exponential)

Data Type Continuous Can be continuous or discrete

Natural phenomena, Varies by context (e.g., binomial for trials,

Kurtosis 3 (mesokurtic) Varies (e.g., Poisson has kurtosis > 3)

Explain the common methods of data collection used in statistical studies.

7. Secondary Data Collection

What is sampling? Name any four types of sampling techniques.

Why is Sampling Important?

Four Common Types of Sampling Techniques

Name at least ﬁve types of plots or charts that can be created in R.

Name at least ﬁve key functions provided by the dplyr package in R.

Three Common Techniques in Descriptive Data Analysis

2. summarise() – Reduce to Summary

Name at least three methods or functions available in R to convert data between

Name at least three statistical methods that can be implemented in R.

Key Functions in dplyr

1. ﬁlter() - Subset rows based on conditions

2. select() - Choose speciﬁc columns

3. mutate() - Create or modify columns

4. arrange() - Sort the rows

5. summarise() - Aggregate data

Explain the purpose of the tidyr package in R.

Key Functions in tidyr

1. gather() - Convert data from wide format to long format

2. spread() - Convert data from long format to wide format

3. separate() - Split a single column into multiple columns

4. unite() - Combine multiple columns into a single column

5. ﬁll() - Fill missing values with the previous value

1. Parsing Dates and Times Easily

2. Extracting Date-Time Components

today + days(5) # Add 5 days

5. Working with Intervals and Durations

intv <- interval(start, end)

print(intv) # 2023-01-01 to 2024-01-01

6. Time Zones Support

1. Scatter Plot – geom_point()

2. Bar Chart – geom_bar()

3. Line Chart – geom_line()

ggplot(df, aes(x = year, y = sales)) +

Feature data.frame data.table

Basic nature Base R structure Extension of data.frame

Syntax More verbose Concise and expressive

Memory usage Moderate More memory-e icient

Not supported (creates a Supported (modiﬁes by

Less readable for long

Keying (indexing) Not available Allows fast lookup via setkey()

Advantages of data.table for Large Datasets

You might also like