Satyam Jha r File
Satyam Jha r File
(MORNING) 2023-2026
1
INDEX
S.NO TOPIC PAGE NO.
UNIT 1 RESEARCH DESIGN AND DATA 3
REPRESENTATION
1.1 Introduction 4-5
1.2 R commands 6-8
1.3 R data types 9-10
1.4 R operations 11-12
1.5 Functions of R 13
1.6 Vectors 14-15
1.7 Introduction to the objects 16
Lists 16
Matrices 16
Arrays 17
UNIT 2 DATAANALYSIS USING R 18
2.1 Properties of tidy data 19-20
2.2 Importing data sheets 21-23
2.3 Summary statistics 24-26
UNIT 3 GRAPHICAL ANALYSIS OF DATA 27
3.1 Histogram 28-29
3.2 Density plot 30
3.3 Pie-charts 31-32
3.4 Packages 33-35
UNIT 4 STATISTICAL TESTS 36
4.1 T-test 37-39
4.2 Anova & Correlation 40-41
2
UNIT 1
RESEARCH DESIGN AND DATA
REPRESENTATION
3
1.1 INTRODUCTION TO R
What is R?
R is an open-source programming language and software environment specifically
designed for statistical computing, data analysis, and graphical representation.
Originally developed for statisticians, R has grown in popularity for data science due
to its powerful data manipulation capabilities and strong community support.
R supports numerous packages, which are collections of R functions and code that
add specific functionalities, making R highly extensible.
Key Features of R:
1. Data Analysis & Visualization: R excels in data manipulation, analysis, and
graphical representation.
2. Extensive Package Ecosystem: With CRAN (Comprehensive R Archive Network),
R has thousands of packages to enhance capabilities.
3. Community Support: R has a strong community that contributes to libraries and
provides resources, documentation, and tutorials.
4. Cross-Platform: R is compatible with multiple operating systems, including
Windows, mac OS, and Linux.
What is R Studio?
RStudio is an integrated development environment (IDE) for R. It provides a user-
friendly interface and tools to write, run, and manage R scripts effectively.
RStudio organizes R's features and simplifies many processes, making it easier to
write and debug code, visualize data, and work with packages.
4
5
1.2 BASIC COMMANDS FOR R STUDIO
1. Basic Arithmetic Commands:
Use <- or = in R to assign a value to a variable, making it easy to store and reuse data.
Vectors are sequences of elements of the same type (e.g., numeric, character) that allow
operations on groups of values simultaneously.
A data frame is a 2D table-like structure in R, where each column can hold different data
types, such as numbers or characters. It's widely used for data storage and analysis.
6
5. Accessing Data in Data Frames:
Access data in data frames by specifying row and column indices or by using column names
directly.
These functions help calculate descriptive statistics, providing a quick summary of the data.
7
7. Conditional Statements:
Conditional statements, such as if and else, allow you to execute code based on specific
conditions.
8. Loops:
Loops, like for and while, enable you to repeat code multiple times, iterating over a
sequence or continuing until a condition is met.
8
1.3 R DATA TYPE
1. Numeric
Definition: Represents real numbers, including both integers and floating-point values.
Examples: 2.3, -4, 0.7, 15.
2. Integer
Definition: Represents whole numbers without any decimal places; denoted by adding "L"
(e.g., 5L).
Examples: 1L, -8L, 0L.
Usage: Ideal for tasks involving counting, loops, or cases where fractional values aren't
needed.
3. Character (String)
Usage: Used for storing text data, labelling, and naming elements in data sets.
4. Logical (Boolean)
Usage: Primarily used in control statements and for making comparisons within data.
9
5. Complex
Usage: Used in specialized calculations that require handling of imaginary values, such as in
certain mathematical or engineering tasks.
6. Factor
Definition: Represents categorical data with a limited set of possible values, known as
levels.
Examples: "High," "Medium," "Low" as levels for priority.
Usage: Useful in grouping and segmenting data, often applied in statistical models.
Definition: Represents calendar dates and times, with specific R classes (Date, POSIXct,
POSIXlt) to handle these.
Examples: "2024-11-01," "2023-07-15 08:30:00."
Usage: Essential for working with time series data, date calculations, and any analysis
involving dates or times.
10
1.4 R OPERATORS
1. Arithmetic Operators:
2. Assignment Operators:
These operators are used to store values in variables.
3. Comparison Operators:
11
Operator Description Example Result
== Equal to 4==6 False
!= Not equal to 8!=4 true
> Greater than 6>4 true
< Less than 2<6 true
>= Greater than or 5 >= 2 True
equal to
<= Less than or equal 3 <= 6 true
to
4. Logical Operators:
12
1.5 Built-in Functions
R has numerous built-in functions for data analysis, statistics, data manipulation, and more.
Here are some commonly used ones:
Basic Mathematical Functions
Absolute value
Square root
Natural logarithm.
Exponential
1. Statistical Functions
2. Character Functions
4. Logical Functions
13
1.6 VECTORS
In R, vectors are core data structures that store sequences of elements of the same type, like
numbers, text, or logical values. They are one-dimensional, meaning they hold a simple
sequence of items that can be accessed and modified easily. Vectors are crucial in R for data
analysis, as they allow efficient handling and processing of data.
Creating a Vector
Vectors in R are created using the c() function, which stands for "combine." Below are
examples of different types of vectors:
Types of Vectors in R
Complex Vector: Contains complex numbers with real and imaginary parts.
14
BASIC OPERATIONS ON VECTOR
Indexing:
Access specific elements within a vector using square brackets [].
Arithmetic Operations:
Apply calculations on numeric vectors.
Concatenation:
Append elements to an existing vector.
15
1.7 INTRODUCTION TO OTHER OBJECTS
In R, beyond vectors, there are several other key data structures: lists, matrices, and arrays.
Each serves a specific purpose, allowing R to handle and organize data in various shapes
and complexities.
1. Lists
Description: Lists in R are flexible containers that can hold elements of various types and
structures. Unlike vectors, which require all elements to be of the same type, lists can store
mixed data types—such as numbers, text, vectors, other lists, and functions.
Example: Lists are useful for complex structures, like the combined results of multiple
analyses or datasets containing a variety of data types.
Use Case: Lists are ideal for handling nested or mixed data types, particularly for data from
diverse sources or objects containing results of statistical tests.
2. Matrices
Description: A matrix is a two-dimensional data structure with rows and columns, similar to
a table. Every element in a matrix must be of the same type (typically numeric, character, or
logical).
Example: Matrices are beneficial for organizing numeric data in two dimensions, such as for
mathematical operations, linear algebra, and data transformations.
Basic Operations: You can perform mathematical operations on matrices, such as addition,
subtraction, and multiplication, along with actions like transposing and inverting.
Use Case: Matrices are essential in data analysis for numerical data processing or preparing
data for advanced modeling and machine learning.
3. Arrays
16
Example: Arrays are useful for organizing high-dimensional data, such as spatial data or
data requiring three or more dimensions.
Use Case: Arrays are widely applied in statistical programming and scientific computing,
especially for data with multiple dimensions, such as image processing or handling time-
series data across groups and dimensions.
17
UNIT 2
DATA ANALYSIS USING R
18
2.1 Properties of tidy data
Tidy data is a concept introduced by Hadley Wickham that refers to a specific way of
organizing data to ensure it's structured for easy analysis and visualization. Tidy data
follows certain guidelines to maintain consistency, cleanliness, and usability.
o In tidy data, each column represents one variable, with the column name clearly
indicating the variable.
o Example: In a dataset with people's height and weight, the columns would be labeled
"Height" and "Weight."
o Each row represents a single observation or data point, and it should contain values for
all the variables (columns) in the dataset.
o Example: A row would represent one individual, including their height and weight.
o Each dataset or table should focus on a single type of observational unit. For instance,
a table about students should only contain student-related data, not mixed with other
entities like classes.
o Example: A sales dataset should have one table for sales transactions and another for
product details.
19
4. Data is Not Mixed Across Different Levels of Measurement
o Different types of data should not be combined in the same column. Each column
should store values at the same level of measurement to avoid confusion.
o Example: Avoid placing both "Product" and "Date" in the same column. Instead, each
piece of information (product, date, etc.) should have its own distinct column.
Consistency:
The structure remains uniform, making it easier to interpret and manage.
Efficiency:
Analysis becomes more straightforward, as each variable is stored in its own column,
allowing operations to be directly applied to the columns.
Compatibility:
Tidy data works seamlessly with various R packages, such as dplyr, ggplot2, and tidyr,
which are optimized for this structure.
20
2.2 IMPORTING DATA SHEET
This method provides a simple way to import an Excel file into R without needing to write
any code in the console. All actions are performed through the R Studio environment
window.
21
Step 3: Browse and Select the Excel File
Click on the "Browse" button to navigate and select the Excel file you wish to import into R.
RESULT
22
23
2.3 SUMMARY STATISTICS
sd(): Calculates the standard deviation, which measures how much the data deviates from
the mean.
var(): Computes the variance, which is the square of the standard deviation.
range(): Provides the minimum and maximum values within the dataset.
min() and max(): Return the smallest and largest values, respectively.
24
2. Quantiles and Percentiles
quantile(): Returns specific percentiles from the dataset, including the 0th, 25th, 50th
(median), 75th, and 100th percentiles by default.
IQR(): Computes the interquartile range, which is the difference between the 25th
and 75th percentiles.
summary(): Provides a descriptive summary of key statistics like minimum, first quartile,
median, mean, third quartile, and maximum.
When used on a data frame, summary() gives these statistics for each column individually.
For data frames, summary functions can be applied column-wise using functions like
apply(), sapply(), or lapply().
25
5. Visual Summary Statistics
Basic visualizations offer an intuitive way to understand data distribution:
boxplot(): Generates a boxplot to visualize the data spread, including the median and
quartiles.
plot(): Offers various plot types, such as scatter plots, for visual analysis.
26
UNIT 3
GRAPHICAL ANAYSIS OF
DATA
27
3.1 Histogram
In R, a histogram is a graphical tool used to visualize the distribution of a dataset. It divides
the data into intervals, called "bins," and displays the count of data points within each bin.
Histograms are useful for understanding the shape, spread, and central tendency of
continuous data.
You can modify the breaks parameter to set the bin width, either by specifying a number or
choosing an automatic method.
28
2. Changing Colors
To modify the appearance, use col to set the color of the bars and border to change the
color of their outlines.
29
3.2 Density Plot
A density plot in R is a smooth curve that visualizes the distribution of continuous data.
Unlike histograms, which rely on discrete bins, density plots provide a continuous
approximation of the data’s probability density function (PDF). This makes them useful
for examining the shape, central tendency, and spread of the data in a more fluid,
continuous manner.
30
3.3 Pie Charts
A pie chart is a circular visualization divided into slices, where each slice represents a
portion of the total dataset. It is commonly used to display categorical data, helping to
highlight how each category contributes to the whole. While pie charts are visually
appealing, they can become difficult to interpret when there are too many categories, so they
should be used with caution.
31
2. Changing Slice Colors
You can modify the color of each slice by specifying a vector of color names or color codes
in the col argument.
32
3.4 PACKAGES
In R Studio, packages are collections of functions, data, and compiled code that extend the
functionality of R. You can install, load, and manage packages to make your analysis
more efficient and to access a wide range of tools.
33
Write the name of the package and click “install”.
2. Load a Package
After installing a package, you need to load it into your R session to use its functions. Use
library() to do this.
4. Update Packages
To keep your packages up to date, use: update.packages()
34
5. Remove a Package
35
UNIT 4
STATISTICAL TESTS
36
4.1 T-TEST
We will be trying to understand the T-Test in with the help of an example. Suppose a
businessman with two sweet shops in a town wants to check if the average number of sweets
sold in a day in both stores is the same or not.
So, the businessman takes the average number of sweets sold to 15 random people in the
respective shops. He found out that the first shop sold 30 sweets on average whereas the
second shop sold 40. So, from the owner’s point of view, the second shop was doing better
business than the former. But the thing to notice is that the data set is based on a mere
number of random people and they cannot represent all the customers. This is where T-
testing comes into play it helps us to understand whether the difference between the two
means is real or simply by chance.
Mathematically, what the t-test does is, take a sample from both sets and establish the
problem assuming a null hypothesis that the two means are the same.
Classification of T-tests
One Sample T-test
Two sample T-test
Paired sample T-test
The One-Sample T-Test is used to test the statistical difference between a sample mean and
a known or assumed/hypothesized value of the mean in the population.
So, for performing a one-sample t-test in R, we would use the syntax t.test (y, mu = 0)
where x is the name of the variable of interest and mu is set equal to the mean specified by
the null hypothesis.
37
t = -15.249, df = 49, and a 2.2e-16 p-value: provides the p-value, degrees of freedom
(df), and test statistic (t). The computed t-value in this instance is -15.249, there are 49
degrees of freedom, and the p-value is very small ( 2.2e-16), indicating strong evidence
that the null hypothesis is false.
The true mean is not equal to 150, as an alternative explains the alternative theory,
which contends that the population’s actual mean is not 150.
The confidence interval, which ranges from 138.8176 to 141.4217, shows that there is a
95% chance that the genuine population mean is located between those two numbers.
provides the sample estimate, in this example the sample mean (x) of 140.1197, or
“sample estimates: mean of x 140.1197.”
It is used to help us to understand whether the difference between the two means is real or
simply by chance.
The general form of the test is t.test (y1, y2, paired=FALSE). By default, R assumes that
the variances of y1 and y2 are unequal, thus defaulting to Welch’s test. To toggle this, we
use the flag var.equal=TRUE.
38
Sample estimates: 140.1077 for the mean of x and 150.0856 for the mean of y the
sample means (x and y), which are the sample estimates. In this instance, shopOne’s
mean is 140.1077, whereas shopTwo’s mean is 150.0856.
This is a statistical procedure that is used to determine whether the mean difference
between two sets of observations is zero. In a paired sample t-test, each subject is
measured two times, resulting in pairs of observations.
The test is run using the syntax t.test (y1, y2, paired=TRUE)
39
4.2 ANNOVA TEST & CORRELATION IN R
ANOVA also known as Analysis of variance is used to investigate relations between
categorical variables and continuous variables in the R Programming Language. It is a
type of hypothesis testing for population variance. It enables us to assess whether
observed variations in means are statistically significant or merely the result of chance by
comparing the variation within groups to the variation between groups. The ANOVA test
is frequently used in many disciplines, including business, social sciences, biology, and
experimental research.
R – ANOVA Test
ANOVA tests may be run in R programming, and there are a number of functions and
packages available to do so.
40
Corr(x,y)=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2 Corr(x,y)=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi
−xˉ)(yi−yˉ)
where,
Correlation in R
Syntax: cor(x, y,
method) where,
x and y represents the data vectors
41