[go: up one dir, main page]

0% found this document useful (0 votes)
18 views41 pages

Satyam Jha r File

The document is a lab file for a Research Methods course at Jagannath International Management School, detailing various topics related to R programming for data analysis. It includes sections on research design, data representation, data analysis using R, graphical analysis, and statistical tests, along with practical commands and examples. The file is submitted by Satyam Jha in partial fulfillment of a Bachelor of Commerce (Hons.) degree.

Uploaded by

sachinrawat45500
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views41 pages

Satyam Jha r File

The document is a lab file for a Research Methods course at Jagannath International Management School, detailing various topics related to R programming for data analysis. It includes sections on research design, data representation, data analysis using R, graphical analysis, and statistical tests, along with practical commands and examples. The file is submitted by Satyam Jha in partial fulfillment of a Bachelor of Commerce (Hons.) degree.

Uploaded by

sachinrawat45500
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Jagannath International Management School MOR Pocket -105

Kalkaji, New Delhi- 110019


(Affiliated to Guru Gobind Singh Indraprastha University and

Approved under section 2(f) of UGC Act 1956

Accredited by National Assessment and Accreditation Council


(NAAC)

Research Methods for Commerce Lab File


BCOM 211

Submitted in partial fulfilment


of Bachelors of commerce
(Hons.)

BCOM (H) - III SEMESTER

(MORNING) 2023-2026

SUBMITTED TO: SUBMITTED BY:

Dr. Shivani Sharma SATYAM JHA

Assistant Professor 01214188823

1
INDEX
S.NO TOPIC PAGE NO.
UNIT 1 RESEARCH DESIGN AND DATA 3
REPRESENTATION
1.1 Introduction 4-5
1.2 R commands 6-8
1.3 R data types 9-10
1.4 R operations 11-12
1.5 Functions of R 13
1.6 Vectors 14-15
1.7 Introduction to the objects 16
Lists 16
Matrices 16
Arrays 17
UNIT 2 DATAANALYSIS USING R 18
2.1 Properties of tidy data 19-20
2.2 Importing data sheets 21-23
2.3 Summary statistics 24-26
UNIT 3 GRAPHICAL ANALYSIS OF DATA 27
3.1 Histogram 28-29
3.2 Density plot 30
3.3 Pie-charts 31-32
3.4 Packages 33-35
UNIT 4 STATISTICAL TESTS 36
4.1 T-test 37-39
4.2 Anova & Correlation 40-41

2
UNIT 1
RESEARCH DESIGN AND DATA
REPRESENTATION

3
1.1 INTRODUCTION TO R
What is R?
 R is an open-source programming language and software environment specifically
designed for statistical computing, data analysis, and graphical representation.
 Originally developed for statisticians, R has grown in popularity for data science due
to its powerful data manipulation capabilities and strong community support.
 R supports numerous packages, which are collections of R functions and code that
add specific functionalities, making R highly extensible.

Key Features of R:
1. Data Analysis & Visualization: R excels in data manipulation, analysis, and
graphical representation.
2. Extensive Package Ecosystem: With CRAN (Comprehensive R Archive Network),
R has thousands of packages to enhance capabilities.
3. Community Support: R has a strong community that contributes to libraries and
provides resources, documentation, and tutorials.
4. Cross-Platform: R is compatible with multiple operating systems, including
Windows, mac OS, and Linux.

What is R Studio?
 RStudio is an integrated development environment (IDE) for R. It provides a user-
friendly interface and tools to write, run, and manage R scripts effectively.
 RStudio organizes R's features and simplifies many processes, making it easier to
write and debug code, visualize data, and work with packages.

Key Features of R Studio:


1. Script Editor: For writing and editing code.
2. Console: To run R commands and see immediate results.
3. Environment/History Panel: Shows variables, functions, data frames, and command
history.
4. File Management: Manage your project files and access R packages.
5. Visualization: R Studio allows you to create, save, and view plots and graphs easily.
6. Integrated Package Management: Easier to install, load, and manage R packages
within the IDE.

4
5
1.2 BASIC COMMANDS FOR R STUDIO
1. Basic Arithmetic Commands:

Perform simple calculations in R, each of which returns a numeric result.

2. Assigning Values to Variables:

Use <- or = in R to assign a value to a variable, making it easy to store and reuse data.

3. Working with Vectors:

Vectors are sequences of elements of the same type (e.g., numeric, character) that allow
operations on groups of values simultaneously.

4. Creating Data Frames:

A data frame is a 2D table-like structure in R, where each column can hold different data
types, such as numbers or characters. It's widely used for data storage and analysis.

6
5. Accessing Data in Data Frames:
Access data in data frames by specifying row and column indices or by using column names
directly.

6. Basic Statistical Functions:

These functions help calculate descriptive statistics, providing a quick summary of the data.

7
7. Conditional Statements:
Conditional statements, such as if and else, allow you to execute code based on specific
conditions.

8. Loops:

Loops, like for and while, enable you to repeat code multiple times, iterating over a
sequence or continuing until a condition is met.

8
1.3 R DATA TYPE
1. Numeric

Definition: Represents real numbers, including both integers and floating-point values.
Examples: 2.3, -4, 0.7, 15.

Usage: Used in mathematical computations such as addition, multiplication, and division.

2. Integer
Definition: Represents whole numbers without any decimal places; denoted by adding "L"
(e.g., 5L).
Examples: 1L, -8L, 0L.

Usage: Ideal for tasks involving counting, loops, or cases where fractional values aren't
needed.
3. Character (String)

Definition: Represents sequences of text, enclosed within quotation marks.


Examples: "Data Science," "R Programming," 'Statistics'.

Usage: Used for storing text data, labelling, and naming elements in data sets.
4. Logical (Boolean)

Definition: Holds truth values, which are either TRUE or FALSE.


Examples: TRUE, FALSE.

Usage: Primarily used in control statements and for making comparisons within data.

9
5. Complex

Definition: Represents numbers with both real and imaginary components.


Examples: 3 + 2i, -1 + 4i.

Usage: Used in specialized calculations that require handling of imaginary values, such as in
certain mathematical or engineering tasks.
6. Factor
Definition: Represents categorical data with a limited set of possible values, known as
levels.
Examples: "High," "Medium," "Low" as levels for priority.

Usage: Useful in grouping and segmenting data, often applied in statistical models.

7. Date and Time

Definition: Represents calendar dates and times, with specific R classes (Date, POSIXct,
POSIXlt) to handle these.
Examples: "2024-11-01," "2023-07-15 08:30:00."

Usage: Essential for working with time series data, date calculations, and any analysis
involving dates or times.

10
1.4 R OPERATORS
1. Arithmetic Operators:

These operators are used to perform fundamental math operations.

Operator Description Example result


+ addition 6+4 10
- subtraction 15-5 10
* multiplication 7*4 28
/ division 20/5 4
^ exponentiation 3^2 9
%% Modulus(reminder) 9%%4 1
%/% Integer division 10%/%3 3

2. Assignment Operators:
These operators are used to store values in variables.

Operator description example


<- Left assignment a <- 6
-> Right assignment 8 -> b
<<- Global assignment c <<- 7
within functions
= Alternate assignment d = 15
method

3. Comparison Operators:

Used to compare values, yielding a logical result (TRUE or FALSE).

11
Operator Description Example Result
== Equal to 4==6 False
!= Not equal to 8!=4 true
> Greater than 6>4 true
< Less than 2<6 true
>= Greater than or 5 >= 2 True
equal to
<= Less than or equal 3 <= 6 true
to

4. Logical Operators:

Used to evaluate multiple conditions, returning logical values (TRUE or FALSE).

operator description example result


& Logical AND (8 > 5) & (2 < 7) true
(element-wise)
| Logical OR (element- (9 > 3) | (4 < 2) true
wise)
&& Logical AND (first (7 > 3) && (5 < 9) true
element)
|| Logical OR (first (8 > 4) || (1 > 3) true
element)
! Logical NOT !(3 == 6) true
Note: & and | apply element-wise across vectors, while && and || evaluate only the first
element.
5. Sequence Operators:

These operators generate sequences of numbers.

operator description example result


: Create a sequence 2:6 2,3,4,5,6
Seq() Create a sequence with seq(2, 12, 3) 2,5,8,11
specified increments

12
1.5 Built-in Functions
R has numerous built-in functions for data analysis, statistics, data manipulation, and more.
Here are some commonly used ones:
Basic Mathematical Functions

Absolute value

Square root

Natural logarithm.

Exponential

1. Statistical Functions

2. Character Functions

3. Data Manipulation Functions

4. Logical Functions

5. Applying Functions on Data Structures

13
1.6 VECTORS

In R, vectors are core data structures that store sequences of elements of the same type, like
numbers, text, or logical values. They are one-dimensional, meaning they hold a simple
sequence of items that can be accessed and modified easily. Vectors are crucial in R for data
analysis, as they allow efficient handling and processing of data.

Key Features of Vectors

 Homogeneous Data Type:


Every element in a vector must be of the same type (e.g., all numeric, all character, or
all logical).
 One-Dimensional:
Vectors are single-dimensional, so elements are stored in a single row or column.
 Indexed:
Each element has an index, making it easy to retrieve specific items by their position.

Creating a Vector

Vectors in R are created using the c() function, which stands for "combine." Below are
examples of different types of vectors:

Types of Vectors in R

 Numeric Vector: Contains numbers, either integers or decimals.

 Character Vector: Contains strings or text values.

 Logical Vector: Contains Boolean values (TRUE or FALSE).

 Complex Vector: Contains complex numbers with real and imaginary parts.

14
BASIC OPERATIONS ON VECTOR

Vectors can be modified and utilized with several operations, including:

Indexing:
Access specific elements within a vector using square brackets [].

Arithmetic Operations:
Apply calculations on numeric vectors.

Concatenation:
Append elements to an existing vector.

15
1.7 INTRODUCTION TO OTHER OBJECTS

In R, beyond vectors, there are several other key data structures: lists, matrices, and arrays.
Each serves a specific purpose, allowing R to handle and organize data in various shapes
and complexities.

1. Lists

Description: Lists in R are flexible containers that can hold elements of various types and
structures. Unlike vectors, which require all elements to be of the same type, lists can store
mixed data types—such as numbers, text, vectors, other lists, and functions.

Example: Lists are useful for complex structures, like the combined results of multiple
analyses or datasets containing a variety of data types.

Use Case: Lists are ideal for handling nested or mixed data types, particularly for data from
diverse sources or objects containing results of statistical tests.

2. Matrices

Description: A matrix is a two-dimensional data structure with rows and columns, similar to
a table. Every element in a matrix must be of the same type (typically numeric, character, or
logical).

Example: Matrices are beneficial for organizing numeric data in two dimensions, such as for
mathematical operations, linear algebra, and data transformations.

Basic Operations: You can perform mathematical operations on matrices, such as addition,
subtraction, and multiplication, along with actions like transposing and inverting.

Use Case: Matrices are essential in data analysis for numerical data processing or preparing
data for advanced modeling and machine learning.

3. Arrays

Description: Arrays in R generalize matrices to multiple dimensions, allowing data to be


stored across more than two dimensions. Each element within an array must be of the same
type across all dimensions.

16
Example: Arrays are useful for organizing high-dimensional data, such as spatial data or
data requiring three or more dimensions.

Use Case: Arrays are widely applied in statistical programming and scientific computing,
especially for data with multiple dimensions, such as image processing or handling time-
series data across groups and dimensions.

17
UNIT 2
DATA ANALYSIS USING R

18
2.1 Properties of tidy data

Tidy data is a concept introduced by Hadley Wickham that refers to a specific way of
organizing data to ensure it's structured for easy analysis and visualization. Tidy data
follows certain guidelines to maintain consistency, cleanliness, and usability.

1. Each Variable Forms a Column

o In tidy data, each column represents one variable, with the column name clearly
indicating the variable.
o Example: In a dataset with people's height and weight, the columns would be labeled
"Height" and "Weight."

2. Each Observation Forms a Row

o Each row represents a single observation or data point, and it should contain values for
all the variables (columns) in the dataset.
o Example: A row would represent one individual, including their height and weight.

3. Each Type of Observational Unit Forms a Table

o Each dataset or table should focus on a single type of observational unit. For instance,
a table about students should only contain student-related data, not mixed with other
entities like classes.
o Example: A sales dataset should have one table for sales transactions and another for
product details.

19
4. Data is Not Mixed Across Different Levels of Measurement

o Different types of data should not be combined in the same column. Each column
should store values at the same level of measurement to avoid confusion.
o Example: Avoid placing both "Product" and "Date" in the same column. Instead, each
piece of information (product, date, etc.) should have its own distinct column.

ADVANTAGES OF TIDY DATA

Consistency:
The structure remains uniform, making it easier to interpret and manage.

Efficiency:
Analysis becomes more straightforward, as each variable is stored in its own column,
allowing operations to be directly applied to the columns.

Compatibility:
Tidy data works seamlessly with various R packages, such as dplyr, ggplot2, and tidyr,
which are optimized for this structure.

20
2.2 IMPORTING DATA SHEET

Importing an Excel File into R

Using R Studio’s Built-in Menu Options

This method provides a simple way to import an Excel file into R without needing to write
any code in the console. All actions are performed through the R Studio environment
window.

Steps to Import an Excel File Using the Dataset Option in R Studio:

Step 1: Choose "Import Dataset"


In the environment window of R Studio, select the "Import Dataset" option. This initiates
the process of importing data into R.

Step 2: Select "From Excel"


In the "Import Dataset" menu, choose the "From Excel" option. This specifies that the data
source is an Excel file.

21
Step 3: Browse and Select the Excel File
Click on the "Browse" button to navigate and select the Excel file you wish to import into R.

Step 4: Click "Import"


After selecting the file, click the "Import" button. This will complete the process and
successfully import the Excel file into R.

RESULT

22
23
2.3 SUMMARY STATISTICS

In R, summary statistics offer a concise overview of a dataset's key characteristics, including


central tendency, variability, and distribution. These statistics are crucial for identifying
patterns in the data and performing initial analyses before more advanced statistical methods
are applied.

1. Common Summary Statistics Functions in R

Basic Summary Statistics

mean(): Computes the average value of a numeric vector.

median(): Returns the middle value (median) of a numeric vector.

sd(): Calculates the standard deviation, which measures how much the data deviates from
the mean.

var(): Computes the variance, which is the square of the standard deviation.

range(): Provides the minimum and maximum values within the dataset.

min() and max(): Return the smallest and largest values, respectively.

sum(): Sums up all the values in a vector.

24
2. Quantiles and Percentiles

 quantile(): Returns specific percentiles from the dataset, including the 0th, 25th, 50th
(median), 75th, and 100th percentiles by default.

 IQR(): Computes the interquartile range, which is the difference between the 25th
and 75th percentiles.

3. Descriptive Summary with the summary() Function

summary(): Provides a descriptive summary of key statistics like minimum, first quartile,
median, mean, third quartile, and maximum.

When used on a data frame, summary() gives these statistics for each column individually.

4. Summary Statistics for Data Frames

For data frames, summary functions can be applied column-wise using functions like
apply(), sapply(), or lapply().

25
5. Visual Summary Statistics
Basic visualizations offer an intuitive way to understand data distribution:

 hist(): Creates a histogram to show the distribution of data.

 boxplot(): Generates a boxplot to visualize the data spread, including the median and
quartiles.

 plot(): Offers various plot types, such as scatter plots, for visual analysis.

26
UNIT 3
GRAPHICAL ANAYSIS OF
DATA

27
3.1 Histogram
In R, a histogram is a graphical tool used to visualize the distribution of a dataset. It divides
the data into intervals, called "bins," and displays the count of data points within each bin.
Histograms are useful for understanding the shape, spread, and central tendency of
continuous data.

Creating a Basic Histogram in R


To create a histogram in R, you can use the hist() function. By default, R automatically
determines the number of bins, but you have the option to customize this, along with other
features.
 x: The numeric vector containing the data to plot.
 breaks: Determines the number of bins. You can specify a fixed number or
use methods like "Sturges", "Scott", or "FD" for automatic bin width
calculation.
 main: The title of the histogram.
 xlab and ylab: Labels for the X and Y axes.
 col: Defines the color of the bars in the histogram.

Customizing the Histogram


1. Adjusting the Number of Bins

You can modify the breaks parameter to set the bin width, either by specifying a number or
choosing an automatic method.

28
2. Changing Colors

To modify the appearance, use col to set the color of the bars and border to change the
color of their outlines.

3. Adding Titles and Labels

Enhance clarity by adding descriptive titles and axis labels.

4. Overlaying a Density Curve


To illustrate the data’s distribution smoothly, add a density curve by setting freq = FALSE,
which scales the Y-axis to represent probabilities rather than frequencies.

29
3.2 Density Plot
A density plot in R is a smooth curve that visualizes the distribution of continuous data.
Unlike histograms, which rely on discrete bins, density plots provide a continuous
approximation of the data’s probability density function (PDF). This makes them useful
for examining the shape, central tendency, and spread of the data in a more fluid,
continuous manner.

Creating a Basic Density Plot in R


 To create a density plot in R, you can use the density() function to compute the
density and the plot() function to display it.
 x: The numeric vector containing the data for which the density is calculated.
 main: The title of the plot.
 xlab and ylab: Labels for the X and Y axes.
 col: The color of the density curve.

30
3.3 Pie Charts
A pie chart is a circular visualization divided into slices, where each slice represents a
portion of the total dataset. It is commonly used to display categorical data, helping to
highlight how each category contributes to the whole. While pie charts are visually
appealing, they can become difficult to interpret when there are too many categories, so they
should be used with caution.

Creating a Basic Pie Chart in R


In R, the pie() function is used to create a pie chart, with the data representing the sizes of
the segments.
 x: A numeric vector containing the values for each segment of the pie chart.
 labels: Optional labels that are added to the slices, often representing the category
names.
 main: The title for the pie chart.
 col: A vector of colors for the pie slices (optional).
 radius: Controls the size of the pie chart, with a default value of 1.

Customizing the Pie Chart


1. Adding Percentages to Slices
To display the percentage contribution of each category, you can calculate the proportion
of each value in x and include it as part of the slice labels.

31
2. Changing Slice Colors

You can modify the color of each slice by specifying a vector of color names or color codes
in the col argument.

32
3.4 PACKAGES

In R Studio, packages are collections of functions, data, and compiled code that extend the
functionality of R. You can install, load, and manage packages to make your analysis
more efficient and to access a wide range of tools.

Here’s how you can work with packages in R Studio:


1. Install a Package
 To install a package, you can use the install.packages() function in the R console:

 Another way of install packages in Rstudio is:


 Go to the “tools” tab, click “install packages”.

33
 Write the name of the package and click “install”.

2. Load a Package

After installing a package, you need to load it into your R session to use its functions. Use
library() to do this.

3. Check Installed Packages

To see which packages are installed, you can use: installed.packages()

4. Update Packages
To keep your packages up to date, use: update.packages()

34
5. Remove a Package

If you want to uninstall a package, you can use: remove.packages("ggplot2")

 Commonly Used Packages:


o ggplot2 – for data visualization.
o dplyr – for data manipulation.
o tidyr – for data tidying.
o shiny – for creating interactive web apps.
o lubridate – for working with dates and times.
o caret – for machine learning.

35
UNIT 4
STATISTICAL TESTS

36
4.1 T-TEST
We will be trying to understand the T-Test in with the help of an example. Suppose a
businessman with two sweet shops in a town wants to check if the average number of sweets
sold in a day in both stores is the same or not.
So, the businessman takes the average number of sweets sold to 15 random people in the
respective shops. He found out that the first shop sold 30 sweets on average whereas the
second shop sold 40. So, from the owner’s point of view, the second shop was doing better
business than the former. But the thing to notice is that the data set is based on a mere
number of random people and they cannot represent all the customers. This is where T-
testing comes into play it helps us to understand whether the difference between the two
means is real or simply by chance.
Mathematically, what the t-test does is, take a sample from both sets and establish the
problem assuming a null hypothesis that the two means are the same.

Classification of T-tests
 One Sample T-test
 Two sample T-test
 Paired sample T-test

One Sample T – Test Approach

The One-Sample T-Test is used to test the statistical difference between a sample mean and
a known or assumed/hypothesized value of the mean in the population.
So, for performing a one-sample t-test in R, we would use the syntax t.test (y, mu = 0)
where x is the name of the variable of interest and mu is set equal to the mean specified by
the null hypothesis.

37
 t = -15.249, df = 49, and a 2.2e-16 p-value: provides the p-value, degrees of freedom
(df), and test statistic (t). The computed t-value in this instance is -15.249, there are 49
degrees of freedom, and the p-value is very small ( 2.2e-16), indicating strong evidence
that the null hypothesis is false.
 The true mean is not equal to 150, as an alternative explains the alternative theory,
which contends that the population’s actual mean is not 150.
 The confidence interval, which ranges from 138.8176 to 141.4217, shows that there is a
95% chance that the genuine population mean is located between those two numbers.
 provides the sample estimate, in this example the sample mean (x) of 140.1197, or
“sample estimates: mean of x 140.1197.”

Two sample T-Test Approach

It is used to help us to understand whether the difference between the two means is real or
simply by chance.
The general form of the test is t.test (y1, y2, paired=FALSE). By default, R assumes that
the variances of y1 and y2 are unequal, thus defaulting to Welch’s test. To toggle this, we
use the flag var.equal=TRUE.

38
 Sample estimates: 140.1077 for the mean of x and 150.0856 for the mean of y the
sample means (x and y), which are the sample estimates. In this instance, shopOne’s
mean is 140.1077, whereas shopTwo’s mean is 150.0856.

Paired Sample T-test

This is a statistical procedure that is used to determine whether the mean difference
between two sets of observations is zero. In a paired sample t-test, each subject is
measured two times, resulting in pairs of observations.
The test is run using the syntax t.test (y1, y2, paired=TRUE)

39
4.2 ANNOVA TEST & CORRELATION IN R
ANOVA also known as Analysis of variance is used to investigate relations between
categorical variables and continuous variables in the R Programming Language. It is a
type of hypothesis testing for population variance. It enables us to assess whether
observed variations in means are statistically significant or merely the result of chance by
comparing the variation within groups to the variation between groups. The ANOVA test
is frequently used in many disciplines, including business, social sciences, biology, and
experimental research.

R – ANOVA Test
ANOVA tests may be run in R programming, and there are a number of functions and
packages available to do so.

ANOVA test involves setting up:


 Null Hypothesis: The default assumption, or null hypothesis, is that there is no
meaningful relationship or impact between the variables. It stands for the
absence of a population-wide link, difference, or effect. The statement that two
or more groups are equal or that the effect size is zero is sometimes expressed
as the null hypothesis. The null hypothesis is commonly written as H0.
 Alternate Hypothesis: The opposite of the null hypothesis is the alternative
hypothesis. It implies that there is a significant relationship, difference, or link
among the population’s variables. Depending on the study question or the
nature of the issue under investigation, it may take several forms. Alternative
hypotheses are sometimes referred to as H1 or HA.

ANOVA tests are of two types:


 One-way ANOVA: One-way When there is a single categorical independent variable
(also known as a factor) and a single continuous dependent variable, an ANOVA is
employed. It seeks to ascertain whether there are any notable variations in the
dependent variable’s means across the levels of the independent variable.
 Two-way ANOVA: When there are two categorical independent variables (factors)
and one continuous dependent variable, two-way ANOVA is used as an extension of
one- way ANOVA. You can evaluate both the direct impacts of each independent
variable and how they interact with one another on the dependent variable.

4.3 Correlation in R Programming Language


cor() function in R programming measures the correlation coefficient value. Correlation
is a relationship term in statistics that uses the covariance method to measure how
strongly the selected vectors are related to each other. Mathematically,

40
Corr⁡(x,y)=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2 Corr(x,y)=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi
−xˉ)(yi−yˉ)

where,

x represents the x data vector


y represents the y data vector
xˉ xˉ represents mean of x data
vector yˉ yˉ represents mean of y
data vector

Correlation in R
Syntax: cor(x, y,
method) where,
x and y represents the data vectors

method defines the type of method to be used to compute covariance. Default is


“pearson”.

41

You might also like