[go: up one dir, main page]

0% found this document useful (0 votes)
27 views5 pages

Social Science Research Principles Methods and Practices

This document discusses key statistical concepts including types of variables, descriptive statistics, and sample size estimation. It explains the distinction between quantitative and qualitative variables, the importance of descriptive statistics in summarizing data, and the necessity of accurate sample size calculations for effective study design. The authors emphasize the role of variables in data collection and the implications of sampling methods on research outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views5 pages

Social Science Research Principles Methods and Practices

This document discusses key statistical concepts including types of variables, descriptive statistics, and sample size estimation. It explains the distinction between quantitative and qualitative variables, the importance of descriptive statistics in summarizing data, and the necessity of accurate sample size calculations for effective study design. The authors emphasize the role of variables in data collection and the implications of sampling methods on research outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Research Snippets

Types of Variables, Descriptive Statistics, and Sample Size

Abstract Feroze Kaliyadan,


This short “snippet” covers three important aspects related to statistics – the concept of variables, Vinay Kulkarni1
the importance, and practical aspects related to descriptive statistics and issues related to Department of Dermatology,
sampling – types of sampling and sample size estimation. King Faisal University,
Al Hofuf, Saudi Arabia,
Keywords: Biostatistics, descriptive statistics, sample size, variables 1
Department of Dermatology,
Prayas Amrita Clinic, Pune,
Maharashtra, India
Variables If you can subtract the value of one variable
from the other to get a meaningful result,
What is a variable?[1,2] To put it in very
then you are dealing with a quantitative
simple terms, a variable is an entity whose
variable (this of course will not apply to
value varies. A variable is an essential rating scales/ranks).
component of any statistical data. It is a
feature of a member of a given sample or Quantitative variables can be either
population, which is unique, and can differ discrete or continuous
in quantity or quantity from another member
Discrete variables are variables in which
of the same sample or population. Variables
no values may be assumed between the
either are the primary quantities of interest
two given values (e.g., number of lesions
or act as practical substitutes for the same.
in each patient in a sample of patients with
The importance of variables is that they
urticaria).
help in operationalization of concepts for
data collection. For example, if you want to Continuous variables, on the other hand,
do an experiment based on the severity of can take any value in between the two
urticaria, one option would be to measure given values (e.g., duration for which the
the severity using a scale to grade severity weals last in the same sample of patients
of itching. This becomes an operational with urticaria). One way of differentiating
variable. For a variable to be “good,” it between continuous and discrete variables
needs to have some properties such as good is to use the “mid‑way” test. If, for every
reliability and validity, low bias, feasibility/ pair of values of a variable, a value exactly
practicality, low cost, objectivity, clarity, and mid‑way between them is meaningful, the
acceptance. Variables can be classified into variable is continuous. For example, two
various ways as discussed below. values for the time taken for a weal to
subside can be 10 and 13 min. The mid‑way
Quantitative vs qualitative value would be 11.5 min which makes
A variable can collect either qualitative or sense. However, for a number of weals,
quantitative data. A variable differing in suppose you have a pair of values – 5 and
Address for correspondence:
quantity is called a quantitative variable 8 – the midway value would be 6.5 weals,
Dr. Feroze Kaliyadan,
(e.g., weight of a group of patients), which does not make sense. Department of Dermatology,
King Faisal University,
whereas a variable differing in quality Under the umbrella of qualitative Saudi Arabia.
is called a qualitative variable (e.g., the variables, you can have nominal/ E‑mail: ferozkal@hotmail.com
Fitzpatrick skin type)
categorical variables and ordinal
A simple test which can be used to variables Access this article online
differentiate between qualitative and
quantitative variables is the subtraction test. Nominal/categorical variables are, as the Website: www.idoj.in
name suggests, variables which can be DOI: 10.4103/idoj.IDOJ_468_18
slotted into different categories (e.g., gender Quick Response Code:
This is an open access journal, and articles are
distributed under the terms of the Creative Commons or type of psoriasis).
Attribution‑NonCommercial‑ShareAlike 4.0 License, which
allows others to remix, tweak, and build upon the work How to cite this article: Kaliyadan F, Kulkarni V.
non‑commercially, as long as appropriate credit is given and the Types of variables, descriptive statistics, and sample
new creations are licensed under the identical terms. size. Indian Dermatol Online J 2019;10:82-6.

For reprints contact: reprints@medknow.com Received: December, 2018. Accepted: December, 2018.

82 © 2019 Indian Dermatology Online Journal | Published by Wolters Kluwer ‑ Medknow


Kaliyadan and Kulkarni: Variables, descriptive statistics and sample size

Ordinal variables or ranked variables are similar to is generally better to use groups in the frequency table.
categorical, but can be put into an order (e.g., a scale for Ideally, group sizes should be equal (except in extreme
severity of itching). ends where open groups are used; e.g., age “greater than”
or “less than”).
Dependent and independent variables
Another form of presenting frequency distributions is the
In the context of an experimental study, the dependent
variable (also called outcome variable) is directly linked to “stem and leaf” diagram, which is considered to be a more
the primary outcome of the study. For example, in a clinical accurate form of description.
trial on psoriasis, the PASI (psoriasis area severity index) Suppose the weight in kilograms of a group of 10 patients
would possibly be one dependent variable. The independent is as follows:
variable (sometime also called explanatory variable) is
56, 34, 48, 43, 87, 78, 54, 62, 61, 59
something which is not affected by the experiment itself but
which can be manipulated to affect the dependent variable. The “stem” records the value of the “ten’s” place (or higher)
Other terms sometimes used synonymously include blocking and the “leaf” records the value in the “one’s” place
variable, covariate, or predictor variable. Confounding [Table 1].
variables are extra variables, which can have an effect
on the experiment. They are linked with dependent and Illustration/visual display of data
independent variables and can cause spurious association. The most common tools used for visual display include
For example, in a clinical trial for a topical treatment in frequency diagrams, bar charts (for noncontinuous
psoriasis, the concomitant use of moisturizers might be a variables) and histograms (for continuous variables).
confounding variable. A control variable is a variable that Composite bar charts can be used to compare variables.
must be kept constant during the course of an experiment. For example, the frequency distribution in a sample
population of males and females can be illustrated as given
Descriptive Statistics
in Figure 1.
Statistics can be broadly divided into descriptive statistics
A pie chart helps show how a total quantity is divided
and inferential statistics.[3,4] Descriptive statistics give a
among its constituent variables. Scatter diagrams can be
summary about the sample being studied without drawing
used to illustrate the relationship between two variables.
any inferences based on probability theory. Even if the
For example, global scores given for improvement
primary aim of a study involves inferential statistics,
in a condition like acne by the patient and the doctor
descriptive statistics are still used to give a general
[Figure 2].
summary. When we describe the population using tools
such as frequency distribution tables, percentages, and Summary statistics
other measures of central tendency like the mean, for
example, we are talking about descriptive statistics. When The main tools used for summary statistics are broadly
we use a specific statistical test (e.g., Mann–Whitney grouped into measures of central tendency (such as mean,
U‑test) to compare the mean scores and express it in terms median, and mode) and measures of dispersion or variation
of statistical significance, we are talking about inferential (such as range, standard deviation, and variance).
statistics. Descriptive statistics can help in summarizing Imagine that the data below represent the weights of a
data in the form of simple quantitative measures such as sample of 15 pediatric patients arranged in ascending
percentages or means or in the form of visual summaries order:
such as histograms and box plots.
30, 35, 37, 38, 38, 38, 42, 42, 44, 46, 47, 48, 51, 53, 86
Descriptive statistics can be used to describe a single variable
(univariate analysis) or more than one variable (bivariate/
Table 1: Stem and leaf plot
multivariate analysis). In the case of more than one variable,
Stem Leaf
descriptive statistics can help summarize relationships
0 ‑
between variables using tools such as scatter plots.
1 ‑
Descriptive statistics can be broadly put under two 2 ‑
categories: 3 4
• Sorting/grouping and illustration/visual displays 4 3 8
• Summary statistics. 5 4 6 9
6 1 2
Sorting and grouping 7 8
Sorting and grouping is most commonly done using 8 7
frequency distribution tables. For continuous variables, it 9 ‑

Indian Dermatology Online Journal | Volume 10 | Issue 1 | January-February 2019 83


Kaliyadan and Kulkarni: Variables, descriptive statistics and sample size

Just having the raw data does not mean much to us, so
we try to express it in terms of some values, which give a
summary of the data.
Mean
The mean is basically the sum of all the values divided by
the total number. In this case, we get a value of 45.
The problem is that some extreme values (outliers), like
“’86,” in this case can skew the value of the mean. In
this case, we consider other values like the median, which
is the point that divides the distribution into two equal
Figure 1: Composite bar chart
halves. It is also referred to as the 50th percentile (50%
of the values are above it and 50% are below it). In our
previous example, since we have already arranged the
values in ascending order we find that the point which
divides it into two equal halves is the 8th value – 42. In
case of a total number of values being even, we choose
the two middle points and take an average to reach the
median.
The mode is the most common data point. In our example,
this would be 38. The mode as in our case may not
necessarily be in the center of the distribution.
The median is the best measure of central tendency from
among the mean, median, and mode. In a “symmetric”
Figure 2: Scatter diagram
distribution, all three are the same, whereas in skewed
data the median and mean are not the same; lie more
toward the skew, with the mean lying further to the skew
compared with the median. For example, in Figure 3, a
right skewed distribution is seen (direction of skew is
based on the tail); data values’ distribution is longer on
the right‑hand (positive) side than on the left‑hand side.
The mean is typically greater than the median in such
cases.
Measures of dispersion
The range gives the spread between the lowest and highest
Figure 3: Location of mode, median, and mean
values. In our previous example, this will be 86‑30 = 56.
A more valuable measure is the interquartile range. Similarly we can calculate the deviations for all values
A quartile is one of the values which break the distribution in a sample. Adding these deviations and averaging will
into four equal parts. The 25th percentile is the data point give a clue to the total dispersion, but the problem is that
which divides the group between the first one‑fourth and since the deviations are a mix of negative and positive
the last three‑fourth of the data. The first one‑fourth will values, the final total becomes zero. To calculate the
form the first quartile. The 75th percentile is the data point
variance, this problem is overcome by adding squares of
which divides the distribution into a first three‑fourth
the deviations. So variance would be the sum of squares
and last one‑fourth (the last one‑fourth being the fourth
of the variation divided by the total number in the
quartile). The range between the 25th percentile and
population (for a sample we use “n − 1”). To get a more
75th percentile is called the interquartile range.
realistic value of the average dispersion, we take the
Variance is also a measure of dispersion. The larger the square root of the variance, which is called the “standard
variance, the further the individual units are from the mean. deviation.”
Let us consider the same example we used for calculating
the mean. The mean was 45. The box plot
For the first value (30), the deviation from the mean will The box plot is a composite representation that portrays the
be 15; for the last value (86), the deviation will be 41. mean, median, range, and the outliers [Figure 4].

84 Indian Dermatology Online Journal | Volume 10 | Issue 1 | January-February 2019


Kaliyadan and Kulkarni: Variables, descriptive statistics and sample size

The concept of skewness and kurtosis into subgroups – followed be random sampling in each
subgroup), systematic (sampling is based on a systematic
Skewness is a measure of the symmetry of distribution.
technique – e.g., every third person is selected for a
Basically if the distribution curve is symmetric, it looks
survey), and cluster sampling (similar to stratified sampling
the same on either side of the central point. When this
except that the clusters here are preexisting clusters unlike
is not the case, it is said to be skewed. Kurtosis is a
stratified sampling where the researcher decides on the
representation of outliers. Distributions with high kurtosis stratification criteria), whereas nonprobability sampling,
tend to have “heavy tails” indicating a larger number of where every unit in the population does not have an equal
outliers, whereas distributions with low kurtosis have chance of inclusion into the sample, includes methods such
light tails, indicating lesser outliers. There are formulas to as convenience sampling (e.g., sample selected based on
calculate both skewness and kurtosis [Figures 5-8]. ease of access) and purposive sampling (where only people
who meet specific criteria are included in the sample).
Sample Size
An accurate calculation of sample size is an essential
In an ideal study, we should be able to include all units
aspect of good study design. It is important to calculate the
of a particular population under study, something that
sample size much in advance, rather than have to go for
is referred to as a census.[5,6] This would remove the
post hoc analysis. A sample size that is too less may make
chances of sampling error (difference between the outcome
the study underpowered, whereas a sample size which is
characteristics in a random sample when compared with
more than necessary might lead to a wastage of resources.
the true population values – something that is virtually
unavoidable when you take a random sample). However, it We will first go through the sample size calculation for a
is obvious that this would not be feasible in most situations. hypothesis‑based design (like a randomized control trial).
Hence, we have to study a subset of the population to reach The important factors to consider for sample size calculation
to our conclusions. This representative subset is a sample include study design, type of statistical test, level of
and we need to have sufficient numbers in this sample to significance, power and effect size, variance (standard
make meaningful and accurate conclusions and reduce the deviation for quantitative data), and expected proportions in
effect of sampling error. the case of qualitative data. This is based on previous data,
We also need to know that broadly sampling can be either based on previous studies or based on the clinicians’
divided into two types – probability sampling and experience. In case the study is something being conducted
nonprobability sampling. Examples of probability for the first time, a pilot study might be conducted which
sampling include methods such as simple random helps generate these data for further studies based on a
sampling (each member in a population has an equal larger sample size). It is also important to know whether
chance of being selected), stratified random sampling (in the data follow a normal distribution or not.
nonhomogeneous populations, the population is divided Two essential aspects we must understand are the concept
of Type I and Type II errors. In a study that compares
two groups, a null hypothesis assumes that there is no
significant difference between the two groups, and any
observed difference being due to sampling or experimental
error. When we reject a null hypothesis, when it is true,

Figure 4: Box plot

Figure 5: Positive skew

Figure 6: Negative skew Figure 7: Low kurtosis (negative kurtosis – also called “Platykurtic”)

Indian Dermatology Online Journal | Volume 10 | Issue 1 | January-February 2019 85


Kaliyadan and Kulkarni: Variables, descriptive statistics and sample size

A note is that for estimation type of studies/surveys,


sample size calculation needs to consider some other
factors too. This includes an idea about total population
size (this generally does not make a major difference when
population size is above 20,000, so in situations where
population size is not known we can assume a population
of 20,000 or more). The other factor is the “margin of
error” – the amount of deviation which the investigators
find acceptable in terms of percentages. Regarding
confidence levels, ideally, a 95% confidence level is the
minimum recommended for surveys too. Finally, we need
an idea of the expected/crude prevalence – either based on
previous studies or based on estimates.
Sample size calculation also needs to add corrections for
Figure 8: High kurtosis (positive kurtosis – also called leptokurtic) patient drop‑outs/lost‑to‑follow‑up patients and missing
records. An important point is that in some studies dealing
we label it as a Type I error (also denoted as “alpha,” with rare diseases, it may be difficult to achieve desired
correlating with significance levels). In a Type II error sample size. In these cases, the investigators might have
(also denoted as “beta”), we fail to reject a null hypothesis, to rework outcomes or maybe pool data from multiple
when the alternate hypothesis is actually true. Type II centers. Although post hoc power can be analyzed, a better
errors are usually expressed as “1‑ β,” correlating with the approach suggested is to calculate 95% confidence intervals
power of the test. While there are no absolute rules, the for the outcome and interpret the study results based on
minimal levels accepted are 0.05 for α (corresponding to a this.
significance level of 5%) and 0.20 for β (corresponding to
a minimum recommended power of “1 − 0.20,” or 80%). Financial support and sponsorship
Effect size and minimal clinically relevant difference Nil.
For a clinical trial, the investigator will have to decide in Conflicts of interest
advance what clinically detectable change is significant (for
There are no conflicts of interest.
numerical data, this is could be the anticipated outcome
means in the two groups, whereas for categorical data, it References
could correlate with the proportions of successful outcomes
in two groups.). While we will not go into details of the 1. Variable classification. In: Seltman HJ, editor. Experimental
Design and Analysis. 1st ed. Pittsburgh, PA: Carnegie Mellon
formula for sample size calculation, some important points University; 2012; p. 9‑18.
are as follows: 2. Hoeks S, Kardys I, Lenzen M, van Domburg R, Boersma E.
In the context where effect size is involved, the sample size Tools and techniques – Statistics: Descriptive statistics.
is inversely proportional to the square of the effect size. EuroIntervention 2013;9:1001‑3.
3. Review of probability. In: Seltman HJ, editor. Experimental
What this means in effect is that reducing the effect size
Design and Analysis. 1st ed. Pittsburgh, PA: Carnegie Mellon
will lead to an increase in the required sample size. University; 2012; p. 19‑60.
Reducing the level of significance (alpha) or increasing 4. Nick TG. Descriptive statistics. Methods Mol Biol
power (1‑β) will lead to an increase in the calculated 2007;404:33‑52.
sample size. 5. Endacott R, Botti M. Clinical research 3: Sample selection.
Accid Emerg Nurs 2007;15:234‑8.
An increase in variance of the outcome leads to an increase 6. Hazra A, Gogtay N. Biostatistics series module 5: Determining
in the calculated sample size. sample size. Indian J Dermatol 2016;61:496‑504.

86 Indian Dermatology Online Journal | Volume 10 | Issue 1 | January-February 2019

You might also like