Unit 5
Unit 5
Stage – II: The manual desk editing stage is a traditional method that is put into effect by a
specialized editing team. The data, (if) on paper is checked after the data has been
collected and before it is fed into the data bases. If however, electronic means have been
used to collect the data, the forms entered into the database are revised individually.
The automated data editing method makes use of computer programs and systems for
checking the data all at once after it has been entered electronically. These programs and
systems contain the auditing rules which validate the data, detect errors and determine
unacceptable responses
▪ What is Code?
▪ A code in research methodology is a short word or phrase describing the
meaning and context of the whole sentence, phrase or paragraph. The code
makes the process of data analysis easier
▪ Numerical quantities can be assigned to codes and this these quantities can be
interpreted. Codes help quantify qualitative data and give meaning to raw data.
▪ What is Coding?
▪ Data coding is the process of deriving codes from the observed data. In
qualitative research the data is either obtained from observations, interviews or
from questionnaires.
▪ Preliminary Codes
When data coder assigns codes to the observed data, he cannot manage to assign
well-refined codes in the first instance
▪ Final Codes
The final codes will help you observe a better pattern in the data. This pattern is
necessary to reach the final evaluation or analysis stage of the data.
▪ Textual
In such form of presentation, data is simply mentioned as mere text, that is generally ain a paragraph. This is commonly used when the
data is not very large
e.g. the 2002 earthquake proved to be a mass murderer of humans. As many as 10,000 citizens have been reported dead
▪ Tabular
The data is organised in rows and columns. This is one of the most widely used forms of presentation of data since data tables are easy
to construct and read
Components:
▪ Table Number
▪ Title
▪ Headnotes
▪ Stubs
▪ Caption
▪ Body of Field
▪ Footnotes
▪ Source
▪ Ease of representation
▪ Ease of analysis
▪ Helps in comparison
▪ Economical
▪ Quantitative Classification: In quantitative classification, data is classified on the
basis of quantitative attributes.
0-50 29
51-100 54
▪ Spatial Classification: When data is classified according to location, it becomes a
spatial classification
India 1,24,000
China 56,000
▪ The frequency of a particular data value is the number of times the data value
occurs. E.g. if four students have a score of 80 in mathematics., and then the score
of 80 is said to have a frequency of 4. the frequency of a data value is often
represented by f.
▪ A frequency table is constructed by arranging collected data values in ascending
order of magnitude with their corresponding frequencies
▪ Example: the marks awarded for an assignment set for a year 8 class of 20 students
were as follows:
4) Decide the starting point: The lower-class limit or class boundary should cover
the smallest value in the raw data. It is multiple of class intervals.
Example: 0, 5, 10, 15, 20, etc. are commonly used.
5) Determine the remaining class limits (boundary): When the lowest class boundary
has been decided, by adding the class interval size to the lower-class boundary you can
compute the upper-class boundary. The remaining lower and upper class limits may be
determined by adding the class interval size repeatedly till the largest value of the data is
observed in the data.
6) Distribute the data into respective classes: All the observations are divided into
respective classes by using the tally bar method which is suitable for tabulating the
observations into respective classes. The number of tally bars is counted to get the
frequency against each class. The frequency of all the classes is noted to get the grouped
data of frequency distribution of the data. The total of the frequency columns must be
equal to the number of observations.
There are four important characteristics of frequency distribution. They are as
follows:
▪ Measures of central tendency and location (mean, median, mode)
▪ Measures of dispersion (range, variance, standard deviation)
▪ The extent of symmetry/asymmetry (skewness)
▪ The flatness or peakedness (kurtosis)
Generally, the central tendency of a dataset can be described using the following
measures:
▪ Mean (Average): Represents the sum of all values in a dataset divided by the total
number of the values
▪ Median: the middle value in a dataset that is arranged in ascending order (from
the smallest value to the largest value). If a dataset contains an even number of
values, the median of the dataset is the mean of the two middle values.
▪ Mode: Defines the most frequently occurring value in a dataset. In some cases, a
dataset may contain multiple modes, while some datasets may not have any mode
at all
In statistics, the measures of dispersion help to interpret the variability of data i.e. to know
how much homogeneous or heterogeneous the data is. In simple terms, it shows how
squeezed or scattered the variable is.
▪ Range: is is simply the difference between the maximum value and the minimum value
given in a data set. Example: 1,3,5,6,7 => Range = 7-1 =6
▪ Variance: Deduct the mean from each data in the set then squaring each of them and
adding each square and finally dividing them by the total no. of values in the data set is
the variance.
▪ Standard Deviation: The square root of the variance is known as the standard deviation
i.e
Skewness refers to a distortion or asymmetry that derivates from the symmetrical
bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or
to the right, it is said to be skewed. Skewness can be quantified as a representation
of the extent to which a given distribution varies from a normal distribution. A
normal distribution has a skew of zero, while a lognormal distribution, for example,
would exhibit some degree of right-skew.
Like skewness, kurtosis is a statistical measure that is used to describe distribution.
Whereas skewness differentiates extreme values in one versus the other tail, kurtosis
measures extreme values in either tail. Distributions with large kurtosis exhibit tail
data exceeding the tails of the normal distribution (e.g. five or more standard
deviations from the mean). Distributions with low kurtosis exhibit tail data that are
generally less extreme than the tails of the normal distribution.
Apart from diagrams, Graphic presentation is another way of the presentation of data
and information. Usually, graphs are used to present time series and frequency
distributions.
▪ Suitable Title
▪ Unit of Measurement
▪ Suitable Scale
▪ Index
▪ Data Sources
▪ Keep it Simple
▪ Neat
▪ Bar Graph - contains a vertical axis and horizontal axis and displays data as
rectangular bars with lengths proportional to the values that they represent, a
useful visual aid for marketing purposes
▪ Histogram – frequency distribution and graphical representation uses adjacent
vertical bars erected over discrete intervals to represent the data frequency within
a given interval; a useful visual aid for meteorology and environmental purposes
▪ Pie Chart – shows percentage values as a slice of pie; a useful visual aid for
marketing purposes
▪ A bar chart is used when you want to show a distribution of data points or perform
a comparison of metric values across different subgroups of your data. From a bar
chart, we can see which groups are highest or most common, and how other groups
compare against the other
▪ A pie chart can only be used if the sum of the individual parts add up to a
meaningful whole, and is built for visualizing how each part contributes to that
whole. Meanwhile, a bar chart can be used for a broader range of data types, not
just for breaking down a whole into components.
▪ A histogram is used to summarize discrete or continuous data. In other words, it
provides a visual interpretation, of numerical data by showing the number of data
points that fall within a specified range of values (called ‘bins’). It is similar to
vertical bar graphs.
ADVANTAGES DISADVANTAGES
▪ Understanding Content: Visuals are more effective ▪ Cost of human efforts and resources
than text in human understanding
▪ Process of selecting the most appropriate
▪ Flexibility of use: graphical representation can be graphical and tabular representation of data
leveraged in nearly every field involving data
▪ Greater design complexity of visualizing data
▪ Increases structured thinking: users can make quick,
data-driven decisions at a glance with visual aids ▪ Potential for human bias
Expectancy – It Objectivity – It
Simplicity – It
should state the should not include
should be stated as
expected value judgements,
far as possible in
relationships relative terms or any
simple terms
between variables moral preaching
Theoretical
Availability of
Relevance – It
Techniques –
should be consistent
Statistical methods
with a substantial
should be available
body of established
for testing the
or known facts or
proposed hypothesis
existing theory
Null Hypothesis & Alternative
Hypothesis
Level of Significance
BASIC
CONCEPTS IN Decision
HYPOTHESIS
TESTING
Type I and Type II Error
▪ The t-score is a ratio between the difference between two groups and
the difference within the groups
▪ The larger the t score , the more difference there is between groups
▪ The smaller the t score, the more similarity there is between groups.
▪ A t score of 3 means that the groups are three times as different from
each other as they are within each other.
▪ T-Values and P-Values: How big is “big enough”? Every t-value has a p-value to go
with it. A p-value is the probability that the results from your sample data occurred
by chance. P-values are from 0% to 100%. They are usually written as a decimal
e.g., a p value of 5% is 0.05. Low p-values are good: They indicate your data did not
occur by chance. E.g., a p-value of 0.01 means there is only 1% probability that the
results from an experiment happened by chance. In most cases, a p-value of 0.05
(5%) is accepted to mean the data is valid.
▪ There are three main types of t-tests:
▪ An Independent Samples t-test compares the means for two groups
▪ A paired sample t-test compares means from the same group at different times (say, one
year apart)
▪ A one sample t-test tests the mean of a single group against a known mean
▪ A z-test is a statistical test used to determine whether two population means are
different when the variances are known and the sample size is large
▪ The test statistic is assumed to have a normal distribution and nuisance parameters
such as standard deviation should be known in order for an accurate z-test to be
performed
▪ Z-test is a statistical test to determine whether two population means are different
when the variances are known and the sample size is large
▪ Z-test is a hypothesis test in which the z-statistic follows a normal distribution
▪ A z-statistic or z-score is a number representing the result from z-test
▪ Z-tests are closely related to t-tests, but t-tests are best performed when an
experiment has a small sample size
▪ Z-tests assume the standard deviation is known, while t-tests assume it isunknown
Several different types of tests are used in statistics. You would use a Z test if:
▪ Your sample size is greater than 30. Otherwise, use a t-test.
▪ Data points should be independent from each other. On other words, one data
point isn’t related or doesn’t affect another data point
▪ Your data should be normally distributed. However, for large sample sizes (over 30)
this doesn’t always matter.
▪ Your data should be randomly selected from a population, where each item has an
equal chance of being selected.
▪ Sample sizes should be equal if at all possible.
▪ Let’s say we need to determine if girls on average score higher than 600 in the
exam. We have the information that the standard deviation for girls' score is 100. So,
we collect the data of 20 girls by using random samples and record their marks.
Finally, we also set our α value (significance level) to be 0.05
▪ In this example:
▪ Mean Score for Girls is 640
▪ The size of the sample is 20
▪ The population mean is 600
▪ Standard Deviation for Population is 100
▪ https://youtu.be/3Jhg4tbtIuM
▪ https://youtu.be/qTk4pWRH7Ic
▪ https://youtu.be/tcWpAP0JKSE
▪ https://youtu.be/sCfpA_vycQI
▪ https://youtu.be/yrPsgj6gThY