[go: up one dir, main page]

0% found this document useful (0 votes)
5 views55 pages

Data Processing and Analysis (Read-Only) (Compatibility Mode)

The document outlines the processes of data processing and analysis, detailing steps such as editing, coding, classification, and tabulation of collected data for effective analysis. It emphasizes the importance of descriptive and inferential statistics in research, including measures of central tendency and hypothesis testing. Additionally, it discusses the necessity of sampling in research, its theory, and definitions related to sampling concepts.

Uploaded by

kandy said
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views55 pages

Data Processing and Analysis (Read-Only) (Compatibility Mode)

The document outlines the processes of data processing and analysis, detailing steps such as editing, coding, classification, and tabulation of collected data for effective analysis. It emphasizes the importance of descriptive and inferential statistics in research, including measures of central tendency and hypothesis testing. Additionally, it discusses the necessity of sampling in research, its theory, and definitions related to sampling concepts.

Uploaded by

kandy said
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

DATA PROCESSING AND ANALYSIS

Data Processing And Analysis


• The data, after collection, has to be processed and
analysed in accordance with the outline laid down for
the purpose at the time of developing the research plan
• Processing implies editing, coding, classification and
tabulation of collected data so that they are amenable to
analysis.
• The term analysis refers to the computation of certain
measures along with searching for patterns of
relationship that exist among data-groups.
Processing Operations
1. Editing
• Editing of data is a process of examining the collected
raw data to detect errors and omissions and to correct
these when possible
• As a matter of fact, editing involves a careful scrutiny of
the completed questionnaires and/or schedules.
• Editing is done to assure that the data are accurate,
consistent with other facts gathered, uniformly entered,
as completed as possible and have been well arranged to
facilitate coding and tabulation.
Processing Operations…
• There are two stages at which editing should be done:
a. Field editing
• Consists in the review of the reporting forms by the
investigator for completing (translating or rewriting)
what the investigator has written in abbreviated
and/or in illegible form at the time of recording the
respondents’ responses.
• This type of editing is necessary in view of the fact that
individual writing styles often can be difficult for
others to decipher.
Processing Operations…
b. Central editing
• This type of editing implies that all forms should get a
thorough editing by a single editor in a small study and
by a team of editors in case of a large inquiry.
• Editor(s) may correct the obvious errors such as an
entry in the wrong place, entry recorded in months
when it should have been recorded in weeks, and the
like.
• In case of inappropriate on missing replies, the editor
can sometimes determine the proper answer by
reviewing the other information in the schedule.
• At times, the respondent can be contacted for
clarification.
Processing Operations…
2. Coding
• Coding refers to the process of assigning numerals
or other symbols to answers so that responses can be
put into a limited number of categories or classes.
• Such classes should be:
i. appropriate to the research problem under
consideration.
ii. They must also possess the characteristic of
exhaustiveness (i.e., there must be a class for every
data item) and also be mutual exclusive.
iii. There should be uni-dimensionality that is, every class
is defined in terms of only one concept.
Processing Operations…
Coding…
• Coding is necessary for efficient analysis and through it
the several replies may be reduced to a small number of
classes which contain the critical information required
for analysis.
• Coding decisions should usually be taken at the
designing stage of the questionnaire.
• This makes it possible to pre-code the questionnaire
choices and which in turn is helpful for computer
tabulation as one can straight forward key punch from
the original questionnaires.
Processing Operations…
3. Classification
• Most research studies result in a large volume of raw
data which must be reduced into homogeneous groups
if we are to get meaningful relationships.
• This fact necessitates classification of data which
happens to be the process of arranging data in groups or
classes on the basis of common characteristics.
• Classification can be one of the following two types,
depending upon the nature of the phenomenon
involved:
Processing Operations…
Classification…
(a) Classification according to attributes
• Data classified on the basis of common characteristics which
are descriptive (such as literacy, sex, honesty, etc.) is said to
be classified according to attributes.
• Such classification can be simple classification or manifold
classification.
• In simple classification we consider only one attribute and
divide the population into two classes—one class consisting
of items possessing the given attribute and the other class
with items which do not possess the attribute.
• Manifold classification we consider two or more attributes
simultaneously, and divide that data into a number of classes.
Processing Operations…
(b) Classification according to class-intervals
• Unlike descriptive characteristics, the numerical
characteristics refer to quantitative phenomenon
which can be measured through some statistical
units.
• Data relating to income, production, age, weight,
etc. come under this category.
• Such data are known as statistics of variables and are
classified on the basis of class intervals.
• Classification according to class intervals usually
involves the following three main problems:
Processing Operations…
Classification according to class-intervals…
i. How many classes should be there?What should be their
magnitudes?
• There can be no specific answer with regard to the
number of classes. Typically, we may have 5 to 15
classes.
• With regard to the second part of the question, to the
extent possible, class-intervals should be of equal
magnitudes, but in some cases unequal magnitudes
may result in better classification.
• Multiples of 2, 5 and 10 are generally preferred while
determining class magnitudes.
Processing Operations…
• Classification according to class-intervals…
• Some statisticians adopt the following formula,
suggested by H.A. Sturges, determining the size of
class interval:
i = R/(1 + 3.3 log N)
Where
i = size of class interval;
R = Range (i.e., difference between the values of the largest item
and smallest item among the given items);
N = Number of items to be grouped.
Processing Operations…
Classification according to class-intervals…
(ii) How to choose class limits?
• While choosing class limits, the researcher must take
into consideration the criterion that the mid-point
(generally worked out first by taking the sum of the
upper limit and lower limit of a class and then divide
this sum by 2) of a class-interval and the actual average
of items of that class interval should remain as close to
each other as possible.
• Consistent with this, the class limits should be located
at multiples of 2, 5, 10, 20, 100 and such other figures.
• Class limits may generally be stated in any of the
following forms:
Processing Operations…
Exclusive type class intervals:
• They are usually stated as follows:
10–20
20–30
30–40
40–50
The above intervals should be read as under:
10 and under 20
20 and under 30
30 and under 40
40 and under 50
• Thus, under the exclusive type class intervals, the items whose values
are equal to the upper limit of a class are grouped in the next higher
class.
• Under exclusive type class intervals, the upper limit of a class
interval is excluded. For example, an item whose value is exactly 30
would be put in 30–40 class interval and not in 20–30 class interval.
Processing Operations…
Inclusive type class intervals:
• They are usually stated as follows:
11–20
21–30
31–40
41–50
• In inclusive type class intervals the upper limit of a class
interval is also included in the concerning class interval.
• Thus, an item whose value is 20 will be put in 11–20 class
interval. The stated upper limit of the class interval 11–20 is
20 but the real limit is 20.99999 and as such 11–20 class
interval really means 11 and under 21.
Processing Operations…
• Classification according to class-intervals…
(iii) How to determine the frequency of each class?
• This can be done either by tally sheets or by mechanical
aids. Under the technique of tally sheet, the class-
groups are written on a sheet of paper (commonly
known as the tally sheet) and for each item a stroke
(usually a small vertical line) is marked against the class
group in which it falls.
• The general practice is that after every four small
vertical lines in a class group, the fifth line for the item
falling in the same group, is indicated as horizontal line
through the said four lines
Processing Operations…
4. Tabulation:
• Tabulation is the process of summarizing raw data and
displaying the same in compact form (i.e., in the form of
statistical tables) for further analysis.
• In a broader sense, tabulation is an orderly arrangement of
data in columns and rows.
• Tabulation is essential because of the following reasons.
i. It conserves space and reduces explanatory and descriptive
statement to a minimum.
ii. It facilitates the process of comparison.
iii. It facilitates the summation of items and the detection of errors
and omissions.
iv. It provides a basis for various statistical computations.
• Tabulation can be done by hand or by mechanical or
electronic devices.
Some Problems In Processing
• We can take up the following two problems of
processing the data for analytical purposes:
(a) The problem concerning “Don’t know” (or DK) responses
• One category of such responses may be ‘Don’t Know
Response’ or simply DK response. When the DK
response group is small, it is of little significance.
• But when it is relatively big, it becomes a matter of
major concern in which case the question arises: Is the
question which elicited DK response useless?
• The answer depends on two points viz., the
respondent actually may not know the answer or the
researcher may fail in obtaining the appropriate
information
Some Problems In Processing…
(b) Use of percentages
• Percentages are often used in data presentation for
they simplify numbers, reducing all of them to a 0
to 100 range. Through the use of percentages, the
data are reduced in the standard form with base
equal to 100 which fact facilitates relative
comparisons.
• While using percentages, the following rules should
be kept in view by researchers:
Some Problems In Processing…
1. Two or more percentages must not be averaged unless each is
weighted by the group size from which it has been derived.
2. Use of too large percentages should be avoided, since a large
percentage is difficult to understand and tends to confuse,
defeating the very purpose for which percentages are used.
3. Percentages hide the base from which they have been
computed. If this is not kept in view, the real differences may
not be correctly read.
4. Percentage decreases can never exceed 100 per cent and as
such for calculating the percentage of decrease, the higher
figure should invariably be taken as the base.
5. Percentages should generally be worked out in the direction of
the causal-factor in case of two-dimension tables and for this
purpose we must select the more significant factor out of the
two given factors as the causal factor.
DATA ANALYSIS
• The role of statistics in research is to function as a tool in
designing research, analysing its data and drawing
conclusions therefrom. There are two major areas of
statistics:
a) Descriptive statistics
b) Inferential statistics
• Descriptive statistics: concern the development of
certain indices from the raw data, whereas inferential
statistics concern with the process of generalisation.
• Inferential statistics: are also known as sampling statistics
are mainly concerned with two major type of problems:
(i) the estimation of population parameters, and (ii) the
testing of statistical hypotheses.
Descriptive Analysis
• Descriptive statistics consists of the following
measures:
1. Measures of central tendency or statistical
averages;
2. Measures of dispersion;
3. Measures of normality (skewness and kurtosis);
4. Measures of relationship; and
5. other measures.
Descriptive Analysis…
• Measures of central tendency: mean, median and mode
are commonly used. Geometric mean and harmonic mean
are sometimes used.
• Measures of dispersion: variance and standard deviation
are often used. Mean deviation and range are sometimes
used. For comparison purposes coefficient of std dev and
coefficient of variation are used
• Measures of skewness and kurtosis: measure of skewness
based on mean and mode or on mean and median. Other
measures of skewness, based on quartiles or on the methods
of moments, are also used sometimes.
• Kurtosis is also used to measure the peakedness of the curve
of the frequency distribution.
Descriptive Analysis…
• Measures of relationship: Karl Pearson’s coefficient of
correlation is the frequently used measure in case of
statistics of variables, whereas Yule’s coefficient of
association is used in case of statistics of attributes.
• Multiple correlation coefficient, partial correlation
coefficient, regression analysis, etc., are other important
measures often used by a researcher.
• Index numbers, analysis of time series, coefficient of
contingency, etc., are other measures that may as well be
used by a researcher, depending upon the nature of the
problem under study.
Inferential Analysis
• Inferential analysis is concerned with the various tests
of significance for testing hypotheses in order to
determine with what validity data can be said to
indicate some conclusion or conclusions.
• It is also concerned with the estimation of
population values.
• It is mainly on the basis of inferential analysis that
the task of interpretation (the task of drawing
inferences and conclusions) is performed.
Sampling

• Sampling may be defined as the selection of some


part of an aggregate or totality on the basis of which
a judgement or inference about the aggregate or
totality is made.
• It is the process of obtaining information about an
entire population by examining only a part of it, in
this case the sample.
Need For Sampling
• Sampling is used in practice for a variety of reasons such
as:
1. Sampling can save time and money. A sample study is
usually less expensive than a census study and produces
results at a relatively faster speed.
2. Sampling may enable more accurate measurements for
a sample study is generally conducted by trained and
experienced investigators.
3. Sampling remains the only way when population
contains infinitely many members.
Need For Sampling…
4. Sampling remains the only choice when a test
involves the destruction of the item under study.
5. Sampling usually enables to estimate the sampling
errors and, thus, assists in obtaining information
concerning some characteristic of the population.
Sampling Theory
• Sampling theory is a study of relationships existing
between a population and samples drawn from the
population.
• Sampling theory is applicable only to random
samples.
• The main problem of sampling theory is the problem
of relationship between a parameter and a statistic.
• The theory of sampling is concerned with estimating
the properties of the population from those of the
sample and also with gauging the precision of the
estimate.
Inferential Statistics
• Sampling theory is designed to attain one or more of
the following objectives:
(i) Statistical estimation: Sampling theory helps in
estimating unknown population parameters from a
knowledge of statistical measures based on sample
studies.
• In other words, to obtain an estimate of parameter
from statistic is the main objective of the sampling
theory. In other words, to obtain an estimate of
parameter from statistic is the main objective of
the sampling theory.
Inferential Statistics…
(ii) Testing of hypotheses: The second objective of
sampling theory is to enable us to decide whether to
accept or reject hypothesis; the sampling theory
helps in determining whether observed differences
are actually due to chance or whether they are really
significant.
(iii) Statistical inference: Sampling theory helps in
making generalisation about the population from the
studies based on samples drawn from it. It also
helps in determining the accuracy of such
generalisations.
Sampling Definitions
• Some fundamental definitions concerning sampling
concepts and principles.
1. Universe/Population
• From a statistical point of view, the term
‘Universe’refers to the total of the items or units in
any field of inquiry, whereas the term ‘population’
refers to the total of items about which information
is desired. The attributes that are the object of study
are referred to as characteristics and the units
possessing them are called as elementary units.
Sampling Definitions…
2. Sampling frame
• The elementary units or the group or cluster of such
units may form the basis of sampling process in
which case they are called as sampling units. A list
containing all such sampling units is known as
sampling frame.
• This frame is either constructed by a researcher for
the purpose of his study or may consist of some
existing list of the population. For instance, one can
use telephone directory as a frame for conducting
opinion survey in a city
Sampling Definitions…
3. Sampling design:
• A sample design is a definite plan for obtaining a
sample from the sampling frame. It refers to the
technique or the procedure the researcher would
adopt in selecting some sampling units from which
inferences about the population is drawn. Sampling
design is determined before any data are collected.
Various sampling designs have already been
explained earlier
Sampling Definitions…
4. Statistic and parameter
• A statistic is a characteristic of a sample, whereas a
parameter is a characteristic of a population.
• Thus, when we work out certain measures such as
mean, median, mode or the like from samples, they
are called statistic(s), for they describe the
characteristics of a sample. But when such measures
describe the characteristics of a population, they are
known as parameter(s). For instance, the population
mean µ is a parameter, whereas the sample mean x
is a statistic.
Sampling Definitions…
5. Sampling error
• Sample survey implies the study of a small portion
of the population and as such there would naturally
be a certain amount of inaccuracy in the
information collected. This inaccuracy may be
termed as sampling error or error variance.
• Sampling errors are those errors which arise on
account of sampling and they generally happen to be
random variations (in case of random sampling) in
the sample estimates around the true population
values.
Sampling Definitions…
Sampling error…
• The magnitude of the sampling error depends upon
the nature of the universe; the more homogeneous
the universe, the smaller the sampling error.
• Sampling error is inversely related to the size of the
sample i.e., sampling error decreases as the sample
size increases and vice-versa.
• Sampling error is usually worked out as the product
of the critical value at a certain level of significance
and the standard error.
Sampling Definitions…
6. Precision
• Precision is the range within which the population
average (or other parameter) will lie in accordance
with the reliability specified in the confidence level
as a percentage of the estimate ± or as a numerical
quantity.
• For instance, if the estimate is Khs 4000 and the
precision desired is ± 4%, then the true value will
be no less than Khs 3840 and no more than Khs
4160.
Sampling Definitions…
7. Confidence level and significance level
• The confidence level or reliability is the expected
percentage of times that the actual value will fall within
the stated precision limits.
• Thus, if we take a confidence level of 95%, then we
mean that there are 95 chances in 100 (or .95 in 1) that
the sample results represent the true condition of the
population within a specified precision range against 5
chances in 100 (or .05 in 1) that it does not.
Sampling Definitions…
• Precision is the range within which the answer may vary and
still be acceptable; confidence level indicates the likelihood
that the answer will fall within that range, and the
significance level indicates the likelihood that the answer
will fall outside that range.
• Always remember that if the confidence level is 95%, then
the significance level will be (100 – 95) i.e., 5%; if the
confidence level is 99%, the significance level is (100 – 99)
i.e., 1%, and so on.
• Also remember that the area of normal curve within
precision limits for the specified confidence level constitute
the acceptance region and the area of the curve outside
these limits in either direction constitutes the rejection
regions.
Sampling Definitions…
8. Sampling distribution
• We are often concerned with sampling distribution in
sampling analysis.
• If we take certain number of samples and for each
sample compute various statistical measures such as
mean, standard deviation, etc., then we can find that
each sample may give its own value for the statistic
under consideration.
• All such values of a particular statistic, say mean,
together with their relative frequencies will constitute
the sampling distribution of the particular statistic, say
mean.
Sampling Definitions…
• We can have sampling distribution of mean, or the
sampling distribution of standard deviation or the
sampling distribution of any other statistical measure.
• It may be noted that each item in a sampling
distribution is a particular statistic of a sample. The
sampling distribution tends quite closer to the normal
distribution if the number of samples is large.
• The significance of sampling distribution follows from
the fact that the mean of a sampling distribution is the
same as the mean of the population. Thus, the mean of
the sampling distribution can be taken as the mean of
the population.
IMPORTANT SAMPLING
DISTRIBUTIONS
• Some important sampling distributions commonly
used are:
1) Sampling distribution of mean;
2) Sampling distribution of proportion;
3) Student’s ‘t’ distribution;
4) F distribution; and
5) Chi-square distribution.
1. Sampling distribution of mean
• It refers to the probability distribution of all the
possible means of random samples of a given size that
we take from a population.
• If samples are taken from a normal population, N(µ,σp),
the sampling distribution of mean would also be normal
with mean µx =µ and standard deviation = σp√n ,
where µ is the mean of the population, σp is the
standard deviation of the population and n the number
of items in a sample.
Sampling distribution of mean…
• In case we want to reduce the sampling distribution
of mean to unit normal distribution i.e., N (0,1), we
can write the normal variate z  x   for the
p n
sampling distribution of mean.
• This characteristic of the sampling distribution of
mean is very useful in several decision situations for
accepting or rejection of hypotheses.
2. Sampling distribution of proportion
• Like sampling distribution of mean, we can as well have a
sampling distribution of proportion. This happens in case
of statistics of attributes.
• Assume that we have worked out the proportion of
defective parts in large number of samples, each with say
100 items, that have been taken from an infinite
population and plot a probability distribution of the said
proportions, we obtain what is known as the sampling
distribution of proportions.
• Usually the statistics of attributes correspond to the
conditions of a binomial distribution that tends to become
normal distribution as n becomes larger and larger.
Sampling distribution of proportion…
• If p represents the proportion of defectives (successes)
and q the proportion of non-defectives (failures) i.e
q = 1– p) and if p is treated as a random variable, then
the sampling distribution of proportion of successes has
a mean p with standard deviation p.q , where n is the
n
sample size.
• Presuming the binomial distribution approximating the
normal distribution for large n, the normal variate of
pˆ  p
the sampling distribution of proportion z 
( p.q) n
where p̂ is the sample proportion of successes, can be
used for testing of hypotheses.
3. Student’s t-distribution
• When population standard deviation (σp) is not known
and the sample is of a small size (n ≤ 30), we use t-
distribution for the sampling distribution of mean and
workout t variable as: t (x)
(s n)
( Xi X )2
Where s  n 1 which is the sample standard
deviation
• t-distribution is also symmetrical and is very close to
the distribution of standard normal variate, z, except
for small values of n.
Student’s t-distribution…
• The variable t differs from z in the sense that we use
sample standard deviation (σs) in the calculation of t,
whereas we use standard deviation of population (σp) in
the calculation of z.
• There is a different t distribution for every possible
sample size (for different degrees of freedom). The
degrees of freedom for a sample of size n is n–1. As the
sample size gets larger, the shape of the t distribution
becomes approximately equal to the normal distribution.
• In fact for sample sizes of more than 30, the t distribution
is so close to the normal distribution that we can use the
normal to approximate the t-distribution.
Student’s t-distribution…
• The t-distribution tables are available which give
the critical values of t for different degrees of
freedom at various levels of significance.
• The table value of t for given degrees of freedom at
a certain level of significance is compared with the
calculated value of t from the sample data, and if the
calculated value is either equal to or exceeds the
table value, we infer that the null hypothesis cannot
be accepted.
4. F distribution
• If (σs1)2 and (σs2)2 are the variances of two
independent samples of size n1 and n2 respectively
taken from two independent normal populations,
having the same variance, (σp1)2 =(σp2)2 , the ratio
F = (σs1)2 /(σs2)2 , where ( )   ( X  X ) / n 1
s1
2
1i 1
2
1

And ( s 2 ) 2  ( X 2i  X 2 ) 2 / n2  1
has an F distribution with n1-1 and n2-1 degrees of
freedom.
F distribution…
• F ratio is computed in a way that the larger variance is
always in the numerator.
• Tables have been prepared for F distribution that give
critical values of F for various values of degrees of
freedom for larger as well as smaller variances.
• The calculated value of F from the sample data is
compared with the corresponding table value of F and if
the former is equal to or exceeds the latter, then we
infer that the null hypothesis of the variances being
equal cannot be accepted.
5. Chi-square (χ2) distribution
• Chi-square distribution is encountered when we
deal with collections of values that involve adding up
squares. Variances of samples require us to add a
collection of squared quantities and thus have
distributions that are related to chi-square
distribution.
• If we take each one of a collection of sample
variances, divide them by the known population
variance and multiply these quotients by (n – 1),
where n means the number of items in the sample,
we shall obtain a chi-square distribution.
Chi-square (χ2) distribution...
• Thus, (σs2 / σp2)(n-1) would have the same
distribution as chi-square distribution with (n – 1)
degrees of freedom.
• Chi-square distribution is not symmetrical and all
the values are positive. One must know the degrees
of freedom for using chi-square distribution.
• This distribution may also be used for judging the
significance of difference between observed and
expected frequencies and also as a test of goodness
of fit.
Chi-square (χ2) distribution...
• The generalised shape of χ2 distribution depends
upon the d.f. and the χ2 value is worked out as
under: 2 k (oi  Ei )2
x 
i1 Ei
• Tables are there that give the value of χ2 for given
d.f. which may be used with calculated value of χ2
for relevant d.f. at a desired level of significance for
testing hypotheses.

You might also like