Data Processing
Data Processing
DATA PROCESSING
Data continues to be in raw form, unless and until they are processed and analyzed.
Processing is a statistical method by which the collected data is so organized the further
analysis and interpretation of data become easy. It is an intermediary stage between the
collection of data and their analysis and interpretation.
PROCESSING STAGES
1.Editing
it is the process of examining the data collected in questionnaires or detect errors the
data collected in questionnaires or schedules to detect the errors and omissions and to
correct these as far as possible for further analysis. It involves a careful scrutiny of the
completed questionnaires and schedules.
With regard to points or stages at which editing should be done, one can talk of field
editing and central editing
field editing
This sort of editing should be done as soon as possible after the interview, preferably on
the very day or on the next day.
Central editing
Central editing should take place when all forms or schedules have been completed and
returned to the office. This type of editing implies that all forms should get a thorough
editing by a single editor in a small study and by a team of editors in case of a large
inquiry.
2. Coding
Coding refers to the process of assigning numerals or other symbols to answers so that
responses can be put into a limited number of categories or classes. Such classes should
be appropriate to the research problem under consideration.
In other words coding involve involves two main operations (1) decide the category to be
used and.(2) to allocate individual answers to them.
The coding eliminates much of information in raw data . That means several replies are
reduced to a small number of classes which contains critical information required for
analysis.
Steps in coding
1. Study the answers carefully.
2. Develop a coding frame by listing the answers and by aligning codes to each of them.
3. Prepare a coding manual with the detail of variable names, codes and instructions.
4. If the coding manual has already been prepared before the collection of the data,
make the required
Coding rules
5. Give each respondent a code number for identification.
6. Provide code number for each question.
7. All responses including ‘don’t know’, ‘no opinion’. Etc is to be coded.
8. Assign additional codes to partially coded questions.
3. Classification
Classification of data is the process of grouping of related facts into classes on the basis of
certain common characteristics. Data having common characteristics are placed in one
class. It divides data in to different groups or classes according to their similarities and
dissimilarities.
Objectives
To get meaning full relationship
To organize complex data
To facilitate comparison
To identify the most significant features of the data at a glance.
Types of classification
A. Quantitative classification
Quantitative classification refers to the classification data according to some characteristic
that can be measured such height, weight etc..
B. Qualitative classification
In qualitative classification data are classified on the basis of some attributes or quality
such as sex, literacy, religion etc..
C. Geographical classification
Data are classified on the basis of geographical or locational: difference between various
items
D. Chronological classification
If the data are observed over a period of time it is better to classify data on the basis
chronology. Such classification is known as chronological classification.
Classification according to class interval
Data relating to income , production ,age, height, wight, etc. are classified on the basis of
class intervals. Such data are known as statistics variables. For example data relating to
income in between 101 to 200 can be one class and 201 to 300 may be in another class.
Each class interval has upper limit and lower limit. So the difference between upper class
limit and lower class limit is termed as class magnitude.
1. Exclusive type of class interval
In exclusive type of class interval, the items whose values are equal to the upper limit of a
class are grouped in the next higher class. For example.
100- 200
200- 300
300- 400
400- 500
In this case the item whose value is exactly 200 is included in 200- 300 classes and not in
100- 200
2. Inclusive type class intervals.
In this case the upper limit of the class interval is included in the same class interval
101- 200
201- 300
301- 400
401- 500
An item whose value is 200, then it is included in the group of 100- 200
Size of the class interval
The size of the class interval is determined by the number of classes and the total range
in the data.
C = Highest value – lowest value
1+ 3.222 log N
Frequency
The number of observations corresponding to particular class is termed as its frequency.
If the lowest limit 100 and highest limit is 150, it means the frequency in between these
two is 50. frequency of each class can be determined by using tally sheet.
4. Tabulation
Tabulation is the next step to classification. It is an orderly arrangement of data in rows
and columns. It is defined as the “measurement of data in columns and rows”. Data
presented in tabular form is much easier to read and understand than the data presented
in the text the main purpose of tabulation is to prepare the data for final analysis. It is a
stage between classification of data and final analysis.
Objectives of Tabulation
1. To clarify the purpose of enquiry
2. To make the significance of data clear.
3. To express the data in least possible space.
4. To enable comparative study.
5. To eliminate unnecessary data
6. To help in further analysis of the data.
TYPES OF TABULATION
Simple tabulation.
In simple table only one characteristics is shown. Hence this type of table is known as one
way table.
Complex table.
In complex table two or more characteristics are shown. It also called two way table.
PRINCIPLES OF TABULATION
• The table should suit the size of the paper
• The table should have a clear, concise and adequate title
• Every table should have a distinct number for easy reference.
• The captions and stubs should be arranged in a systematic manner in all types of tables
• The unit of measurement should be clearly defined and is given in the table such as income in
purpose.
• Figures should be rounded off to avoid unnecessary details
• Explanatory foot notes if any should be given as foot notes along with reference
• Sources from which data in the table is obtained are to be indicated just below the table.
• The columns drawn are to be separated by lines which make the table more attractive and legible.
• The columns may be numbered for reference.
• Table should be clear, accurate and simple
• If figures are to be repeated it should be shown in each time
DATA ENTRY
Data entry converts information by secondary or primary methods to a medium for
Viewing and manipulation. The researcher can store the data in a medium. The entire
data can be entered in computer for statistical packages like SPSS(Statistical Package for
the Social Sciences). The information from the questionnaires is to be entered into the
computer using SPSS for windows and subsequently stored on CD as an SPSS.
DESCRIPTIVE ANALYSIS UNDER DIFFERENT TYPES OF MEASUREMENTS
There are four levels of measurement. They are nominal measurement. ordinal
measurement, interval measurement and ratio measurement.
1. Nominal measurement
At the nominal level of measurement, numbers or other symbols are assigned to a set of
categories for the purpose of naming, labeling, or classifying the observations. Gender is
an example of a nominal level variable. The only comparisons that can be made between
variable values are equality and inequality.
2. Ordinal measurement.
This level of measurement enables one to make ordinal judgments. The numbers have all
the features of nominal measures and also represent the rank order (1st. 2nd, 3rd etc ) of
the entities measured. The numbers are ordinals Whenever we assign numbers to rank-
ordered categories ranging from low to high, we have an ordinal level variable.
3. Interval measurement.
This scale or level of measurement has the characteristics of rank order and equal
intervals (i.e., the distance between adjacent points is the same). It does not possess an
absolute zero point. The numbers have all the features of ordinal measurement and also
are separated by the same interval. In this case, differences between arbitrary pairs of
numbers can be meaningfully compared
4. Ratio measurement.
Ratio scales are like interval scales except they have true zero points. This is the highest
level of measurement and is used for quantitative data. The numbers have all the
features of interval measurement and also have meaningful ratios between arbitrary
pairs of numbers.
COMMON DESCRIPTIVE TECHNIQUES
The most common descriptive statistics used in research consists of percentages,
measures of central tendency, index numbers etc. There are several techniques for
displaying data. One among them is the frequency table.
PERCENTAGES
A percentage is a way of expressing a number as a fraction of 100 (per cent meaning "per
hundred" in Greek). It is often indicated by using the percent sign, “%". Percentages are
used to express how large or small one quantity is, relative to another quantity
FREQUENCY TABLE
A frequency table is a simple device for arraying data. In statistics, a frequency
distribution is a tabulation of the values that one or more variables take in a sample.
Each entry in the table contains the frequency or count of the occurrences of values
within a particular group or interval, and in this way the table summarizes the
distribution of values in the sample.
Contingency table
Contingency table is a format for presenting the relationships among variables as
percentage distributions. It is essentially a display format used to analyze and record the
relationship between two or more categorical variables. It is the categorical equivalent of
the scatter plot used to analyze the relationship between two continuous variables. The
term contingency table was first used by the statistician Karl Pearson in 1904.
Contingency tables will normally have as many rows.
GRAPHICAL REPRESENTATION OF DATA
Graphical presentation involves use of graphics, charts and other pictorial devices. These
forms are very helpful to condense large quantity of statistical data to a forth that can be
quickly understood.
Rules for Graphical Representation
1. The chart must have a title. The title of the chart should be placed above the chart.
2. The title should be clear, brief, and simple and should portray the nature of data
presented.
3. Numerical data on which graphic is drawn must be supplemented as a table.
4. The scale intervals should be equal.
5. Each curve or bar on the chart should be labelled,
TYPES OF GRAPHS
6. Charts or line graphs
7. Bar charts
8. Circle charts or pie diagram
9. Pictograms
1. Line Graphs
A line graph displays information in a series of data points that each represents an
individual measurement or piece of data. The series of points are then connected by a
line to show a visual trend in data over a period of time. The line is connected through
each piece chronologically. For example:- following data show birth rate per thousands of
six countries over a period.
2.Bar Charts
The bar graph is a common type of graph which consists of parallel bars or rectangles
with lengths that are equal to the quantities that occur in a given data set. The bars can
be presented vertically or horizontally to show the contrast and record information. Bar
graphs are used for plotting discontinuous (discrete) data. Discrete data contains discrete
values and are not continuous.
Histogram
A histogram is a graph of frequency distributions. It is a set of vertical bars whose are
proportional to the frequencies. While constructing histogram, the variable is always
taken on the x- axis and the frequencies on y-axis.
Frequency Polygon
The frequency polygon is a graph of frequency distribution. Here we draw histogram of
the data and then join by straight line and mid points of upper horizontal sides of these
bars. Join both ends of the frequency polygon with the x- Axis.
Frequency Curves
A continuous frequency distribution can be represented by a smoothed curve known as
Frequency curves.
Ogive or Cumulative Frequency Curve
A frequency distribution can be cumulated in two ways, less than cumulative series and
more than cumulative series. Smoothed frequency curves drawn for these two
cumulative series are called cumulative frequency curves or ogives.
The Lorenz Curve
It is a graphical representation of the proportionality of a distribution (the cumulative
percentage of the values). In other words the Lorenz curve is a graphical representation
of the cumulative distribution function of a probability distribution.
3. Circle Charts or Pie Diagram
A pie graph is a circle divided into sections which each display the size of a relative piece
of information. Each section of the graph comes together to form a whole. In a pie graph,
the length of each sector is proportional to the percentage it represents. Pie graphs work
particularly well when each slice of the pie represents 25 to 50 percent of the given data.
4.Pictograms
A pictogram, also called a pictogram or pictograph, is an ideogram that conveys its
meaning through its pictorial resemblance to a physical object. Pictographs are often
used in writing and graphic systems in which the characters are to a considerable extent
pictorial in appearance.
ANALYSIS AND INTERPRETATION OF DATA
Analysis of data means studying the tabulated material in order to determine
inherent facts or meanings. Analysis can be classified into two,
1. Descriptive analysis.
2. Inferential analysis.
DESCRIPTIVE ANALYSIS
This type of analysis describes the nature of the object or phenomenon
under study. In descriptive analysis we tries to workout various measures
that illustrates the size and shape of distribution along with the study of
measuring relationships between two or more variables. It consists of three
types of analysis.
1. UNIDIMENSIONAL ANALYSIS
If the study is related one variable it is called unidimensional analysis. It includes average,
dispersion, skewness and kurtosis. The measures of dispersion include range, the
standard deviation and coefficient of variation.
2. BIVARIATE ANALYSIS
If the study is related with two variables it is called bivariate analysis.
A. Correlation analysis
Correlation analysis is a mathematical tool which is used to describe the degree to which
one variable is linearly related to each other. Correlation measures the degree of
relationship between the variables under study.
B. Casual analysis or regression analysis
It is concerned with the study of how one or more variable effect changes in another
variable. This analysis is more important in experimental researches.
3. MULTIVARIATE ANALYSIS
When the study is related with more than two variables it is termed as multivariate
analysis. The following analyses are included in multivariate analysis.
A. Multiple regression analysis
This analysis is suitable when the researcher has one dependent variable which is to be
assumed as the function of the two or more independent variables.
B. Multiple discriminant analysis
This analysis is adopted when the researcher has one dependent variable that cannot be
measured, but can be classified into two or more groups on the basis of some attribute.
C. Multivariate analysis of variance
This analysis is mainly used by a researcher to test the hypothesis related with
multivariate differences in group responses to experimental manipulations.
4. Factor Analysis
Factor analysis is a technique that is used to reduce a large number of variables into
fewer numbers of factors. This technique extracts maximum common variance from all
variables and puts them into a common score. As an index of all variables, we can use
this score for further analysis.
5. Canonical Analysis
It studies the relationship between two sets of data with several independent and
dependent variables, It is a regression analysis of data.
INFERENTIAL ANALYSIS
It is concerned with testing of hypothesis and significance. It is also related with
estimation of unknown population parameters.
TOOLS FOR STATISTICAL ANALYSIS
1. Measures of the central tendency or averages
2. Measures of the dispersion
3. Other measures
MEASURES OF THE CENTRAL TENDENCY OR AVERAGES
It Is the average value of the entire data. In other words it is the single value that
describes the characteristics of the entire group.
Mean
Mean is known as arithmetic mean. It is the most common measure of central tendency.
Its value is obtained by adding all items and by dividing this by the total number of items.
mean
Median
Median is the middle value in the distribution when it is arranged in descending or
ascending order. It is a positional average. It may be defined as the value of that item
which divides the series in to equal parts, one half contains values greater than it and the
other half contains value less than it.
median = N+1 item
2
Mode
Mode is the value of the item of a series which occurs most frequently. According to
Kenny ‘the value of the variable which occurs most frequently in a distribution is called a
mode”. In the case of individual series, the value which occurs more number of times is
mode.
For example, a set of students of a class report the following number of video movies
they see in a month. No of movies: 10,15,20,15,15,8 Mostly the students see 15 movies
in a month. Therefore mode=15
Geometric mean
It is defined as the nth root of the product of the values of n times in a given series.
Symbolically, we can put it thus: Geometric mean (or G.M.) = n√X1 x X2 x X3……..Xn n=
number of items X1,X2 = the various values
Harmonic Mean
It is calculated by dividing the number of observations by the reciprocal of each number
in the series. Thus, the harmonic mean is the reciprocal of the arithmetic mean of the
reciprocal
Weighted arithmetic mean
In arithmetic mean, equal importance are given to all items in series. But there are cases
where the relative importance of the different items may not be the same. In such cases,
we computing weighted arithmetic mean.
MEASURES OF DISPERSION
The measures of central tendency indicate the central tendency of a frequency
distribution in the form of an average. The degree to which numerical data tend to spread
about an average value is called the variation or dispersion of data.
1. Range
It is the simplest measure of dispersion. It is defined as the difference between the
highest value and lowest values in a series.
2. Mean deviation
It is the arithmetic mean of the deviation of series computed from any measure of central
tendency. It is the average of the absolute difference of values of items from an average.
3. Standard deviation
It is defined as the positive square root of the arithmetic mean of the squares of the
deviation of the given observation from their arithmetic mean.
Coefficient of standard deviation
If we divide the standard deviation by the arithmetic average of the series we will get coefficient of
standard deviation.
REGRESSION ANALYSIS
Regression analysis is a statistical process for estimating the relationships among variables. It includes
many techniques for modelling and analysing several variables, when the focus is on the relationship
between a dependent variable and one or more independent variables.
INDEX NUMBERS
Index numbers are designed to measure the magnitude of economic changes over time. A statistic which
assigns a single number to several individual statistics in order to quantify trends. Index numbers are the
indicators of the various trends in an economy. An index number is specialized average.
Simple and weighted index numbers
Simple index numbers are those in the calculation of which all the items are treated as equally important.
Here items are not given any weight. Weighted index numbers are those in the calculation of which each
item is assigned a particular weight.
Price Index Numbers
Price index numbers measure changes in the price of a commodity for a given period in comparison with
another period.
INFERENTIAL ANALYSIS
Statistical inference is the process by which we draw a conclusion about some measures of
population on the basis of a sample value. Statistical inference can be classified into two,
Test of hypothesis and Estimation of population parameters.
Parameters and Statistics
Parameters are numbers that summarize data for an entire population. Statistics are
numbers that summarize data from a sample, i.e. some subset of the entire population.
Testing of hypothesis
Hypothesis is a tentative conclusion logically drawn concerning any parameter of the
population.it is an assumption or some assumption to be proved or diasapproved.
Features of hypothesis.
It should be clear and precise
It should be capable of being tested.
It should be stated in simple terms
It should state the relationship between variable.
Basic concepts in the context of testing of hypotheses need to be explained.
Null hypothesis and alternative hypothesis :
In the context of statistical analysis, we often talk about null hypothesis and alternative
hypothesis. If we are to compare method A with method B about its superiority and if we
proceed on the assumption that both methods are equally good, then this assumption is
termed as the null hypothesis. As against this, we may think that the method A is superior
or the method B is inferior, we are then stating what is termed as alternative hypothesis.
The null hypothesis is generally symbolized as H0 and the alternative hypothesis as H1.
The level of significance :
This is a very important concept in the context of hypothesis testing. It is always some
percentage (usually 5%) which should be chosen with great care, thought and reason. the
5 per cent level of significance means that researcher is willing to take as much as a 5 per
cent risk of rejecting the null hypothesis when it (H0) happens to be true. Thus the
significance level is the maximum value of the probability of rejecting H0 when it is true
and is usually determined in advance before testing the hypothesis
Type I and Type II errors :
In the context of testing of hypotheses, there are basically two types of errors we can
make. We may reject H0 when H0 is true this is called type one error and we may accept
H0 when in fact H0 is not true it is known as type two error.
Computation of the standard error :
the standard deviation of of the sampling distribution of a statistic is known as its
standard error.
STEPS IN TESTING HYPOTHESIS
1. Statement of problem.
2. Set up hypothesis
Generally we formulate a null hypothesis for study. The common way of stating a
hypothesis is that there is no difference between two values, when population mean is
compared with a sample mean.
3. Selection of test statistics.
Acceptance or rejection of null hypothesis is done on the basis of statistic computed from
the sample. Such a statistic is called the test statistic based on an appropriate probability
distribution. Some probability distributions which are used for testing are Z test, t test, F
test, or chi square tests.
4. Select level of significance
5. Calculate the value of the test statistic.
6. Obtain the table value
7. Make a decision to accept or reject hypothesis.
Test procedure to be adopted to test the significance
1. Set up a null hypothesis that there is no significant difference between sample mean
and population mean.
2. Decide the test criteria t or Z. when sample is large or standard deviation of
population is known, apply Z test. Otherwise use t tes.
3. Calculate the test static by applying the following formula
SE = when population standard deviation is known
SE= when population standard deviation is not known and sample size is large.
=
4. Decide the level of significance at 5% or 1 % level of significant and the degree of
freedom. The degree of freedom is n-1 for small samples and infinity for large
samples.
5. Find out table value
6. Make a decision either to accept or reject the null hypothesis. If calculated value of Z
or t is numerically less than the table value, it falls in acceptance region. So we can
accept the null hypothesis. Otherwise we can reject it.
ANALYSIS OF VARIANCE (ANOVA)
This technique is used when multiple sample cases are involved. As stated earlier, the
significance of the difference between the means of two samples can be judged through
either z-test or the t-test, but the difficulty arises when we happen to examine the
significance of the difference amongst more than two sample means at the same time.
The ANOVA technique enables us to perform this simultaneous test and as such is
considered to be an important tool of analysis in the hands of a researcher. Using this
technique, one can draw inferences about whether the samples have been drawn from
populations having the same mean.
Assumptions of ANOVA:
(i) All populations involved follow a normal distribution.
(ii) All populations have the same variance (or standard deviation).
(iii) The samples are randomly selected and independent of one another.
There are two techniques of analysis of variance such as one way classification and two
way classification.
One way classification
In one way classification data are classifies on the basis on one criterion. The following
are the important steps.
1. Calculate variance between samples.
Steps required calculating variance between samples.
1.Calculate mean of each sample.
2.Calculate the grand average.
3.Take the difference between the means of the various samples and the grand average.
4.Square these deviations and find the sum of squares between the samples (SSC)
5.Divide SSC by degree of freedom. We will get mean square between samples (MSC)
O= observed frequency
E= expected frequency
Degree of freedom=(v) n-k
Interpretation of data
Interpretation refers to the technique of drawing inference from the collected facts and
explaining the significance of those inferences after an analytical and experimental study.
It is a search for broader and more abstract means of the research findings. If the
interpretation is not done very carefully, misleading conclusions may be drawn. The
interpreter must be creative of ideas he should be free from bias and prejudice.
Techniques of interpretation
1. Relationship:
There may be three types of relationship such as symmetrical relationship, reciprocal
relationship and asymmetrical relationship. The data can be interpreted on the basis of
these relationships.
2. Proportion:
It is generally ascertained to determine the nature and form of absolute changes in the
subject of study.
3. Percentages:
It is used to make a comparison between two or more series of data. Percentages are also
used to describe the relationship between variables.
4. Average:
There are three forms of averages such mean, median and mode. For analyzing and
interpreting of long statistical table we will use averages or other measures to interpret
the results. Averages or other measures of comparison are considered as the most
appropriate one for the purpose of interpretation.
PRE REQUISITES OF INTERPRETATION
1. Adequate data
The data may be large enough and it is unbiased. Then only the results will represented
the whole population. All data collected are not suitable for statistical study. The data
collected may be such that it is suitable for statistical treatment.
2. Accurate Data
The collected data may be reliable and accurate. The accuracy of data helps the
researcher to arrive at a true conclusion. If the data collected are not accurate it is very
difficult to interpret it and get a true conclusion.
3. Appropriate type of classification and tabulation
If the classification and tabulation done is not in proper way, it will make errors or will
help to reach wrong conclusions about the study. So data collected should be
systematically classified and properly tabulated.
4. Requirement of homogeneous data
When we want to get a uniform and accurate result the data should be homogeneous
The use of heterogeneous data will not give the desired result.
5. Data consistency
If data collected are inconsistent, it will not be useful to provider accurate results. For
applying statistical tools, the data should be consistent.
6. Use of statistical tools
If the researcher use in appropriate statistical tools, inadequate faulty calculations, it will
make fake results.
CONCLUSIONS AND GENERALIZATIONS
As a last step the researcher comes with his conclusions of his study. He should review
carefully the evidence for and against each hypothesis while deriving him should
conclusions. The generalizations put forward by ould be agreed with the facts revealed by
the study. He should also check the generalizations with the facts and experiences of
other researchers.