Quality Assessment Tools in Education
Quality Assessment Tools in Education
0 10-July-2020
In our previous lessons we have learned that assessment is an integral part of instruction, as it
determines whether or not the goals of education are being met. Assessment affects decisions about
grades, placement, advancement, instructional needs, curriculum, and, in some cases, funding.
Assessments include a variety of different methods that allow learners to demonstrate evidence of learning and
can range from observations, student writing samples, performance tasks, to large-scale standardized tests. Thus
there is a dire need of equipping education students with the much needed knowledge and skills in creating and
designing classroom assessments.
Teachers need reliable assessment data to guide their instruction and make sure that every child meets
rigorous learning standards. But the data they get from a formative or interim assessment is only as good as the
quality of the instrument itself. Without accurate information from an assessment, educators can misdiagnose a
child’s learning needs and spend time focusing on the wrong concepts as a result. Creating high-quality
assessment forms (tests) and items (questions) takes time and expertise, since not all assessment content will
produce valid result.
Objective
The effective assessment is objective and focused on student performance. It should not reflect the
personal opinions, likes, dislikes, or biases of the teacher. s must not permit judgment of student performance to
be influenced by their personal views of the student, favorable or unfavorable. If an assessment is to be objective,
it must be honest; it must be based on the performance as it was, not as it could have been.
Flexible
The teacher must evaluate the entire performance of a student in the context in which it is accomplished.
Sometimes a good student turns in a poor performance, and a poor student turn in a good one. A friendly student
may suddenly become hostile, or a hostile student may suddenly become friendly and cooperative. The teacher
must fit the tone, technique, and content of the assessment to the occasion, as well as to the student. An
assessment should be designed and executed so that the teacher can allow for variables.
Acceptable
The student must accept the teacher in order to accept his or her assessment willingly. Students must
have confidence in the ’s qualifications, teaching ability, sincerity, competence, and authority. Usually, teachers
have the opportunity to establish themselves with students before the formal assessment arises. If not, however,
the ’s manner, attitude, and familiarity with the subject at hand must serve this purpose. Assessments must be
presented fairly, with authority, conviction, sincerity, and from a position of recognizable competence. s must
never rely on their position to make an assessment more acceptable to students.
Comprehensive
A comprehensive assessment is not necessarily a long one, nor must it treat every aspect of the
performance in detail. The teacher must decide whether the greater benefit comes from a discussion of a few
major points or a number of minor points. The teacher might assess what most needs improvement, or only what
the student can reasonably be expected to improve. An effective assessment covers strengths as well as
weaknesses. The teacher’s task is to determine how to balance the two.
Constructive
An assessment is pointless unless the student benefits from it. Praise for its own sake is of no value, but
praise can be very effective in reinforcing and capitalizing on things that are done well, in order to inspire the
student to improve in areas of lesser accomplishment. When identifying a mistake or weakness, the teacher must
give positive guidance for correction. Negative comments that do not point toward improvement or a higher level
of performance should be omitted from an assessment altogether.
Organized
An assessment must be organized. Almost any pattern is acceptable, as long as it is logical and makes
sense to the student. An effective organizational pattern might be the sequence of the performance itself.
Sometimes an assessment can profitably begin at the point at which a demonstration failed, and work backward
through the steps that led to the failure. A success can be analyzed in similar fashion. Alternatively, a glaring
deficiency can serve as the core of an assessment. Breaking the whole into parts, or building the parts into a
whole, is another possible organizational approach.
Thoughtful
An effective assessment reflects the teacher’s thoughtfulness toward the student’s need for self-esteem,
recognition, and approval. The teacher must not minimize the inherent dignity and importance of the individual.
Ridicule, anger, or fun at the expense of the student never has a place in assessment. While being straightforward
and honest, the teacher should always respect the student’s personal feelings. For example, the teacher should
try to deliver criticism in private.
Specific
The teacher’s comments and recommendations should be specific. Students cannot act on
recommendations unless they know specifically what the recommendations are. If the teacher has a clear, well-
founded, and supportable idea in mind, it should be expressed with firmness and authority, and in terms that
cannot be misunderstood.
Flexibility - High-quality assessment content should give educators a choice, in which approach they
would like to take for different exams, based on their needs and goals. Educators can determine how likely
students are to achieve grade-level standards, and they can even compare a student’s achievement to national
norms with a percentile score. Lesson design curriculum and assessment requires flexibility. When we speak
about being flexible, we have to understand that what is applicable to one may not be applicable to others.
Validity- Something Valid is something fair. One of the most important characteristic of quality
assessment tool is validity. Simply put, content validity means that the assessment measures what is initiated to
measure for its intended purpose and nothing more. When data from an assessment that lacks content validity is
used to inform instruction, the result could include wasted time and inappropriate growth expectations of students.
For these reasons, validity is central to a quality educational assessment.
Reliability- Reliability is the consistency of an assessment. It is the degree to which student results are
the same when:
• They take the same test on different occasions
• Different scores score the same task
• Different but equivalent test are taken at the same time or different times.
Reliability is about making sure that different test forms in a single administration are equivalent; that
retests of a given test are equivalent to the original test; and that test difficulty remains constant year to year,
administration to administration.
Variety- Does the assessment content provide variety of item types, from selected response questions to
technology-enhanced items and performance tasks? Different kinds of assessment questions serve different
purpose. Finding a way to assess students is one of the most important and sometime difficult things teachers
have to consider in planning lessons and units. This is why it is important for teachers to use variety of different
assessment methods in their classroom. Teachers should also find a way to incorporate different types of
evaluation in their classroom to accommodate students’ interest and to be fair to all students.
Insight- Assessment content include reports that provide actionable data for teachers and administrators.
The whole purpose of assessment is to help educators become more effective by using the results to target and
improve their instruction and reports provide key insights make this process easier. The reports, and the detailed
information behind it, can be used to inform classroom instructions, modify teaching strategies, and assess
progress.
Teacher-made tests are normally prepared and administered for testing classroom achievement of students,
evaluating the method of teaching adopted by the teacher and other curricular program of the school.
Teacher-made test is one of the most valuable instrument in the hands of the teacher to solve his purpose. It
is designed to solve the problem or requirements of the class for which it is prepared.
It is prepared to measure the outcomes and content of local curriculum. It is very much flexible so that, it can
be adopted to any procedure and material. It does not require any sophisticated technique for preparation.
11. They do not have norms whereas providing norms is quite essential for standardized tests.
Binomial-choice tests are tests that have only two (2) options such as true or false, right or wrong, good
or better and so on. A student who knows nothing of the content of the examination would have 50% chance of
getting the correct answer by sheer guess work. Although correction-for-guessing formulas exist, it is best that
the teacher ensures that a true-false item is able to discriminate properly between those who know and those
who are just guessing. A modified true-false test can offset the effect of guessing by requiring students to explain
their answer and to disregard a correct answer if the explanation is incorrect. Here are some rules of thumb in
constructing true false items.
Rule 2. Avoid using the words “always”, “never”, “often" and other adverbs that tend to be either always true or
always false.
Rule 3. Avoid long sentences as these tend to be "True". Keep sentences short.
Rule 4. Avoid trick statements with some minor misleading word or spelling anomaly, misplaced phrases, etc. A
wise student who does not know the subject matter may detect this strategy and thus get the answer correctly.
Example: True or False. The Principle of our school is Mr. Albert P. Panadero.
The Principal's name may actually be correct but since the word is misspelled and the entire sentence
takes a different meaning, the answer would be false! This is an example of a tricky but utterly useless item.
Rule 5. Avoid quoting verbatim from reference materials or textbooks. This practice sends the wrong signal to
the students that it is necessary to memorize the textbook word for word and thus, acquisition of higher level
thinking skills is not given due importance.
Rule 6. Avoid specific determiners or give-away qualifiers. Students quickly learn that strongly worded statements
are more likely to be false than true, for example, statements with "never” “no” "all" or "always." Moderately worded
statements are more likely to be true than false. Statements with “many” “often” “sometimes” “generally”
“frequently” or "some" should be avoided.
Rule 7. With true or false questions, avoid a grossly disproportionate number of either true or false statements or
even patterns in the occurrence of true and false statements.
Example:
The short story: May Day's Eve, was written by which Filipino author?
A. Jose Garcia Villa
B. Nick Joaquin
C. Genoveva Edrosa Matute
D. Robert Frost
E. Edgar Allan Poe
If distracters had all been Filipino authors, the value of the item would be greatly increased. In this particular
instance, only the first three carry the burden of the entire item since the last two can be essentially disregarded
by the students.
7) All multiple choice options should be grammatically consistent with the stem.
8) The length, explicitness, or decree of technicality of alternatives should not be the determinants
of the correctness of the answer.
9) Avoid stems that reveal the answer to another item.
10) Avoid alternatives that are synonymous with others or those that, include or overlap others.
Example:
What causes ice to transform from solid state to liquid state?
a. Change in temperature
b. Changes in pressure
c. Change in the chemical composition
d. Change in heat levels
The options a and d are essentially the same. Thus, a student who spots these identical choices would right away
narrow down the field of choices to a, b, and c. The last distracter would play no significant role in increasing the
value of the item.
11) Avoid presenting sequenced items in the same order as in the text.
12) Avoid use of assumed qualifiers that many examinees may not be aware of.
13) Avoid use of unnecessary words or phrases, which are not relevant to the problem at hand (unless
such discriminating ability is the primary intent of the evaluation). The item’s value is particularly damaged
if the unnecessary material is designed to distract or mislead. Such items test the students reading
comprehension rather than knowledge of the subject matter.
14) Avoid use of non-relevant sources of difficulty such as requiring a complex calculation when only
knowledge of a principle is being tested.
Note in the previous example, knowledge of the sine of the 30-degree angle would have led some students to
use the sine formula for calculation even if a simpler approach would have sufficed.
15) Avoid extreme specificity requirements in responses.
16) Include as much of the item as possible in the stem. This allows less repetition and shorter choice
options.
17) Use the "None of the above" option only when the keyed answer is totally correct. When choice of
the "best" response is intended, "none of the above" is not appropriate, since the implication has already
been made that the correct response may be partially inaccurate.
18) Note that use of “all of the above" may allow credit for partial knowledge. In a multiple option item,
(allowing only one option choice) if a student only knew that two (2) options were correct, he could then
deduce the correctness of "all of the above". This assumes you are allowed only one correct choice.
19) Having compound response choices may purposefully increase difficulty of an item.
20) The difficulty of a multiple choice item may be controlled by varying the homogeneity or degree of
similarity of responses. The more homogeneous, the more difficult the item.
Normally, column B will contain more items than column A to prevent guessing on the part of the students.
Matching type items, unfortunately, often test lower order thinking skills (knowledge level) and are unable to test
higher order thinking skills such as application and judgement skills.
A variant of the matching type items is the data sufficiency and comparison type of test illustrated below:
Example: Write G if the item on the left is greater than the item on the right; L if the item on the left is less than
the item on the right; E if the item on the left equals the item on the right and D if the relationship cannot be
determined.
А B
1. Square root of 9 _____ a. -3
2. Square of 25 _____ b. 615
3. 36 inches _____ c. 3 meters
The data sufficiency test above can, if properly constructed, test higher order thinking skills. Each item
goes beyond simple recall of facts and, in fact, requires the students to make decisions.
Another useful device for testing lower order thinking skills is the supply type of tests. Like the multiple choice
test, the items in this kind of test consist of a stem and a blank where the students would write the correct answer.
Example: The study of life and living organisms is called________
Supply type tests depend heavily on the way that the stems are constructed. These tests allow for one
and only one answer and hence, often test only the student’s knowledge. It is, however, possible to construct
supply type of tests that will test higher order thinking as the following example shows:
Example: Write an appropriate synonym for each of the following Each blank corresponds to letter:
Metamorphose: _ _ _ _ _ _
Flourish: _ _ _ _
The appropriate synonym for the first is CHANGE with six (6) letters while the appropriate synonym for
the second is GROW with fur (4) letters. Notice that these questions require et only mere recall of words but also
understanding of these words.
D. Essays
Essays, classified as non-objective tests, allow for the assessment of higher order thinking skills. Such
tests require students to organize their thoughts on a subject matter in coherent sentences in order to inform an
audience. In essay tests, students are required to write one or more paragraphs on a specific topic. Essay
questions can be used to measure attainment of a variety of objectives. Stecklein (1955) has listed 14 types of
abilities that can be measured by essay items:
Note that all these involve the higher-level skills mentioned in Bloom's Taxonomy.
The following are rules of thumb which facilitate the scoring of essays:
Rule 1: Phrase the direction in such a way that students are guided on the key concepts to be included.
Rule 2: Inform the students on the criteria to be used for grading their essays. This rule allows the students to
focus on relevant and substantive materials rather than on peripheral and unnecessary facts and bits of
information.
Example: Write an essay on the topic "Plant Photosynthesis" using the keywords indicated. You will be graded
according to the following criteria: (a) coherence, (b) accuracy or statements, (c) use of keywords, (d) clarity and
(e) extra points for innovative presentation of ideas.
Rule 3: Put a time limit on the essay test.
Rule 4: Decide on your essay grading system prior to getting the essays of your students.
Rule 5: Evaluate all of the students’ answers to one question before proceeding to the next question.
Scoring or grading essay tests question by question, rather than student by student, makes it possible to maintain
a more uniform standard for judging the answers to each question.
Rule 6: Evaluate answers to essay questions without knowing the identity of the writer. This is another attempt
to control personal bias during scoring.
Rule 7: Whenever possible, have two or more persons grade each answer. The best way to check on the reliability
of the scoring of essay answers is to obtain two or more independent judgments.
LEARNING ACTIVITY 1: Test Construction
I - Construct a multiple choice test (20 items) by following the guidelines stipulated above. Convert your
assignment in PDF and send via MSTeams.
1. For BECEd students (any lesson unit or chapter in Grade 3 Science, Grade 3 Math or Grade 3 English)
2. For BPEd students (any lesson unit or chapter in MAPEH 7, MAPEH 8, MAPEH 9 or MAPEH 10)
3. For BSE Science (any lesson unit or chapter in Science 7, Science 8, Science 9 or Science 10)
I. Direction
II. Multiple Choice Test
III. Answer key
“One way you will know that your targets are clear and usable is if you can determine what kind of learning
is being called for. The accuracy of the assessments you develop will depend in part on your ability to classify
learning targets in any written curriculum in a way that helps ensure a dependable assessment”
–Chappuis et al. 2012
Assessment Methods
1. Selected Response
Selected response assessments are those in which students select the correct or best response from a
list provided.
2. Written Response
Written response assessments require students to construct an answer in response to a question or task
rather than to select the answer from a list. They include short-answer items and extended written
response items. Short-answer items call for a very brief response having one or a limited range of
possible right answers. Extended written response items require a response that is at least several
sentences in length. They generally have a greater number of possible correct answers.
3. Performance Assessment
Performance assessment is assessment based on observation and judgment. It has two parts: the task
and the criteria for judging quality. Students complete a task—give a demonstration or create a product—
and we evaluate it by judging the level of quality using a rubric.
4. Personal Communication
Gathering information about students through personal communication is just what it sounds like—we
find out what students have learned through interacting with them.
Student responses are evaluated in one of two ways. Sometimes the questions we ask require students
to provide a simple, short answer, and all we’re looking for is whether the answer is correct or incorrect. This is
parallel to scoring for written selected response questions. Other times, our questions generate longer and more
complex responses, parallel to extended written response questions. Just as with extended written response
methodology, to evaluate the quality of oral responses we can use a rubric.
The accuracy of any classroom assessment turns on selecting the appropriate assessment method that
matches the achievement target to be assessed. To begin thinking about the match between kind of learning
target and assessment method, “Target–method Match.” The acceptable matches between methods and kind of
learning target result in accurate information gathered as efficiently as possible. The mismatches occur when the
assessment method is not capable of yielding accurate information about the learning target.
The following table summarizes which assessment methods are generally best matched to the basic
types of instructional learning targets. Note that for most types of learning targets, there are multiple assessment
methods that can yield accurate, reliable information on student learning.
Target-Method Match
In developing assessment instruments, the candidates to be assessed should always be kept in mind at each
step of the process. Different scenarios to be assessed call for different tools and modes of evaluation. Ensure
that the instruments and procedures for assessing are relevant to the audience, the skills and the task for which
they are being evaluated.
1. Table of Specifications
The table of specifications (TOS) is a tool used to ensure that a test or assessment measures the
content and thinking skills that the test intends to measure. Thus, when used appropriately, it can provide
response content and construct (i.e., response process) validity evidence. The purpose of a Table of
Specifications is to identify the achievement domains being measured and to ensure that a fair and representative
sample of questions appear on the test.
Knowledge. The students must be able to identify the subject and the verb in a given sentence.
Comprehension. The students must be able to determine the appropriate form of a verb to be used
given the subject of a sentence.
Application. The students must be able to write sentences observing rules on subject-verb agreement.
Analysis. The students must be able to break down a given sentence into its subject and predicate.
Synthesis/Evaluation. The students must be able to formulate rules to be followed regarding subject-
verb agreement.
Deciding on the type of objective test. The test objectives guide the kind of objective tests that will be
designed and constructed by the teacher. For instance, for the first four (4) levels, we may want to construct a
multiple-choice type of test while for application and judgment, we may opt to give an essay test or a modified
essay test.
Preparing a table of specifications (TOS). A table of specifications or TOS is a test map that guides
the teacher in constructing a test. The TOS ensures that there is balance - between items that test lower-level
thinking skills and those which test higher order thinking skills (or alternatively, a balance between easy and
difficult items) in the test. The simplest TOS consists of four (4) columns:
(a) level of objective to be tested,
(b) statement of objective,
(c) item numbers where such an objective is being tested, and
(d) Number of items and percentage out of the total for that particular objective. A prototype table is
shown below:
In the table of specifications we see that there are five items that deal with knowledge and these items
are items 1,3,5,7,9. Similarly, from the same table we see that five items represent synthesis, namely: 12, 14, 16,
18, 20. The first four levels of Bloom’s taxonomy are equally represented in the test while application (tested
through essay) is weighted equivalent to ten (10) points or double the weight given to any of the first four levels.
The table of specifications guides the teaches in formulating the test. As we can see, the TOS also ensures that
each of the objectives in the hierarchy of educational objectives is well represented in the test. As such, the
resulting test that will be constructed by the teacher will be more or less comprehensive. Without the table of
specifications, the tendency for the test maker is to focus too much on facts and concepts the knowledge level.
Constructing the test items. The actual construction of the “test items follows the TOS. As a general
rule, it is advised that the actual number of items to be constructed in the draft should be double the desired
number of items, for instance, if there are five (5) knowledge level items to be included in the final test form, then
at least ten (10) knowledge level items should be included in the draft. The subsequent test try-out and item
analysis will most likely eliminate many of the constructed items in the draft (either they are too difficult, too easy
or nondiscriminatory), hence, it will be necessary to construct more items than will actually be included in the final
test form.
Item analysis and try-out. The test draft is tried out to a group of pupils or students. The purpose of this
try out is to determine the: (a.) item characteristics through item analysis, and (b) characteristics of the test itself-
validity, reliability, and practicality.
Assessment tools in the affective domain, in particular, those which are used to assess attitudes, interest,
motivation, and self-efficacy, have been developed. Whether one is assessing learning outcomes or objectives
for an academic or co-curricular program or for a single course, it is important to remember that assessment is
an iterative process, intended to provide useful feedback about what and how well students are learning. When
developing the plan, it is important to think through all four steps of the cycle.
1. Set program goals and outcomes- decide and articulate what students should know and/or be able to do
when they complete the program or the class;
2. Develop and implement assessment strategies-design tests, assignments, reports, performances, or
other activities that measure the types and quality of learning expected;
3. Review the assessment data- evaluate the results of the assessments to see what they show about student
learning; and
4. Create an action plan- decide how to address issues raised by the assessment data to improve student
learning.
2. Item analysis
Item analysis is a way to select the appropriate items for the final draft and reject the poor items. It
improves the questions for future test administrations. The purpose of item analysis is to examining student’s
responses to each item and look into difficulty and discriminating ability of the item as well as the effectiveness of
each alternative. It’s an important tool used for selecting and rejecting items of the test on the basis of their
difficulty, to analyzed and assessed questions for student’s performance. Also, it’s an effective method, not only
in the test but it can reveal important data.
Item difficulty defined the number of students who are able to answer the item correctly divided by the
total numbers of students. It’s usually expressed in percentage. Portrays the easiness of an item because the
higher the percentage, the easier the item
Discrimination index between the proportion of the top scorers and the lowest scorers. Index of difficulty
show the solution on how to get the percentage of each item.
• Addresses the validity of the item i.e. the extent to which item test the attributes it was intended to test.
𝑅𝑢−𝑅𝐿
D= 1
𝑇
2
Ru = number of the students in the upper group who got the right.
RL = number of the students in the lower group who got the right.
1
𝑇 = half of the total number of students included in the analysis
2
3. Reliability
Reliability is a consistency of a research or measuring test. It has four types of reliability and how to
measure the:
1. Test – retest reliability measures the consistency of result when you repeat the same test on the same
sample at a different point in time.
2. Interrater reliability measures the degree of agreement between different people observing or
assessing the same thing.
3. Parallel forms reliability measures the correlation between two equivalent versions of a test.
4. Internal consistency assesses the correlation between multiple items in a test that are intended to
measure the same construct.
The following table is a standard followed almost universally in educational test and measurement for
reliability and its interpretation:
• .90 and Above – excellent reliability at the level of the best standardized tests.
• .80 – 90 – very good for a classroom test.
• .70 – 80 – good for a classroom test in the range of most.
• .60 – 70 – somewhat low, it means the test needs to be supplemented by other measures to determine
grades.
• .50 – 60 – suggest need for revision of test, unless it is quite short.
• .50 – Below – questionable reliability, this test should not contribute heavily to the course grade, and it
needs revision.
a. Measures of relationship
Measures of Relationship are statistical measures which show a relationship between two or more variables
or two or more sets of data. For example, generally there is a high relationship or correlation between
parent's education and academic achievement. On the other hand, there is generally no relationship or
correlation between a person's height and academic achievement. The major statistical measure of
relationship is the correlation coefficient.
b. Index of determination
The reliability of a test is indicated by the reliability coefficient. It is denoted by the letter "r," and is
expressed as a number ranging between 0 and 1.00, with r = 0 indicating no reliability, and r = 1.00
indicating perfect reliability. Do not expect to find a test with perfect reliability. Generally, you will see
the reliability of a test as a decimal, for example, r = .80 or r = .93. The larger the reliability coefficient,
the more repeatable or reliable the test scores. Table 1 serves as a general guideline for interpreting
test reliability. However, do not select or reject a test solely based on the size of its reliability
coefficient.
The following table is a standard followed almost universally in educational test and measurement.
Reliability Interpretation
0.90 and above Excellent reliability; at the level of the best standardized test
0.80-0.90 Very good for a classroom test
0.70-0.80 Good for a classroom test; in the range of most. There are probably a few items
which could be improved.
0.60-0.70 Somewhat low. This test needs to be supplemented by other measures (e.g., more
test) to determine grades. There are some items which could be improved.
0.50-0.60 Suggest need for revision of test, unless it is quite short (ten or fewer items). The test
definitely needs to be supplemented by other measures (e.g., more tests) for grading.
0.50 or below Questionable reliability. This test should not contribute heavily to the course grade,
and it needs revision.
c. Inter-rater reliability
Inter-rater Reliability refers to the degree to which different raters give consistent estimates of
the same behavior. Inter-rater reliability can be used for interviews. In other words, scoring is precise,
as would be the case with selected-response items. Note, it can also be called inter-observer reliability
when referring to observational research. Here, researchers observe the same behavior independently
(to avoided bias) and compare their data. If the data is similar then it is reliable.
Where observer scores do not significantly correlate then reliability can be improved by training
observers in the observation techniques being used and making sure everyone agrees with them and
ensuring behavior categories have been operationalized. This means that they have been objectively
defined.
For example, if two researchers are observing ‘aggressive behavior’ of children at nursery they
would both have their own subjective opinion regarding what aggression comprises. In this scenario, it
would be unlikely they would record aggressive behavior the same and the data would be unreliable.
However, if they were to operationalize the behavior category of aggression this would be more
objective and make it easier to identify when a specific behavior occurs.
The disadvantage of this type of inter-rater reliability is that it assumes for chance agreement
and exceeds the level of agreement. This is the main reason why percent agreement shouldn’t be used
for academic work.
Validity is defined as the extent to which an assessment accurately measures what it is intended to measure.
Validity also refers to the accuracy of an assessment, whether or not it measures what is supposed to measure.
Validity tells you how accurately a method measure something. If a method measures what it claims to measure,
and the results closely correspond to real world values, then it can be considered valid.
4 Types of Validity
Construct Validity- is about ensuring that the method of measurement matches the construct you want
to measure. If you develop a questionnaire to diagnose depression, you need to know: does the questionnaire
really measure the construct of depression? Or is it actually measuring the respondent’s mood, self-esteem, or
some other construct?
To achieve construct validity, you have to ensure that your indicators and measurements are carefully developed
based on relevant existing knowledge. The questionnaire must include only relevant questions that measure
known indicators of depression.
Content Validity- assesses whether a test is representative of all aspects of the construct. To produce
valid results, the content of a test survey or measurement method must cover all relevant parts of the subject it
aims to measure. Example, a mathematics teacher develops an end of semester algebra test for her class. The
test should cover every form of algebra that was taught in the class. If some types of algebra are left out, then the
results may not be an accurate indication of students understanding of the subject. Similarly, if she includes
questions that are not related to algebra, the results are no longer a valid measure of algebra knowledge.
Face Validity- considers how suitable the content of a test seems to be on the surface. It’s similar to
content validity, but face validity is a more informal and subjective assessment. Example, IQ tests are supposed
to measure intelligence. The test would be valid if it accurately measured intelligence. At face value, the test was
thought to be valid and fair to speakers of languages other than English, because pictures are a universal
language. As a face validity is a subjective measure, it’s often considered the weakest form of validity. However,
it can be useful in the initial stages of developing a method.
Criterion Validity- measures how well one measure predicts an outcome for another measure. A test
has this type of validity if it is useful for predicting performance or behavior in another situation. For example, A
job applicant takes a performance test during the interview process. If this test accurately predicts how well the
employee will perform on the job, the test is said to have criterion validity.
APPLICATION
1. What is the relationship between validity and reliability? Can a test be reliable and yet not valid? Illustrate?
2. Discuss the different measures of reliability. Justify the use of each measure in the context of measuring
reliability.
3. Construct a TOS on any chapter or lesson in your field of specialization, allot for a 30 items multiple choice,
from your TOS, construct the multiple choice. Have it try out to at least 10 students and perform simple item
analysis using item difficulty and discrimination index.
SUMMARY
• Creating high-quality assessment forms (tests) and items (questions) takes time and expertise. Not all
REFERENCES
▪ Navarro, R.L. & Santos, R.G. (2012) Authentic Assessment of Student Learning Outcomes (Assessment
1), 2e. Lorimar Publishing, Inc., Quezon City, Philippines.
▪ Jones, C. A. (2005). Assessment for Learning, London: Learning and Skills Development Agency.
Retrieved from [Link]
▪ Chappuis, J., R. Stiggins, S. Chappuis, & J. Arter. 2012. Classroom Assessment for Student Learning:
Doing It Right—Using It Well, 2nd ed. Upper Saddle River, NJ: Pearson Education, p. 11.
▪ McTighe, J. & Ferrera S. (1998) Assessing Learning in the Classroom. Retrieved from
▪ Krathwohl, D.R., Bloom,B.S. and Masia, B. B. (1964).Taxonomy of educational objectives, Book II.
Affective domain. New York, NY. David McKay Company, Inc.
▪ [Link]
▪ [Link]
assessment/#:~:text=Constructive,in%20areas%20of%20lesser%20accomplishment.
▪ [Link]