[go: up one dir, main page]

0% found this document useful (0 votes)
540 views31 pages

Educ 106 Module-3

This document discusses the importance of reliability in assessment. It defines three types of reliability: test-retest reliability, which measures consistency over time; internal consistency, which measures consistency across items; and inter-rater reliability, which measures consistency between raters. The document emphasizes that reliability alone is not sufficient and an assessment must also be valid. It provides examples of how to measure test-retest and parallel forms reliability using correlation coefficients. Strategies are presented for improving reliability, such as using clear instructions, questions that cover material taught, and feedback.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
540 views31 pages

Educ 106 Module-3

This document discusses the importance of reliability in assessment. It defines three types of reliability: test-retest reliability, which measures consistency over time; internal consistency, which measures consistency across items; and inter-rater reliability, which measures consistency between raters. The document emphasizes that reliability alone is not sufficient and an assessment must also be valid. It provides examples of how to measure test-retest and parallel forms reliability using correlation coefficients. Strategies are presented for improving reliability, such as using clear instructions, questions that cover material taught, and feedback.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Don Mariano Marcos Memorial State University

La Union, Philippines

ASSESSMENT IN LEARNING I
MODULE 3

DR. VERONICA B. CARBONELL


MODULE 3
INTRODUCTION

Lesson 1 Reliability

Lesson 2 Validity

Lesson 3 Practicality and Efficiency

Lesson 4 Ethics
MODULE 3

ASSESSMENT IN LEARNING 1

INTRODUCTION

This is a 3-unit course that focuses on the principles, development and


utilization of conventional assessment tools to improve the teaching-learning process.
It emphasizes on the use of testing for measuring knowledge, comprehension and
other thinking skills. It allows students to go through the standard steps in test
construction for quality assessment. It includes competencies contained in the
Trainers’ Methodology I of TESDA.

OBJECTIVES

After studying the module, you should be able to:

1. define reliability, including the different types and how they are assessed;
2. define validity, including the different types and how they are assessed;
3. describe the kinds of evidence that would be relevant to assessing the
reliability and validity of a particular measure.
4. describe how testing can be made more practical and efficient.

5. recommend actions to observe ethical standards in testing.

DIRECTIONS/ MODULE ORGANIZER

There are four lessons in the module. Read each lesson carefully then answer
the exercises/activities to find out how much you have benefited from it. Work on
these exercises carefully and submit your output to your instructor.

In case you encounter difficulty, discuss this with your instructor during the
face-to-face meeting. If not contact your instructor at the College of Education
office.

Good luck and happy reading!!!


Lesson 1

 RELIABILITY

Reliability refers to the consistency of a measure. Psychologists consider three types


of consistency: over time (test-retest reliability), across items (internal consistency),
and across different researchers (inter-rater reliability).

Test-Retest Reliability

When researchers measure a construct that they assume to be consistent across time,
then the scores they obtain should also be consistent across time. Test-
retest reliability is the extent to which this is actually the case. For example,
intelligence is generally thought to be consistent across time. A person who is highly
intelligent today will be highly intelligent next week. This means that any good
measure of intelligence should produce roughly the same scores for this individual
next week as it does today. Clearly, a measure that produces highly inconsistent
scores over time cannot be a very good measure of a construct that is supposed to be
consistent.

Assessing test-retest reliability requires using the measure on a group of people at


one time, using it again on the same group of people at a later time, and then looking
at test-retest correlation between the two sets of scores. This is typically done by
graphing the data in a scatterplot and computing the correlation coefficient. Figure
4.2 shows the correlation between two sets of scores of several university students on
the Rosenberg Self-Esteem Scale, administered two times, a week apart. The
correlation coefficient for these data is +.95. In general, a test-retest correlation of
+.80 or greater is considered to indicate good reliability.

A Deeper Look at Reliability

TYPES OF RELIABILITY

The reliability of an assessment refers to the consistency of results. The most basic
interpretation generally references something called test-retest reliability, which is
characterized by the replicability of results. That is to say, if a group of students
takes a test twice, both the results for individual students, as well as the relationship
among students’ results, should be similar across tests.

However, there are two other types of reliability: alternate-form and internal
consistency. Alternate form is a measurement of how test scores compare across two
similar assessments given in a short time frame. Alternate form similarly refers to the
consistency of both individual scores and positional relationships. Internal
consistency is analogous to content validity and is defined as a measure of how the
actual content of an assessment works together to evaluate understanding of a
concept.

LIMITATIONS OF RELIABILITY

The three types of reliability work together to produce, according to Schillingburg,


“confidence… that the test score earned is a good representation of a child’s actual
knowledge of the content.” Reliability is important in the design of assessments
because no assessment is truly perfect. A test produces an estimate of a student’s
“true” score, or the score the student would receive if given a perfect test; however,
due to imperfect design, tests can rarely, if ever, wholly capture that score. Thus,
tests should aim to be reliable, or to get as close to that true score as possible.

Imperfect testing is not the only issue with reliability. Reliability is sensitive to the
stability of extraneous influences, such as a student’s mood. Extraneous influences
could be particularly dangerous in the collection of perceptions data, or data that
measures students, teachers, and other members of the community’s perception of
the school, which is often used in measurements of school culture and climate.

Uncontrollable changes in external factors could influence how a respondent


perceives their environment, making an otherwise reliable instrument seem
unreliable. For example, if a student or class is reprimanded the day that they are
given a survey to evaluate their teacher, the evaluation of the teacher may be
uncharacteristically negative. The same survey given a few days later may not yield
the same results. However, most extraneous influences relevant to students tend to
occur on an individual level, and therefore are not a major concern in the reliability
of data for larger samples.

HOW TO IMPROVE RELIABILITY

On the other hand, extraneous influences relevant to other agents in the classroom
could affect the scores of an entire class.

If the grader of an assessment is sensitive to external factors, their given grades may
reflect this sensitivity, therefore making the results unreliable. Assessments that go
beyond cut-and-dry responses engender a responsibility for the grader to review the
consistency of their results.

Some of this variability can be resolved through the use of clear and specific rubrics
for grading an assessment. Rubrics limit the ability of any grader to apply normative
criteria to their grading, thereby controlling for the influence of grader biases.
However, rubrics, like tests, are imperfect tools and care must be taken to ensure
reliable results.

How does one ensure reliability? Measuring the reliability of assessments is often done
with statistical computations.

The three measurements of reliability discussed above all have associated coefficients
that standard statistical packages will calculate. However, schools that don’t have
access to such tools shouldn’t simply throw caution to the wind and abandon these
concepts when thinking about data.

Schillingburg advises that at the classroom level, educators can maintain reliability
by:

 Creating clear instructions for each assignment


 Writing questions that capture the material taught
 Seeking feedback regarding the clarity and thoroughness of the assessment
from students and colleagues.

With such care, the average test given in a classroom will be reliable. Moreover, if
any errors in reliability arise, Schillingburg assures that class-level decisions made
based on unreliable data are generally reversible, e.g. assessments found to be
unreliable may be rewritten based on feedback provided.

However, reliability, or the lack thereof, can create problems for larger-scale
projects, as the results of these assessments generally form the basis for decisions
that could be costly for a school or district to either implement or reverse.

Why is it necessary?

While reliability is necessary, it alone is not sufficient.  For a test to be reliable, it


also needs to be valid.  For example, if your scale is off by 5 lbs, it reads your weight
every day with an excess of 5lbs.  The scale is reliable because it consistently reports
the same weight every day, but it is not valid because it adds 5lbs to your true
weight.  It is not a valid measure of your weight.

Reliability is the degree to which an assessment tool produces stable and consistent
results. 

Types of Reliability

1. Test-retest reliability is a measure of reliability obtained by administering the


same test twice over a period of time to a group of individuals.  The scores
from Time 1 and Time 2 can then be correlated in order to evaluate the test
for stability over time. 
  Example:  A test designed to assess student learning in psychology
could be given to a group of students twice, with the second
administration perhaps coming a week after the first.  The obtained
correlation coefficient would indicate the stability of the scores.

2. Parallel forms reliability is a measure of reliability obtained by administering


different versions of an assessment tool (both versions must contain items that probe
the same construct, skill, knowledge base, etc.) to the same group of individuals. 
The scores from the two versions can then be correlated in order to evaluate the
consistency of results across alternate versions. 

Example:  If you wanted to evaluate the reliability of a critical thinking assessment,


you might create a large set of items that all pertain to critical thinking and then
randomly split the questions up into two sets, which would represent the
parallel forms.

3. Inter-rater reliability is a measure of reliability used to assess the degree to which


different judges or raters agree in their assessment decisions.  Inter-rater reliability is
useful because human observers will not necessarily interpret answers the same way;
raters may disagree as to how well certain responses or material demonstrate
knowledge of the construct or skill being assessed. 

Example:  Inter-rater reliability might be employed when different judges are


evaluating the degree to which art portfolios meet certain standards.  Inter-
rater reliability is especially useful when judgments can be considered
relatively subjective.  Thus, the use of this type of reliability would probably
be more likely when evaluating artwork as opposed to math problems.

4. Internal consistency reliability is a measure of reliability used to evaluate the


degree to which different test items that probe the same construct produce similar
results. 
A. Average inter-item correlation is a subtype of internal consistency
reliability.  It is obtained by taking all of the items on a test that probe
the same construct (e.g., reading comprehension), determining the
correlation coefficient for each pair of items, and finally taking the
average of all of these correlation coefficients.  This final step yields the
average inter-item correlation.  
B. Split-half reliability is another subtype of internal consistency
reliability.  The process of obtaining split-half reliability is begun by
“splitting in half” all items of a test that are intended to probe the same
area of knowledge (e.g., World War II) in order to form two “sets” of
items.  The entire test is administered to a group of individuals, the
total score for each “set” is computed, and finally the split-half
reliability is obtained by determining the correlation between the two
total “set” scores.
THINK!

TASK 1: SCENARIO-BASED/PROBLEM SOLVING LEARNING


As the department head or principal, what action would you take on the following
matters? Provide your recommendations based on the principal of validity and
reliability.

Scenario 1:
Mr., Roa taught the different elements and principles of art. After intrusions,
administered a test about the prominent painters’ and sculptors in the 20th century.
Why would you recommend revisions? Why?
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________

Scenario 2:
In a geometry class, the learners have to calculate perimeters and areas of plane
figures like triangles, quadrilaterals and circles. The teacher decided to use
alternatives assessment rather than tests. Students came up with Mathematics
portfolios containing their writings about geometry.
What would you tell the teachers? Why?
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________

Scenario 3:
There are two available assessment instruments to measure English skills on grammar
and vocabulary. Test A has high validity but no information concerning its reliability.
Test B was tested to have a high reliability index but more information about its
validity. Which one would you recommend? Explain briefly.
Lesson 2

 VALIDITY

The validity of an instrument is the idea that the instrument measures what it
intends to measure.

Validity pertains to the connection between the purpose of the research and which
data the researcher chooses to quantify that purpose.

For example, imagine a researcher who decides to measure the intelligence of a


sample of students. Some measures, like physical strength, possess no natural
connection to intelligence. Thus, a test of physical strength, like how many push-ups
a student could do, would be an invalid test of intelligence.

Validity refers to how well a test measures what it is purported to measure. 

A Deeper Look at Validity

The most basic definition of validity is that an instrument is valid if it measures what
it intends to measure. It’s easier to understand this definition through looking at
examples of invalidity. Colin Foster, an expert in mathematics education at the
University of Nottingham, gives the example of a reading test meant to measure
literacy that is given in a very small font size. A highly literate student with bad
eyesight may fail the test because they can’t physically read the passages supplied.
Thus, such a test would not be a valid measure of literacy (though it may be a valid
measure of eyesight). Such an example highlights the fact that validity is wholly
dependent on the purpose behind a test. More generally, in a study plagued by weak
validity, “it would be possible for someone to fail the test situation rather than the
intended test subject.” Validity can be divided into several different categories, some
of which relate very closely to one another. We will discuss a few of the most
relevant categories in the following paragraphs.

Types of Validity 

1. Face Validity ascertains that the measure appears to be assessing the intended


construct under study. The stakeholders can easily assess face validity. Although this
is not a very “scientific” type of validity, it may be an essential component in
enlisting motivation of stakeholders. If the stakeholders do not believe the measure is
an accurate assessment of the ability, they may become disengaged with the task. 
Example: If a measure of art appreciation is created all of the items should be related
to the different components and types of art.  If the questions are regarding historical
time periods, with no reference to any artistic movement, stakeholders may not be
motivated to give their best effort or invest in this measure because they do not
believe it is a true assessment of art appreciation. 

2. Construct Validity is used to ensure that the measure is actually measure what it is
intended to measure (i.e. the construct), and not other variables. Using a panel of
“experts” familiar with the construct is a way in which this type of validity can be
assessed. The experts can examine the items and decide what that specific item is
intended to measure.  Students can be involved in this process to obtain their
feedback.

 Example: A women’s studies program may design a cumulative assessment of learning


throughout the major.  The questions are written with complicated wording and
phrasing.  This can cause the test inadvertently becoming a test of reading
comprehension, rather than a test of women’s studies.  It is important that the
measure is actually assessing the intended construct, rather than an extraneous
factor.

3. Criterion-Related Validity is used to predict future or current performance - it


correlates test results with another criterion of interest.

 Example: If a physics program designed a measure to assess cumulative student


learning throughout the major.  The new measure could be correlated with a
standardized measure of ability in this discipline, such as an ETS field test or the GRE
subject test. The higher the correlation between the established measure and new
measure, the more faith stakeholders can have in the new assessment tool.

4. Formative Validity when applied to outcomes assessment it is used to assess how


well a measure is able to provide information to help improve the program under
study.

Example:  When designing a rubric for history one could assess student’s knowledge
across the discipline.  If the measure can provide information that students are
lacking knowledge in a certain area, for instance the Civil Rights Movement, then that
assessment tool is providing meaningful information that can be used to improve the
course or program requirements. 

5. Sampling Validity (similar to content validity) ensures that the measure covers the
broad range of areas within the concept under study.  Not everything can be covered,
so items need to be sampled from all of the domains.  This may need to be completed
using a panel of “experts” to ensure that the content area is adequately sampled. 
Additionally, a panel can help limit “expert” bias (i.e. a test reflecting what an
individual personally feels are the most important or relevant areas). 
Example: When designing an assessment of learning in the theatre department, it
would not be sufficient to only cover issues related to acting.  Other areas of theatre
such as lighting, sound, functions of stage managers should all be included.  The
assessment should reflect the content area in its entirety.

What are some ways to improve validity?

1. Make sure your goals and objectives are clearly defined and operationalized. 
Expectations of students should be written down.
2. Match your assessment measure to your goals and objectives. Additionally,
have the test reviewed by faculty at other schools to obtain feedback from an
outside party who is less invested in the instrument.
3. Get students involved; have the students look over the assessment for
troublesome wording, or other difficulties.
4. If possible, compare your measure with other measures, or data that may be
available.
THINK!

ACTIVITY 1: ASSESSMENT SCENARIO


For each of the following situations determine whether the assumption is
valid. Explain your answer in two or three sentences citing the type of validity.
Scenario 1:
Test constructors in a secondary school designed a new measurement
procedure to measure intellectual ability. Compared to well-established measures of
intellectual ability, the new test is shorter to reduce the arduous effect of a long test
on students. To determine its effectiveness, a sample of students accomplished two
tests – a standardized intelligence test and the new test with only a few days interval.
Results from both assessments revealed high correlation.
Scenario 2:
After the review sessions, a stimulated examination was given to graduating
students a few months before the Licensure Examination for Teachers (LET). When
the results of the LET came out, the review coordinator found that the scores in the
stimulated (mock) examination are not significantly correlated with LET scores.
Scenario 3:
A new test was used as a qualifying examination for Secondary Education freshmen
who would like to major in Biological Science. The test was developed to measure
students’ knowledge of Biology and those already majoring in Biological Science. It
was hypothesized that the latter group will score better in the assessment procedure.
Test results indicated that it is so.
Scenario 4:
A science teacher gave a test on volcanoes to Grade 9 students. The test
included the type of volcanoes, volcanic eruptions and energy from volcanoes. The
teacher was only able to cover extensively the first two topics. Several tests items
were included on volcanic energy and how energy from volcanoes may be tapped from
human use. Majority of her students got low marks.
Scenario 5:
A teacher handling “Media and Information Literacy” prepared a test on “Current
and Future Trends of Media and Information”. Topics include massive open online
content, wearable technology, 3D environment and ubiquitous learning. Below are the
learning competencies.
The learner should be able to:
a. Evaluate current trends in media and information and how they will affect
individuals and society in general;
b. Describe massive open online content;
c. Predict future media innovation;
d. Synthesize overall knowledge about media and information skills for producing
a prototype of what the learners think is a future media innovation.
The teacher constructed a two-way table of specification indicating the number of
items for each topic. The test items target remembering understanding and applying
levels.
LESSON 3

 Practicality and Efficiency


In this lesson, we will cover factors on practically and efficiency that contribute
to high quality assessments. These include teacher familiarity with the method; time
required; complexity of administration; ease of scoring of interpretation; and cost
(McMilan, 2007). These factors would have to be considered and balanced with
previously discussed principles.

From the foregoing, it is critical that teachers learn the strengths and
weaknesses of each, method of assessment, how they are developed, administered
and marked. You can see here how practicality and efficiency are intertwined with
other criteria of high-quality classroom assessment.

Time Required
It may be easier said than done, but a desirable assessment is short yet able
to provide valid and reliable results. Considering that time is a commodity that many
full-time teachers do not have, it is best that assessments are quick to develop but
not to the point or reckless construction. Assessments should allow students to
respond readily but not hastily. Assessment should also be scored promptly but not
without basis. It should be noted that time is a matter of choice – it hinges on the
teacher’s choice of assessment method. For instance, a multiple choice test may take
time to prepare but it can be accomplished by students in a relatively short period.
Moreover, the test is easily and objectively scored especially now with the aid of
optical mark readers. However, there is still a need to revisit the first principle: “Is it
appropriate for your learning targets?” An essay may be better in some occasions as
students as allowed to express their ideas with relatively few restraints. However,
essays are time consuming on the part of the teacher and his/her students. Essay
questions can easily be thought of, but essays will require considerable time for
students to organize their thoughts and express them in writing. It will also consume
time for teachers to read and mark students’ responses. Essay items are good for
testing small groups of students but such advantage decrease as the class size grows.
Meanwhile, performance assessments take a lot of time in preparation, student
response and scoring but they offer an opportunity to assess students on several
learning targets and at different levels of performance. Care should be exercised
though especially if performance assessments take away too much of instructional
time.
After considering the time issue, let us know discuss assessments’ reusability.
A multiple choice test may take a substantial amount of time in terms of preparation,
but the items when properly constructed may again be used for different groups.
Security is an issue in large-scale, high-stakes testing. Nonetheless, it is also a
problem, in classroom testing because it affects both validity and reliability. If tests
items were item-analysed and the tests itself was established to be a valid and
reliable evidence of student performance, then many of the items can be used in
succeeding terms or periods as long as the same learning outcomes are targeted.
Thus, is it critical that objective tests are kept secure so that teachers do not have to
prepare an entirely new set of test questions every time? Besides, much time and
energy were spent in the test construction, and maintaining a database of excellent
test items shows that tests can be recycled and reused.
One suggestion to improve reliability is to lengthen the assessment. The
longer it is the higher the reliability. McMillan (2007) claims that for assessments to
provide reliable results, it generally takes thirty to forty minutes for a single score on
a short unit. He added that more time is required if separate scores are needed for
sub skills. He shared a general rule of thumb that six to ten objective items are
required if assessing a concept or specific skill.

Assessments should be easier to administer. To avoid questions during the test of


performance test, instructions must be clear and complete. Instructions that are
vague will confuse the students and they will consequently provide and incorrect
responses. This may be a problem with performance assessments that contain long
direct and explicitly. If assessment procedures like in Science experiments are too
elaborate, reading and comprehending the procedures would consume time.
Ease of Scoring
Obviously, selected response formats are the easiest to score compared to
restricted and more so extended-response essays. It is also difficult to score
performance assessments like oral presentations, research papers, among others.
Objectivity is also an issue. Selected-response tests are objectively marked because
each item has one correct or best answer. This contributes to the reliability of the
test. Performance assessments however make use of rubrics, and while this facilitates
scoring, care must be observed in rating to ensure objectivity. McMilan (2007)
suggests that for performance assessments, it is more practical to use scales and
checklists rather than writing extended individualized evaluations.
Ease of Interpretation
Oftentimes, students are given a score that reflects their knowledge, skills or
performance. However, this is meaningless if such score is not interpreted. Objective
tests are the easiest to interpret. By establishing a standard, the teacher is able to
determine right away if a student passed the test. By matching the score with the
level of proficiency, teachers can determine if the student had reached mastery or
not. In performance tasks, rubrics are prepared to objectively and expeditiously rate
student’s actual performance or product. It reduced the time spent in grading
because teachers refer to the descriptors and performance levels in the rubric
without the need to write long comments. Nonetheless, the rubric would have to be
shared with the students so that the so that they understand the meaning of the
scores or grades given to them.
Cost
Classroom tests are generally inexpensive compared to national or high-stakes
tests. Hence, citing cost as reason for being unable to come up with valid and reliable
tests is simply unreasonable. As for performance tasks, examples of tasks that are not
considerably costly are written and oral reports, debated and panel discussions. Of
course, students would have to pay for their use of resources like computers and
copying facilities among others. However, they are not as costly as some performance
assessments that require plenty and/or expensive materials, examples of which are
laboratory experiments, dioramas, paintings, role plays with elaborate costumes,
documentary films and portfolios. Teachers may remedy this by asking students to
consider using second-hand or recycled materials.

While classroom tests may not be costly compared to some performance


assessments, it is relevant to know that excessive testing may just train students on
how to take tests but inadequately prepares them for a productive life as an adult. It
is paramount that other methods of assessment are used in line with the learning
targets. According to Darling-Hammond & Adamson (2013), open-ended assessments –
essay exams and performance tasks - are expensive to score, but they can support
more ambitious teaching and learning.
It is recommended that one chooses the most economical assessment. McMilan (2007,
p. 88) in his explanation of economy said the “economy should be thought of in the
long run, and unreliable less expensive tests (or any assessment for that matter) may
eventually cost more in further assessment”. In the school level, multiple choice tests
are still very popular especially in entrance tests, classroom tests, and stimulated
board examinations. However, high quality assessments focus on deep learning which
is essential if students are to develop the skills they need for a knowledge society
(Darling-Hammond & Adamson, 2013). That being said, schools must support
performance assessment.
THINK!

Task: SCENARIO-BASED/ PROBLEM SOLVING LEARNING

As an assessment expert, you were asked for advice on the following matters, provide
your recommendations.

Scenario 1:

Ms. Loreto handles six Grade 5 classes in English. She would like to test their skills in
spelling. She has two options:

Oral spelling test


The teacher pronounces each word out loud and the students write down each word.
Spelling bee-type test
Each student is asked individually one-at-a-time to spell specific words out loud.
In terms of practically and efficient, which one would you suggest? Why?
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
_________________________________________________

Scenario 2:

Mr. Chua is preparing a final examination for his third year college students in
Philippine Government and Constitution scheduled two weeks from now. He is
handling sex classes. He has no teaching assistant. He realized that after the final
examinations, he has three days to calculate the grades of his students. He is thinking
of giving extended response essays.

What advice would you give him? Which aspect of practicality and efficiency should he
prioritize? What type of assessment should he consider?
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________
Scenario 3:

Ms. Rodriguez will give an end-of-year listening comprehension and speaking test in
English for her grade three pupils in Pampanga. She handles three sections with 45
students. She meets them daily for 40-50 minutes. She had only one teaching
assistant.

Should the test be individually-administered or group-administered?


Should the directions, examples and prompts be in the mother tongue in English?
Should these be spoken or written?
Should student answers and responses be in mother tongue or in English? Spoken or
written?
Should the method of scoring be based on counting the number or correct answers,
and/or follow a holistic approach (one overall score) or analytic approach (separate
scores for each performance criterion)?

Give your recommendations in view of the principle of practicality and efficiency.


__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__
Lesson 4

 Ethics
Over
view
This lesson centers on ethical issues and responsibilities of teachers in the
assessment process. Russell & Airasian (2012) defines assessment as more than just a
technical activity – it is a human activity. They explained that assessment has
consequences to students and other stakeholders. If you recall the relevance and
roles of assessment, teachers as well as students, administrators, policy holders, and
other stakeholders have a take on assessment. Assessment is used to form judgments
on the nature, score and extent of students’ learning. For summative purposes, it
drives instruction. If students’ assessment scores are used for other purposes different
from what was intended, let’s say to evaluate teachers’ performance, then there is
an ethical issue.
“Teachers’ assessments have important long-term and short-term consequences
for students thus teachers have an ethical responsibility to make decisions using the
most valid and reliable information possible” (Russell & Airasian, 2012, p.21). by and
large, teachers are accountable in ensuring that their assessments are valid and
reliable. Validity and reliability are aspects of fairness. Fairness is am ethical value.
Other aspects of fairness include (1) student’s knowledge of learning targets and
assessments; (2) opportunity to learn; (3) prerequisite knowledge and skills; (4)
avoiding student stereotyping; (5) avoiding bias in assessment tasks and procedures;
and (6) accommodating special needs (McMilan 2007; Russell & Airasian, 2012).

Students’ Knowledge of Learning Targets and Assessments


This aspect of fairness speaks of transparency. Transparency is defined here as
disclosure of information to students about assessments. This includes what learning
outcomes are to be assessed and evaluated, assessment methods and formats,
weighting of items, allocated time in completing the assessment and grading criteria
of rubric. By informing students regarding the assessment details, they can
adequately prepare and recognize the importance of assessment. They become part
of the assessment process. By doing so, assessment becomes learner-centered.
In regard to written tests, it is important that students know what is included
and excluded in the test. Giving them sample questions may help them evaluate their
strategies and current levels of understanding. Aside from the content, the scoring
criteria should be known or made public (to the students concerned). As for
performance assessments, the criteria should be divulged prior to assessment so that
students will know what the teacher is looking for in the actual performance or
product. Following the criteria, they can reflect on their practices and personally
identify their strengths and weaknesses. They can evaluate their work performance
and product output and make the necessary improvements before the scheduled
assessment or work submission. For product-based assessments, it would be
instrumental if teachers can provide a sample of the work done by previous students
so that students can acknowledge the kind of quality if work their teacher is
expecting from them. However, originality should be emphasized.
Now, what about surprise tests or pop quizzes? There are teachers who would defend
and rationalize their giving of unannounced assessments. There are studies that would
support it. For instance, Graham’s (1999) study revealed that unannounced quizzes
raised test scores of mid-range undergraduate students and majority of students in his
sample claimed to appreciate the use of quizzes. This may be due to feedback process
involved. Graham also found that unannounced quizzes motivated students to
distribute their learning evenly and encouraged them to read materials in advance.
Kamuche (2007) reported that unannounced quizzes showed better academic
performance than the control group with announced quizzes. While unannounced
quizzes compel students to come to class prepared, we could not discount the
possibility that pop quizzes generate anxiety among students. Graham (cited by
Kamuche, 2007) stated that unannounced quizzes tend to increase the examination
tension and stress, and did not offer a fair examination.
Test-taking skills is another concern. For instance, some students may be good
in answering multiple choice test items than other students. They may have
developed test-taking skills as strategies like reading the directions carefully,
previewing the test, answering easy items first, reading the stem and all the options
before selecting an answer, marking vital information in the stem, eliminating
alternatives, and managing test time effectively. To level the playing field, all
students should be familiar with the test strategies before they take the assessment.
Relative to the above, teachers should not crate unusual hybrids of assessment
formats. For instance, a matching type of three columns may leave test-takers
perplexed, deciphering the test format rather than having them show how well
they’ve learned the subject matter.
Opportunity to Learn
There are teachers who are forced to give reading assignments because of the
breadth of content that has to be covered in addition to limited or lost classroom
time. Then, there is little or no instruction that follows. This would certainly put
students to a disadvantage because they were not given ample time and resources to
sufficiently assimilate the material. They are being penalized for lack of opportunity.
McMilan (2007) asserted that fair assessments are aligned with instruction that
provides adequate time and opportunities for all students to learn. Discussing an
extensive unit in a house is obviously insufficient. Inadequate instructional
approaches would not be just for the learners because they are not given enough
experiences to process information and develop their skills. They will be ill-prepared
for a summative test or performance assessment.

Prerequisite Knowledge and Skills


Students may perform poorly in an assessment if they do not possess
background knowledge and skills. Suppose grade school pupils were taught about
inverse proportion. Even if they memorize the meaning of proportion, they would not
be able to develop a schema if they are not able to connect new information with
previous knowledge. If they lack adequate knowledge about rations and direct
proportion, they would not fully grasp the concept of inverse proportion. Moreover,
they would have difficulty solving word problems on proportion if they have weak
skills in multiplication and division which are prerequisite skills. And so, it would be
improper if students are tested on the topic without any attempt or effort to address
the gap in knowledge or skills. The problem is compounded if there are
misconceptions. The need for action and correction is more critical.
Another problem emerges if the assessment focuses heavily on prior
knowledge and prerequisite skills. Going back to the previous example, if students are
tasked to solve problems on proportion that were written verbosely, then their
performance in the assessment is considerably reflective of their reading
comprehension skills and vocabulary rather than their problem-solving skills. The
same thing can be said for problems that are simply worded but contains extremely
large members. The test would simply call for skills in arithmetic calculations. This
would not be fair to the students concerned.
So as not to be unfair, the teacher must identify early on the prerequisite
skills necessary for completing an assessment. The teacher can analyse the
assessment items and procedures and determine the pieces of knowledge and skills
required to answer them. Afterwards, the teacher can administer a prior knowledge
assessment, the results of which can lead to additional or supplemental teacher or
students-managed activities like peer-assisted study sessions, compensatory groups,
note swapping and active review. The teacher may also provide clinics or reinforced
tutorials to address gaps in students’ knowledge and skills. He/she may also
recommend reading materials or advise students to attend supplemental instruction
sessions when possible. These are forms of remediation. Depending on the case, if
warranted, the teacher may advice students to drop the course until they are ready
or retake a prerequisite course. In the undergraduate level, prerequisites are imposed
to ensure that students possess background knowledge and skills necessary to advance
and become successful in subsequent courses.
Avoiding Stereotyping
A stereotype is a generalization of a group of people based on inconclusive
observations of a small sample of this group. Common stereotype are racial, sexual
and gender remarks. Stereotyping is caused by preconceived judgements of people on
comes in contact with which are sometimes unintended. It is different from
discrimination which involved acting out one’s prejudicial opinions. For instance, a
teacher employed in a city school may have low expectations of students coming from
provincial schools or those belonging to ethnic minority groups. Another may harbour
uncomfortable feelings towards students from impoverished communities. These
teachers carry the idea that students from such groups are cognitively or affectively
inferior. These typecast are based on socio-economic status.
There are also those on gender, race or culture. A professional education
teacher may believe that since he education program is dominated by females, they
are better off as teachers than males. A teacher may also have an opinion that other
Asian students are better in Mathematics than Filipino students, and thus the latter
will require more instruction. Stereotypes may either by positive or negative. For
instance, foreigners would regard Filipino as hospitable and hardworking individuals,
but Filipinos are also being stereotyped as domestic helpers and caregivers.
Teachers should avoid terms and examples that may be offensive to students
of different gender, race, culture or nationality. Stereotype can affect students’
performance in examinations. In 1995, Steel & Aronson developed the theory of
stereotype threat claiming that for people who are challenged in areas they deem
important like intellectual ability, their fear of confirming negative stereotypes can
cause them to falter in their actual test performance. For instance, a female student
who was told that females are academically inferior in Mathematics may feel a
certain level of anxiety, and the negative expectations may cause her to
underperform in the assessment. Jordan & Lovett’s (2006) paper provided a review of
literature on stereotype threat. The paper cited researchers on the detrimental
effects of stereotype threat such as reduced working memory capacity; abandonment
of valued social identities; disengagement from threatening domains among
stereotyped individuals; and lowered self-worth.
To reduce the negative effects of stereotype threat, simple changed in classroom
instruction and assessment can be implemented such as encouraging diverse students
that they can excel at difficult tasks, that any responsible student can achieve high
standards, and also by ensuring gender free and culturally-unbiased test items. A
school environment that fosters positive practices and supports collaboration instead
of competition can be beneficial especially for students in diverse classrooms where
ethnic, gender and cultural diversity thrive.
Jordan & Lovett (2006) recommended five concrete changes to psycho-educational
assessments to alleviate stereotype threats:
Be careful in asking questions about topics related to student’s demographic group.
This may in advertently induce stereotype threats even if the information presented
in the test in accurate.
Place measures of maximal performance like ability and achievement tests are at
beginning of assessments before giving less formal; self-report activities that contain
topics or information about family background, current home environment, preferred
extracurricular activities and self-perceptions of academic functioning.
Do not describe tests as diagnostic of intellectual capacity.
Determine if there are mediators of stereotype threat that affect test performance.
This can be done using informal interviews or through standardized measures of
cognitive interference and test anxiety.
Consider possibility of stereotype threat when interpreting test scores of susceptible
typecast individuals.

Avoiding Bias in Assessment Tasks and Procedures


Assessment must be free from bias. Fairness demands that all learners are
given equal chances to do well (from the task) and get a good assessment (from the
rater). Teachers should not be affected by factors that are not part of assessment
criteria. In correcting an essay for instance, a student’s gender, academic status,
socio-economic background or handwriting should not influence the teacher’s
judgement or scoring decision. This aspect of fairness also includes removal of bias
towards students with limited English or with different cultural experiences when
providing instruction and construction assessment (Russell & Airasian, 2012). This
should not be ignored especially in the advent of the ASEAN (Association of Southeast
Asian Nations) Economic Integration in 2015 when there is greater mobilization of
students among educational institution across ASEAN countries.
There are two forms of assessment bias: offensiveness and unfair penalization
(Popham, 2011). These forms distort test performance of individuals in a group.
Offensiveness happens if test-takers get distressed, upset or distracted about how an
individual or a particular group is portrayed in the test. The content of the
assessment may have contained slurs or negative stereotypes of particular ethnic,
religious or any group, causing undue resentment, items and their concentration in
answering subsequent items suffers. Ultimately, they end up not performing as well as
they could have, reducing the validity of inferences. An essay about traffic congestion
sweepingly portraying traffic enforcers of Metropolitan Manila Development Authority
(MMDA) as corrupt in an example of bias. This assessment may effect students whose
parents are working with the MMDA.
Unfair penalization harms student performance due to test content, not because
items are offensive but rather, the content carters to some particular groups from the
same economic class, race, gender, etc, leaving other groups at a loss or a
disadvantage. For example, a reading comprehension test was given. Questions were
given by the teacher based on a given passage. The passage is about the effects to K
to12. Will students who are not familiar with the K to 12 basic education framework
answer the questions accurately? Similarly, will male and female students to be able
to perform equally well in a statistics test that contains several problems and data
about sports? What if a teacher in Computer or Educational Technology gives a test on
how wearable technology can impact various professions, will students coming from
low income family’s answer as elaborately as those from the upper class who actually
possess wearable gadgets? Consider another situation. Suppose the subject is Filipino
or Araling Panlipuna (Social Studies), and class had foreign students. Should they be
mixed with native speakers in class? Should test items be constructed containing deep
or heavy Filipino words? The aforementioned are situations that illustrate undue
penalization resulting from group membership. Unfair penalization causes distortion
and greater variation in scores which is not due to differences in ability. Substantial
variation or disparity in assessment scores between student groups is called disparate
impact PoPham (2011) pointed out that disparate impact is not tantamount to
assessment bias. Differentiation may yet exist but it may be due to inadequate prior
instructional experience. Take for instance the 2013 National Achievement Test
where the National Capital Region topped other regions under Cluster 1 composed of
Eastern Visayas, Western Visayas, Central Luzon, Bicol and Calabarzon. If the test
showed no sign of bias, then it is insinuated that the disparate impact id due to prior
instructional inadequacies or lack of preparation.
To avoid bias during the instruction phase, teachers should heighten their
sensitivity towards bias and generate multiple examples, analogies, metaphors and
problems that cut across boundaries. To eradicate or significantly reduce bias in
assessments, teachers can consider a judgmental approach or an empirical approach
(Popham, 2011).
Teachers can have their tests reviewed by colleagues to remove offensive works or
items. Content-knowledgeable reviewers can scrutinize the assessment procedure or
each item of the test. In developing high-stakes test, a review panel is usually formed
- a mix of male and female members from various subgroups who might be adversely
impacted by the test. On each item, the panelists are asked to determine if it might
offend or unfairly penalize any group of students on the basis of personal
characteristics. Each panel member responds and gives their comments. The mean
per items absence-of-bias index is calculated by getting the average of the “no”
responses. If an item is found biased, the item is discarded. Qualitative comments are
also considered in the decision to retain, modify or reject items in labelled with DIF
when people with comparable abilities but from different groups have unequal
chances of item success. Item response theory (ITR), Mantel-Haenszel and logistic
regression are common procedures for assessing DIF
Accommodation Special Needs
Teachers need to be sensitive to the needs of students. Certain
accommodations must be given especially for those who are physically or mentally
challenged. The legal basis for accommodation is contained in Sec 12 of Republic Act
7277 entitles “An Act Providing for the Rehabilitation, Self-Development and Self-
Reliance of Disabled Person and their Integration into the Mainstream of Society and
for Other Purposes”. The provision talks about access to quality education - that
learning institutions should consider the special needs of learners with disabilities in
terms of facilities, class schedules, physical education requirements and other related
matters. Another is Sec 32 of CHED Memorandum 09, s. 2013 on “Enhanced Policies
and Guidelines on Student Affairs and Services” which states that higher education
institutions should ensure that academic accommodation is made available to persons
with disabilities and learners with special needs.
Accommodation does not mean giving advantage to students with learning disabilities,
but rather allowing them to demonstrate their knowledge on assessments without
hindrances from their disabilities. It is distinct from assessment modification as
accommodation does not institute altering the construct of the assessment - what the
assessment was intended to measure in the first place.
Let us consider some situations that require accommodation. For students with
documented learning disabilities who are slow in reading, analysing and responding to
test questions, the teacher can offer extended time to complete the test. For
students who are easily distracted by noise, the teacher can make arrangements for
these students to accomplish the assessment in another room free from distractions or
carry out simple or innovative ways to reduce unnecessary noise from entering the
classroom. For students who do not have perfect vision, the teacher can adjust and
print the written assessment with a larger font.
The situations above have straightforward solutions. But there are also
challenging situations that require much thought. For instance, should a student who
is recovering from an accident, unable to write with his/her hand be allowed to have
a scribe? Should foreign students who do not possess an extensive English vocabulary
be permitted to use a dictionary? Are there policies and processes for such cases?
Accommodations can be placed in one of six categories (Thurlow, McGrew,
Tindal, Thompson & ysseldyke, 2000) Presentation (repeat directions, read aloud, use
large print, Braille) Response (mark answers in test booklet, permit responses via
digital recorder or computer, use reference materials like dictionary) Setting (study
carrel, separate room, preferential seating, individualized or small group, special
lightning) Timing (extended time, frequent breaks, and unlimited time) Scheduling
(specific time of day subtests in different order, administer test in several timed
sessions) Others (special test preparation techniques and out-of-level tests)
Fundamentally, an assessment accommodation should attend to the particular need of
the student concerned. For instance, presentation and setting are important
consideration for a learner who is visually impaired. For a learner diagnosed with
attention deficit hyperactivity disorder (ADHD), frequent breaks are needed during a
test because of the child’s short attention span. To ensure the appropriateness of the
accommodation supplied, it should take into account three important elements:
Nature and extent of the learner’s disability. Accommodation is dictated by the type
and degree of disability possessed by the learner. A learner with moderate visual
impairment would need a larger print edition of the assessment or special lightning
condition. Of course, a different type of accommodation is needed if the child has
severe visual loss.
Type and formal of assessment. Accommodation is matched to the type and
format of assessment given. Accommodations vary depending on the length of the
assessment, the time allotted, mode of response, etc. A partially deaf child would not
require assistance in a written test. However, his/her hearing impairment would
affect his/her performance should the test be dictated. He/she would also have
difficulty in assessment tasks characterized by group of discussions like round table
sessions.
Competency and content being assessed. Accommodation does not alter the level of
performance or content the assessment measures. In Science. Permitting students to
have a list of scientific formulae during a test is acceptable if the teacher is assessing
how students are able to apply the formulae not simple recall. In Mathematics, if the
objective is to add and subtract counting numbers quickly, extended time would not
be a reasonable accommodation.

Relevance

Relevance can also be thought of as an aspect of fairness. Irrelevant


assessment would mean short-changing students of worthwhile assessment
experiences. Assessment should be set in a context that students will find purposeful,
Killen (2000) gave additional criteria for achieving quality assessments.
“Assessment should reflect the knowledge and skills that are most important
for students to learn.” Assessment should not include irrelevant and trivial content.
Instead, it should measure learner’s higher-order abilities such as critical thinking
problem solving and creativity which are higher-order abilities such as critical
thinking, problem solving and competitive in today’s global society. Teachers are
reminded that when aiming for high levels of performance, assessment should not
curtail students’ sense of creativity and personality in their work. Rather, it should
foster initiative and innovation among students.
“Assessment should support every student’s opportunity to learn things that
are important.” Assessment must provide genuine opportunities for students to show
what they have learned and encourage reflective thinking. It should prompt them to
explore what they think is important. This can be done for example using Ogle’s KWL
(know-Want-Learn) chart as a learning and assessment tool. It activates students’
prior knowledge and personal curiosity and encourages inquiry and research.
“Assessment should tell teachers and individual students something that they do not
already know.” Assessment should stretch students’ ability and understanding.
Assessment tasks should allow them to apply their knowledge in new situations. In a
constructive classroom, assessment can generate new knowledge by scaffolding prior
knowledge.

Ethical Issues
There are times when assessment is not called for. Asking pupils to answer
sensitive questions like their sexuality or problems in the family are unwarranted
especially without the consent of the parents. Grades and reports of teachers
generated from using invalid and unreliable test instruments are unjust. Resulting
interpretations are inaccurate and misleading.
Other ethical issues in testing (and research) that may arise include possible harm to
the participants; confidentiality of results; deception in regard to the purpose and use
of the assessment; and temptation to assist students in answering tests or responding
to surveys.
THINK !

Activity 1: ASSESSMENT SCENARIOS


Suppose you are the principal of a public high school. You received complaints from
students concerning their tests. Based on their complaints, you decided to talk to the
teachers concerned and offered advice based on ethical standards. Write down your
recommendations citing specific aspects of ethics or fairness discussed in this
chapter.

Scenario 1: Eight-grade students complained that their music teacher uses only
written tests as the sole method of assessment. They were not assessed on their skills
singing or creating musical melodies.

Scenario 2: Grade 7 students complained that they were not informed that there is a
summative test in Algebra.

Scenario 3: Grade 7 students complained that they were not told what to study for
the mastery test. They were simply told to study and prepare for the test.

Scenario 4: Grade 9 students complained that they were questions in their Science
test on the last unit which was not discussed in class.

Scenario 5: Grade 9 students studied for a long test covering the following topics:
Motion in two dimensions; Mechanical Energy; Heat work and Efficiency; and
Electricity and Magnetism. The teacher prepares several questions on the first topic.
Hence, students complained that most of what they studied did not turn up in the
test.
Scenario 6: Students complained that they had difficulty with the test because they
prepared by memorizing the content.

Scenario 7: Students were tested on an article they read about Filipino ethnic groups.
The article depicted Ilocano’s as stingy. Students who hail from Ilocos were not
comfortable answering the test.

Activity 2: DO’S AND DONT’S


Suppose you are a firm teacher especially when it comes to ethics in assessment.
You would like to observe correct conduct prior, during and after a test. Write down
five do’s and don’ts for each phase of testing.

Before the test


During the test
After the test

CASE ANALYSIS

Article VIII of the Code of Ethics for Teachers (Resolution No. 435, s. 1997) contain
ethical standards in relation with students. These standards are benchmarks to ensure
that teachers observe fairness, justice and equality in all phases of the teaching-
learning cycle.

Article VIII: The Teacher and Learners

Section 1. A teacher had a right and duty to determine the academic marks and
the promotions of learners in the subject or grades he handles, provided that such
determination shall be in accordance with generally accepted procedures of
evaluation and measurement. In case of any complaint, teachers concerned shall
immediately take appropriate actions, observing due process.

Section 2. A teacher shall recognize that the interest and welfare of learners are of
first and foremost concern, and shall deal justifiably and impartially with each of
them.

Section 3.Under no circumstances shall a teacher be prejudiced or discriminate


against a learner.
Section 4. A teacher shall not accept favours or gift from learners, their parents or
others in their behalf in exchange or requested concessions, especially if undeserved.

Section 5. A teacher shall not accept, directly or indirectly, any remuneration from
tutorials other than what is authorized for such service.

Section 6. A teacher shall base the evaluation of the learner’s work only in merit
and quality of academic performance.

Section 7. In a situation where mutual attraction and subsequent love develop


between teacher and learner, the teacher shall exercise utmost professional
discretion to avoid scandal, gossip and preferential treatment of the learner.

Section 8. A teacher shall not inflict corporal punishment on offending learner


nor make deductions from their scholastic rating as a punishment for acts which are
clearly not manifestation of poor scholarship.

Section 9. A teacher shall ensure that conditions contribute to the maximum


development of learners are adequate, and shall extended needed assistance in
preventing or solving learner’s problems and difficulties.
For each of the following, explain why the teacher’s action is deemed unethical.
Moreover, cite a section of the code of Ethics that was violated to support your
answer:
A Social Studies teacher gave an essay asking students to suggest ways to improve the
quality of education in the country. The teacher simply scanned students’ answers
and many of the students received the same score.
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________

A Grade 4 Technology and Livelihood Educational teacher did not allot sufficient time
in teaching animal care but instead focused on gardening a topic which he liked the
most.
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________

The teacher uses stigmatizing description for students who are not able to answer
teacher’s questions during an initiation-response-evaluation/recitation.
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________
A grade school teacher deducted five points from a student’s test score for his
misdemeanor in class.
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________

The teacher gave additional points to students who bought tickets for the school play
which was declared to be optional.
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________

Some students received an “incomplete” in the performance task due to absences or


other reasons. The teacher told them to acquire books instead in lieu of the
performance assessment.
__________________________________________________________________________
__________________________________________________________________________
_________________________________________________

A student approached the teacher during the summative test in Mathematics to ask
about a test item. The teacher reworded the test item and clarified the test question.
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________

The teacher excused foreign students from participating in Linggo ng Wika program.
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________
Task 2: WORD CLOUD
A word cloud is an image composed of words selected from a text source.
Words that appear more frequently are given more prominence.
Explain how ethics come into play in regards to assessment. Use at least three
words/expressions found in the word cloud. Cite and example or situation how ethics
(in assessment) can be translated from theory to practice.
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________

GROUP ASSIGNMENT:
Ask a group of elementary, high school or non-Education college students (20-25)
about their concept of a fair assessment. Write down their responses in bullet form.
Highlight key words. Create your own word cloud. You may use any online word cloud
generators.

You might also like