Align
Align
Loom Video:
https://www.loom.com/share/78e134b42e684882b0829358db822b99?sid=e95fb90c-85e8-
4888-8094-6385804f4636
In this task, you will be rating and comparing two responses that were generated from the same
coding related prompt by an AI chatbot.
1. You will be shown two different responses to the prompt, and asked if you are able to
directly attempt to run and test the code blocks present
2. You will be asked to rate the correctness of both responses individually.
3. You will be asked to rate how well both responses follow the instructions given in the
prompt
4. You will then provide a rating on the relative quality of the two responses.
Important notes:
● Questions come with detailed instructions. Please read these carefully. Many questions
require explanations depending on the response.
● Some questions require explanations of issues that were found, if any. Aim to write
reasonably concise, but not overly generic explanations that would allow a skilled
programmer to identify and remedy the problem. Expect to write 1-2 sentences on
average.
Skipping items:
You should only rate items where you are familiar with the subject material and very confident
that the assessments you’re making are accurate. If you are completely unfamiliar with the
programming language, unable to assess the correctness of responses, or unable to confidently
compare the responses in a reasonable amount of time (~30 minutes), skip the item.
Assessing Correctness:
It is very important to be confident that the ratings you’re providing are accurate. Many aspects
of the rating task are very subtle and nuanced, where relatively small differences between
responses are crucially important, or very reasonable sounding explanations are not factual.
Even if at first glance a response seems perfectly appropriate, it is still important to
● Explicitly check each claim and
● Run code blocks yourself (when appropriate and possible) to verify the code runs and
works as expected
Rating Task
The primary focus of this rating task is on the correctness of the model responses, and how
well the responses follow instructions. Here, correctness refers both to the
truthfulness/factuality of any claim made in the response and the functionality of any code
provided. Instruction following refers to how well the response addresses the goals and
questions, and requirements of the prompt. It is your job to:
(a) Explicitly research and check these textual claims and
(b) Run and inspect the output of any code provided (where applicable) to confirm its
executability and functionality.
You will likely need to utilize google search, stack overflow, documentation, or online code
executors/compilers (e.g., jsfiddle, colab, programiz, etc.) to confirm that a response is
reasonable and that your rating is accurate.
Task Details:
Did the response follow the instructions it was given in the prompt (both explicit and
implicit)?
● Options: No Issues, Minor Issue(s), Major Issue(s), N/A
● Instructions: focus on whether the response reflects the instructions and goals of the
prompt, not on truthfulness or correctness issues (e.g., bad code, poor explanation) –
those are rated below. Use the following rubric:
○ No Issues: All prompt instructions were followed; response delivered fully on the
tasks of the prompt.
○ Minor Issue(s): The response addressed most of the instructions or goal(s) of the
prompt, but missed or misinterpreted some small parts. A user would still be
reasonably satisfied.
■ Example: a response that describes the right API but assumes a slightly
different use-case than what the user articulates.
○ Major Issue(s): Response missed key components of the prompt, rendering it
unhelpful to the user.
■ Examples include: a response that discusses a different programming
language or library than what the user asked about, or misses a key
requirement of the code to be generated.
○ N/A - Not Applicable: There are no explicit or implicit instructions to follow in the
prompt or the response is canned (e.g. the model states it cannot do it).
● Explanation: required if issues are found. Describe what aspects of the prompt the
response missed or misinterpreted.
SxS Score
● Options: Rate your preference between the two responses on a scale from 1 to 7, where
1 means response A is much better than B, 7 means response B is much better than A,
and 4 is neutral.
● Instructions: You should prefer the response that would be more helpful to the user. This
is mainly a function of how correct the response is and how well it followed instructions.
In general, correctness should be the primary consideration and instruction following
secondary, but there may be scenarios where the less correct response is the better
one. Use your best judgment.
○ If the two responses are equal in terms of correctness and instruction following,
you may want to consider other factors such as verbosity or style:
■ Verbosity: Is the length of the response (both the code and non-code
portions) appropriate? The response should include all essential
information, while avoiding excessive additional details. Generally, a
succinct response should be preferred over a verbose one (all else being
equal), but this can depend on preference and context.
■ Style: Does the response use high-quality prose that’s well-organized and
easy to read, and whether the included code, if any, is reasonably
formatted and includes sufficient and accurate documentation
● Explanation: Always required. Briefly explain the most important considerations in your
indicated preference. Relate your motivation to the answers provided above.