0% found this document useful (0 votes)

22 views5 pages

Align

The document outlines a structured rating task for comparing two AI-generated responses to a coding prompt. It includes detailed instructions on assessing code executability, instruction following, correctness, and a side-by-side comparison. The task emphasizes the importance of accuracy and thoroughness in evaluating the responses, requiring users to run code and check claims against reliable sources.

Uploaded by

singhabhishek3162

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views5 pages

Align

Uploaded by

singhabhishek3162

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Blackhat Instructions

Loom Video:
https://www.loom.com/share/78e134b42e684882b0829358db822b99?sid=e95fb90c-85e8-
4888-8094-6385804f4636

In this task, you will be rating and comparing two responses that were generated from the same
coding related prompt by an AI chatbot.

The form is structured in four sections:

1. You will be shown two different responses to the prompt, and asked if you are able to
directly attempt to run and test the code blocks present
2. You will be asked to rate the correctness of both responses individually.
3. You will be asked to rate how well both responses follow the instructions given in the
prompt
4. You will then provide a rating on the relative quality of the two responses.

Important notes:
● Questions come with detailed instructions. Please read these carefully. Many questions
require explanations depending on the response.
● Some questions require explanations of issues that were found, if any. Aim to write
reasonably concise, but not overly generic explanations that would allow a skilled
programmer to identify and remedy the problem. Expect to write 1-2 sentences on
average.

Out of scope items:

If a prompt is out of scope (e.g. not code related), please select "Cannot Assess" in section 3,
and explain why it is out of scope. You can select the other answers arbitrarily; they will be
filtered out of the metric calculation.

Skipping items:
You should only rate items where you are familiar with the subject material and very confident
that the assessments you’re making are accurate. If you are completely unfamiliar with the
programming language, unable to assess the correctness of responses, or unable to confidently
compare the responses in a reasonable amount of time (~30 minutes), skip the item.

Assessing Correctness:
It is very important to be confident that the ratings you’re providing are accurate. Many aspects
of the rating task are very subtle and nuanced, where relatively small differences between
responses are crucially important, or very reasonable sounding explanations are not factual.
Even if at first glance a response seems perfectly appropriate, it is still important to
● Explicitly check each claim and
● Run code blocks yourself (when appropriate and possible) to verify the code runs and
works as expected

Rating Task
The primary focus of this rating task is on the correctness of the model responses, and how
well the responses follow instructions. Here, correctness refers both to the
truthfulness/factuality of any claim made in the response and the functionality of any code
provided. Instruction following refers to how well the response addresses the goals and
questions, and requirements of the prompt. It is your job to:
(a) Explicitly research and check these textual claims and
(b) Run and inspect the output of any code provided (where applicable) to confirm its
executability and functionality.

You will likely need to utilize google search, stack overflow, documentation, or online code
executors/compilers (e.g., jsfiddle, colab, programiz, etc.) to confirm that a response is
reasonable and that your rating is accurate.

The rating task consists of three parts:

1. Rate the correctness of each of the two responses individually (single-sided).
2. Rate how well each of the two responses follow instructions given in the prompt (single-
sided).
3. Rate the relative quality of each response in a side-by-side score.

Task Details:

(1) Code Response Executability:

Are you able to run code provided in the responses?

● Options: Yes-Fully, Yes-Partially, No, No Code Present
● Instructions: Run all code (snippets, functions, programs, etc.) provided in either of the
responses, to check both its executability and its correctness.
○ This may not always be possible. For instance, the code may only make sense
embedded inside a larger program, or it may require some external file/API
dependency for which no execution sandbox is readily available. If the code
cannot be executed, select “No”.
○ Some responses contain multiple code blocks. If you can run some, but not all of
these, answer “Yes-Partially”. Otherwise, answer “Yes-Fully”.
○ Note: This question is not asking about the correctness of the code, only if there
is enough contextual information to test it yourself.
● Explanation: If you answer “Yes-Partially” or “No”, an explanation is required for what
code you were not able to run, and why.

(2) Single-Sided Instruction Following:

Did the response follow the instructions it was given in the prompt (both explicit and
implicit)?
● Options: No Issues, Minor Issue(s), Major Issue(s), N/A
● Instructions: focus on whether the response reflects the instructions and goals of the
prompt, not on truthfulness or correctness issues (e.g., bad code, poor explanation) –
those are rated below. Use the following rubric:
○ No Issues: All prompt instructions were followed; response delivered fully on the
tasks of the prompt.
○ Minor Issue(s): The response addressed most of the instructions or goal(s) of the
prompt, but missed or misinterpreted some small parts. A user would still be
reasonably satisfied.
■ Example: a response that describes the right API but assumes a slightly
different use-case than what the user articulates.
○ Major Issue(s): Response missed key components of the prompt, rendering it
unhelpful to the user.
■ Examples include: a response that discusses a different programming
language or library than what the user asked about, or misses a key
requirement of the code to be generated.
○ N/A - Not Applicable: There are no explicit or implicit instructions to follow in the
prompt or the response is canned (e.g. the model states it cannot do it).
● Explanation: required if issues are found. Describe what aspects of the prompt the
response missed or misinterpreted.

(3) Single-Sided Correctness:

Is the response truthful and correct?

● Options: No Issues, Minor Issue(s), Major Issue(s), Cannot Assess, N/A
● Instructions: identify the correctness of any claims in the explanation and whether the
code (if any) is correct, executable, functional, and useful. Please take up to 30 minutes
to research information across both responses, and explicitly run code snippets as
needed and where appropriate. Use the following rubric:
○ No Issues: All claims in both the explanation and any code comments are factual
and accurate; the code (if any) is functional, safe, and useful.
○ Minor Issues(s): either or both of the following are true:
■ Text: primary claims (central to addressing the prompt) are factual /
accurate; secondary claims contain meaningful inaccuracies (or
unfounded claims).
● /Examples include: an otherwise correct explanation of a library
that uses an incorrect link, or a description of a system that
misconstrues a small detail of its design.
■ Code: has minor problems where the main functionality of the code is
correct; e.g., it fails to handle an edge case, or is correct but has
misleading comments.
○ Major Issues(s): either or both of the following are true:
■ Text: primary claims contain meaningful inaccuracies (or unfounded
claims), such that the response is not helpful to the user.
● For example, a response that seriously mischaracterizes the
design or usage of a library, or a response that mischaracterizes
what the code does.
■ Code: has one or more of the following problems:
● Executability: the program does not compile or run and would
require substantial effort to repair.
● Functionality: The code does not, or will not, produce the proper
intended output or is broken in a logical/functional fashion.
● Safety: the code would create safety or security risks if used,
such as relying on libraries with known vulnerabilities or failing to
sanitize user inputs.
○ Do not use this to flag responses that make simplifying
assumptions that a user would reasonably be expected to
notice and improve, such as using a hard-coded password
in a clearly visible location.
● Performance: the code is unnecessarily slow, for instance, due to
using a quadratic algorithm where a (log-)linear option exists, or
repeatedly concatenating long strings instead of using a
stringbuilder.
● Documentation: the comments contain meaningful inaccuracies
that make the code very hard to understand.
● Keep in mind that the code may be functional for the prompter,
even if it does not compile or run on your setup. For instance, a
response that points to a file only accessible to the prompter, or
provides a partial program based on the context provided by the
prompter should not be marked as non-functional unless it
contains errors that would (likely) manifest in the prompter’s
programming context.
○ Cannot Assess: Cannot determine validity of claims made in the response.
Select this option if properly researching the claims in the response would take
>30 minutes.
○ N/A - Not Applicable: No explicit or implicit claims are made in the response and
it does not include code. Use for this punts (e.g., “As an AI model I am not
capable of responding to this type of question”)
● Explanation: Required if issues are found. Describe all issues. Where possible,
categorize code-related issues based on the type of issue (functionality, safety,
performance, documentation).

(4) Side-by-Side (SxS) Comparison

SxS Score
● Options: Rate your preference between the two responses on a scale from 1 to 7, where
1 means response A is much better than B, 7 means response B is much better than A,
and 4 is neutral.
● Instructions: You should prefer the response that would be more helpful to the user. This
is mainly a function of how correct the response is and how well it followed instructions.
In general, correctness should be the primary consideration and instruction following
secondary, but there may be scenarios where the less correct response is the better
one. Use your best judgment.
○ If the two responses are equal in terms of correctness and instruction following,
you may want to consider other factors such as verbosity or style:
■ Verbosity: Is the length of the response (both the code and non-code
portions) appropriate? The response should include all essential
information, while avoiding excessive additional details. Generally, a
succinct response should be preferred over a verbose one (all else being
equal), but this can depend on preference and context.
■ Style: Does the response use high-quality prose that’s well-organized and
easy to read, and whether the included code, if any, is reasonably
formatted and includes sufficient and accurate documentation
● Explanation: Always required. Briefly explain the most important considerations in your
indicated preference. Relate your motivation to the answers provided above.

(Turing) Guidelines For Python Puzzles (March 2024)
No ratings yet
(Turing) Guidelines For Python Puzzles (March 2024)
11 pages
Bee - Coding Advanced
No ratings yet
Bee - Coding Advanced
18 pages
History of ITIL: Revolution and Evolution Brian Johnson
No ratings yet
History of ITIL: Revolution and Evolution Brian Johnson
10 pages
Cert Secure Coding Standards
No ratings yet
Cert Secure Coding Standards
18 pages
Ax2012 Enus Deviv 04 PDF
100% (1)
Ax2012 Enus Deviv 04 PDF
54 pages
Clsami46c02qx072ibtdtavm1 - Project Blackhat Code Eval Correctness
No ratings yet
Clsami46c02qx072ibtdtavm1 - Project Blackhat Code Eval Correctness
5 pages
(Internal) I18n Code Evals Instructions
No ratings yet
(Internal) I18n Code Evals Instructions
18 pages
Code Extensions - Instructions
No ratings yet
Code Extensions - Instructions
16 pages
Mandolin Task ChatGPT Search
No ratings yet
Mandolin Task ChatGPT Search
14 pages
Code V Code Official Instructions
No ratings yet
Code V Code Official Instructions
43 pages
Code Extensions - Instructions
No ratings yet
Code Extensions - Instructions
19 pages
(2024-10-20) Copy of Gratitude Corsage - SXS Eval - Instructions
No ratings yet
(2024-10-20) Copy of Gratitude Corsage - SXS Eval - Instructions
26 pages
Project Instructions
No ratings yet
Project Instructions
12 pages
Nightgown Standoff
No ratings yet
Nightgown Standoff
7 pages
Instructions
No ratings yet
Instructions
19 pages
Reviewer Checklist
No ratings yet
Reviewer Checklist
15 pages
Nightingale RLHF Code Onboarding WIP
No ratings yet
Nightingale RLHF Code Onboarding WIP
26 pages
Prompt Eng Notes
No ratings yet
Prompt Eng Notes
5 pages
Workspace Projects 1
No ratings yet
Workspace Projects 1
16 pages
HAv2 Write The Requirement
No ratings yet
HAv2 Write The Requirement
5 pages
Core Evals English Instructions
No ratings yet
Core Evals English Instructions
24 pages
Sharing - AMR Human Annotation Guideline - 20240828
No ratings yet
Sharing - AMR Human Annotation Guideline - 20240828
14 pages
Attempter's Cheat Sheet
No ratings yet
Attempter's Cheat Sheet
1 page
EVALUATION - Coding Data Requirements
No ratings yet
EVALUATION - Coding Data Requirements
24 pages
OpenAIs Function Calling Guide 1749358342
No ratings yet
OpenAIs Function Calling Guide 1749358342
18 pages
Cs Certification Ansewes1 - 3
No ratings yet
Cs Certification Ansewes1 - 3
44 pages
Work 1
No ratings yet
Work 1
8 pages
CP2406 Programming-II: Assignment-1: Assessment Description
No ratings yet
CP2406 Programming-II: Assignment-1: Assessment Description
6 pages
Lemur Astrologer Coding
No ratings yet
Lemur Astrologer Coding
28 pages
ORACLE PL/SQL Interview Questions You'll Most Likely Be Asked
From Everand
ORACLE PL/SQL Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
5/5 (1)
Outlier 4
No ratings yet
Outlier 4
4 pages
Tài liệu ôn tập Reading
No ratings yet
Tài liệu ôn tập Reading
8 pages
Mastering ChatGPT: Effective Prompts and Best Practices.
From Everand
Mastering ChatGPT: Effective Prompts and Best Practices.
Steven Mcananey
No ratings yet
System Prompt
No ratings yet
System Prompt
12 pages
Side by Side Evals
No ratings yet
Side by Side Evals
25 pages
Instructions 22
No ratings yet
Instructions 22
28 pages
Algorithm Challenges: The Dojo Collection
From Everand
Algorithm Challenges: The Dojo Collection
Martin Puryear
No ratings yet
Cs Certification Ansewes
No ratings yet
Cs Certification Ansewes
27 pages
Extensions V2 Tool Log
No ratings yet
Extensions V2 Tool Log
6 pages
Software Testing: A Guide to Testing Mobile Apps, Websites, and Games
From Everand
Software Testing: A Guide to Testing Mobile Apps, Websites, and Games
Mark Garzone
4.5/5 (3)
PRACTICAL GUIDE TO LEARN ALGORITHMS: Master Algorithmic Problem-Solving Techniques (2024 Guide for Beginners)
From Everand
PRACTICAL GUIDE TO LEARN ALGORITHMS: Master Algorithmic Problem-Solving Techniques (2024 Guide for Beginners)
MARTY TWITTY
No ratings yet
Debugging and Testing from Scratch: A Practical Guide with Examples
From Everand
Debugging and Testing from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
CS591 Programming Project 1 Description
No ratings yet
CS591 Programming Project 1 Description
4 pages
The Ultimate Prompt Vault: 1001 ChatGPT Commands Every Software Developer Should Know
From Everand
The Ultimate Prompt Vault: 1001 ChatGPT Commands Every Software Developer Should Know
Nemilidinne Ashok Reddy
No ratings yet
Acies Global-1
No ratings yet
Acies Global-1
5 pages
STT Sem6
No ratings yet
STT Sem6
10 pages
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
From Everand
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
Brady Ellison
5/5 (2)
Diary of a Software Craftsman
From Everand
Diary of a Software Craftsman
Mete Atamel
5/5 (3)
Software Developement Prompts
No ratings yet
Software Developement Prompts
14 pages
(Turing) Guidelines For Python Puzzles
No ratings yet
(Turing) Guidelines For Python Puzzles
8 pages
Algorithms Made Simple: Understanding the Building Blocks of Software
From Everand
Algorithms Made Simple: Understanding the Building Blocks of Software
William E. Clark
No ratings yet
HAv2 PR and Problem Statement
No ratings yet
HAv2 PR and Problem Statement
10 pages
CSC1002 Week3 AI Prompt
No ratings yet
CSC1002 Week3 AI Prompt
46 pages
Hello World: Student to Software Professional - a Transformation Guide
From Everand
Hello World: Student to Software Professional - a Transformation Guide
Ashish Vaidya
No ratings yet
Problem Stub
No ratings yet
Problem Stub
4 pages
Swift Tasking
No ratings yet
Swift Tasking
30 pages
Writing Clean Code Step by Step: A Practical Guide with Examples
From Everand
Writing Clean Code Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Java/J2EE Design Patterns Interview Questions You'll Most Likely Be Asked: Second Edition
From Everand
Java/J2EE Design Patterns Interview Questions You'll Most Likely Be Asked: Second Edition
Vibrant Publishers
No ratings yet
(Centific Version) Model Safety Quality SXS Eval (X2T) - V2
No ratings yet
(Centific Version) Model Safety Quality SXS Eval (X2T) - V2
58 pages
Software Development Sheet
No ratings yet
Software Development Sheet
23 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
Chivas ST Attempter Introduction
No ratings yet
Chivas ST Attempter Introduction
14 pages
Lecture 14 Regular Expressions
No ratings yet
Lecture 14 Regular Expressions
4 pages
GEOPOLITICAL
No ratings yet
GEOPOLITICAL
28 pages
Services Transportlayer 3
No ratings yet
Services Transportlayer 3
45 pages
FSJava Script
No ratings yet
FSJava Script
77 pages
Laravel Syll
No ratings yet
Laravel Syll
2 pages
Oo PHP
No ratings yet
Oo PHP
44 pages
Error Handling
No ratings yet
Error Handling
23 pages
Cookies
No ratings yet
Cookies
10 pages
2023 LG Digital Signage E-Catalog
No ratings yet
2023 LG Digital Signage E-Catalog
39 pages
Final Defense Huawei
No ratings yet
Final Defense Huawei
38 pages
Barangay Secretary Turnover Form
No ratings yet
Barangay Secretary Turnover Form
1 page
EVOC FSC 1812V2NA Datasheet
No ratings yet
EVOC FSC 1812V2NA Datasheet
1 page
Serial Camera Module Programming Instructions and Tutorial PDF
No ratings yet
Serial Camera Module Programming Instructions and Tutorial PDF
11 pages
Phishing Analysis Glossary 1
No ratings yet
Phishing Analysis Glossary 1
2 pages
Project Title (Times New Roman, 10 Italic, Sentence Case, Keep Right Align)
No ratings yet
Project Title (Times New Roman, 10 Italic, Sentence Case, Keep Right Align)
8 pages
Hardware Maintenance and Protection
No ratings yet
Hardware Maintenance and Protection
16 pages
Alibaba
No ratings yet
Alibaba
2 pages
Renesas Application Note 2011-02-17
No ratings yet
Renesas Application Note 2011-02-17
26 pages
DMarket White Paper EN
No ratings yet
DMarket White Paper EN
34 pages
cs3157 - Advanced Programming Summer 2006, Lab #4, 30 Points June 15, 2006
No ratings yet
cs3157 - Advanced Programming Summer 2006, Lab #4, 30 Points June 15, 2006
5 pages
API Dokumen - Tempahan
No ratings yet
API Dokumen - Tempahan
5 pages
Guide To ASTRO Digital Radios R03.00.01
100% (1)
Guide To ASTRO Digital Radios R03.00.01
57 pages
TOR of Guj RAMS
No ratings yet
TOR of Guj RAMS
45 pages
Closest Pair of Points Problem
No ratings yet
Closest Pair of Points Problem
3 pages
Aruba Configuration
No ratings yet
Aruba Configuration
3 pages
Configuring Local Storage: This Lab Contains The Following Exercises and Activities
No ratings yet
Configuring Local Storage: This Lab Contains The Following Exercises and Activities
10 pages
Cloud Compute
No ratings yet
Cloud Compute
46 pages
Hangman Game Documentation - With Code
No ratings yet
Hangman Game Documentation - With Code
10 pages
IP Office Anywhere and Power Demo Resource Guide
No ratings yet
IP Office Anywhere and Power Demo Resource Guide
12 pages
31769h Unit2 Que 20230322
No ratings yet
31769h Unit2 Que 20230322
20 pages
04 Linkedlist
No ratings yet
04 Linkedlist
58 pages
SystemDesk - EB Tresos Studio Workflow Descriptions
No ratings yet
SystemDesk - EB Tresos Studio Workflow Descriptions
23 pages
ISTQB Foundation Level (CTFL) Syllabus
No ratings yet
ISTQB Foundation Level (CTFL) Syllabus
11 pages
TQ ReactPhoneBook 070322 1229
No ratings yet
TQ ReactPhoneBook 070322 1229
3 pages
21BLC1206 Experiment3
No ratings yet
21BLC1206 Experiment3
4 pages
Tutorial NX
No ratings yet
Tutorial NX
11 pages

Align

Uploaded by

Align

Uploaded by

Blackhat Instructions

The form is structured in four sections:

Out of scope items:

The rating task consists of three parts:

(1) Code Response Executability:

Are you able to run code provided in the responses?

(2) Single-Sided Instruction Following:

(3) Single-Sided Correctness:

Is the response truthful and correct?

(4) Side-by-Side (SxS) Comparison

You might also like