shared-task-eval-script

📚 LLMs4Subjects -- Evaluation

This README provides instructions and information regarding the evaluation process for the LLMs4Subjects shared task. The aim of this task is the development of advanced semantic subject comprehension systems, focusing on the GND taxonomy. Participants are required to submit ranked lists of relevant subjects, which will be evaluated based on several quantitative metrics.

📂 Test Dataset

A portion of the TIBKAT collection has been designated as the blind test dataset. In this dataset, subject heading annotations are hidden. This folder contains the test dataset for all-subjects, while the folder contains the test dataset related to tib-core-subject.

Participants must submit a ranked list of the top 50 relevant subjects for each record, ordered by descending relevance.

📊 Quantitative Evaluation

The performance of submitted systems will be assessed using the following metrics:

Average Precision@k for k = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50
Average Recall@k for k = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50
Average F1-score@k for k = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50

Evaluation results will be presented at varying levels of granularity to provide comprehensive insights:

Language-level: Separate evaluations for English and German.
Record-level: Evaluations for each of the five types of technical records.
Combined Language and Record-levels: Detailed evaluations combining both language and record type.

🛠️ Evaluation Script

Participants are provided with an evaluation script to test their model performance on the train and dev sets. The same script will be used during the evaluation phase of the shared task.

Execution Instructions

🗂️ Step 1: Preparing the Folder Structure

The script requires a specific folder structure, identical to the train and dev sets, which are organized by record type and language.

Predictions should be stored in a JSON file, named identically to the corresponding record file, containing the predicted subject tags as a list of GND IDs.

▶️ Step 2: Running the Script

The script requires three user inputs:

The path to the gold-standard dataset with the annotations.
The path to the model's predictions.
The path to save the results as an Excel file.

The script will generate an Excel file containing the evaluation metrics scores, organized into three different sheets, each corresponding to a different level of granularity.

🧑‍💻 Code Execution Sample

$python llms4subjects-evaluation.py

LLMs4Subjects Shared Task -- Evaluations

Please enter your Team Name
Team Name> test

Please specify the directory containing the true GND labels
Directory path> evaluation/all_subjects

Please specify the directory containing the predicted GND labels
Directory path> evaluation/all_subjects/run1

Please specify the directory to save the evaluation metrics
Directory path> evaluation/results

Reading the True GND labels...
Reading the Predicted GND labels...

Evaluating the predicted GND labels...

File containing the evaluation metrics score is saved at location: evaluation/results/test_evaluation_metrics.xlsx

🎯 Conclusion

By following these instructions, participants can effectively evaluate their models using the provided script.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
llms4subjects-evaluation.py		llms4subjects-evaluation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

📚 LLMs4Subjects -- Evaluation

📂 Test Dataset

📊 Quantitative Evaluation

🛠️ Evaluation Script

Execution Instructions

🧑‍💻 Code Execution Sample

🎯 Conclusion

FilesExpand file tree

shared-task-eval-script

Directory actions

More options

Directory actions

More options

Latest commit

History

shared-task-eval-script

Folders and files

parent directory

README.md

📚 LLMs4Subjects -- Evaluation

📂 Test Dataset

📊 Quantitative Evaluation

🛠️ Evaluation Script

Execution Instructions

🧑‍💻 Code Execution Sample

🎯 Conclusion