8000 CC-News benchmark by MaxDall · Pull Request #600 · flairNLP/fundus · GitHub
[go: up one dir, main page]

Skip to content

Conversation

MaxDall
Copy link
Collaborator
@MaxDall MaxDall commented Aug 30, 2024

This PR introduces functionality to benchmark publishers using the CC-NEWS dataset.

The benchmarking process involves retrieving HTML and articles at specified intervals (daily, weekly, monthly, etc.) from the CC-NEWS dataset, assessing the completeness of the article extraction, and offering utility and statistical functions for operating on the benchmark. The goal is to detect any layout changes that occurred before the initial implementation of a specific parser and to provide the relevant HTML to address these changes.

@MaxDall MaxDall marked this pull request as draft August 30, 2024 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0