[go: up one dir, main page]

Published September 18, 2023 | Version 1.2
Dataset Open

SCI-3000: A Novel Dataset for the Task of Figure, Table and Caption Extraction from Scientific PDFs

  • 1. TU Vienna

Contributors

  • 1. TU Vienna

Description

This dataset contains bounding boxes of figures, tables, captions in 34,791 pages extracted from 3000 open-access scientific publications from the fields of medicine, chemistry, physics, computer science, and technology. The underlying publications are also included in PDF form.

For more details, refer to the README file.

Notes

V1.2 Adds a training/test split for models trained on SCI-3000

Files

SCI-3000-full.zip

Files (20.4 GB)

Name Size Download all
md5:e2abc448cc6c529eed243324bc184cb5
6.8 GB Preview Download
md5:ed7b26bbc59872432d051562f48d89dc
6.8 GB Preview Download
md5:5b4aa2bb73ac9c4eff43bc3d9595e6f0
6.8 GB Preview Download

Additional details

Related works

Is supplement to
Conference paper: 10.1007/978-3-031-41676-7_14 (DOI)