dspy-arxiv

Explore the use of DSPy for extracting features from PDFs. This repository provides a simple example of how to use this framework to predict the sub-category of a Computer Science paper from arXiv.

Suggested Installation

Clone this repository.
Create a virtual environment.
Install dependencies from requirements.txt.
Install the virtual environment as a Jupyter kernel.

Build Dataset & Database

The dataset is a selection of 150 arXiv papers (metadata + pdf) from the computer science category.

To build the database:

Download the JSON file from Kaggle into the dspy-arxiv directory.
Rename the file to arxiv.json.
Run the notebook data.ipynb from top to bottom.

At the end, you should have two directories:

dspy-arxiv/database
- arxiv.json - the original JSON file with only the computer science category
dspy-arxiv/dataset
- trainset - 50 JSON files with metadata + text used for "training"
- valset - 50 JSON files with metadata + text used for "validation"
- testset - 50 JSON files with metadata + text used for "testing"

If you want to add RAG to the pipeline, it's handy to have the data in a vector database for fast retrieval. Check out database.py for an example script to set up chromadb and populate it with arXiv metadata.

Features Extraction

The notebook features.ipynb can be seen as a simple tutorial on how to use DSPy to programmatically prompt LLM for feature extraction (in this case, predicting the sub-category of a Computer Science paper from arXiv).

You can also take a look at the slides generated from this notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
database.py		database.py
dataset.ipynb		dataset.ipynb
features.ipynb		features.ipynb
features.slides.html		features.slides.html
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dspy-arxiv

Suggested Installation

Build Dataset & Database

Features Extraction

About

Languages

License

S1M0N38/dspy-arxiv

Folders and files

Latest commit

History

Repository files navigation

dspy-arxiv

Suggested Installation

Build Dataset & Database

Features Extraction

About

Topics

Resources

License

Stars

Watchers

Forks

Languages