Explore the use of DSPy for extracting features from PDFs. This repository provides a simple example of how to use this framework to predict the sub-category of a Computer Science paper from arXiv.
- Clone this repository.
- Create a virtual environment.
- Install dependencies from requirements.txt.
- Install the virtual environment as a Jupyter kernel.
The dataset is a selection of 150 arXiv papers (metadata + pdf) from the computer science category.
To build the database:
- Download the JSON file from Kaggle into the
dspy-arxiv
directory. - Rename the file to
arxiv.json
. - Run the notebook
data.ipynb
from top to bottom.
At the end, you should have two directories:
- dspy-arxiv/database
- arxiv.json - the original JSON file with only the computer science category
- dspy-arxiv/dataset
- trainset - 50 JSON files with metadata + text used for "training"
- valset - 50 JSON files with metadata + text used for "validation"
- testset - 50 JSON files with metadata + text used for "testing"
If you want to add RAG to the pipeline, it's handy to have the data in a vector database for fast retrieval. Check out database.py for an example script to set up chromadb and populate it with arXiv metadata.
The notebook features.ipynb can be seen as a simple tutorial on how to use DSPy to programmatically prompt LLM for feature extraction (in this case, predicting the sub-category of a Computer Science paper from arXiv).
You can also take a look at the slides generated from this notebook.