A fork of NLPrinceton / SARC.
- Download the dataset.. Create a folder named
dataset
that is structured like this (don't forget to extract the files):dataset/ ├─ main/ | ├─ comments.json | ├─ test-balanced.csv | └─ train-balanced.csv └─ pol/ ├─ comments.json ├─ test-balanced.csv └─ train-balanced.csv
- Put the
dataset
folder at this repo's root directory. - Still at the repo's root directory, run
git submodule update --init
. This is one of the dependencies to create bag-of-n-grams (bong). - If you want to use word embedding instead of bong, download 1600-dimensional Amazon GloVe embeddings (NOTE: 2.6 GB compressed, 8.7 GB uncompressed). Then put the extracted .txt file inside the
dataset
folder.
Run one of the following commands: ($EMBEDDING is the file of downloaded GloVe embeddings)
'all' dataset
# Bag-of-Words on all:
python eval.py main -l --min_count 5
# Bag-of-Bigrams on all
python eval.py main -n 2 -l --min_count 5
# Embedding on all
python eval.py main -e -l --embedding dataset/amazon_glove1600.txt
'pol' dataset
# Bag-of-Words on pol
python eval.py pol -l
# Bag-of-Bigrams on pol
python eval.py pol -n 2 -l
# Embedding on pol
python eval.py pol -e -l --embedding dataset/amazon_glove1600.txt
'pol' dataset
VADER sentiment analysis scores
python turn-level-sentiment.py pol
Evaluation code for the Self-Annotated Reddit Corpus (SARC).
Dependencies: NLTK, scikit-learn, text_embedding.
To recreate the all-balanced and pol-balanced results in Table 2 of the paper:
-
download 1600-dimensional Amazon GloVe embeddings (NOTE: 2.4 GB compressed)
-
set the root directory of the SARC dataset at the top of utils.py
-
run the following ($EMBEDDING is the file of downloaded GloVe embeddings)
- Bag-of-Words on all: python SARC/eval.py main -l --min_count 5
- Bag-of-Bigrams on all: python SARC/eval.py main -n 2 -l --min_count 5
- Embedding on all: python SARC/eval.py main -e -l --embedding $EMBEDDING
- Bag-of-Words on pol: python SARC/eval.py pol -l
- Bag-of-Bigrams on pol: python SARC/eval.py pol -n 2 -l
- Embedding on pol: python SARC/eval.py pol -e -l --embedding $EMBEDDING
If you find this code useful please cite the following:
@inproceedings{khodak2018corpus,
title={A Large Self-Annotated Corpus for Sarcasm},
author={Khodak, Mikhail and Saunshi, Nikunj and Vodrahalli, Kiran},
booktitle={Proceedings of the Linguistic Resource and Evaluation Conference (LREC)},
year={2018}
}