Recommender Systems and Personalization Datasets
Julian McAuley, UCSD
Description
This page contains a collection of datasets that have been collected for research by our lab. Datasets contain the following features:
- user/item interactions
- star ratings
- timestamps
- product reviews
- social networks
- item-to-item relationships (e.g. copurchases, compatibility)
- product images
- price, brand, and category information
- GPS data
- heart-rate sequences
- other metadata
Please cite the appropriate reference if you use any of the datasets below.
Datasets are in (loose) json format unless specified otherwise, meaning they can be treated as python dictionary objects. A simple script to read json-formatted data is as follows:
Directory by Dataset
Twitch live-streaming interactions
NPR interview dialog data
This American Life podcast transcripts
Recipes and interactions from food.com
Paired Recipes from food.com
EndoMondo fitness tracking data
Amazon product reviews and metadata
Amazon question/answer data
Amazon marketing bias data
Google Local business reviews and metadata
Steam video game reviews and bundles
Goodreads book reviews
Goodreads spoilers
Pinterest fashion compatibility data
ModCloth clothing fit feedback
ModCloth marketing bias data
RentTheRunway clothing fit feedback
Tradesy bartering data
RateBeer bartering data
Gameswap bartering data
Behance community art reviews and image features
Librarything reviews and social data
Epinions reviews and social data
Dance Dance Revolution step charts
NES song data
FUTGA music caption data
PDMX Public Domain MusicXML
BeerAdvocate multi-aspect beer reviews
RateBeer multi-aspect beer reviews
Facebook social circles data
Twitter social circles data
Google+ social circles data
Reddit submission popularity and metadata
Directory by Metadata Type
The datasets below can be roughly organized in terms of the types of metadata they contain:
Review text: see Amazon, BeerAdvocate, RateBeer, Google Local, Google Restaurants
Image data: Amazon, Behance, Pinterest, Google Restaurants
Item-to-item relationships: Amazon
Q/A data: Amazon Q/A
Geographical data: Google Local, Google Restaurants, EndoMondo
Heart-Rate data: EndoMondo
Bundle data: Steam
Peer-to-peer trades: Tradesy, RateBeer, Gameswap
Social connections: Librarything, Epinions
Fit feedback: Modcloth, Renttherunway
Multple aspects: BeerAdvocate, RateBeer
Twitch
Description
This is a dataset of users consuming streaming content on Twitch. We retrieved all streamers, and all users connected in their respective chats, every 10 minutes during 43 days.
Basic statistics
100k | full | |
Users: | 100k | 15.5M |
Streamers (items): | 162.6k | 465k |
Interactions: | 3M | 124M |
Time steps: | 6148 | 6148 |
Metadata
Start and stop times are provided as integers and represent periods of 10 minutes. Stream ID could be used to retrieve a single broadcast segment from a streamer (not used in our work).- User ID (anonymized)
- Stream ID
- Streamer username
- Time start
- Time stop
Example
Download link
See our data folder containing all Twitch files. The file full_a.csv.gz contains the full dataset while 100k.csv is a subset of 100k users for benchmark purposes. The code is available in our Github repository.
Citation
Please cite the following if you use the data:
Recommendation on Live-Streaming Platforms: Dynamic Availability and Repeat Consumption
Jérémie Rappaz, Julian McAuley and Karl Aberer
RecSys, 2021
Interview: NPR Media Dialog Data
Description
This dataset contains interview transcripts from National Public Radio (NPR). Data includes full interview transcripts and news article headlines.
Basic statistics
NPR | |
Speakers: | 185K |
Episodes (Interviews): | 106K |
Utterances: | 3.20M |
Words: | 126.7M |
Metadata
- Episode Date and Title
- Speaker Names
- Speaker Utterances
- News Article Headlines
Example
Download link
See the Interview Dataset Page for download information.
Citation
Please cite the following if you use the data:
Interview: Large-scale Modeling of Media Dialog with Discourse Patterns and Knowledge Grounding
Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley
EMNLP, 2020
pdf
This American Life Podcast Transcripts
Description
This dataset contains program transcripts from This American Life. Data includes full program transcripts and associated audio.
Basic statistics
This American Life | |
Speakers: | 6,608 |
Episodes: | 663 |
Utterances: | 163,808 |
Words: | 7,390,793 |
Metadata
- Episode Act
- Speaker Names
- Speaker Utterances
- Utterance Lengths
- Episode Audio
Example
Download link
See the This American Life Dataset Page for download information.
Citation
Please cite the following if you use the data:
Speech Recognition and Multi-Speaker Diarization of Long Conversations
Huanru Henry Mao, Shuyang Li, Julian McAuley, Garrison W. Cottrell
INTERSPEECH, 2020
pdf
Food.com Recipe & Review Data
Description
These datasets contain recipe details and reviews from Food.com (formerly GeniusKitchen). Data includes cooking recipes and review texts.
Basic statistics
Food.com | |
Number of recipes: | 231,637 |
Number of users: | 226,570 |
Number of reviews: | 1,132,367 |
Metadata
- Ratings and Reviews
- Recipe Name, Description, Ingredients, and Directions
- Recipe Categories (Tags)
- Recipe Nutrition Information
Example
Recipe:
Review:
Download link
See the Food.com Dataset Page for download information.
Citation
Please cite the following if you use the data:
Generating Personalized Recipes from Historical User Preferences
Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley
EMNLP, 2019
pdf
Recipe Pairs data
Description
This is a collection recipes paired with variants, e.g. a recipe matched with a vegan version of the same recipe.
Basic statistics
Food.com | |
Number of recipes: | 83,000 |
Number of base recipes: | 36,000 |
Number of target recipes: | 60,000 |
Metadata
- Ratings and Reviews
- Recipe Name, Description, Ingredients, and Directions
- Recipe Categories (Tags)
- Recipe Nutrition Information
Download link
See the Recipe Pairs Dataset Page for download information.
Citation
Please cite the following if you use the data:
SHARE: a System for Hierarchical Assistive Recipe Editing
Shuyang Li, Yufei Li, Jianmo Ni, Julian McAuley
EMNLP, 2022
pdf
EndoMondo Fitness Tracking Data
Description
This is a collection of workout logs from users of EndoMondo. Data includes multiple sources of sequential sensor data such as heart rate logs, speed, GPS, as well as sport type, gender and weather conditions.
Basic statistics
Users: | 1,104 |
Workouts: | 253,020 |
Metadata
- User Identifier
- Gender
- Sport type
- Latitude/Longitude/Altitude sequences (with timestamps)
- Heart rates
- Various derived sequences
Example
Download link
See the FitRec Dataset Page for download information.
Citation
Please cite the following if you use the data:
Modeling heart rate and activity data for personalized fitness recommendation
Jianmo Ni, Larry Muhlstein, Julian McAuley
WWW, 2019
pdf
Amazon Product Reviews
Description
This is a large-scale Amazon Reviews dataset collected in 2023. This dataset contains 48.19 million items, and 571.54 million reviews from 54.51 million users.
Basic statistics
Ratings: | 571.54 million |
Users: | 54.51 million |
Items: | 48.19 million |
Timespan: | May 1996 - September 2023 |
Metadata
- User Reviews (ratings, text, helpfulness votes, etc.);
- Item Metadata (descriptions, price, raw image, etc.);
- Links (user-item / bought together graphs).
Example
Download link
See the Amazon Reviews 2023 page for download information.
You can also download data from previous versions of these datasets:
Citation
Please cite the following if you use the data:
2023 version
Bridging Language and Items for Retrieval and Recommendation
Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, Julian McAuley
arXiv
pdf
2018 version
Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
EMNLP, 2019
pdf
2014 version
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
Ruining He, Julian McAuley
WWW, 2016
pdf
Image-based recommendations on styles and substitutes
Julian McAuley, Christopher Targett, Javen Shi, Anton van den Hengel
SIGIR, 2015
pdf
Amazon Question and Answer Data
Description
These datasets contain questions and answers about products from the Amazon dataset above.
Basic statistics
Questions: | 1.48 million |
Answers: | 4,019,744 |
Labeled yes/no questions: | 309,419 |
Number of unique products with questions: | 191,185 |
Metadata
- question and answer text
- is the question binary (yes/no), and if so does it have a yes/no answer?
- timestamps
- product ID (to reference the review dataset)
Example
Download link
See the Amazon Q/A Page for download information.
Citation
Please cite the following if you use the data:
Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems
Mengting Wan, Julian McAuley
International Conference on Data Mining (ICDM), 2016
pdf
Addressing complex and subjective product-related queries with customer reviews
Julian McAuley, Alex Yang
World Wide Web (WWW), 2016
pdf
Marketing Bias data
Description
These datasets contain attributes about products sold on ModCloth and Amazon which may be sources of bias in recommendations (in particular, attributes about how the products are marketed). Data also includes user/item interactions for recommendation.
Basic statistics
ModCloth | Amazon Electronics | ||
Reviews: | 99,893 | 1,292,954 | |
Items: | 1,020 | 9,560 | |
Users: | 44,783 | 1,157,633 | |
Bias type: | body shape | gender |
Metadata
- ratings
- product images
- user identities
- item sizes, user genders
Example (ModCloth)
Download links
See our project page for download links.
Citation
Please cite the following if you use the data:
Addressing Marketing Bias in Product Recommendations
Mengting Wan, Jianmo Ni, Rishabh Misra, Julian McAuley
WSDM, 2020
pdf
Google Local Reviews (2021)
Description
This dataset contains review information from Google Maps (ratings, text, images, etc.), business metadata (address, geographic info, descriptions, category information, price, open hours, etc.), and links (related businesses) up to Sep 2021 in the United States.
See also two variants of this dataset below, including a 2021 version, and a version containing item images.
Basic statistics
Reviews: | 666,324,103 |
Users: | 113,643,107 |
Businesses: | 4,963,111 |
Review
- user_id - ID of the reviewer
- name - name of the reviwer
- time - time of the review (unix time)
- rating - rating of the business
- text - text of the review
- pics - pictures of the review
- resp - business response to the review including unix time and text of the response
- gmap_id - ID of the business
Metadata
- name - name of the business
- address - address of the business
- gmap_id - ID of the business
- description - description of the business
- latitude - latitude of the business
- longitude - longitude of the business
- category - category of the business
- avg_rating - average rating of the business
- num_of_reviews - number of reviews
- price - price of the business
- hours - open hours
- MISC - MISC information
- state - the current status of the business (e.g., permanently closed)
- relative_results - relative businesses recommended by Google
- url - URL of the business
Download links
See the Google Local Dataset Page for download information.
Citation
Please cite the following if you use the data:
UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining
Jiacheng Li, Jingbo Shang, Julian McAuley
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
pdf
Personalized Showcases: Generating Multi-Modal Explanations for Recommendations
An Yan, Zhankui He, Jiacheng Li, Tianyang Zhang, Julian Mcauley
arXiv:2207.00422, 2022
pdf
Google Local Reviews (2018)
Description
These datasets contain reviews about businesses from Google Local (Google Maps). Data includes geographic information for each business as well as reviews.
Basic statistics
Reviews: | 11,453,845 |
Users: | 4,567,431 |
Businesses: | 3,116,785 |
Metadata
- reviews and ratings
- GPS coordinates and address
- User information (places lived, jobs)
- timestamps
- business category, opening hours, etc.
Example (review)
Example (business)
Download links
Places Data (276mb)
User Data (178mb)
Review Data (1.4gb)
Citation
Please cite the following if you use the data:
Translation-based factorization machines for sequential recommendation
Rajiv Pasricha, Julian McAuley
RecSys, 2018
pdf
Translation-based recommendation
Ruining He, Wang-Cheng Kang, Julian McAuley
RecSys, 2017
pdf
Google Restaurants
Description
This is a mutli-modal dataset of restaurants from Google Local (Google Maps). Data includes images and reviews posted by users, as well as other metadata for each restaurant.
Basic statistics
subset | full | |
Restaurants: | 30K | 65K |
Users: | 37K | 1.01M |
Reviews: | 108K | 1.77M |
Images: | 203K | 4.43M |
Metadata
- Geographical location and address
- Reviews, ratings and images
- Timestamps
- Business category, opening status, price, etc.
Example
Download link
See our data folder containing all related files. The file image_review_all.json contains the full dataset, while filter_all_t.json is a subset with filtered review sentences that have higher correlation with images. Code is available in our Github repository.
Citation
Please cite the following if you use the data:
Personalized Showcases: Generating Multi-Modal Explanations for Recommendations
An Yan, Zhankui He, Jiacheng Li, Tianyang Zhang, Julian Mcauley
arXiv:2207.00422, 2022
pdf
Steam Video Game and Bundle Data
Description
These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.
Basic statistics
Reviews: | 7,793,069 |
Users: | 2,567,538 |
Items: | 15,474 |
Bundles: | 615 |
Metadata
- reviews
- purchases, plays, recommends ("likes")
- product bundles
- pricing information
Example (bundle)
Download links
Version 1: Review Data (6.7mb)
Version 1: User and Item Data (71mb)
Version 2: Review Data (1.3gb)
Version 2: Item metadata (2.7mb)
Bundle Data (92kb)
Citation
Please cite the following if you use the data:
Self-attentive sequential recommendation
Wang-Cheng Kang, Julian McAuley
ICDM, 2018
pdf
Item recommendation on monotonic behavior chains
Mengting Wan, Julian McAuley
RecSys, 2018
pdf
Generating and personalizing bundle recommendations on Steam
Apurva Pathak, Kshitiz Gupta, Julian McAuley
SIGIR, 2017
pdf
Goodreads Book Reviews
These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a "shelf", rating, and reading.
Basic statistics
Items: | 1,561,465 |
Users: | 808,749 |
Interactions: | 225,394,930 |
Metadata
- reviews
- add-to-shelf, read, review actions
- book attributes: title, isbn
- graph of similar books
Example (interaction data)
Download links
See our dataset page for download links.
Citation
Please cite the following if you use the data:
Item recommendation on monotonic behavior chains
Mengting Wan, Julian McAuley
RecSys, 2018
pdf
Goodreads Spoilers
These datasets contain reviews from the Goodreads book review website, along with annotated "spoiler" information from each review.
Basic statistics
Books: | 25,475 |
Users: | 18,892 |
Reviews: | 1,378,033 |
Metadata
- reviews
- ratings
- spoilers
- see also metadata from the complete Goodreads dataset
Example (spoiler data)
Sentences are annotated as "1" if the sentence contains a spoiler, "0" otherwise.
Download links
See our dataset page for download links.
Citation
Please cite the following if you use the data:
Fine-grained spoiler detection from large-scale review corpora
Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley
ACL, 2019
pdf
Pairwise Fashion Explanations
Description
The Pair Fashion Explanation (PFE) dataset contains 6407 instances, with each instance including items, features and the reason why these items are a good match.
Mentioned Items and the Percentages:
Metadata
- Items (dress, top, skirt, etc.);
- Features (kilt, studded, etc.);
- Explanations (The outfit looks cohesive because the oversized layers are cinched with a studded belt, which complements the little strip from a kilt skirt that is also affixed to the belt, creating a visually pleasing balance in the outfit.);
Example
Download link
See our project page for download information.
Citation
Please cite the following if you use the data:
Deciphering Compatibility Relationships with Textual Descriptions via Extraction and Explanation.
Yu Wang, Zexue He, Zhankui He, Hao Xu, Julian McAuley.
AAAI 2024
pdf
Pinterest Fashion Compatibility
This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.
Basic statistics
Scenes: | 47,739 |
Products: | 38,111 |
Scene-Product Pairs: | 93,274 |
Metadata
- product IDs
- bounding boxes
Example (fashion.json)
Download links
See our project page for download links, and for instructions as to how the product images can be collected from Pinterest.
Citation
Please cite the following if you use the data:
Complete the Look: Scene-based complementary product recommendation
Wang-Cheng Kang, Eric Kim, Jure Leskovec, Charles Rosenberg, Julian McAuley
CVPR, 2019
pdf
Clothing Fit Data
Description
These datasets contain measurements of clothing fit from ModCloth and RentTheRunway.
Basic statistics
Modcloth | Renttherunway | |
Number of users: | 47,958 | 105,508 |
Number of items: | 1,378 | 5,850 |
Number of transactions: | 82,790 | 192,544 |
Metadata
- ratings and reviews
- fit feedback (small/fit/large etc.)
- user/item measurements
- category information
Example (RentTheRunway)
Download links
Modcloth (8.5mb)
Renttherunway (31mb)
Citation
Please cite the following if you use the data:
Decomposing fit semantics for product size recommendation in metric spaces
Rishabh Misra, Mengting Wan, Julian McAuley
RecSys, 2018
pdf
Product Exchange/Bartering Data
Description
These datasets contain peer-to-peer trades from various recommendation platforms.
Basic statistics
Tradesy | Ratebeer | Gameswap | |
Number of users: | 128,152 | 2,215 | 9,888 |
Number of transactions: | 68,543 | 125,665 | 3,470 |
Metadata
- peer-to-peer trades
- "have" and "want" lists
- image data (tradesy)
Example (tradesy)
Download links
Tradesy (3.8mb)
See the project page for ratebeer, gameswap (and other) datasets
Citation
Please cite the following if you use the data:
Bartering books to beers: A recommender system for exchange platforms
Jérémie Rappaz, Maria-Luiza Vladarean, Julian McAuley, Michele Catasta
WSDM, 2017
pdf
VBPR: Visual bayesian personalized ranking from implicit feedback
Ruining He, Julian McAuley
AAAI, 2016
pdf
Behance Community Art Data
Description
Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.
Basic statistics
Users: | 63,497 |
Items: | 178,788 |
Appreciates ("likes"): | 1,000,000 |
Metadata
- appreciates (likes)
- timestamps
- extracted image features
Example ("appreciate" data)
Each entry is a user, item, timestamp triple:
Code to read image features
Download links
See our data folder containing all Behance files. The folder also contains additional documentation.
Citation
Please cite the following if you use the data:
Vista: A visually, socially, and temporally-aware model for artistic recommendation
Ruining He, Chen Fang, Zhaowen Wang, Julian McAuley
RecSys, 2016
pdf
Social Recommendation Data
Description
These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews).
Basic statistics
Librarything | Epinions | |
Number of users: | 73,882 | 116,260 |
Number of items: | 337,561 | 41,269 |
Number of ratings/feedback: | 979,053 | 181,394 |
Number of social relations: | 120,536 | 181,304 |
Metadata
- reviews
- price paid (epinions)
- helpfulness votes (librarything)
- flags (librarything)
Example (LibraryThing reviews)
Example (LibraryThing social network)
Download links
LibraryThing (594mb)
epinions (66mb)
Citation
Please cite the following if you use the data:
SPMC: Socially-aware personalized Markov chains for sparse sequential recommendation
Chenwei Cai, Ruining He, Julian McAuley
IJCAI, 2017
pdf
Improving latent factor models via personalized feature projection for one-class recommendation
Tong Zhao, Julian McAuley, Irwin King
Conference on Information and Knowledge Management (CIKM), 2015
pdf
Other Non-Recommender-Systems Datasets
Description
Below are various datasets collected by my lab that are not related to recommender systems specifically. Formats of these datasets vary, so their respective project pages should be consulted for further details.
Script Grounded Role-play
Description
The dataset contains script-related knowledge and task-specific interviews for fictional character role-play. This dataset is specially designed for the evaluation of various types of hallucinations, such as cross-universe and temporal hallucinations, in role-playing scenarios.
Basic statistics
Stories: | 1,100 |
Characters: | 2,000 |
Interviews/Tasks: | 72,000 |
Knowledge events: | 2,400,000 |
Speech events: | 1,100,000 |
Non-speech events: | 1,300,000 |
Metadata
- Story (title, characters, etc.);
- Character (description, utterance_count, etc.);
- Knowledge Event (character, content_type, text, timestep);
- Interview (story_id, task_id, task_type, character, question, start_time, end_time).
Example Knowledge Event
Example Interview/Task
Download link
See the project page for download information.
Citation
Please cite the following if you use the data:
Mitigating Hallucination in Fictional Character Role-Play
Nafis Sadeq, Zhouhang Xie, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley
EMNLP Findings, 2024
pdf
DogWhistle: Cant Understanding Data
DogWhistle is a Chinese dataset collected from the historical records for an online game. It provides hidden words and the cant for them, with human answers. The dataset is suitable for semantic similarity evaluation for large language models.
Basic statistics
train | dev | test | |
---|---|---|---|
Games: | 9,817 | 1,161 | 1,143 |
Rounds: | 76,740 | 9,593 | 9,592 |
Word Combinations: | 18,832 | 2,243 | 2,220 |
Unique words: | 1,878 | 1,809 | 1,820 |
Cant: | 230,220 | 28,779 | 28,776 |
Metadata
- cant and the hidden words
- cant history
- human answers
Example (insider subtask)
Download links
Please refer to our leaderboard page for download instructions.
Citation
Please cite the following if you use the data:
Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with Common Sense and World Knowledge
Canwen Xu, Wangchunshu Zhou, Tao Ge, Ke Xu, Julian McAuley, Furu Wei
NAACL, 2021
pdf
Video Game Data
Description
Step charts from the video game Dance Dance Revolution, and audio files from the NES platform.
Basic statistics
Num songs (DDR): | 223 (7 hours) |
Num charts (DDR): | 1,102 |
Num games (NES): | 397 |
Num songs (NES): | 5,278 (46 hours) |
Num notes (NES): | 2,325,636 |
Download links
See the project pages for Dance Dance Convolution and NES MDB for further details and links to the data
Citation
Please cite the following if you use the data:
Dance Dance Convolution
Chris Donahue, Zachary Lipton, Julian McAuley
ICML, 2017
pdf
The NES Music Database: A symbolic music dataset with expressive performance attributes
Chris Donahue, Henry Mao, Julian McAuley
International Society for Music Information Retrieval Conference (ISMIR), 2018
pdf
FUTGA
Description
FUTGA is a fine-grained music dense caption dataset for full-length songs (up to 5 minutes). FUTGA is generated from our developed Music-LLM, learning from generative augmentation with temporal compositions. By leveraging existing music caption datasets and large language models (LLMs), we synthesize detailed music captions with structural descriptions and time boundaries for full-length songs. This synthetic dataset enables FUTGA to identify temporal changes at key transition points, their musical functions, and generate dense captions for full-length songs.
Basic statistics
Num songs | Ave duration (sec) | Ave caption length (tokens) | |
---|---|---|---|
MusicCaps: | 5.4K | 223.9 | 472.4 |
SongDescriber: | 706 | 225.4 | 482.5 |
AudioSet: | 51.8 | 273.9 | 473.8 |
HarmonixSet: | 842 | 224.9 | 404.3 |
Example
Gwyn Ashton - Jumping Jack Flash
Download links
See our Huggingface dataset page for a better view. The pre-trained FUTGA Music-LLM is also available.
Citation
Please cite the following if you use the data:
Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation
Wu, Junda and Novack, Zachary and Namburi, Amit and Dai, Jiaheng and Dong, Hao-Wen and Xie, Zhouhang and Chen, Carol and McAuley, Julian
arXiv, 2024
pdf
PDMX
Description
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing. To our knowledge, PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
Basic statistics
Format: | MusicXML |
Hours: | 6,250 |
Size: | 254,077 |
Metadata
See the Zenodo project page for a more in-depth description of the metadata. However, at an abstract level, the metadata includes:
- Genre
- Tag
- Description (title, subtitle, composer, etc.)
- Popularity (i.e. rating score)
Example
Download links
See the Zenodo project page to download PDMX and find more information about the project.
Citation
Please cite the following if you use the data:
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing
Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley
arXiv, 2024
pdf
Multi-aspect Reviews
Description
These datasets include reviews with multiple rated dimensions. The most comprehensive of these are beer review datasets from Ratebeer and Beeradvocate, which include sensory aspects such as taste, look, feel, and smell.
Basic statistics
Ratebeer | BeerAdvocate | |
Number of users: | 40,213 | 33,387 |
Number of items: | 110,419 | 66,051 |
Number of ratings/reviews: | 2,855,232 | 1,586,259 |
Timespan: | Apr 2000 - Nov 2011 | Jan 1998 - Nov 2011 |
Metadata
- reviews
- aspect-specific ratings (taste, look, feel, smell, overall impression)
- product category
- ABV
Example (ratebeer)
Download links
BeerAdvocate (433mb)
RateBeer (388mb)
Sentences with aspect labels (annotator 1) (758kb)
Sentences with aspect labels (annotator 2) (759kb)
Citation
Please cite the following if you use the data:
Learning attitudes and attributes from multi-aspect reviews
Julian McAuley, Jure Leskovec, Dan Jurafsky
International Conference on Data Mining (ICDM), 2012
pdf
From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews
Julian McAuley, Jure Leskovec
WWW, 2013
pdf
Social Circles
Description
These datasets contain social connections and "circles" from Facebook, Twitter, and Google Plus.
Basic statistics
Google Plus | |||
Number of networks: | 10 | 133 | 1,000 |
Number of nodes: | 4,039 | 106,674 | 192,075 |
Number of circles: | 193 | 479 | 5,541 |
Metadata
- social connections
- circles (sets of friends sharing a common property)
- user metadata
Example (Kaggle egonet data)
Download links
See SNAP facebook, twitter, and Google Plus data, as well as the Kaggle competition based on the same data.
Citation
Please cite the following if you use the data:
Learning to Discover Social Circles in Ego Networks
Julian McAuley, Jure Leskovec
Neural Information Processing Systems (NIPS), 2012
pdf
Reddit Submissions
Description
Submissions of reddit posts (and in particular resubmissions of the same content) along with metadata.
Basic statistics
Num of submissions (images): | 132,308 |
Num of unique images: | 16,736 |
Timespan | July 2008 - January 2013 |
Metadata
- timestamps
- upvotes/downvotes
- post title, subreddit, etc.
Example
Download links
resubmissions data (7.3mb)
raw html of resubmissions (1.8gb)
See also the SNAP project page.
Citation
Please cite the following if you use the data:
Understanding the interplay between titles, content, and communities in social media
Himabindu Lakkaraju, Julian McAuley, Jure Leskovec
ICWSM, 2013
pdf
Questions and comments to Julian McAuley