GitHub - amazon-science/redset: Redset is a dataset containing three months worth of user query metadata that ran on a selected sample of instances in the Amazon Redshift fleet. We provide query metadata for 200 provisioned and serverless instances each.

Redset

Redset is a dataset containing three months worth of user query metadata that ran on a selected sample of instances in the Amazon Redshift fleet. We provide query metadata for 200 provisioned and serverless instances each.

Security

See CONTRIBUTING for more information.

License

Download

Folder structure:

s3://redshift-downloads/redset
- README
- LICENSE
- provisioned/
  - full.parquet
  - sample_0.01.parquet (1% uniform random data sample)
  - sample_0.001.parquet (0.1% uniform random data sample)
  - parts/
    - One individual <id>.parquet file per cluster
- serverless/
  - full.parquet
  - sample_0.01.parquet (1% uniform random sample)
  - sample_0.001.parquet (0.1% uniform random data sample)
  - parts/
    - One individual <id>.parquet file per cluster

You can either download files using their http link, e.g., https://s3.amazonaws.com/redshift-downloads/redset/LICENSE Or interact with the s3 bucket using the AWS CLI. For example, to download the full serverless dataset you can run:

aws s3 cp --no-sign-request s3://redshift-downloads/redset/serverless/full.parquet .

Schema

Column	Name Description
instance_id	Uniquely identifies a redshift cluster
cluster_size	Size of the cluster (only available for provisioned)
user_id	Identifies the user that issued the query
database_id	Identifies the database that was queried
query_id	Unique per instance
arrival_timestamp	Timestamp when the query arrived on the system
compile_duration_ms	Time the query spent compiling in milliseconds
queue_duration_ms	Time the query spent queueing in milliseconds
execution_duration_ms	Time the query spent executing in milliseconds
feature_fingerprint	Hash value of the query fingerprint. A proxy for query-likeness, though not based on text. Will overestimate repetition.
was_aborted	Whether the query was aborted during its lifetime
was_cached	Whether the query was answered from result cache
cache_source_query_id	If query was answered from result cache, this is the query id for the query which populated the cache
query_type	Type of query, e.g.., `select`, `copy`, ...
num_permanent_tables_accessed	Number of permanent table accesses by the query (regular database table)
num_external_tables_accessed	Number of external tables accessed by the query
num_system_tables_accessed	Number of system tables accessed by the query
read_table_ids	Comma separated list of unique permanent table ids read by the query
write_table_ids	Comma separated list of unique table ids written to by the query
mbytes_scanned	Total number of megabytes scanned by the query
mbytes_spilled	Total number of megabytes spilled by the query
num_joins	Number of joins in the query plan
num_scans	Number of scans in the query plan
num_aggregations	Number of aggregations in the query plan

Citation

TODO: bibtex citation will be available once paper is published (VLDB 2024).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Redset

Security

License

Download

Schema

Citation

About

Contributors 2

License

amazon-science/redset

Folders and files

Latest commit

History

Repository files navigation

Redset

Security

License

Download

Schema

Citation

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Contributors 2