The graphman dump command exports all entity data and metadata for a
single subgraph deployment into a self-contained directory of Parquet files
and JSON metadata. The resulting dump can be used to restore the deployment
into a different graph-node instance via graphman restore. Dumps are
consistent snapshots of the deployment's state at a specific point in time.
WARNING: The dump and restore commands are experimental and can not replace proper database backups at this point. In particular, there is no guarantee that a dump will be restorable. Having said that, we encourage users to try out the dump and restore commands in non-production environments and report any issues they encounter.
WARNING: Dumping happens in a single transaction and can put significant load on the database for large subgraphs. Use with caution on production instances.
WARNING: Restoring a dump will currently create all the default indexes that a new deployment gets, and ignores the indexes that might have been carefully curated for the original deployment, even though they are recorded in the dump. This can lead to very long restore times for large subgraphs. The restore process will be optimized in the future to only create indexes that are present in the dump's metadata.
graphman dump <deployment> <directory><deployment> identifies the subgraph deployment to dump. It can be a
subgraph name, a deployment hash (Qm...), or a database namespace
(sgdNNN). <directory> is the path where the dump will be written; it
will be created if it does not exist.
Running graphman dump against an existing dump directory performs an
incremental dump: only rows added since the last dump are exported, and
new chunk files are appended rather than rewriting existing ones.
# Full dump
graphman dump my-subgraph /backups/my-subgraph
# Incremental update of the same dump
graphman dump my-subgraph /backups/my-subgraphgraphman restore <directory> [options]<directory> is the path to a dump previously created with graphman dump.
| Option | Description |
|---|---|
--shard |
Target database shard. Uses deployment rules (or primary shard) when omitted. Required with --add |
--name |
Subgraph name for deployment rule matching and node assignment. Falls back to an existing name |
--replace |
Drop and recreate if the deployment already exists in the target shard |
--add |
Create a copy in a shard that doesn't already have this deployment (requires --shard) |
--force |
Replace if the deployment exists in the target shard, add if it doesn't |
--replace, --add, and --force are mutually exclusive. When none is
given, restore fails if the deployment already exists in the target shard.
# Restore into the default shard
graphman restore /backups/my-subgraph
# Restore into a specific shard, replacing if it already exists
graphman restore /backups/my-subgraph --shard shard1 --replace
# Force-restore (replace or add as needed)
graphman restore /backups/my-subgraph --forceA dump directory has the following structure:
<dump-dir>/
metadata.json -- deployment metadata + per-table state
schema.graphql -- raw GraphQL schema text
subgraph.yaml -- raw subgraph manifest YAML (optional)
<EntityType>/
chunk_000000.parquet -- rows ordered by vid
chunk_000001.parquet -- incremental append (future chunks)
...
data_sources$/
chunk_000000.parquet -- dynamic data sources
Each entity type defined in the GraphQL schema gets its own subdirectory,
named after the entity type exactly as it appears in the schema (e.g.
Token/, Pool/). The Proof of Indexing appears as a regular entity
directory name Poi$. The special data_sources$ directory holds dynamic
data sources created at runtime.
Within each directory, data is stored in numbered chunk files
(chunk_000000.parquet, chunk_000001.parquet, ...). A fresh dump
produces a single chunk_000000.parquet per table. Incremental dumps
append new chunks rather than rewriting existing ones.
The GraphQL schema and subgraph manifest are stored as separate plain-text
files schema.graphql and subgraph.yaml.
The top-level metadata.json contains everything needed to reconstruct the
deployment's table structure, plus diagnostic information captured at dump
time. Its structure is:
{
"version": 1,
"deployment": "Qm...",
"network": "mainnet",
"manifest": {
"spec_version": "1.0.0",
"description": "Optional subgraph description",
"repository": "https://github.com/...",
"features": ["..."],
"entities_with_causality_region": ["EntityType1"],
"history_blocks": 2147483647
},
"earliest_block_number": 12345,
"start_block": { "number": 12345, "hash": "0xabc..." },
"head_block": { "number": 99999, "hash": "0xdef..." },
"entity_count": 150000,
"graft_base": null,
"graft_block": null,
"debug_fork": null,
"health": {
"failed": false,
"health": "healthy",
"fatal_error": null,
"non_fatal_errors": []
},
"indexes": {
"token": [
"CREATE INDEX CONCURRENTLY IF NOT EXISTS attr_0_0_id ON sgd.token USING btree (id)"
]
},
"tables": {
"Token": {
"immutable": true,
"has_causality_region": false,
"chunks": [
{
"file": "Token/chunk_000000.parquet",
"min_vid": 0,
"max_vid": 50000,
"row_count": 50000
}
],
"max_vid": 50000
},
"data_sources$": {
"immutable": false,
"has_causality_region": true,
"chunks": [
{
"file": "data_sources$/chunk_000000.parquet",
"min_vid": 0,
"max_vid": 100,
"row_count": 100
}
],
"max_vid": 100
}
}
}Field descriptions:
| Field | Description |
|---|---|
version |
Format version. Must be 1. |
deployment |
Deployment hash (Qm...). |
network |
The blockchain network (e.g. mainnet, goerli). |
manifest |
Manifest metadata extracted from subgraphs.subgraph_manifest. |
manifest.spec_version |
Subgraph API version. Required to parse schema.graphql. |
manifest.entities_with_causality_region |
Entity types that have a causality_region column. |
manifest.history_blocks |
How many blocks of entity version history are retained. |
earliest_block_number |
Earliest block for which data exists (accounts for pruning). |
start_block |
The block where indexing started. Null if not set. |
head_block |
The latest indexed block at dump time. |
entity_count |
Total entity count across all tables. |
graft_base |
Deployment hash of the graft base, if any. |
graft_block |
Block pointer of the graft point, if any. |
debug_fork |
Debug fork deployment hash, if any. |
health |
Point-in-time health snapshot. Not used during restore. |
indexes |
Point-in-time index definitions as SQL. Not used during restore (indexes are auto-created by Layout::create_relational_schema()). |
tables |
Per-table metadata keyed by entity type name (or data_sources$). |
Each entry in tables contains:
| Field | Description |
|---|---|
immutable |
Whether the entity type is immutable (uses block$ instead of block_range). |
has_causality_region |
Whether rows have a causality_region column. |
chunks |
Ordered list of Parquet chunk files for this table. |
chunks[].file |
Relative path from the dump directory. |
chunks[].min_vid |
Minimum vid value in this chunk. |
chunks[].max_vid |
Maximum vid value in this chunk. |
chunks[].row_count |
Number of rows in this chunk. |
max_vid |
Maximum vid across all chunks. -1 if the table is empty. |
Each entity table's Parquet file uses an Arrow schema derived from the entity's GraphQL definition. Columns are ordered as follows:
-
System columns (always present, in this order):
vid(Int64, non-nullable) -- row version ID- Block tracking (one of):
- Immutable entities:
block$(Int32, non-nullable) - Mutable entities:
block_range_start(Int32, non-nullable),block_range_end(Int32, nullable -- null means unbounded/current)
- Immutable entities:
causality_region(Int32, non-nullable) -- only if the entity has one
-
Data columns in GraphQL declaration order, skipping fulltext (
TSVector) columns which are generated and rebuilt on restore.
The PostgreSQL int4range type used for block_range is decomposed into
two scalar columns (block_range_start, block_range_end) in the Parquet
representation. This avoids the need for a custom range type in Arrow.
GraphQL/PostgreSQL column types map to Arrow data types as follows:
| ColumnType | Arrow DataType | Notes |
|---|---|---|
Boolean |
Boolean |
|
Int |
Int32 |
|
Int8 |
Int64 |
|
Bytes |
Binary |
Raw bytes, no hex encoding |
BigInt |
Utf8 |
Stored as decimal string for arbitrary precision |
BigDecimal |
Utf8 |
Stored as decimal string for arbitrary precision |
Timestamp |
Timestamp(Microsecond, None) |
Microseconds since epoch, no timezone |
String |
Utf8 |
|
Enum(...) |
Utf8 |
Enum variant as string (cast from PG enum to text during dump) |
TSVector(...) |
skipped | Fulltext index columns are generated; rebuilt on restore |
Array columns: A GraphQL list field (e.g. tags: [String!]!) is
stored as List<T> where T is the base Arrow type from the table
above. Whether a column is a list is determined by the GraphQL field type,
not by ColumnType. For example, [String!]! becomes List<Utf8> and
[Int!] becomes List<Int32>.
Nullability follows the GraphQL schema: non-null fields produce non-nullable Arrow columns; optional fields produce nullable columns. List elements within list columns are always marked nullable in the Arrow schema.
The data_sources$ table has a fixed schema independent of the GraphQL
definition:
| Column | Arrow DataType | Nullable | Description |
|---|---|---|---|
vid |
Int64 |
no | Row version ID |
block_range_start |
Int32 |
no | Lower bound of block_range |
block_range_end |
Int32 |
yes | Upper bound (null = unbounded) |
causality_region |
Int32 |
no | Causality region |
manifest_idx |
Int32 |
no | Index into the manifest's data source list |
parent |
Int32 |
yes | Self-referencing parent data source |
id |
Binary |
yes | Data source identifier |
param |
Binary |
yes | Data source parameter |
context |
Utf8 |
yes | JSON context |
done_at |
Int32 |
yes | Block number where the data source was marked done |
All Parquet files use ZSTD compression (default level).
Within each Parquet chunk file, rows are ordered by vid (ascending).
This matches the primary key ordering in PostgreSQL and enables efficient
sequential reads during restore.
An incremental dump reads the existing metadata.json, determines the
max_vid for each table, and queries only rows with vid > max_vid. New
rows are written to new chunk files (e.g. chunk_000001.parquet) and the
metadata is updated atomically (write to a temp file, then rename).
The metadata.json file is always written atomically: the dump writes to
metadata.json.tmp first, then renames it to metadata.json. This
ensures that a reader never sees a partially-written metadata file. If the
dump process crashes mid-write, the previous metadata.json remains
intact. The Parquet chunk files are written before metadata.json is
updated, so chunk files referenced by metadata.json are always complete.