dump.md

Dump Format

The graphman dump command exports all entity data and metadata for a single subgraph deployment into a self-contained directory of Parquet files and JSON metadata. The resulting dump can be used to restore the deployment into a different graph-node instance via graphman restore. Dumps are consistent snapshots of the deployment's state at a specific point in time.

WARNING: The dump and restore commands are experimental and can not replace proper database backups at this point. In particular, there is no guarantee that a dump will be restorable. Having said that, we encourage users to try out the dump and restore commands in non-production environments and report any issues they encounter.

WARNING: Dumping happens in a single transaction and can put significant load on the database for large subgraphs. Use with caution on production instances.

WARNING: Restoring a dump will currently create all the default indexes that a new deployment gets, and ignores the indexes that might have been carefully curated for the original deployment, even though they are recorded in the dump. This can lead to very long restore times for large subgraphs. The restore process will be optimized in the future to only create indexes that are present in the dump's metadata.

Usage

Dumping a deployment

graphman dump <deployment> <directory>

<deployment> identifies the subgraph deployment to dump. It can be a subgraph name, a deployment hash (Qm...), or a database namespace (sgdNNN). <directory> is the path where the dump will be written; it will be created if it does not exist.

Running graphman dump against an existing dump directory performs an incremental dump: only rows added since the last dump are exported, and new chunk files are appended rather than rewriting existing ones.

# Full dump
graphman dump my-subgraph /backups/my-subgraph

# Incremental update of the same dump
graphman dump my-subgraph /backups/my-subgraph

Restoring a deployment

graphman restore <directory> [options]

<directory> is the path to a dump previously created with graphman dump.

Option	Description
`--shard`	Target database shard. Uses deployment rules (or primary shard) when omitted. Required with `--add`
`--name`	Subgraph name for deployment rule matching and node assignment. Falls back to an existing name
`--replace`	Drop and recreate if the deployment already exists in the target shard
`--add`	Create a copy in a shard that doesn't already have this deployment (requires `--shard`)
`--force`	Replace if the deployment exists in the target shard, add if it doesn't

--replace, --add, and --force are mutually exclusive. When none is given, restore fails if the deployment already exists in the target shard.

# Restore into the default shard
graphman restore /backups/my-subgraph

# Restore into a specific shard, replacing if it already exists
graphman restore /backups/my-subgraph --shard shard1 --replace

# Force-restore (replace or add as needed)
graphman restore /backups/my-subgraph --force

Directory layout

A dump directory has the following structure:

<dump-dir>/
  metadata.json                  -- deployment metadata + per-table state
  schema.graphql                 -- raw GraphQL schema text
  subgraph.yaml                  -- raw subgraph manifest YAML (optional)
  <EntityType>/
    chunk_000000.parquet         -- rows ordered by vid
    chunk_000001.parquet         -- incremental append (future chunks)
    ...
  data_sources$/
    chunk_000000.parquet         -- dynamic data sources

Each entity type defined in the GraphQL schema gets its own subdirectory, named after the entity type exactly as it appears in the schema (e.g. Token/, Pool/). The Proof of Indexing appears as a regular entity directory name Poi$. The special data_sources$ directory holds dynamic data sources created at runtime.

Within each directory, data is stored in numbered chunk files (chunk_000000.parquet, chunk_000001.parquet, ...). A fresh dump produces a single chunk_000000.parquet per table. Incremental dumps append new chunks rather than rewriting existing ones.

The GraphQL schema and subgraph manifest are stored as separate plain-text files schema.graphql and subgraph.yaml.

metadata.json

The top-level metadata.json contains everything needed to reconstruct the deployment's table structure, plus diagnostic information captured at dump time. Its structure is:

{
  "version": 1,
  "deployment": "Qm...",
  "network": "mainnet",

  "manifest": {
    "spec_version": "1.0.0",
    "description": "Optional subgraph description",
    "repository": "https://github.com/...",
    "features": ["..."],
    "entities_with_causality_region": ["EntityType1"],
    "history_blocks": 2147483647
  },

  "earliest_block_number": 12345,
  "start_block": { "number": 12345, "hash": "0xabc..." },
  "head_block": { "number": 99999, "hash": "0xdef..." },
  "entity_count": 150000,

  "graft_base": null,
  "graft_block": null,
  "debug_fork": null,

  "health": {
    "failed": false,
    "health": "healthy",
    "fatal_error": null,
    "non_fatal_errors": []
  },

  "indexes": {
    "token": [
      "CREATE INDEX CONCURRENTLY IF NOT EXISTS attr_0_0_id ON sgd.token USING btree (id)"
    ]
  },

  "tables": {
    "Token": {
      "immutable": true,
      "has_causality_region": false,
      "chunks": [
        {
          "file": "Token/chunk_000000.parquet",
          "min_vid": 0,
          "max_vid": 50000,
          "row_count": 50000
        }
      ],
      "max_vid": 50000
    },
    "data_sources$": {
      "immutable": false,
      "has_causality_region": true,
      "chunks": [
        {
          "file": "data_sources$/chunk_000000.parquet",
          "min_vid": 0,
          "max_vid": 100,
          "row_count": 100
        }
      ],
      "max_vid": 100
    }
  }
}

Field descriptions:

Field	Description
`version`	Format version. Must be `1`.
`deployment`	Deployment hash (`Qm...`).
`network`	The blockchain network (e.g. `mainnet`, `goerli`).
`manifest`	Manifest metadata extracted from `subgraphs.subgraph_manifest`.
`manifest.spec_version`	Subgraph API version. Required to parse `schema.graphql`.
`manifest.entities_with_causality_region`	Entity types that have a `causality_region` column.
`manifest.history_blocks`	How many blocks of entity version history are retained.
`earliest_block_number`	Earliest block for which data exists (accounts for pruning).
`start_block`	The block where indexing started. Null if not set.
`head_block`	The latest indexed block at dump time.
`entity_count`	Total entity count across all tables.
`graft_base`	Deployment hash of the graft base, if any.
`graft_block`	Block pointer of the graft point, if any.
`debug_fork`	Debug fork deployment hash, if any.
`health`	Point-in-time health snapshot. Not used during restore.
`indexes`	Point-in-time index definitions as SQL. Not used during restore (indexes are auto-created by `Layout::create_relational_schema()`).
`tables`	Per-table metadata keyed by entity type name (or `data_sources$`).

Each entry in tables contains:

Field	Description
`immutable`	Whether the entity type is immutable (uses `block$` instead of `block_range`).
`has_causality_region`	Whether rows have a `causality_region` column.
`chunks`	Ordered list of Parquet chunk files for this table.
`chunks[].file`	Relative path from the dump directory.
`chunks[].min_vid`	Minimum `vid` value in this chunk.
`chunks[].max_vid`	Maximum `vid` value in this chunk.
`chunks[].row_count`	Number of rows in this chunk.
`max_vid`	Maximum `vid` across all chunks. `-1` if the table is empty.

Parquet schema: entity tables

Each entity table's Parquet file uses an Arrow schema derived from the entity's GraphQL definition. Columns are ordered as follows:

System columns (always present, in this order):
- vid (Int64, non-nullable) -- row version ID
- Block tracking (one of):
  - Immutable entities: block$ (Int32, non-nullable)
  - Mutable entities: block_range_start (Int32, non-nullable), block_range_end (Int32, nullable -- null means unbounded/current)
- causality_region (Int32, non-nullable) -- only if the entity has one
Data columns in GraphQL declaration order, skipping fulltext (TSVector) columns which are generated and rebuilt on restore.

The PostgreSQL int4range type used for block_range is decomposed into two scalar columns (block_range_start, block_range_end) in the Parquet representation. This avoids the need for a custom range type in Arrow.

Type mapping

GraphQL/PostgreSQL column types map to Arrow data types as follows:

ColumnType	Arrow DataType	Notes
`Boolean`	`Boolean`
`Int`	`Int32`
`Int8`	`Int64`
`Bytes`	`Binary`	Raw bytes, no hex encoding
`BigInt`	`Utf8`	Stored as decimal string for arbitrary precision
`BigDecimal`	`Utf8`	Stored as decimal string for arbitrary precision
`Timestamp`	`Timestamp(Microsecond, None)`	Microseconds since epoch, no timezone
`String`	`Utf8`
`Enum(...)`	`Utf8`	Enum variant as string (cast from PG enum to text during dump)
`TSVector(...)`	skipped	Fulltext index columns are generated; rebuilt on restore

Array columns: A GraphQL list field (e.g. tags: [String!]!) is stored as List<T> where T is the base Arrow type from the table above. Whether a column is a list is determined by the GraphQL field type, not by ColumnType. For example, [String!]! becomes List<Utf8> and [Int!] becomes List<Int32>.

Nullability follows the GraphQL schema: non-null fields produce non-nullable Arrow columns; optional fields produce nullable columns. List elements within list columns are always marked nullable in the Arrow schema.

Parquet schema: data_sources$

The data_sources$ table has a fixed schema independent of the GraphQL definition:

Column	Arrow DataType	Nullable	Description
`vid`	`Int64`	no	Row version ID
`block_range_start`	`Int32`	no	Lower bound of `block_range`
`block_range_end`	`Int32`	yes	Upper bound (null = unbounded)
`causality_region`	`Int32`	no	Causality region
`manifest_idx`	`Int32`	no	Index into the manifest's data source list
`parent`	`Int32`	yes	Self-referencing parent data source
`id`	`Binary`	yes	Data source identifier
`param`	`Binary`	yes	Data source parameter
`context`	`Utf8`	yes	JSON context
`done_at`	`Int32`	yes	Block number where the data source was marked done

Compression

All Parquet files use ZSTD compression (default level).

Row ordering

Within each Parquet chunk file, rows are ordered by vid (ascending). This matches the primary key ordering in PostgreSQL and enables efficient sequential reads during restore.

Incremental dumps

An incremental dump reads the existing metadata.json, determines the max_vid for each table, and queries only rows with vid > max_vid. New rows are written to new chunk files (e.g. chunk_000001.parquet) and the metadata is updated atomically (write to a temp file, then rename).

Atomicity

The metadata.json file is always written atomically: the dump writes to metadata.json.tmp first, then renames it to metadata.json. This ensures that a reader never sees a partially-written metadata file. If the dump process crashes mid-write, the previous metadata.json remains intact. The Parquet chunk files are written before metadata.json is updated, so chunk files referenced by metadata.json are always complete.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dump Format

Usage

Dumping a deployment

Restoring a deployment

Directory layout

metadata.json

Parquet schema: entity tables

Type mapping

Parquet schema: data_sources$

Compression

Row ordering

Incremental dumps

Atomicity

FilesExpand file tree

dump.md

Latest commit

History

dump.md

File metadata and controls

Dump Format

Usage

Dumping a deployment

Restoring a deployment

Directory layout

metadata.json

Parquet schema: entity tables

Type mapping

Parquet schema: data_sources$

Compression

Row ordering

Incremental dumps

Atomicity