8000 feat: sampling API and strategies · Issue #726 · graphframes/graphframes · GitHub
[go: up one dir, main page]

Skip to content

feat: sampling API and strategies #726

@SemyonSinchenko

Description

@SemyonSinchenko

Is your feature request related to a problem? Please describe.
To apply any kind of Graph ML / Graph NN algorithms on real-world power-law graphs we should consider an implementation of the sampling first.

Describe the solution you would like

Top level API:

graph.sampleEdges(strategy: EdgesSamplingStrategy, seed: Long): DataFrame
graph.sampleVertices(strategy: VerticesSamplingStrategy, seed: Long): DataFrame

EdgesSamplingStrategy, VerticesSamplingStrategy -- traits, a part of the public API;

Batteries:

  • simple random sampling
  • weights based sampling
  • fixed-size sampling (like GraphSAGE)
  • degree-based sampling
  • context sampling (user provides function (src, dst, edge) -> prob
  • ???

Component

  • Scala Core Internal
  • Scala API
  • Spark Connect Plugin
  • Infrastructure
  • PySpark Classic
  • PySpark Connect

Additional context
Without sampling it will be hard to implement any of:

  • good approximate algorithms
  • random walks on power-law graphs
  • gcnn and any graph convolutions in general

Are you planning on creating a PR?

  • I'm willing to make a pull-request

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0