[DRAFT]: Adding JsonIndexDistinctOperator and InvertedIndexDistinctOperator by raghavyadav01 · Pull Request #17820 · apache/pinot

raghavyadav01 · 2026-03-05T05:20:50Z

the commit contains following changes:

Adding support for JsonIndexDistinctOperator and InvertedIndexDistinctOperator using the filteroiperator to avoid scan in DistinctOperator
Integration test to validate the changes.
Unit Tests
operators are disabled by default and can be enabled by query Options.

codecov-commenter · 2026-03-05T06:10:59Z

Codecov Report

❌ Patch coverage is 0.81744% with 364 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.14%. Comparing base (097a89f) to head (a69679e).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...core/operator/query/JsonIndexDistinctOperator.java	0.00%	194 Missing ⚠️
.../operator/query/InvertedIndexDistinctOperator.java	0.00%	157 Missing ⚠️
...a/org/apache/pinot/core/plan/DistinctPlanNode.java	0.00%	10 Missing and 2 partials ⚠️
...inot/segment/spi/index/reader/JsonIndexReader.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17820      +/-   ##
============================================
- Coverage     63.27%   63.14%   -0.14%     
+ Complexity     1466     1460       -6     
============================================
  Files          3190     3192       +2     
  Lines        192101   192468     +367     
  Branches      29433    29505      +72     
============================================
- Hits         121547   121526      -21     
- Misses        61040    61417     +377     
- Partials       9514     9525      +11

Flag	Coverage Δ
custom-integration1	`?`
integration	`?`
integration1	`?`
integration2	`?`
java-11	`63.11% <0.81%> (-0.11%)`	⬇️
java-21	`63.11% <0.81%> (-0.14%)`	⬇️
temurin	`63.14% <0.81%> (-0.14%)`	⬇️
unittests	`63.14% <0.81%> (-0.13%)`	⬇️
unittests1	`55.43% <0.81%> (-0.15%)`	⬇️
unittests2	`34.20% <0.00%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

...st/java/org/apache/pinot/integration/tests/InvertedIndexDistinctOperatorIntegrationTest.java

pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java

...st/java/org/apache/pinot/integration/tests/InvertedIndexDistinctOperatorIntegrationTest.java

Copilot

Pull request overview

This PR adds two new index-based distinct operators (JsonIndexDistinctOperator and InvertedIndexDistinctOperator) that avoid the scan-based projection pipeline for SELECT DISTINCT queries by reading values directly from JSON or inverted indexes. Both operators are disabled by default and opt-in via query options (useJsonIndexDistinct, useInvertedIndexDistinct, or the umbrella useIndexBasedDistinctOperator).

Changes:

Two new operators (JsonIndexDistinctOperator, InvertedIndexDistinctOperator) and their integration into DistinctPlanNode's operator selection logic, plus query option plumbing in CommonConstants and QueryOptionsUtils.
Integration tests in JsonPathTest and OfflineClusterIntegrationTest validating correctness and index-only execution stats for both operators.
A new isPathIndexed default method on JsonIndexReader SPI interface, and an unrelated change to MultiStageWithoutStatsIntegrationTest.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`CommonConstants.java`	Adds three new query option keys for index-based distinct
`QueryOptionsUtils.java`	Adds parsing methods for the new query options with umbrella fallback
`JsonIndexReader.java`	Adds `isPathIndexed` default method for path indexing check
`DistinctPlanNode.java`	Integrates new operators into the plan selection logic
`JsonIndexDistinctOperator.java`	New operator using JSON index value→docId map for DISTINCT
`InvertedIndexDistinctOperator.java`	New operator using inverted index dictId→docIds for DISTINCT
`JsonPathTest.java`	Integration tests for `JsonIndexDistinctOperator`
`OfflineClusterIntegrationTest.java`	Integration tests for `InvertedIndexDistinctOperator`
`MultiStageWithoutStatsIntegrationTest.java`	Unrelated change replacing enum reference with string literal

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-11T23:03:02Z

...t-core/src/main/java/org/apache/pinot/core/operator/query/InvertedIndexDistinctOperator.java

+  private static boolean addToTable(IntDistinctTable table, int value,
+      @Nullable OrderByExpressionContext orderByExpression) {
+    if (table.hasLimit()) {
+      if (orderByExpression != null) {
+        table.addWithOrderBy(value);
+        return false;
+      } else {
+        return table.addWithoutOrderBy(value);
+      }
+    } else {
+      table.addUnbounded(value);
+      return false;
+    }
+  }
+
+  private static boolean addToTable(LongDistinctTable table, long value,
+      @Nullable OrderByExpressionContext orderByExpression) {
+    if (table.hasLimit()) {
+      if (orderByExpression != null) {
+        table.addWithOrderBy(value);
+        return false;
+      } else {
+        return table.addWithoutOrderBy(value);
+      }
+    } else {
+      table.addUnbounded(value);
+      return false;
+    }
+  }
+
+  private static boolean addToTable(FloatDistinctTable table, float value,
+      @Nullable OrderByExpressionContext orderByExpression) {
+    if (table.hasLimit()) {
+      if (orderByExpression != null) {
+        table.addWithOrderBy(value);
+        return false;
+      } else {
+        return table.addWithoutOrderBy(value);
+      }
+    } else {
+      table.addUnbounded(value);
+      return false;
+    }
+  }
+
+  private static boolean addToTable(DoubleDistinctTable table, double value,
+      @Nullable OrderByExpressionContext orderByExpression) {
+    if (table.hasLimit()) {
+      if (orderByExpression != null) {
+        table.addWithOrderBy(value);
+        return false;
+      } else {
+        return table.addWithoutOrderBy(value);
+      }
+    } else {
+      table.addUnbounded(value);
+      return false;
+    }
+  }
+
+  private static boolean addToTable(BigDecimalDistinctTable table, java.math.BigDecimal value,
+      @Nullable OrderByExpressionContext orderByExpression) {
+    if (table.hasLimit()) {
+      if (orderByExpression != null) {
+        table.addWithOrderBy(value);
+        return false;
+      } else {
+        return table.addWithoutOrderBy(value);
+      }
+    } else {
+      table.addUnbounded(value);
+      return false;
+    }
+  }
+
+  private static boolean addToTable(StringDistinctTable table, String value,
+      @Nullable OrderByExpressionContext orderByExpression) {
+    if (table.hasLimit()) {
+      if (orderByExpression != null) {
+        table.addWithOrderBy(value);
+        return false;
+      } else {
+        return table.addWithoutOrderBy(value);
+      }
+    } else {
+      table.addUnbounded(value);
+      return false;
+    }
+  }
+
+  private static boolean addToTable(BytesDistinctTable table, ByteArray value,
+      @Nullable OrderByExpressionContext orderByExpression) {
+    if (table.hasLimit()) {
+      if (orderByExpression != null) {
+        table.addWithOrderBy(value);
+        return false;
+      } else {
+        return table.addWithoutOrderBy(value);
+      }
+    } else {
+      table.addUnbounded(value);
+      return false;
+    }
+  }


The seven addToTable overloads (lines 194-297) are nearly identical, differing only in the value type. This is a significant amount of duplicated logic. Consider using a generic helper or consolidating via the DistinctTable base class (or a templated approach) to reduce boilerplate.

Copilot · 2026-03-11T23:03:03Z

.../src/test/java/org/apache/pinot/integration/tests/MultiStageWithoutStatsIntegrationTest.java

  protected void overrideServerConf(PinotConfiguration serverConf) {
    serverConf.setProperty(CommonConstants.MultiStageQueryRunner.KEY_OF_SEND_STATS_MODE,
-        SendStatsPredicate.Mode.NEVER.name());
+        "NEVER");


This change from SendStatsPredicate.Mode.NEVER.name() to a hardcoded string "NEVER" appears unrelated to the PR's purpose (adding index-based distinct operators). If it's needed (e.g. to break a dependency), it should be mentioned in the PR description. Using the enum constant is safer against typos and refactoring.

Copilot · 2026-03-11T23:03:03Z

pinot-core/src/main/java/org/apache/pinot/core/operator/query/JsonIndexDistinctOperator.java

+  private JsonIndexReader getJsonIndexReader(String columnName) {
+    DataSource dataSource = _indexSegment.getDataSource(columnName, _queryContext.getSchema());
+    JsonIndexReader reader = dataSource.getJsonIndex();
+    if (reader == null) {
+      Optional<IndexType<?, ?, ?>> compositeIndex =
+          IndexService.getInstance().getOptional("composite_json_index");
+      if (compositeIndex.isPresent()) {
+        reader = (JsonIndexReader) dataSource.getIndex(compositeIndex.get());
+      }
+    }
+    return reader;


The logic to locate a JsonIndexReader (checking getJsonIndex() then falling back to composite_json_index) is duplicated between canUseJsonIndexDistinct() (lines 297-306) and getJsonIndexReader() (lines 148-158). Consider extracting this into a single shared static method to avoid the duplication diverging over time.

Copilot · 2026-03-11T23:03:03Z

pinot-core/src/main/java/org/apache/pinot/core/operator/query/JsonIndexDistinctOperator.java

+ * POC: Distinct operator that uses JSON index value→docId map directly instead of scanning docs.
+ * Avoids projection/transform pipeline for SELECT DISTINCT jsonExtractIndex(...).


The class Javadoc says "POC" (Proof of Concept). If this is intended for production use (as suggested by the integration tests and query options), the "POC" label should be removed. If it truly is a POC, it may not be ready to merge.

Suggested change

* POC: Distinct operator that uses JSON index value→docId map directly instead of scanning docs.

* Avoids projection/transform pipeline for SELECT DISTINCT jsonExtractIndex(...).

* Distinct operator that uses the JSON index value→docId map directly instead of scanning documents.

* Avoids the projection/transform pipeline for SELECT DISTINCT jsonExtractIndex(...).

Copilot · 2026-03-11T23:03:03Z

...on-tests/src/test/java/org/apache/pinot/integration/tests/OfflineClusterIntegrationTest.java

 import java.util.List;
 import java.util.Map;
 import java.util.Properties;
+import java.util.HashSet;


Import HashSet is not in alphabetical order relative to the surrounding imports. It should be placed before LinkedHashMap and after Collections to maintain alphabetical ordering consistent with the rest of the file.

Copilot · 2026-03-11T23:03:03Z

...t-core/src/main/java/org/apache/pinot/core/operator/query/InvertedIndexDistinctOperator.java

+        RoaringBitmap docIdsRoaring = docIds.toMutableRoaringBitmap().toRoaringBitmap();
+        RoaringBitmap intersection = RoaringBitmap.and(docIdsRoaring, filteredDocIds);
+        includeValue = !intersection.isEmpty();


Unnecessary double conversion: docIds.toMutableRoaringBitmap().toRoaringBitmap() creates two intermediate bitmap copies. RoaringBitmap.and() accepts ImmutableRoaringBitmap parameters directly (since RoaringBitmap extends ImmutableRoaringBitmap), so you can simply check ImmutableRoaringBitmap.andCardinality(docIds, filteredDocIds) > 0 or use RoaringBitmap.intersects(docIds.toMutableRoaringBitmap(), filteredDocIds) to avoid materializing the intersection entirely. This is called for every dictionary entry so the extra allocations could be significant.

Suggested change

RoaringBitmap docIdsRoaring = docIds.toMutableRoaringBitmap().toRoaringBitmap();

RoaringBitmap intersection = RoaringBitmap.and(docIdsRoaring, filteredDocIds);

includeValue = !intersection.isEmpty();

includeValue = ImmutableRoaringBitmap.andCardinality(docIds, filteredDocIds) > 0;

Copilot · 2026-03-11T23:03:04Z

pinot-core/src/main/java/org/apache/pinot/core/operator/query/JsonIndexDistinctOperator.java

+  private RoaringBitmap buildFilteredDocIds() {
+    if (_filterOperator.isResultMatchingAll()) {
+      return null;
+    }
+
+    if (_filterOperator.canProduceBitmaps()) {
+      return _filterOperator.getBitmaps().reduce().toRoaringBitmap();
+    }
+
+    if (_filterOperator.isResultEmpty()) {
+      return new RoaringBitmap();
+    }
+
+    RoaringBitmap bitmap = new RoaringBitmap();
+    DocIdSetPlanNode docIdSetPlanNode = new DocIdSetPlanNode(
+        _segmentContext, _queryContext, DocIdSetPlanNode.MAX_DOC_PER_CALL, _filterOperator);
+    var docIdSetOperator = docIdSetPlanNode.run();
+    DocIdSetBlock block;
+    while ((block = docIdSetOperator.nextBlock()) != null) {
+      int[] docIds = block.getDocIds();
+      int length = block.getLength();
+      bitmap.addN(docIds, 0, length);
+    }
+    return bitmap;
+  }


The buildFilteredDocIds() method is duplicated verbatim between JsonIndexDistinctOperator and InvertedIndexDistinctOperator. Consider extracting this into a shared utility method (e.g., in a common base class or a static helper) to avoid code duplication and make future maintenance easier.

…anges.

Adding InvertedIndexDistinctOperator and Integration tests for the changes.

…perators The index-based distinct operators had two bugs: 1. Both InvertedIndexDistinctOperator and JsonIndexDistinctOperator broke out of the iteration loop once distinctTable.size() >= limit, even when ORDER BY was present. With ORDER BY the heap needs to see all candidates to determine the true top-K, so early break is only safe without ORDER BY. 2. JsonIndexDistinctOperator hardcoded STRING schema and StringDistinctTable for all result types. This caused incorrect DISTINCT/ORDER BY semantics for numeric types (lexicographic vs numeric comparison). Now parses the resultsType argument to create the correctly typed table and convert JSON index string values to the proper native type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

raghavyadav01 force-pushed the json-index-distinct-operator branch from 9fe3f76 to a986d91 Compare March 5, 2026 05:22

raghavyadav01 changed the title ~~Adding JsonIndexDistinctOperator and InvertedIndexDistinctOperator~~ [DRAFT]: Adding JsonIndexDistinctOperator and InvertedIndexDistinctOperator Mar 5, 2026

Jackie-Jiang reviewed Mar 6, 2026

View reviewed changes

raghavyadav01 force-pushed the json-index-distinct-operator branch 2 times, most recently from 5e9cc7e to bc0d213 Compare March 11, 2026 21:56

xiangfu0 requested review from Jackie-Jiang, Copilot, xiangfu0 and yashmayya March 11, 2026 22:57

xiangfu0 added the query label Mar 11, 2026

Copilot started reviewing on behalf of xiangfu0 March 11, 2026 22:58 View session

xiangfu0 added the enhancement label Mar 11, 2026

Copilot AI reviewed Mar 11, 2026

View reviewed changes

raghavyadav01 force-pushed the json-index-distinct-operator branch from bc0d213 to 57dfc5b Compare March 12, 2026 00:58

raghavyadav01 and others added 5 commits March 12, 2026 12:40

Add support for JsonIndexDistinctOperator

89a0b47

changes for JsonIndexDistinctOperator

291f143

Adding InvertedIndexDistinctOperator and Integration tests for the ch…

4b02fd0

…anges.

incorporating th 4D1C e review comments.

5172309

Adding InvertedIndexDistinctOperator and Integration tests for the changes.

xiangfu0 force-pushed the json-index-distinct-operator branch from 57dfc5b to a69679e Compare March 12, 2026 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT]: Adding JsonIndexDistinctOperator and InvertedIndexDistinctOperator #17820

[DRAFT]: Adding JsonIndexDistinctOperator and InvertedIndexDistinctOperator #17820
raghavyadav01 wants to merge 5 commits intoapache:masterfrom
raghavyadav01:json-index-distinct-operator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		* POC: Distinct operator that uses JSON index value→docId map directly instead of scanning docs.
		* Avoids projection/transform pipeline for SELECT DISTINCT jsonExtractIndex(...).

Conversation

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants