GH-46676: [C++][Python][Parquet] Allow reading Parquet LIST data as LargeList directly #46678

pitrou · 2025-06-02T16:00:16Z

Rationale for this change

When reading a Parquet LIST logical type (or a repeated field without a logical type), Parquet C++ automatically reads it as a Arrow List array.

However, this can in some cases run into the 32-bit offsets limit. We'd like to be able to choose to read as LargeList instead, even if there is no serialized Arrow schema in the Parquet file.

What changes are included in this PR?

Add a Parquet read option list_type to select which Arrow type to read LIST / repeated Parquet columns into
Fix an index truncation bug when writing a huge single chunk of data to Parquet

Are these changes tested?

Yes, the functionality is tested. However, I wasn't able to write a unit test that wouldn't consume a horrendous amount of time or memory writing/reading a list with offsets larger than 2**32.

Are there any user-facing changes?

No, only an API improvement.

GitHub Issue: [C++][Parquet] Allow reading Parquet LIST data as LargeList directly #46676

pitrou · 2025-06-02T16:11:35Z

@github-actions crossbow submit -g cpp

github-actions · 2025-06-02T16:14:04Z

Revision: 9fb50a9

Submitted crossbow builds: ursacomputing/crossbow @ actions-b3bd66f38b

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

…geList directly

pitrou · 2025-06-03T11:38:56Z

cc @wgtmac @mapleFU

mapleFU · 2025-06-03T14:50:14Z

Maybe out of topic, I just found out that arrow doesn't have a LargeMap type, lol...

mapleFU

C++ part general LGTM

mapleFU · 2025-06-03T15:24:20Z

cpp/src/parquet/arrow/schema.cc

-  out->field = ::arrow::field(group.name(), ::arrow::list(child_field->field),
-                              group.is_optional(), FieldIdMetadata(group.field_id()));
+  ARROW_ASSIGN_OR_RAISE(auto list_type,
+                        MakeArrowList(child_field->field, ctx->properties));


So this just affect inferred type, and (might) be overwritten by origin storage type?

Yes, the same semantics as binary_type.

mapleFU · 2025-06-03T15:43:17Z

cpp/src/parquet/arrow/arrow_schema_test.cc

+    //     required binary str (UTF8);
+    //   };
+    // }
+    // Special case: group is named array


Emmm this is special case in parquet list spec, why do we rename this? (4 in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules )

This code is just reindented, but the diff view messes things up unfortunately.

Here is a diff with whitespace ignored, it will make things clearer: https://gist.github.com/pitrou/9957f54fa79a4323973f86158cb7e6fe

ohhh, my bad

wgtmac · 2025-06-04T01:44:47Z

cpp/src/parquet/properties.h

@@ -1051,6 +1053,18 @@ class PARQUET_EXPORT ArrowReaderProperties {
  /// Return the Arrow binary type to read BYTE_ARRAY columns as.
  ::arrow::Type::type binary_type() const { return binary_type_; }

+  /// \brief Set the Arrow list type to read Parquet list columns as.
+  ///
+  /// Allowed values are Type::LIST and Type::LARGE_LIST.


Do we need to support list view?

We don't need to, as list view is probably less useful than binary view.
Moreover, this would need a specific implementation as the current ListReader constructs offsets explicitly.

conbench-apache-arrow · 2025-06-05T07:23:17Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 03a1867.

There were 70 benchmark results with an error:

Commit Run on arm64-t4g-2xlarge-linux at 2025-06-04 19:28:30Z
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-07, scale_factor=1
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-09, scale_factor=1
and 68 more (see the report linked below)

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 52 possible false positives for unstable benchmarks that are known to sometimes produce them.

github-actions bot added Component: Parquet Component: C++ Component: Python awaiting review Awaiting review labels Jun 2, 2025

pitrou marked this pull request as ready for review June 2, 2025 16:27

pitrou requested review from AlenkaF, raulcd, rok and wgtmac as code owners June 2, 2025 16:27

pitrou force-pushed the pq-list-type branch from 9fb50a9 to 5a79c03 Compare June 2, 2025 16:46

apacheGH-46676: [C++][Parquet] Allow reading Parquet LIST data as Lar…

e8f2f8d

…geList directly

pitrou force-pushed the pq-list-type branch from 5a79c03 to e8f2f8d Compare June 3, 2025 11:38

mapleFU reviewed Jun 3, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 3, 2025

mapleFU reviewed Jun 3, 2025

View reviewed changes

wgtmac approved these changes Jun 4, 2025

View reviewed changes

mapleFU approved these changes Jun 4, 2025

View reviewed changes

pitrou merged commit 03a1867 into apache:main Jun 4, 2025
45 of 47 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Jun 4, 2025

pitrou mentioned this pull request Jun 4, 2025

[C++][Parquet] Allow reading Parquet LIST data as LargeList directly #46676

Closed

pitrou deleted the pq-list-type branch June 4, 2025 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-46676: [C++][Python][Parquet] Allow reading Parquet LIST data as LargeList directly #46678

GH-46676: [C++][Python][Parquet] Allow reading Parquet LIST data as LargeList directly #46678

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GH-46676: [C++][Python][Parquet] Allow reading Parquet LIST data as LargeList directly #46678

GH-46676: [C++][Python][Parquet] Allow reading Parquet LIST data as LargeList directly #46678

Uh oh!

Conversation

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!