8000 GH-46676: [C++][Python][Parquet] Allow reading Parquet LIST data as LargeList directly by pitrou · Pull Request #46678 · apache/arrow · GitHub
[go: up one dir, main page]

Skip to content

GH-46676: [C++][Python][Parquet] Allow reading Parquet LIST data as LargeList directly #46678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 4, 2025

Conversation

pitrou
Copy link
Member
@pitrou pitrou commented Jun 2, 2025

Rationale for this change

When reading a Parquet LIST logical type (or a repeated field without a logical type), Parquet C++ automatically reads it as a Arrow List array.

However, this can in some cases run into the 32-bit offsets limit. We'd like to be able to choose to read as LargeList instead, even if there is no serialized Arrow schema in the Parquet file.

What changes are included in this PR?

  • Add a Parquet read option list_type to select which Arrow type to read LIST / repeated Parquet columns into
  • Fix an index truncation bug when writing a huge single chunk of data to Parquet

Are these changes tested?

Yes, the functionality is tested. However, I wasn't able to write a unit test that wouldn't consume a horrendous amount of time or memory writing/reading a list with offsets larger than 2**32.

Are there any user-facing changes?

No, only an API improvement.

@pitrou
Copy link
Member Author
pitrou commented Jun 2, 2025

@github-actions crossbow submit -g cpp

Copy link
github-actions bot commented Jun 2, 2025

Revision: 9fb50a9

Submitted crossbow builds: ursacomputing/crossbow @ actions-b3bd66f38b

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou pitrou marked this pull request as ready for review June 2, 2025 16:27
@pitrou pitrou requested review from AlenkaF, raulcd, rok and wgtmac as code owners June 2, 2025 16:27
@pitrou
Copy link
Member Author
pitrou commented Jun 3, 2025

cc @wgtmac @mapleFU

@mapleFU
Copy link
Member
mapleFU commented Jun 3, 2025

Maybe out of topic, I just found out that arrow doesn't have a LargeMap type, lol...

Copy link
Member
@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ part general LGTM

out->field = ::arrow::field(group.name(), ::arrow::list(child_field->field),
group.is_optional(), FieldIdMetadata(group.field_id()));
ARROW_ASSIGN_OR_RAISE(auto list_type,
MakeArrowList(child_field->field, ctx->properties));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this just affect inferred type, and (might) be overwritten by origin storage type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the same semantics as binary_type.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 3, 2025
// required binary str (UTF8);
// };
// }
// Special case: group is named array
Copy link
Member
@mapleFU mapleFU Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm this is special case in parquet list spec, why do we rename this? (4 in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is just reindented, but the diff view messes things up unfortunately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a diff with whitespace ignored, it will make things clearer: https://gist.github.com/pitrou/9957f54fa79a4323973f86158cb7e6fe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh, my bad

@@ -1051,6 +1053,18 @@ class PARQUET_EXPORT ArrowReaderProperties {
/// Return the Arrow binary type to read BYTE_ARRAY columns as.
::arrow::Type::type binary_type() const { return binary_type_; }

/// \brief Set the Arrow list type to read Parquet list columns as.
///
/// Allowed values are Type::LIST and Type::LARGE_LIST.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support list view?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to, as list view is probably less useful than binary view.
Moreover, this would need a specific implementation as the current ListReader constructs offsets explicitly.

@pitrou pitrou merged commit 03a1867 into apache:main Jun 4, 2025
45 of 47 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jun 4, 2025
@pitrou pitrou deleted the pq-list-type branch June 4, 2025 16:30
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 03a1867.

There were 70 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 52 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0