-
Notifications
You must be signed in to change notification settings - Fork 3.8k
GH-46676: [C++][Python][Parquet] Allow reading Parquet LIST data as LargeList directly #46678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@github-actions crossbow submit -g cpp |
Revision: 9fb50a9 Submitted crossbow builds: ursacomputing/crossbow @ actions-b3bd66f38b |
Maybe out of topic, I just found out that arrow doesn't have a LargeMap type, lol... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ part general LGTM
out->field = ::arrow::field(group.name(), ::arrow::list(child_field->field), | ||
group.is_optional(), FieldIdMetadata(group.field_id())); | ||
ARROW_ASSIGN_OR_RAISE(auto list_type, | ||
MakeArrowList(child_field->field, ctx->properties)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this just affect inferred type, and (might) be overwritten by origin storage type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the same semantics as binary_type
.
// required binary str (UTF8); | ||
// }; | ||
// } | ||
// Special case: group is named array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm this is special case in parquet list spec, why do we rename this? (4 in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is just reindented, but the diff view messes things up unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a diff with whitespace ignored, it will make things clearer: https://gist.github.com/pitrou/9957f54fa79a4323973f86158cb7e6fe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh, my bad
@@ -1051,6 +1053,18 @@ class PARQUET_EXPORT ArrowReaderProperties { | |||
/// Return the Arrow binary type to read BYTE_ARRAY columns as. | |||
::arrow::Type::type binary_type() const { return binary_type_; } | |||
|
|||
/// \brief Set the Arrow list type to read Parquet list columns as. | |||
/// | |||
/// Allowed values are Type::LIST and Type::LARGE_LIST. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support list view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to, as list view is probably less useful than binary view.
Moreover, this would need a specific implementation as the current ListReader
constructs offsets explicitly.
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 03a1867. There were 70 benchmark results with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 52 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
When reading a Parquet LIST logical type (or a repeated field without a logical type), Parquet C++ automatically reads it as a Arrow List array.
However, this can in some cases run into the 32-bit offsets limit. We'd like to be able to choose to read as LargeList instead, even if there is no serialized Arrow schema in the Parquet file.
What changes are included in this PR?
list_type
to select which Arrow type to read LIST / repeated Parquet columns intoAre these changes tested?
Yes, the functionality is tested. However, I wasn't able to write a unit test that wouldn't consume a horrendous amount of time or memory writing/reading a list with offsets larger than 2**32.
Are there any user-facing changes?
No, only an API improvement.