HADOOP-19354. S3A: S3AInputStream to be created by factory under S3AStore #7214

steveloughran · 2024-12-06T18:38:26Z

steveloughran · 2024-12-06T18:44:34Z

hadoop-yetus · 2025-01-03T19:51:16Z

hadoop-yetus · 2025-01-07T16:48:39Z

mukund-thakur

mukund-thakur · 2025-01-06T21:04:28Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

@@ -993,7 +983,7 @@ private void initThreadPools(Configuration conf) {
    unboundedThreadPool.allowCoreThreadTimeOut(true);
    executorCapacity = intOption(conf,
        EXECUTOR_CAPACITY, DEFAULT_EXECUTOR_CAPACITY, 1);
-    if (prefetchEnabled) {
+    if (requirements.createFuturePool()) {


change the name to prefetchRequirements.

there's more requirements than just prefetching, e.g if vector IO support is needed then some extra threads are added to the pool passed down.

...doop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/streams/ObjectInputStreamCallbacks.java

steveloughran · 2025-01-08T13:36:10Z

hadoop-yetus · 2025-01-08T20:41:01Z

rajdchak · 2025-01-15T15:39:51Z

...-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/streams/ObjectInputStream.java

+    this.ioStatistics = streamStatistics.getIOStatistics();
+    this.inputPolicy = context.getInputPolicy();
+    streamStatistics.inputPolicySet(inputPolicy.ordinal());
+    this.boundedThreadPool = parameters.getBoundedThreadPool();


I see boundedThreadPool is used in S3AInputStream but not in S3APrefetchingInputStream, can we keep boundedThreadPool local to S3AInputStream?

each stream can declare what it wants thread-pool wise and we will allocate those to them. If they don't want it, they don't get it.
That bounded thread pool passed down is the semaphore pool we also use in uploads. It takes a subset of the shared pool, has its own pending queue and blocks the caller thread when that pending queue is full.

If the analytics stream doesn't currently need it -don't ask for any

But I do want to have the vector IO code to be moved out of S3AInputStream so it can work with the superclass, so all streams get it. These also want a bounded number of threads

rajdchak · 2025-01-15T15:49:23Z

...-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/streams/ObjectInputStream.java

+
+/**
+ * A stream of data from an S3 object.
+ * The blase class includes common methods, stores


Nit: spelling base

rajdchak · 2025-01-16T00:56:21Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

   * This must be re-invoked after replacing the S3Client during test
   * runs.
+   * <p>
+   * It requires the S3Store to have been instantiated.
   * @param conf configuration.


@param conf is no longer required

rajdchak · 2025-01-16T01:02:42Z

...ools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/streams/StreamThreadOptions.java

+   * @param sharedThreads Number of shared threads to included in the bounded pool.
+   * @param streamThreads How many threads per stream, ignoring vector IO requirements.
+   * @param createFuturePool Flag to enable creation of a future pool around the bounded thread pool.
+   */


@param vectorSupported missing

rajdchak · 2025-01-16T01:17:22Z

hadoo 6D40 p-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

@@ -845,7 +826,7 @@ private S3AFileSystemOperations createFileSystemHandler() {
  @VisibleForTesting
  protected S3AStore createS3AStore(final ClientManager clientManager,
      final int rateLimitCapacity) {
-    return new S3AStoreBuilder()
+    final S3AStore st = new S3AStoreBuilder()


Nit: rename variable to meaningful name

steveloughran · 2025-01-16T10:26:47Z

ahmarsuhail

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

...-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/streams/StreamIntegration.java

...op-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/streams/InputStreamType.java

hadoop-yetus · 2025-02-05T20:13:16Z

hadoop-yetus · 2025-02-06T20:18:35Z

hadoop-yetus · 2025-02-11T18:47:35Z

ahmarsuhail · 2025-02-12T13:01:10Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

+
+    // do not validate() the parameters as the store
+    // completes this.
+    ObjectReadParameters parameters = new ObjectReadParameters()


@steveloughran just realised, in our internal integration, we used to do s3SeekableInputStreamFactory.createStream() before the extractOrFetchSimpleFileStatus() call in this executeOpen() method.

AAL has a metadata cache, and so this ensures we don't make repeated HEADs for the same key. Important (though not sure what the perf impact is), because Spark opens the same file multiple times in a task, once to read the footer, and then to read the column data. So S3A default currently does atleast 2 HEADs per file.

Now that the stream initialisation happens after extractOrFetchSimpleFileStatus(), S3A does the head even though it's not required as it's already in the AAL cache.

We should discuss what we can do here (maybe wire up S3A to AAL's metadata cache regardless of the stream it's using?), and do it as a follow up.

ooh, wire up to history is good. But does it have an expiry? can we turn it off? I ask as caches can be their own source of pain, and for other use cases they do cause problems.

If you look at how parquet and iceberg open files, they do have the file status first, so we just need to wire up passing down that FileStatus, along with file type, and if known: footer location.

parquet does now pass down its status, so the HEAD is skipped.

hadoop-yetus · 2025-02-14T19:53:51Z

steveloughran · 2025-02-17T21:29:38Z

hadoop-yetus · 2025-02-17T22:56:10Z

hadoop-yetus · 2025-02-18T01:44:18Z

ahmarsuhail

ahmarsuhail · 2025-02-18T10:49:41Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

+        flags.add(AuditorFlags.PermitOutOfBandOperations);
+      }
+      getAuditManager().setAuditFlags(flags);
+      // get the vector IO context from the factory.o


nit: typo "factory.o"

ahmarsuhail · 2025-02-18T14:51:06Z

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/reading.md

+```xml
+<property>
+  <name>fs.s3a.input.stream.type</name>
+  <value>default</value>


typo: should be "analytics"

mukund-thakur · 2025-02-18T23:09:02Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditIntegration.java

@@ -68,7 +69,7 @@ public static AuditManagerS3A createAndStartAuditManager(
      auditManager = new ActiveAuditManagerS3A(
          requireNonNull(iostatistics));
    } else {
-      LOG.debug("auditing is disabled");


why remove the word auditing?

good q. don't remember. will revert.

mukund-thakur

ahmarsuhail

hadoop-yetus · 2025-02-19T21:26:09Z

hadoop-yetus · 2025-02-19T21:47:35Z

hadoop-yetus · 2025-02-20T10:33:51Z

steveloughran requested review from ahmarsuhail and mukund-thakur December 6, 2024 18:38

steveloughran marked this pull request as draft December 6, 2024 18:38

github-actions bot added trunk TOOLS AWS labels Dec 6, 2024

steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from 5a32f16 to 7d76047 Compare December 6, 2024 18:45

ahmarsuhail mentioned this pull request Dec 20, 2024

HADOOP-19354. S3AInputStream to be created by factory under S3AStore. #7237

Closed

apache deleted a comment from hadoop-yetus Jan 1, 2025

steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from a944b86 to 0f01d61 Compare January 3, 2025 17:39

steveloughran marked this pull request as ready for review January 3, 2025 18:08

apache deleted a comment from hadoop-yetus Jan 3, 2025

steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from 0f01d61 to e7e454c Compare January 7, 2025 14:36

mukund-thakur reviewed Jan 7, 2025

View reviewed changes

rajdchak reviewed Jan 15, 2025

View reviewed changes

rajdchak reviewed Jan 16, 2025

View reviewed changes

ahmarsuhail requested changes Jan 22, 2025

View reviewed changes

steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from 88d31d4 to 677eb50 Compare February 6, 2025 16:20

steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from 677eb50 to e5371d2 Compare February 11, 2025 14:34

ahmarsuhail reviewed Feb 12, 2025

View reviewed changes

steveloughran added 2 commits February 17, 2025 18:42

HADOOP-19354. Review, javadocs and docs

57bbea6

* Review and expand docs. * Add javadocs on getter/setters where they were missing Change-Id: I6f2dbb6326f79ed9187418a89ca9d6a8d2f76a2a

HADOOP-19354. checkstyle

75309ec

Change-Id: I6f2b74e0e79e03d03af9cd33076ea6b782a84e4c

ahmarsuhail approved these changes Feb 18, 2025

View reviewed changes

ahmarsuhail reviewed Feb 18, 2025

View reviewed changes

mukund-thakur reviewed Feb 18, 2025

View reviewed changes

mukund-thakur approved these changes Feb 18, 2025

View reviewed changes

ahmarsuhail approved these changes 10000 Feb 19, 2025

View reviewed changes

HADOOP-19354. Final comments/javadocs

c57b878

Change-Id: I71e27d699ace9e63ad13245913816e4f071cd657

steveloughran force-pushed the s3/HADOOP-19354-s3a-inputstream-factory branch from e88e068 to c57b878 Compare February 19, 2025 17:39

HADOOP-19354. Final checkstyle

87f1ea4

Change-Id: I37f175a716859e2d5ab53b7ff9ea60232280cc9a

steveloughran merged commit 5067082 into apache:trunk Feb 20, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HADOOP-19354. S3A: S3AInputStream to be created by factory under S3AStore #7214

HADOOP-19354. S 8000 3A: S3AInputStream to be created by factory under S3AStore #7214

HADOOP-19354. S3A: S3AInputStream to be created by factory under S3AStore #7214

HADOOP-19354. S 8000 3A: S3AInputStream to be created by factory under S3AStore #7214

Conversation

How was this patch tested?

For code changes:

TODO

VectoredIOContext

Stream capabilities

IOStats

Testing.

Docs

open issues

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment