8000 feat(integration-tests): package integration test jobs (do not review) by quinntaylormitchell · Pull Request #1766 · y-scope/clp · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@quinntaylormitchell
Copy link
Collaborator
@quinntaylormitchell quinntaylormitchell commented Dec 10, 2025

Description

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Summary by CodeRabbit

  • New Features

    • Added support for Presto-based query operations in integration tests
    • Introduced expanded test data including multiple compression job configurations and post-compression workflows
    • Enhanced test infrastructure with job filtering, port assignment, and cluster management capabilities
  • Tests

    • Expanded test data fixtures across multiple mission scenarios and log formats
    • Added comprehensive ground-truth validation for filtering and query operations
    • Implemented test job dispatch and execution framework

✏️ Tip: You can customize this high-level summary in your review settings.

@quinntaylormitchell quinntaylormitchell requested a revi 8000 ew from a team as a code owner December 10, 2025 20:00
@quinntaylormitchell quinntaylormitchell marked this pull request as draft December 10, 2025 20:00
@coderabbitai
Copy link
Contributor
coderabbitai bot commented Dec 10, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request introduces comprehensive integration test infrastructure for CLP package testing, including test data directories with ground-truth datasets, job configuration definitions for compression and post-compression workflows, Presto cluster support, fixture enhancements for dynamic port assignment and configuration management, and utility functions for test job orchestration and validation.

Changes

Cohort / File(s) Summary
Pytest Configuration
.pytest.ini, integration-tests/tests/conftest.py
Adds environment variables for test paths (CLP_REPO_DIR, TEST_DATA_DIR, TEST_JOBS_DIR, TEST_LOGS_DIR), defines BASE_PORT marker, and introduces pytest_addoption hook registering BASE_PORT ini value and command-line options --job-name-contains and --no-jobs.
Test Data Module
integration-tests/tests/data/__init__.py
Introduces data package module with docstring for static CLP integration test data storage.
Ground-Truth Test Data
integration-tests/tests/data/ground-truths/filter-*.jsonl
Adds ground-truth JSONL files for PostgreSQL describe and select-distinct filter validation.
Compression Job Definitions
integration-tests/tests/data/jobs/compression/__init__.py, integration-tests/tests/data/jobs/compression/compression_jobs.py
Defines compression package module and PACKAGE_COMPRESSION_JOBS dictionary with three compression job configurations (json-singlefile-1, text-singlefile, json-singlefile-2).
Post-Compression Job Definitions
integration-tests/tests/data/jobs/post_compression/__init__.py, integration-tests/tests/data/jobs/post_compression/admin_tools_jobs.py, integration-tests/tests/data/jobs/post_compression/search_jobs.py
Introduces post-compression jobs package with empty PACKAGE_ADMIN_TOOLS_JOBS and PACKAGE_SEARCH_JOBS registries.
Presto Filter Job Definitions
integration-tests/tests/data/jobs/presto/__init__.py, integration-tests/tests/data/jobs/presto/presto_filters.py
Establishes Presto jobs package with empty PRESTO_FILTER_JOBS registry.
JSON Log Test Data
integration-tests/tests/data/logs/json-multifile/..., integration-tests/tests/data/logs/json-singlefile-1/..., integration-tests/tests/data/logs/json-singlefile-2/...
Adds JSON log datasets with READMEs and JSONL files: Artemis II (150 events), New Horizons (150 events), and STS-135 multi-file logs (5 files × 8 events).
Text Log Test Data
integration-tests/tests/data/logs/text-multifile/..., integration-tests/tests/data/logs/text-singlefile/...
Introduces text log datasets including Apollo-17 mission days 1–13 (13 files × 15 events each) and Voyager-1 mission log.
Test Fixtures
integration-tests/tests/fixtures/integration_test_logs.py, integration-tests/tests/fixtures/package_config.py, integration-tests/tests/fixtures/package_instance.py
Adds spark_event_logs fixture; extends fixt_package_config with port assignment, job list construction via build_package_job_list, and directory cleanup; modifies fixt_package_instance to accept request fixture and conditionally start Presto cluster.
Test Module
integration-tests/tests/test_package_start.py
Enhances test with validation calls (validate_running_mode_correct, validate_presto_running) and conditional job dispatch via dispatch_test_jobs.
Validation Utilities
integration-tests/tests/utils/asserting_utils.py
Adds validate_running_mode_correct() and validate_presto_running() functions to verify CLP mode configuration and Presto container status.
Job Dispatch Logic
integration-tests/tests/utils/clp_job_utils.py
Introduces build_package_job_list() to construct job lists by mode and keyword filter, and dispatch_test_jobs() to orchestrate compression → post-compression → presto-filter execution with archive cleanup.
Mode Utilities
integration-tests/tests/utils/clp_mode_utils.py
Adds compute_mode_signature() function, introduces CLP_REDUCER_COMPONENT and CLP_PRESTO_COMPONENTS constants, defines "clp-json-presto" mode configuration with Presto settings, and refactors component list initialization.
Configuration Models
integration-tests/tests/utils/config.py
Introduces four new dataclasses (PackageCompressionJob, PackagePostCompressionJob, PrestoFilterJob, PackageJobList); extends PackagePathConfig with package_decompression_dir and script path properties; extends PackageConfig and PackageInstance with job list and shared config path fields.
Docker Utilities
integration-tests/tests/utils/docker_utils.py
Adds list_running_containers_with_prefix() function to filter running containers by name prefix using regex.
Package Utilities
integration-tests/tests/utils/package_utils.py
Adds cluster management (start_presto_cluster, stop_presto_cluster), compression/post-compression execution (run_package_compression_script, run_package_script), Presto filtering (run_presto_filter), and ground-truth comparison helpers.
Port Assignment
integration-tests/tests/utils/port_utils.py
Introduces PortAssignment dataclass and assign_ports_from_base() function to discover port requirements, validate availability, and assign sequential ports to CLP components.
Utility Helpers
integration-tests/tests/utils/utils.py
Adds load_yaml_to_dict() function for YAML file parsing with validation.

Sequence Diagram(s)

sequenceDiagram
    actor Test
    participant JobDispatch as Job Dispatch Logic
    participant CompressionJob as Compression Job
    participant Archive as CLP Archive
    participant PostCompJob as Post-Compression Job
    participant GroundTruth as Ground Truth Validator
    participant PrestoFilter as Presto Filter Job

    Test->>JobDispatch: dispatch_test_jobs(package_instance)
    JobDispatch->>JobDispatch: Clean initial archive directory
    
    loop For each compression job
        JobDispatch->>CompressionJob: run_package_compression_script()
        CompressionJob->>Archive: Compress logs into archive
        Archive-->>CompressionJob: Archive created
        
        loop For each dependent post-compression job
            JobDispatch->>PostCompJob: run_package_script()
            PostCompJob->>Archive: Extract/query archive
            Archive-->>PostCompJob: Output generated
            PostCompJob->>GroundTruth: _compare_output_with_ground_truth()
            GroundTruth-->>PostCompJob: Validation result
        end
        
        loop For each dependent presto-filter job
            JobDispatch->>PrestoFilter: run_presto_filter()
            PrestoFilter->>Archive: Execute Presto filter query
            Archive-->>PrestoFilter: Query results
            PrestoFilter->>GroundTruth: Validate against ground truth
            GroundTruth-->>PrestoFilter: Validation result
        end
        
        JobDispatch->>Archive: Clean up archive directory
    end
Loading
sequenceDiagram
    actor Test
    participant Fixture as Package Fixture
    participant PortUtils as Port Assignment
    participant PackageConfig as PackageConfig
    participant PrestoCluster as Presto Cluster
    participant CLP as CLP Package

    Test->>Fixture: fixt_package_config (setup)
    Fixture->>PortUtils: assign_ports_from_base(BASE_PORT, clp_config)
    PortUtils->>PortUtils: Discover port requirements
    PortUtils->>PortUtils: Validate ports are free
    PortUtils->>PackageConfig: Assign ports to components
    PackageConfig-->>Fixture: Configuration ready
    
    Fixture->>Fixture: build_package_job_list(mode, filter)
    Fixture->>PackageConfig: Set package_job_list
    PackageConfig-->>Fixture: Config with jobs
    
    Fixture->>Fixture: yield config (fixture ready)
    
    Test->>Fixture: fixt_package_instance (setup)
    alt mode is "clp-json-presto"
        Fixture->>PrestoCluster: start_presto_cluster(package_config)
        PrestoCluster->>PrestoCluster: Construct split-filter mappings
        PrestoCluster->>PrestoCluster: Generate split-filter.json
        PrestoCluster->>CLP: Start Presto services
        CLP-->>PrestoCluster: Services running
    end
    
    Fixture->>Fixture: yield instance (fixture ready)
    
    Test->>Test: Execute test_clp_package()
    
    Fixture->>Fixture: fixt_package_instance (teardown)
    alt mode is "clp-json-presto"
        Fixture->>PrestoCluster: stop_presto_cluster()
        PrestoCluster->>CLP: Stop Presto services
        PrestoCluster->>PrestoCluster: Reset split-filter.json
    end
    Fixture-->>Test: Cleanup complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention during review:

  • integration-tests/tests/utils/clp_job_utils.py — Core orchestration logic with three-stage job dispatch (compression → post-compression → presto-filter); cross-linking of job dependencies and archive cleanup sequencing requires careful validation
  • integration-tests/tests/utils/package_utils.py — Multiple public functions managing Presto cluster lifecycle, split-filter JSON generation, and ground-truth comparison; ensure error handling and cleanup paths are robust
  • integration-tests/tests/utils/port_utils.py — Port discovery, availability checking, and validation logic; verify range bounds and socket-binding approach are platform-compatible
  • integration-tests/tests/utils/config.py — Four new dataclasses and extensions to existing ones; validate field names, type hints, and __post_init__ logic in PackageInstance for shared config path resolution
  • integration-tests/tests/fixtures/package_config.py and integration-tests/tests/fixtures/package_instance.py — Fixture interdependencies, configuration passing, and conditional Presto setup; ensure NO_JOBS an 6969 d JOB_NAME_CONTAINS filtering work correctly
  • integration-tests/tests/utils/clp_mode_utils.py — Addition of "clp-json-presto" mode and CLP_PRESTO_COMPONENTS; verify compatibility with existing mode logic and component list construction
  • integration-tests/tests/test_package_start.py — New validation function calls and conditional job dispatch; ensure test expectations align with actual mode and Presto availability

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change as packaging integration test jobs, which aligns with the extensive test infrastructure additions throughout the PR.
Docstring Coverage ✅ Passed Docstring coverage is 90.74% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor
@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39398e6 and f737e83.

📒 Files selected for processing (50)
  • integration-tests/.pytest.ini (2 hunks)
  • integration-tests/tests/conftest.py (1 hunks)
  • integration-tests/tests/data/__init__.py (1 hunks)
  • integration-tests/tests/data/ground-truths/filter-describe-postgresql.jsonl (1 hunks)
  • integration-tests/tests/data/ground-truths/filter-select-distinct-postgresql.jsonl (1 hunks)
  • integration-tests/tests/data/jobs/compression/__init__.py (1 hunks)
  • integration-tests/tests/data/jobs/compression/compression_jobs.py (1 hunks)
  • integration-tests/tests/data/jobs/post_compression/__init__.py (1 hunks)
  • integration-tests/tests/data/jobs/post_compression/admin_tools_jobs.py (1 hunks)
  • integration-tests/tests/data/jobs/post_compression/search_jobs.py (1 hunks)
  • integration-tests/tests/data/jobs/presto/__init__.py (1 hunks)
  • integration-tests/tests/data/jobs/presto/presto_filters.py (1 hunks)
  • integration-tests/tests/data/logs/json-multifile/README.md (1 hunks)
  • integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-08.jsonl (1 hunks)
  • integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-09.jsonl (1 hunks)
  • integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-11.jsonl (1 hunks)
  • integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-19.jsonl (1 hunks)
  • integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-21.jsonl (1 hunks)
  • integration-tests/tests/data/logs/json-singlefile-1/README.md (1 hunks)
  • integration-tests/tests/data/logs/json-singlefile-1/logs/artemis_ii.jsonl (1 hunks)
  • integration-tests/tests/data/logs/json-singlefile-2/README.md (1 hunks)
  • integration-tests/tests/data/logs/json-singlefile-2/logs/new-horizons.jsonl (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/README.md (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day01.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day02.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day03.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day04.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day05.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day06.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day07.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day08.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day09.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day10.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day11.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day12.txt (1 hunks)
  • integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day13.txt (1 hunks)
  • integration-tests/tests/data/logs/text-singlefile/README.md (1 hunks)
  • integration-tests/tests/data/logs/text-singlefile/logs/voyager-1.txt (1 hunks)
  • integration-tests/tests/fixtures/integration_test_logs.py (1 hunks)
  • integration-tests/tests/fixtures/package_config.py (2 hunks)
  • integration-tests/tests/fixtures/package_instance.py (2 hunks)
  • integration-tests/tests/test_package_start.py (2 hunks)
  • integration-tests/tests/utils/asserting_utils.py (2 hunks)
  • integration-tests/tests/utils/clp_job_utils.py (1 hunks)
  • integration-tests/tests/utils/clp_mode_utils.py (5 hunks)
  • integration-tests/tests/utils/config.py (7 hunks)
  • integration-tests/tests/utils/docker_utils.py (2 hunks)
  • integration-tests/tests/utils/package_utils.py (3 hunks)
  • integration-tests/tests/utils/port_utils.py (1 hunks)
  • integration-tests/tests/utils/utils.py (2 hunks)
🧰 Additional context used
🧠 Learnings (13)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/pyproject.toml:11-13
Timestamp: 2025-08-16T10:24:29.316Z
Learning: In the CLP project's integration-tests directory, the integration tests package is not intended to be shipped or distributed - only the tests are leveraged for local testing purposes, so console script entry points should not be included in pyproject.toml.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • integration-tests/tests/data/jobs/compression/__init__.py
  • integration-tests/tests/data/jobs/compression/compression_jobs.py
  • integration-tests/tests/utils/clp_job_utils.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • integration-tests/tests/data/jobs/compression/compression_jobs.py
📚 Learning: 2025-08-20T08:36:38.391Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/__init__.py:1-1
Timestamp: 2025-08-20T08:36:38.391Z
Learning: In the CLP project, __init__.py files should include short module-level docstrings describing the package purpose, as required by their ruff linting configuration, rather than being left empty.

Applied to files:

  • integration-tests/tests/data/__init__.py
  • integration-tests/tests/data/jobs/presto/__init__.py
📚 Learning: 2025-08-16T10:24:29.316Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/pyproject.toml:11-13
Timestamp: 2025-08-16T10:24:29.316Z
Learning: In the CLP project's integration-tests directory, the integration tests package is not intended to be shipped or distributed - only the tests are leveraged for local testing purposes, so console script entry points should not be included in pyproject.toml.

Applied to files:

  • integration-tests/tests/data/__init__.py
  • integration-tests/.pytest.ini
  • integration-tests/tests/utils/clp_job_utils.py
  • integration-tests/tests/utils/package_utils.py
  • integration-tests/tests/utils/config.py
📚 Learning: 2025-08-17T16:10:38.722Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Applied to files:

  • integration-tests/tests/data/__init__.py
  • integration-tests/.pytest.ini
  • integration-tests/tests/fixtures/integration_test_logs.py
  • integration-tests/tests/utils/asserting_utils.py
  • integration-tests/tests/fixtures/package_instance.py
  • integration-tests/tests/fixtures/package_config.py
  • integration-tests/tests/test_package_start.py
  • integration-tests/tests/utils/package_utils.py
📚 Learning: 2025-08-08T21:15:10.905Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/test_identity_transformation.py:87-95
Timestamp: 2025-08-08T21:15:10.905Z
Learning: In the CLP project's integration tests (Python code), variable names should use "logs" (plural) rather than "log" (singular) when referring to test logs or log-related entities, as this aligns with the naming conventions used throughout the codebase.

Applied to files:

  • integration-tests/tests/data/__init__.py
📚 Learning: 2025-07-28T08:33:57.487Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/test_identity_transformation.py:133-137
Timestamp: 2025-07-28T08:33:57.487Z
Learning: In CLP integration tests, the `run_and_assert` utility function should only be used for commands that test CLP package functionality (like clp/clp-s compression/decompression operations). For processing or helper commands (like jq, sort, etc.), direct `subprocess.run` calls should be used instead, as they serve different purposes and don't need the same error handling patterns.

Applied to files:

  • integration-tests/tests/data/__init__.py
  • integration-tests/tests/utils/clp_job_utils.py
  • integration-tests/tests/utils/package_utils.py
📚 Learning: 2025-08-18T05:43:07.868Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1184
File: components/core/cmake/Modules/FindLibLZMA.cmake:21-24
Timestamp: 2025-08-18T05:43:07.868Z
Learning: In the CLP project, all supplied `<lib>_ROOT` variables will live within the `CLP_CORE_DEPS_DIR` as part of their architectural design. This means that using CLP_CORE_DEPS_DIR for library discovery in custom Find modules is the intended approach, and prioritizing individual `<lib>_ROOT` variables over CLP_CORE_DEPS_DIR is not necessary.

Applied to files:

  • integration-tests/.pytest.ini
📚 Learning: 2025-05-06T10:07:04.654Z
Learnt from: kirkrodrigues
Repo: y-scope/clp PR: 881
File: components/core/tools/scripts/lib_install/macos/install-all.sh:11-12
Timestamp: 2025-05-06T10:07:04.654Z
Learning: In CLP installation scripts, temporary directories with downloaded files should not be automatically cleaned up on failure (e.g., with EXIT traps) to preserve artifacts for debugging purposes.

Applied to files:

  • integration-tests/tests/fixtures/package_config.py
📚 Learning: 2025-10-27T07:07:37.901Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.

Applied to files:

  • integration-tests/tests/utils/clp_mode_utils.py
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.

Applied to files:

  • integration-tests/tests/utils/clp_mode_utils.py
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
Repo: y-scope/clp PR: 1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.

Applied to files:

  • integration-tests/tests/utils/clp_mode_utils.py
🧬 Code graph analysis (8)
integration-tests/tests/data/jobs/post_compression/search_jobs.py (1)
integration-tests/tests/utils/config.py (1)
  • PackagePostCompressionJob (181-203)
integration-tests/tests/data/jobs/compression/compression_jobs.py (1)
integration-tests/tests/utils/config.py (1)
  • PackageCompressionJob (157-177)
integration-tests/tests/utils/docker_utils.py (1)
integration-tests/tests/utils/utils.py (1)
  • get_binary_path (13-23)
integration-tests/tests/fixtures/integration_test_logs.py (2)
integration-tests/tests/fixtures/path_configs.py (1)
  • integration_test_path_config (20-24)
integration-tests/tests/utils/config.py (2)
  • IntegrationTestPathConfig (330-350)
  • IntegrationTestLogs (354-378)
integration-tests/tests/data/jobs/presto/presto_filters.py (1)
integration-tests/tests/utils/config.py (1)
  • PrestoFilterJob (208-221)
integration-tests/tests/fixtures/package_instance.py (2)
integration-tests/tests/utils/package_utils.py (2)
  • start_presto_cluster (91-167)
  • stop_presto_cluster (190-217)
integration-tests/tests/fixtures/package_config.py (1)
  • fixt_package_config (30-82)
integration-tests/tests/data/jobs/post_compression/admin_tools_jobs.py (1)
integration-tests/tests/utils/config.py (1)
  • PackagePostCompressionJob (181-203)
integration-tests/tests/test_package_start.py (4)
integration-tests/tests/utils/asserting_utils.py (3)
  • validate_presto_running (100-115)
  • validate_running_mode_correct (77-97)
  • validate_package_running (46-74)
integration-tests/tests/utils/clp_job_utils.py (1)
  • dispatch_test_jobs (141-181)
integration-tests/tests/utils/config.py (1)
  • PackageInstance (274-326)
integration-tests/tests/fixtures/package_instance.py (1)
  • fixt_package_instance (23-61)
🪛 LanguageTool
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day10.txt

[uncategorized] ~4-~4: This expression is usually spelled with a hyphen.
Context: ... navigation sightings compare well with ground based tracking solutions; all primary systems...

(BASED_HYPHEN)

integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day03.txt

[style] ~10-~10: ‘in brief’ might be wordy. Consider a shorter alternative.
Context: ...-17 d03 e009 attitude control jets fire in brief pulses to maintain thermal attitudes; a...

(EN_WORDINESS_PREMIUM_IN_BRIEF)

integration-tests/tests/data/logs/text-singlefile/logs/voyager-1.txt

[uncategorized] ~18-~18: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~22-~22: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~26-~26: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~30-~30: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~34-~34: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~38-~38: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~40-~40: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...pdated to Voyager Interstellar Mission; long term science focus shifts to heliospheric bo...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~44-~44: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...pdated to Voyager Interstellar Mission; long term science focus shifts to heliospheric bo...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~48-~48: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...pdated to Voyager Interstellar Mission; long term science focus shifts to heliospheric bo...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~52-~52: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...pdated to Voyager Interstellar Mission; long term science focus shifts to heliospheric bo...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[misspelling] ~112-~112: This expression is normally spelled as one or with a hyphen.
Context: ...ary crossing signatures continues using multi year combined instrument observations 2013-1...

(EN_COMPOUNDS_MULTI_YEAR)


[misspelling] ~116-~116: This expression is normally spelled as one or with a hyphen.
Context: ...ary crossing signatures continues using multi year combined instrument observations 2015-0...

(EN_COMPOUNDS_MULTI_YEAR)


[misspelling] ~120-~120: This expression is normally spelled as one or with a hyphen.
Context: ...ary crossing signatures continues using multi year combined instrument observations 2016-0...

(EN_COMPOUNDS_MULTI_YEAR)


[misspelling] ~124-~124: This expression is normally spelled as one or with a hyphen.
Context: ...ary crossing signatures continues using multi year combined instrument observations 2017-1...

(EN_COMPOUNDS_MULTI_YEAR)


[misspelling] ~129-~129: This word is normally spelled with a hyphen.
Context: ...e forces selective instrument shutdown; power sharing strategy revised to extend mission life...

(EN_COMPOUNDS_POWER_SHARING)


[misspelling] ~133-~133: This word is normally spelled with a hyphen.
Context: ...e forces selective instrument shutdown; power sharing strategy revised to extend mission life...

(EN_COMPOUNDS_POWER_SHARING)


[misspelling] ~137-~137: This word is normally spelled with a hyphen.
Context: ...e forces selective instrument shutdown; power sharing strategy revised to extend mission life...

(EN_COMPOUNDS_POWER_SHARING)

integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day08.txt

[uncategorized] ~7-~7: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...and commemorative plaques documented in high resolution imagery; all primary systems nominal; c...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[style] ~14-~14: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...7 apollo-17 d08 e013 crew rests briefly prior to ascent timeline for launch from Moon; a...

(EN_WORDINESS_PREMIUM_PRIOR_TO)

integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day12.txt

[uncategorized] ~15-~15: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...-17 d12 e014 crew reports readiness for high speed entry and ocean splashdown; all primary...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

🪛 markdownlint-cli2 (0.18.1)
integration-tests/tests/data/logs/text-multifile/README.md

4-4: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: package-image
  • GitHub Check: build (ubuntu-24.04)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: build (macos-15)
🔇 Additional comments (52)
integration-tests/tests/data/__init__.py (1)

1-1: Module-level docstring is appropriate. The file correctly includes a short, descriptive docstring identifying the package purpose, which satisfies the ruff linting requirements per project standards.

integration-tests/tests/data/logs/json-singlefile-2/logs/new-horizons.jsonl (1)

1-150: Acknowledging test data file as requested.

The PR title indicates this should not be reviewed. The JSONL format is valid, timestamps are monotonically increasing, and the schema is consistent across all 150 records. The test data appears suitable for integration testing purposes.

integration-tests/tests/data/ground-truths/filter-select-distinct-postgresql.jsonl (1)

1-3: LGTM!

The ground-truth data is valid JSON with appropriate PostgreSQL error severity values for testing filter/select/distinct operations.

integration-tests/tests/data/ground-truths/filter-describe-postgresql.jsonl (1)

1-22: LGTM!

The ground-truth data is valid JSON with appropriate PostgreSQL column definitions for testing describe operations.

integration-tests/tests/data/jobs/presto/__init__.py (1)

1-1: LGTM!

The module docstring follows the project's linting requirements for package initializers.

integration-tests/tests/data/logs/json-singlefile-2/README.md (1)

1-16: LGTM!

The README metadata is consistent and clearly documents the test dataset characteristics.

integration-tests/tests/data/logs/json-singlefile-1/README.md (1)

1-16: LGTM!

The README metadata accurately reflects the artemis_ii.jsonl data file, with timestamps matching the first and last entries.

integration-tests/.pytest.ini (1)

14-17: LGTM!

The new environment variables provide clear test path configuration for repository, data, jobs, and logs directories.

integration-tests/tests/data/jobs/compression/__init__.py (1)

1-1: LGTM!

The module docstring follows the project's linting requirements for package initialisers.

integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-09.jsonl (1)

1-8: LGTM!

The JSONL data has consistent structure with properly ordered timestamps and sequential line indices.

integration-tests/tests/data/logs/text-singlefile/README.md (2)

3-4: Verify the log type - inconsistency between README documentation and directory naming.

The README states the type is "JSON", but the directory is named "text-singlefile", which suggests the logs should be text-based rather than JSON format. Inspect the actual log files in this directory to determine the correct type and update the README or directory name accordingly.


6-7: Verify the file count in the text-singlefile directory.

The README states "Number of files: 5", but this needs verification against the actual file count in integration-tests/tests/data/logs/text-singlefile/. Check the directory to confirm the actual number of files matches the documented value.

integration-tests/tests/data/logs/json-singlefile-1/logs/artemis_ii.jsonl (1)

1-150: Verify suffix inconsistency pattern and testing intent.

The analysis confirms 35 of 150 entries contain the suffix "This log documents Artemis II integrated systems performance during the crewed lunar test mission." while 115 do not, creating an irregular mixed-format pattern. However, line 128 was incorrectly listed in the review as containing the suffix when it does not. Determine whether this mixed formatting is intentional for testing log parsing robustness or requires standardization to a consistent format.

integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-08.jsonl (1)

1-8: Test data is well-structured and properly formatted.

The JSONL file contains properly formatted JSON objects with consistent fields and sequential timestamps appropriate for testing CLP package functionality.

integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-19.jsonl (1)

1-8: Test data is well-structured and properly formatted.

The JSONL file contains properly formatted JSON objects representing STS-135 mission day 11 events with consistent structure and sequential timestamps.

integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-11.jsonl (1)

1-8: Test data is well-structured and properly formatted.

The JSONL file contains properly formatted JSON objects representing STS-135 mission day 3 events with consistent structure and sequential timestamps.

integration-tests/tests/fixtures/integration_test_logs.py (1)

46-58: Fixture implementation follows established patterns.

The spark_event_logs fixture consistently follows the same pattern as existing fixtures (hive_24hr, postgresql), using the shared _download_and_extract_dataset helper function with appropriate parameters.

integration-tests/tests/data/logs/text-singlefile/logs/voyager-1.txt (1)

1-150: Test data is properly formatted for integration testing.

The text file contains well-structured Voyager-1 mission logs with consistent formatting (ISO timestamps, spacecraft identifier, event type, and descriptions). The static analysis hints regarding hyphenation are stylistic suggestions for the log message content and are not applicable to test data.

integration-tests/tests/data/jobs/post_compression/__init__.py (1)

1-1: Package initializer is appropriate.

The file properly establishes the post_compression package namespace with a descriptive docstring.

integration-tests/tests/data/jobs/presto/presto_filters.py (1)

1-5: Scaffolding is properly structured.

The module provides a well-typed, empty dictionary for Presto filter job definitions, following good patterns for test job configuration infrastructure.

integration-tests/tests/data/logs/json-multifile/README.md (1)

1-16: Documentation clearly describes the test dataset.

The README provides accurate metadata about the json-multifile logs, including type, file count, events per file, and timestamp range, which aligns with the actual JSONL files added in this PR.

integration-tests/tests/conftest.py (1)

3-35: Pytest options and BASE_PORT ini wiring look coherent

The added ini key and CLI options match downstream usage (NO_JOBS, JOB_NAME_CONTAINS, BASE_PORT) and provide clear help text; configuration and typing look sound.

integration-tests/tests/utils/docker_utils.py (1)

3-3: New Docker helper correctly narrows containers by name prefix

The combination of docker ps --filter name=... with a re.fullmatch(re.escape(prefix) + r"\d+") check gives a precise prefix + digits match while avoiding regex pitfalls; error handling is consistent with the existing compose helper.

Also applies to: 39-64

integration-tests/tests/test_package_start.py (1)

7-9: End-to-end package, mode, and job validation flow reads solid

Validating package components, then mode correctness, then conditionally checking Presto for clp-json-presto, and finally dispatching jobs only when package_job_list is present aligns cleanly with the fixture/job list design and avoids 6961 unnecessary work when --no-jobs is used.

Also applies to: 10-12, 14-16, 25-44

integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-21.jsonl (1)

1-8: JSONL log entries are well-formed and consistent

All records share the expected schema, timestamps are strictly increasing, and line_index values are unique and ordered, which should work well for multifile log test scenarios.

integration-tests/tests/data/jobs/post_compression/search_jobs.py (1)

1-5: Search job registry skeleton matches existing job-modelling pattern

Defining PACKAGE_SEARCH_JOBS: dict[str, PackagePostCompressionJob] = {} mirrors the other job registries and gives a clear, type-safe place to add search jobs later without affecting current test behaviour.

integration-tests/tests/utils/utils.py (1)

8-11: YAML loading helper is robust and consistent with its contract

load_yaml_to_dict cleanly distinguishes parse/IO failures (ValueError) from structural issues (TypeError on non-mapping), and the typing/return contract (dict[str, Any]) matches the enforced top-level mapping requirement.

Also applies to: 69-93

integration-tests/tests/data/jobs/post_compression/admin_tools_jobs.py (1)

1-5: Admin-tools job registry skeleton is aligned with the job model

The PACKAGE_ADMIN_TOOLS_JOBS: dict[str, PackagePostCompressionJob] = {} definition follows the same pattern as other job registries and safely acts as a placeholder until specific admin-tools jobs are added.

integration-tests/tests/utils/asserting_utils.py (2)

9-21: LGTM!

The new imports are well-organized and correctly support the validation utilities for mode verification and Presto container checks.


77-97: LGTM!

The validate_running_mode_correct function correctly validates the shared config against the intended configuration by comparing mode signatures. The error handling for ValidationError is appropriate.

integration-tests/tests/data/jobs/compression/compression_jobs.py (1)

1-38: LGTM!

The compression job definitions are well-structured and follow the PackageCompressionJob dataclass schema. The dictionary keys appropriately match the job_name fields, which aids maintainability and lookup consistency.

integration-tests/tests/fixtures/package_config.py (4)

3-26: LGTM!

The imports are well-organized and the logger setup follows the standard pattern for the test infrastructure.


45-54: LGTM!

Good defensive error handling for the BASE_PORT configuration value. The explicit error message clearly indicates the expected type.


59-70: LGTM!

The job list construction correctly respects the NO_JOBS and JOB_NAME_CONTAINS options, allowing flexible test filtering.


75-82: LGTM!

The enhanced teardown properly cleans up both the temporary config file and the generated data/tmp directories, ensuring test isolation.

integration-tests/tests/utils/clp_mode_utils.py (4)

44-56: LGTM!

The new clp-json-presto mode configuration correctly sets up the Presto query engine with CLP_S storage, disables results cache retention (appropriate for Presto-based queries), and configures the Presto coordinator.


97-114: LGTM!

The compute_mode_signature function provides a clean way to compare mode-defining aspects of configurations. The tuple captures the essential fields that determine operating mode behaviour.


139-145: LGTM!

The refactored initialization correctly builds the component list by extending from CLP_BASE_COMPONENTS and conditionally adding query components and the reducer for FULL deployment type.


91-94: Consider using a constant for the Presto worker component name.

PRESTO_COORDINATOR_COMPONENT_NAME is imported from clp_py_utils.clp_config, but "presto-worker" is hardcoded. Verify whether a PRESTO_WORKER_COMPONENT_NAME constant exists in clp_py_utils.clp_config and use it for consistency with the coordinator component name.

integration-tests/tests/utils/port_utils.py (6)

9-18: LGTM!

Constants are well-defined with clear naming. The valid port range correctly excludes privileged ports (0-1023) and includes the full range up to 65535.


21-33: LGTM!

The PortAssignment dataclass is well-structured with clear documentation. Using Any for component is appropriate given the dynamic nature of CLP config components.


36-58: LGTM!

The assign_ports_from_base function correctly discovers port requirements, validates the range, checks availability, and assigns ports sequentially. The logic is sound and well-organized.


61-76: LGTM!

Port availability checking logic is correct. The error message is informative and actionable.


128-141: Socket resource handling is correct.

The context manager properly ensures the socket is closed. Catching OSError covers both address-in-use and other binding failures appropriately.


144-163: LGTM!

Bounds validation logic is correct. The error message clearly indicates the issue and provides both the required and valid ranges.

integration-tests/tests/utils/package_utils.py (2)

258-263: Placeholder assertions should be tracked.

The assert True statements with TODO comments are placeholders that effectively skip validation. While the TODOs reference specific PRs/issues, these could easily be forgotten.

Consider adding pytest.skip() with a reason instead, which makes the skip visible in test reports:

     if compression_job.mode in ("clp-json", "clp-json-presto"):
-        # TODO: Waiting for PR 1299 to be merged.
-        assert True
+        pytest.skip("Decompression validation pending PR #1299")
     elif compression_job.mode == "clp-text":
-        # TODO: As of the morning of Dec. 10, clp-text decompression is broken.
-        assert True
+        pytest.skip("clp-text decompression is currently broken")

91-167: Good orchestration of Presto cluster startup.

The function correctly validates paths, generates split filters from compression jobs, writes configuration, and starts the Docker Compose cluster.

integration-tests/tests/utils/config.py (5)

156-178: LGTM!

PackageCompressionJob dataclass is well-structured with clear field documentation. The frozen=True ensures immutability which is appropriate for job definitions.


180-204: LGTM!

PackagePostCompressionJob correctly models post-compression dependencies with a reference to the parent PackageCompressionJob. The requires_archive_id default is sensible.


206-231: LGTM!

PrestoFilterJob and PackageJobList dataclasses are well-defined. The comment explaining why Presto filters have their own class is helpful.


130-153: LGTM!

New script path properties follow the established pattern and provide convenient access to package scripts. The paths are consistent with the CLP package structure.


283-302: Good addition of shared_config_file_path validation.

The new field is properly initialized and validated during __post_init__, ensuring the shared config file exists before the instance is considered valid.

# Description of logs

**Type**
Unstructured text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove trailing whitespace from line 4.

The line ends with a single trailing space, which violates markdown formatting standards.

Apply this diff to fix the trailing whitespace:

 **Type**
-Unstructured text 
+Unstructured text
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Unstructured text
**Type**
Unstructured text
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

4-4: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)

🤖 Prompt for AI Agents
integration-tests/tests/data/logs/text-multifile/README.md around line 4: the
line "Unstructured text " contains a single trailing space; remove the trailing
whitespace at the end of line 4 so the line ends with "Unstructured text" (no
extra spaces) and save the file.

@@ -1,5 +1,6 @@
"""Fixtures that start and stop CLP package instances for integration tests."""

import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Fixture behaviour for NO_JOBS and Presto looks correct, with a small opportunity to preserve error detail

The skip condition when package_job_list is None and not no_jobs matches how the job list is built and avoids spinning up instances for modes with no matching jobs, and the Presto start/stop tied to mode_name == "clp-json-presto" in try/finally ensures cleanup. One optional improvement would be to catch RuntimeError as err and include err in the pytest.fail message (or log it) so callers keep the original failure context alongside the helpful BASE_PORT hint.

Also applies to: 13-17, 19-19, 23-26, 31-31, 35-40, 46-60

🤖 Prompt for AI Agents
In integration-tests/tests/fixtures/package_instance.py around lines 3, 13-17,
19, 23-26, 31, 35-40 and 46-60, the tests currently call pytest.fail (or
otherwise re-raise) on RuntimeError without preserving the original exception
details; update each try/except that catches RuntimeError to use "except
RuntimeError as err" and include the err in the failure message (e.g.,
pytest.fail(f"...: {err}") or log the err before failing) so the original
exception context is preserved alongside the existing BASE_PORT hint and cleanup
logic.

Comment on lines 100 to 115
def validate_presto_running() -> None:
"""
Validate that a Presto cluster is running.
:param package_instance:
"""
required_components = CLP_PRESTO_COMPONENTS

for component in required_components:
prefix = f"presto-clp-{component}-"
running_matches = list_running_containers_with_prefix(prefix)
if len(running_matches) == 0:
pytest.fail(
f"No running container found for component '{component}' "
f"(expected name prefix '{prefix}')."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove stale docstring parameter.

The docstring references :param package_instance: but the function takes no parameters.

 def validate_presto_running() -> None:
     """
     Validate that a Presto cluster is running.
-
-    :param package_instance:
     """
     required_components = CLP_PRESTO_COMPONENTS
🤖 Prompt for AI Agents
In integration-tests/tests/utils/asserting_utils.py around lines 100 to 115, the
function validate_presto_running() has a stale docstring parameter reference
":param package_instance:" that does not match the function signature; remove
that incorrect param line from the docstring and, if desired, replace it with a
brief description of the function or relevant params (none in this case) so the
docstring accurately reflects the function signature.

str(package_config.path_config.clp_package_dir),
]
# fmt: on
subprocess.run(setup_presto_cmd, check=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding timeouts to subprocess calls.

The subprocess.run calls for Presto setup, start, and stop don't have timeouts. If the Docker daemon is unresponsive or the commands hang, the test will block indefinitely.

-    subprocess.run(setup_presto_cmd, check=True)
+    subprocess.run(setup_presto_cmd, check=True, timeout=DEFAULT_CMD_TIMEOUT_SECONDS)

Apply similarly to lines 167 and 202.

Also applies to: 167-167, 202-202

🤖 Prompt for AI Agents
In integration-tests/tests/utils/package_utils.py around lines 112, 167, and
202, the subprocess.run invocations for Presto setup/start/stop lack timeouts
and can hang indefinitely; add a reasonable timeout argument (e.g., timeout=30
or configurable via a constant/ENV var) to each subprocess.run call and wrap
them in try/except to catch subprocess.TimeoutExpired to perform cleanup or
raise a clear exception; ensure any resources (docker containers/processes) are
cleaned up or the test fails fast when a timeout occurs.

Comment on lines +122 to +133
if package_compression_jobs is not None:
for package_compression_job in package_compression_jobs:
flags = package_compression_job.flags
if flags is not None:
dataset_name = flags["--dataset"]
timestamp_key = flags["--timestamp-key"]

# Skip jobs that do not specify both dataset_name and timestamp_key.
if dataset_name is None or timestamp_key is None:
continue

dataset_timestamp_dict[dataset_name] = timestamp_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Potential UnboundLocalError if flags is None.

When flags is None, the variables dataset_name and timestamp_key are never assigned, but the code at lines 130-131 still checks and uses them. This will raise an UnboundLocalError.

Apply this diff to fix the scoping issue:

     if package_compression_jobs is not None:
         for package_compression_job in package_compression_jobs:
             flags = package_compression_job.flags
-            if flags is not None:
-                dataset_name = flags["--dataset"]
-                timestamp_key = flags["--timestamp-key"]
-
-            # Skip jobs that do not specify both dataset_name and timestamp_key.
-            if dataset_name is None or timestamp_key is None:
+            if flags is None:
+                continue
+
+            dataset_name = flags.get("--dataset")
+            timestamp_key = flags.get("--timestamp-key")
+
+            # Skip jobs that do not specify both dataset_name and timestamp_key.
+            if dataset_name is None or timestamp_key is None:
                 continue

             dataset_timestamp_dict[dataset_name] = timestamp_key

This also uses .get() to avoid KeyError if the flags exist but don't contain the expected keys.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if package_compression_jobs is not None:
for package_compression_job in package_compression_jobs:
flags = package_compression_job.flags
if flags is not None:
dataset_name = flags["--dataset"]
timestamp_key = flags["--timestamp-key"]
# Skip jobs that do not specify both dataset_name and timestamp_key.
if dataset_name is None or timestamp_key is None:
continue
dataset_timestamp_dict[dataset_name] = timestamp_key
if package_compression_jobs is not None:
for package_compression_job in package_compression_jobs:
flags = package_compression_job.flags
if flags is None:
continue
dataset_name = flags.get("--dataset")
timestamp_key = flags.get("--timestamp-key")
# Skip jobs that do not specify both dataset_name and timestamp_key.
if dataset_name is None or timestamp_key is None:
continue
dataset_timestamp_dict[dataset_name] = timestamp_key
🤖 Prompt for AI Agents
In integration-tests/tests/utils/package_utils.py around lines 122 to 133, the
code reads flags = package_compression_job.flags but only assigns dataset_name
and timestamp_key when flags is not None, yet later checks those variables
unconditionally which can raise UnboundLocalError; fix by initializing
dataset_name and timestamp_key to None before the flags check, then if flags is
not None assign dataset_name = flags.get("--dataset") and timestamp_key =
flags.get("--timestamp-key") and keep the existing skip/continue logic so jobs
missing flags or keys are ignored.

Comment on lines 344 to 351
6961
def _compare_output_with_ground_truth(output: str, ground_truth_file_path: Path) -> bool:
with tempfile.TemporaryDirectory() as temporary_output_directory_name:
# Write output to tempfile.
output_file_path = Path(temporary_output_directory_name) / "output.tmp"
output_file_path.write_text(output)

# Compare tempfile with ground truth.
return is_dir_tree_content_equal(output_file_path, ground_truth_file_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for the is_dir_tree_content_equal function definition
rg -n "def is_dir_tree_content_equal" --type py -A 20

Repository: y-scope/clp

Length of output: 166


🌐 Web query:

is_dir_tree_content_equal function Python file directory comparison

💡 Result:

Here’s a concise, robust implementation to compare two directory trees by names and file contents (recursively), using Python’s filecmp utilities:

Code:

import os
import filecmp

def is_dir_tree_content_equal(dir1: str, dir2: str) -> bool:
    """Return True if dir trees rooted at dir1 and dir2 have the same
    filenames, subdirectories and file contents (recursively)."""
    dcmp = filecmp.dircmp(dir1, dir2)
    if dcmp.left_only or dcmp.right_only or dcmp.funny_files:
        return False

    # compare common files by content
    match, mismatch, errors = filecmp.cmpfiles(dir1, dir2, dcmp.common_files, shallow=False)
    if mismatch or errors:
        return False

    # recurse into common subdirectories
    for sub in dcmp.common_dirs:
        if not is_dir_tree_content_equal(os.path.join(dir1, sub), os.path.join(dir2, sub)):
            return False

    return True

Notes:

  • Uses shallow=False so file contents are compared (not just stat). See Python’s filecmp docs. [1]
  • A similar recursive approach is shown and discussed on Stack Overflow. [2]

Sources:
[1] Python filecmp documentation.
[2] Stack Overflow — recursive directory comparison example.


Function compares single file, but uses directory comparison utility.

_compare_output_with_ground_truth writes output to a single file but uses is_dir_tree_content_equal for comparison. The underlying utility is designed for directory tree comparison using filecmp.dircmp(), which expects directory paths. Comparing a single file path with this function is semantically incorrect and may fail at runtime. Consider either wrapping the file comparison in a directory structure or using a dedicated file comparison function.

🤖 Prompt for AI Agents
In integration-tests/tests/utils/package_utils.py around lines 344 to 351, the
helper writes the output to a single temp file but then calls
is_dir_tree_content_equal (a directory tree comparator), which is semantically
wrong and can fail; change the comparison to use a proper file comparison
instead (e.g., filecmp.cmp or pathlib.read_text() equality) or, if you want to
keep using the directory comparator, create a temporary directory mirroring the
ground-truth layout (place the temp file under the same relative name) before
calling is_dir_tree_content_equal; update the function to perform one of these
fixes and return the boolean result of the correct file-or-directory comparison.

Comment on lines 354 to 362
def run_presto_filter(
presto_filter_job: PrestoFilterJob,
) -> None:
"""
Run a Presto filter.
:param request:
:param presto_filter:
:param package_instance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring parameters don't match function signature.

The docstring references request, presto_filter, and package_instance parameters, but the function only accepts presto_filter_job.

 def run_presto_filter(
     presto_filter_job: PrestoFilterJob,
 ) -> None:
     """
     Run a Presto filter.

-    :param request:
-    :param presto_filter:
-    :param package_instance:
+    :param presto_filter_job: The Presto filter job to execute.
     """
🤖 Prompt for AI Agents
In integration-tests/tests/utils/package_utils.py around lines 354 to 362, the
function run_presto_filter(presto_filter_job) has a docstring listing parameters
request, presto_filter, and package_instance that do not match the actual
signature; update the docstring to document the single parameter
presto_filter_job (and its type/purpose) and remove or rename the incorrect
parameter entries so the docstring accurately reflects the function signature
and expected argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

0