feat(integration-tests): package integration test jobs (do not review) #1766

quinntaylormitchell · 2025-12-10T20:00:09Z

Description

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Summary by CodeRabbit

New Features
- Added support for Presto-based query operations in integration tests
- Introduced expanded test data including multiple compression job configurations and post-compression workflows
- Enhanced test infrastructure with job filtering, port assignment, and cluster management capabilities
Tests
- Expanded test data fixtures across multiple mission scenarios and log formats
- Added comprehensive ground-truth validation for filtering and query operations
- Implemented test job dispatch and execution framework

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ig file; address comments.

…er-level clp-config.yml

…-scope#1549 is merged.

…E_CONFIGS other than clp-text and clp-json

…nto one class.

…fter y-scope#1549).

coderabbitai · 2025-12-10T20:00:23Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request introduces comprehensive integration test infrastructure for CLP package testing, including test data directories with ground-truth datasets, job configuration definitions for compression and post-compression workflows, Presto cluster support, fixture enhancements for dynamic port assignment and configuration management, and utility functions for test job orchestration and validation.

Changes

Cohort / File(s)	Summary
Pytest Configuration `.pytest.ini`, `integration-tests/tests/conftest.py`	Adds environment variables for test paths (CLP_REPO_DIR, TEST_DATA_DIR, TEST_JOBS_DIR, TEST_LOGS_DIR), defines BASE_PORT marker, and introduces `pytest_addoption` hook registering BASE_PORT ini value and command-line options `--job-name-contains` and `--no-jobs`.
Test Data Module `integration-tests/tests/data/__init__.py`	Introduces data package module with docstring for static CLP integration test data storage.
Ground-Truth Test Data `integration-tests/tests/data/ground-truths/filter-*.jsonl`	Adds ground-truth JSONL files for PostgreSQL describe and select-distinct filter validation.
Compression Job Definitions `integration-tests/tests/data/jobs/compression/__init__.py`, `integration-tests/tests/data/jobs/compression/compression_jobs.py`	Defines compression package module and PACKAGE_COMPRESSION_JOBS dictionary with three compression job configurations (json-singlefile-1, text-singlefile, json-singlefile-2).
Post-Compression Job Definitions `integration-tests/tests/data/jobs/post_compression/__init__.py`, `integration-tests/tests/data/jobs/post_compression/admin_tools_jobs.py`, `integration-tests/tests/data/jobs/post_compression/search_jobs.py`	Introduces post-compression jobs package with empty PACKAGE_ADMIN_TOOLS_JOBS and PACKAGE_SEARCH_JOBS registries.
Presto Filter Job Definitions `integration-tests/tests/data/jobs/presto/__init__.py`, `integration-tests/tests/data/jobs/presto/presto_filters.py`	Establishes Presto jobs package with empty PRESTO_FILTER_JOBS registry.
JSON Log Test Data `integration-tests/tests/data/logs/json-multifile/...`, `integration-tests/tests/data/logs/json-singlefile-1/...`, `integration-tests/tests/data/logs/json-singlefile-2/...`	Adds JSON log datasets with READMEs and JSONL files: Artemis II (150 events), New Horizons (150 events), and STS-135 multi-file logs (5 files × 8 events).
Text Log Test Data `integration-tests/tests/data/logs/text-multifile/...`, `integration-tests/tests/data/logs/text-singlefile/...`	Introduces text log datasets including Apollo-17 mission days 1–13 (13 files × 15 events each) and Voyager-1 mission log.
Test Fixtures `integration-tests/tests/fixtures/integration_test_logs.py`, `integration-tests/tests/fixtures/package_config.py`, `integration-tests/tests/fixtures/package_instance.py`	Adds `spark_event_logs` fixture; extends `fixt_package_config` with port assignment, job list construction via `build_package_job_list`, and directory cleanup; modifies `fixt_package_instance` to accept request fixture and conditionally start Presto cluster.
Test Module `integration-tests/tests/test_package_start.py`	Enhances test with validation calls (`validate_running_mode_correct`, `validate_presto_running`) and conditional job dispatch via `dispatch_test_jobs`.
Validation Utilities `integration-tests/tests/utils/asserting_utils.py`	Adds `validate_running_mode_correct()` and `validate_presto_running()` functions to verify CLP mode configuration and Presto container status.
Job Dispatch Logic `integration-tests/tests/utils/clp_job_utils.py`	Introduces `build_package_job_list()` to construct job lists by mode and keyword filter, and `dispatch_test_jobs()` to orchestrate compression → post-compression → presto-filter execution with archive cleanup.
Mode Utilities `integration-tests/tests/utils/clp_mode_utils.py`	Adds `compute_mode_signature()` function, introduces CLP_REDUCER_COMPONENT and CLP_PRESTO_COMPONENTS constants, defines "clp-json-presto" mode configuration with Presto settings, and refactors component list initialization.
Configuration Models `integration-tests/tests/utils/config.py`	Introduces four new dataclasses (PackageCompressionJob, PackagePostCompressionJob, PrestoFilterJob, PackageJobList); extends PackagePathConfig with `package_decompression_dir` and script path properties; extends PackageConfig and PackageInstance with job list and shared config path fields.
Docker Utilities `integration-tests/tests/utils/docker_utils.py`	Adds `list_running_containers_with_prefix()` function to filter running containers by name prefix using regex.
Package Utilities `integration-tests/tests/utils/package_utils.py`	Adds cluster management (`start_presto_cluster`, `stop_presto_cluster`), compression/post-compression execution (`run_package_compression_script`, `run_package_script`), Presto filtering (`run_presto_filter`), and ground-truth comparison helpers.
Port Assignment `integration-tests/tests/utils/port_utils.py`	Introduces PortAssignment dataclass and `assign_ports_from_base()` function to discover port requirements, validate availability, and assign sequential ports to CLP components.
Utility Helpers `integration-tests/tests/utils/utils.py`	Adds `load_yaml_to_dict()` function for YAML file parsing with validation.

Sequence Diagram(s)

sequenceDiagram
    actor Test
    participant JobDispatch as Job Dispatch Logic
    participant CompressionJob as Compression Job
    participant Archive as CLP Archive
    participant PostCompJob as Post-Compression Job
    participant GroundTruth as Ground Truth Validator
    participant PrestoFilter as Presto Filter Job

    Test->>JobDispatch: dispatch_test_jobs(package_instance)
    JobDispatch->>JobDispatch: Clean initial archive directory
    
    loop For each compression job
        JobDispatch->>CompressionJob: run_package_compression_script()
        CompressionJob->>Archive: Compress logs into archive
        Archive-->>CompressionJob: Archive created
        
        loop For each dependent post-compression job
            JobDispatch->>PostCompJob: run_package_script()
            PostCompJob->>Archive: Extract/query archive
            Archive-->>PostCompJob: Output generated
            PostCompJob->>GroundTruth: _compare_output_with_ground_truth()
            GroundTruth-->>PostCompJob: Validation result
        end
        
        loop For each dependent presto-filter job
            JobDispatch->>PrestoFilter: run_presto_filter()
            PrestoFilter->>Archive: Execute Presto filter query
            Archive-->>PrestoFilter: Query results
            PrestoFilter->>GroundTruth: Validate against ground truth
            GroundTruth-->>PrestoFilter: Validation result
        end
        
        JobDispatch->>Archive: Clean up archive directory
    end

sequenceDiagram
    actor Test
    participant Fixture as Package Fixture
    participant PortUtils as Port Assignment
    participant PackageConfig as PackageConfig
    participant PrestoCluster as Presto Cluster
    participant CLP as CLP Package

    Test->>Fixture: fixt_package_config (setup)
    Fixture->>PortUtils: assign_ports_from_base(BASE_PORT, clp_config)
    PortUtils->>PortUtils: Discover port requirements
    PortUtils->>PortUtils: Validate ports are free
    PortUtils->>PackageConfig: Assign ports to components
    PackageConfig-->>Fixture: Configuration ready
    
    Fixture->>Fixture: build_package_job_list(mode, filter)
    Fixture->>PackageConfig: Set package_job_list
    PackageConfig-->>Fixture: Config with jobs
    
    Fixture->>Fixture: yield config (fixture ready)
    
    Test->>Fixture: fixt_package_instance (setup)
    alt mode is "clp-json-presto"
        Fixture->>PrestoCluster: start_presto_cluster(package_config)
        PrestoCluster->>PrestoCluster: Construct split-filter mappings
        PrestoCluster->>PrestoCluster: Generate split-filter.json
        PrestoCluster->>CLP: Start Presto services
        CLP-->>PrestoCluster: Services running
    end
    
    Fixture->>Fixture: yield instance (fixture ready)
    
    Test->>Test: Execute test_clp_package()
    
    Fixture->>Fixture: fixt_package_instance (teardown)
    alt mode is "clp-json-presto"
        Fixture->>PrestoCluster: stop_presto_cluster()
        PrestoCluster->>CLP: Stop Presto services
        PrestoCluster->>PrestoCluster: Reset split-filter.json
    end
    Fixture-->>Test: Cleanup complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention during review:

integration-tests/tests/utils/clp_job_utils.py — Core orchestration logic with three-stage job dispatch (compression → post-compression → presto-filter); cross-linking of job dependencies and archive cleanup sequencing requires careful validation
integration-tests/tests/utils/package_utils.py — Multiple public functions managing Presto cluster lifecycle, split-filter JSON generation, and ground-truth comparison; ensure error handling and cleanup paths are robust
integration-tests/tests/utils/port_utils.py — Port discovery, availability checking, and validation logic; verify range bounds and socket-binding approach are platform-compatible
integration-tests/tests/utils/config.py — Four new dataclasses and extensions to existing ones; validate field names, type hints, and __post_init__ logic in PackageInstance for shared config path resolution
integration-tests/tests/fixtures/package_config.py and integration-tests/tests/fixtures/package_instance.py — Fixture interdependencies, configuration passing, and conditional Presto setup; ensure NO_JOBS an 6969 d JOB_NAME_CONTAINS filtering work correctly
integration-tests/tests/utils/clp_mode_utils.py — Addition of "clp-json-presto" mode and CLP_PRESTO_COMPONENTS; verify compatibility with existing mode logic and component list construction
integration-tests/tests/test_package_start.py — New validation function calls and conditional job dispatch; ensure test expectations align with actual mode and Presto availability

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change as packaging integration test jobs, which aligns with the extensive test infrastructure additions throughout the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 90.74% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39398e6 and f737e83.

📒 Files selected for processing (50)

integration-tests/.pytest.ini (2 hunks)
integration-tests/tests/conftest.py (1 hunks)
integration-tests/tests/data/__init__.py (1 hunks)
integration-tests/tests/data/ground-truths/filter-describe-postgresql.jsonl (1 hunks)
integration-tests/tests/data/ground-truths/filter-select-distinct-postgresql.jsonl (1 hunks)
integration-tests/tests/data/jobs/compression/__init__.py (1 hunks)
integration-tests/tests/data/jobs/compression/compression_jobs.py (1 hunks)
integration-tests/tests/data/jobs/post_compression/__init__.py (1 hunks)
integration-tests/tests/data/jobs/post_compression/admin_tools_jobs.py (1 hunks)
integration-tests/tests/data/jobs/post_compression/search_jobs.py (1 hunks)
integration-tests/tests/data/jobs/presto/__init__.py (1 hunks)
integration-tests/tests/data/jobs/presto/presto_filters.py (1 hunks)
integration-tests/tests/data/logs/json-multifile/README.md (1 hunks)
integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-08.jsonl (1 hunks)
integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-09.jsonl (1 hunks)
integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-11.jsonl (1 hunks)
integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-19.jsonl (1 hunks)
integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-21.jsonl (1 hunks)
integration-tests/tests/data/logs/json-singlefile-1/README.md (1 hunks)
integration-tests/tests/data/logs/json-singlefile-1/logs/artemis_ii.jsonl (1 hunks)
integration-tests/tests/data/logs/json-singlefile-2/README.md (1 hunks)
integration-tests/tests/data/logs/json-singlefile-2/logs/new-horizons.jsonl (1 hunks)
integration-tests/tests/data/logs/text-multifile/README.md (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day01.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day02.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day03.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day04.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day05.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day06.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day07.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day08.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day09.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day10.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day11.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day12.txt (1 hunks)
integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day13.txt (1 hunks)
integration-tests/tests/data/logs/text-singlefile/README.md (1 hunks)
integration-tests/tests/data/logs/text-singlefile/logs/voyager-1.txt (1 hunks)
integration-tests/tests/fixtures/integration_test_logs.py (1 hunks)
integration-tests/tests/fixtures/package_config.py (2 hunks)
integration-tests/tests/fixtures/package_instance.py (2 hunks)
integration-tests/tests/test_package_start.py (2 hunks)
integration-tests/tests/utils/asserting_utils.py (2 hunks)
integration-tests/tests/utils/clp_job_utils.py (1 hunks)
integration-tests/tests/utils/clp_mode_utils.py (5 hunks)
integration-tests/tests/utils/config.py (7 hunks)
integration-tests/tests/utils/docker_utils.py (2 hunks)
integration-tests/tests/utils/package_utils.py (3 hunks)
integration-tests/tests/utils/port_utils.py (1 hunks)
integration-tests/tests/utils/utils.py (2 hunks)

🧰 Additional context used

🧠 Learnings (13)

📓 Common learnings

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/pyproject.toml:11-13
Timestamp: 2025-08-16T10:24:29.316Z
Learning: In the CLP project's integration-tests directory, the integration tests package is not intended to be shipped or distributed - only the tests are leveraged for local testing purposes, so console script entry points should not be included in pyproject.toml.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

📚 Learning: 2025-09-28T15:00:22.170Z

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

integration-tests/tests/data/jobs/compression/__init__.py
integration-tests/tests/data/jobs/compression/compression_jobs.py
integration-tests/tests/utils/clp_job_utils.py

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

integration-tests/tests/data/jobs/compression/compression_jobs.py

📚 Learning: 2025-08-20T08:36:38.391Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/__init__.py:1-1
Timestamp: 2025-08-20T08:36:38.391Z
Learning: In the CLP project, __init__.py files should include short module-level docstrings describing the package purpose, as required by their ruff linting configuration, rather than being left empty.

Applied to files:

integration-tests/tests/data/__init__.py
integration-tests/tests/data/jobs/presto/__init__.py

📚 Learning: 2025-08-16T10:24:29.316Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/pyproject.toml:11-13
Timestamp: 2025-08-16T10:24:29.316Z
Learning: In the CLP project's integration-tests directory, the integration tests package is not intended to be shipped or distributed - only the tests are leveraged for local testing purposes, so console script entry points should not be included in pyproject.toml.

Applied to files:

integration-tests/tests/data/__init__.py
integration-tests/.pytest.ini
integration-tests/tests/utils/clp_job_utils.py
integration-tests/tests/utils/package_utils.py
integration-tests/tests/utils/config.py

📚 Learning: 2025-08-17T16:10:38.722Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Applied to files:

integration-tests/tests/data/__init__.py
integration-tests/.pytest.ini
integration-tests/tests/fixtures/integration_test_logs.py
integration-tests/tests/utils/asserting_utils.py
integration-tests/tests/fixtures/package_instance.py
integration-tests/tests/fixtures/package_config.py
integration-tests/tests/test_package_start.py
integration-tests/tests/utils/package_utils.py

📚 Learning: 2025-08-08T21:15:10.905Z

Learnt from: haiqi96
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/test_identity_transformation.py:87-95
Timestamp: 2025-08-08T21:15:10.905Z
Learning: In the CLP project's integration tests (Python code), variable names should use "logs" (plural) rather than "log" (singular) when referring to test logs or log-related entities, as this aligns with the naming conventions used throughout the codebase.

Applied to files:

integration-tests/tests/data/__init__.py

📚 Learning: 2025-07-28T08:33:57.487Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/test_identity_transformation.py:133-137
Timestamp: 2025-07-28T08:33:57.487Z
Learning: In CLP integration tests, the `run_and_assert` utility function should only be used for commands that test CLP package functionality (like clp/clp-s compression/decompression operations). For processing or helper commands (like jq, sort, etc.), direct `subprocess.run` calls should be used instead, as they serve different purposes and don't need the same error handling patterns.

Applied to files:

integration-tests/tests/data/__init__.py
integration-tests/tests/utils/clp_job_utils.py
integration-tests/tests/utils/package_utils.py

📚 Learning: 2025-08-18T05:43:07.868Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1184
File: components/core/cmake/Modules/FindLibLZMA.cmake:21-24
Timestamp: 2025-08-18T05:43:07.868Z
Learning: In the CLP project, all supplied `<lib>_ROOT` variables will live within the `CLP_CORE_DEPS_DIR` as part of their architectural design. This means that using CLP_CORE_DEPS_DIR for library discovery in custom Find modules is the intended approach, and prioritizing individual `<lib>_ROOT` variables over CLP_CORE_DEPS_DIR is not necessary.

Applied to files:

integration-tests/.pytest.ini

📚 Learning: 2025-05-06T10:07:04.654Z

Learnt from: kirkrodrigues
Repo: y-scope/clp PR: 881
File: components/core/tools/scripts/lib_install/macos/install-all.sh:11-12
Timestamp: 2025-05-06T10:07:04.654Z
Learning: In CLP installation scripts, temporary directories with downloaded files should not be automatically cleaned up on failure (e.g., with EXIT traps) to preserve artifacts for debugging purposes.

Applied to files:

integration-tests/tests/fixtures/package_config.py

📚 Learning: 2025-10-27T07:07:37.901Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.

Applied to files:

integration-tests/tests/utils/clp_mode_utils.py

📚 Learning: 2025-11-10T05:19:56.600Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.

Applied to files:

integration-tests/tests/utils/clp_mode_utils.py

📚 Learning: 2025-08-25T16:27:50.549Z

Learnt from: davemarco
Repo: y-scope/clp PR: 1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.

Applied to files:

integration-tests/tests/utils/clp_mode_utils.py

🧬 Code graph analysis (8)

integration-tests/tests/data/jobs/post_compression/search_jobs.py (1)

integration-tests/tests/utils/config.py (1)

PackagePostCompressionJob (181-203)

integration-tests/tests/data/jobs/compression/compression_jobs.py (1)

integration-tests/tests/utils/config.py (1)

PackageCompressionJob (157-177)

integration-tests/tests/utils/docker_utils.py (1)

integration-tests/tests/utils/utils.py (1)

get_binary_path (13-23)

integration-tests/tests/fixtures/integration_test_logs.py (2)

integration-tests/tests/fixtures/path_configs.py (1)

integration_test_path_config (20-24)

integration-tests/tests/utils/config.py (2)

IntegrationTestPathConfig (330-350)

IntegrationTestLogs (354-378)

integration-tests/tests/data/jobs/presto/presto_filters.py (1)

integration-tests/tests/utils/config.py (1)

PrestoFilterJob (208-221)

integration-tests/tests/fixtures/package_instance.py (2)

integration-tests/tests/utils/package_utils.py (2)

start_presto_cluster (91-167)

stop_presto_cluster (190-217)

integration-tests/tests/fixtures/package_config.py (1)

fixt_package_config (30-82)

integration-tests/tests/data/jobs/post_compression/admin_tools_jobs.py (1)

integration-tests/tests/utils/config.py (1)

PackagePostCompressionJob (181-203)

integration-tests/tests/test_package_start.py (4)

integration-tests/tests/utils/asserting_utils.py (3)

validate_presto_running (100-115)

validate_running_mode_correct (77-97)

validate_package_running (46-74)

integration-tests/tests/utils/clp_job_utils.py (1)

dispatch_test_jobs (141-181)

integration-tests/tests/utils/config.py (1)

PackageInstance (274-326)

integration-tests/tests/fixtures/package_instance.py (1)

fixt_package_instance (23-61)

🪛 LanguageTool

integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day10.txt

[uncategorized] ~4-~4: This expression is usually spelled with a hyphen.
Context: ... navigation sightings compare well with ground based tracking solutions; all primary systems...

(BASED_HYPHEN)

integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day03.txt

[style] ~10-~10: ‘in brief’ might be wordy. Consider a shorter alternative.
Context: ...-17 d03 e009 attitude control jets fire in brief pulses to maintain thermal attitudes; a...

(EN_WORDINESS_PREMIUM_IN_BRIEF)

integration-tests/tests/data/logs/text-singlefile/logs/voyager-1.txt

[uncategorized] ~18-~18: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

[uncategorized] ~22-~22: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

[uncategorized] ~26-~26: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

[uncategorized] ~30-~30: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

[uncategorized] ~34-~34: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

[uncategorized] ~38-~38: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ay detector records gradual increase in high energy particles as solar influence weakens wi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

[uncategorized] ~40-~40: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...pdated to Voyager Interstellar Mission; long term science focus shifts to heliospheric bo...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

[uncategorized] ~44-~44: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...pdated to Voyager Interstellar Mission; long term science focus shifts to heliospheric bo...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

[uncategorized] ~48-~48: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...pdated to Voyager Interstellar Mission; long term science focus shifts to heliospheric bo...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

[uncategorized] ~52-~52: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...pdated to Voyager Interstellar Mission; long term science focus shifts to heliospheric bo...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

[misspelling] ~112-~112: This expression is normally spelled as one or with a hyphen.
Context: ...ary crossing signatures continues using multi year combined instrument observations 2013-1...

(EN_COMPOUNDS_MULTI_YEAR)

[misspelling] ~116-~116: This expression is normally spelled as one or with a hyphen.
Context: ...ary crossing signatures continues using multi year combined instrument observations 2015-0...

(EN_COMPOUNDS_MULTI_YEAR)

[misspelling] ~120-~120: This expression is normally spelled as one or with a hyphen.
Context: ...ary crossing signatures continues using multi year combined instrument observations 2016-0...

(EN_COMPOUNDS_MULTI_YEAR)

[misspelling] ~124-~124: This expression is normally spelled as one or with a hyphen.
Context: ...ary crossing signatures continues using multi year combined instrument observations 2017-1...

(EN_COMPOUNDS_MULTI_YEAR)

[misspelling] ~129-~129: This word is normally spelled with a hyphen.
Context: ...e forces selective instrument shutdown; power sharing strategy revised to extend mission life...

(EN_COMPOUNDS_POWER_SHARING)

[misspelling] ~133-~133: This word is normally spelled with a hyphen.
Context: ...e forces selective instrument shutdown; power sharing strategy revised to extend mission life...

(EN_COMPOUNDS_POWER_SHARING)

[misspelling] ~137-~137: This word is normally spelled with a hyphen.
Context: ...e forces selective instrument shutdown; power sharing strategy revised to extend mission life...

(EN_COMPOUNDS_POWER_SHARING)

integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day08.txt

[uncategorized] ~7-~7: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...and commemorative plaques documented in high resolution imagery; all primary systems nominal; c...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

[style] ~14-~14: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ...7 apollo-17 d08 e013 crew rests briefly prior to ascent timeline for launch from Moon; a...

(EN_WORDINESS_PREMIUM_PRIOR_TO)

integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day12.txt

[uncategorized] ~15-~15: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...-17 d12 e014 crew reports readiness for high speed entry and ocean splashdown; all primary...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

🪛 markdownlint-cli2 (0.18.1)

integration-tests/tests/data/logs/text-multifile/README.md

4-4: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: package-image
GitHub Check: build (ubuntu-24.04)
GitHub Check: lint-check (ubuntu-24.04)
GitHub Check: lint-check (macos-15)
GitHub Check: build (macos-15)

🔇 Additional comments (52)

integration-tests/tests/data/__init__.py (1)

1-1: Module-level docstring is appropriate. The file correctly includes a short, descriptive docstring identifying the package purpose, which satisfies the ruff linting requirements per project standards.

integration-tests/tests/data/logs/json-singlefile-2/logs/new-horizons.jsonl (1)

1-150: Acknowledging test data file as requested.

The PR title indicates this should not be reviewed. The JSONL format is valid, timestamps are monotonically increasing, and the schema is consistent across all 150 records. The test data appears suitable for integration testing purposes.

integration-tests/tests/data/ground-truths/filter-select-distinct-postgresql.jsonl (1)

1-3: LGTM!

The ground-truth data is valid JSON with appropriate PostgreSQL error severity values for testing filter/select/distinct operations.

integration-tests/tests/data/ground-truths/filter-describe-postgresql.jsonl (1)

1-22: LGTM!

The ground-truth data is valid JSON with appropriate PostgreSQL column definitions for testing describe operations.

integration-tests/tests/data/jobs/presto/__init__.py (1)

1-1: LGTM!

The module docstring follows the project's linting requirements for package initializers.

integration-tests/tests/data/logs/json-singlefile-2/README.md (1)

1-16: LGTM!

The README metadata is consistent and clearly documents the test dataset characteristics.

integration-tests/tests/data/logs/json-singlefile-1/README.md (1)

1-16: LGTM!

The README metadata accurately reflects the artemis_ii.jsonl data file, with timestamps matching the first and last entries.

integration-tests/.pytest.ini (1)

14-17: LGTM!

The new environment variables provide clear test path configuration for repository, data, jobs, and logs directories.

integration-tests/tests/data/jobs/compression/__init__.py (1)

1-1: LGTM!

The module docstring follows the project's linting requirements for package initialisers.

integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-09.jsonl (1)

1-8: LGTM!

The JSONL data has consistent structure with properly ordered timestamps and sequential line indices.

integration-tests/tests/data/logs/text-singlefile/README.md (2)

3-4: Verify the log type - inconsistency between README documentation and directory naming.

The README states the type is "JSON", but the directory is named "text-singlefile", which suggests the logs should be text-based rather than JSON format. Inspect the actual log files in this directory to determine the correct type and update the README or directory name accordingly.

6-7: Verify the file count in the text-singlefile directory.

The README states "Number of files: 5", but this needs verification against the actual file count in integration-tests/tests/data/logs/text-singlefile/. Check the directory to confirm the actual number of files matches the documented value.

integration-tests/tests/data/logs/json-singlefile-1/logs/artemis_ii.jsonl (1)

1-150: Verify suffix inconsistency pattern and testing intent.

The analysis confirms 35 of 150 entries contain the suffix "This log documents Artemis II integrated systems performance during the crewed lunar test mission." while 115 do not, creating an irregular mixed-format pattern. However, line 128 was incorrectly listed in the review as containing the suffix when it does not. Determine whether this mixed formatting is intentional for testing log parsing robustness or requires standardization to a consistent format.

integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-08.jsonl (1)

1-8: Test data is well-structured and properly formatted.

The JSONL file contains properly formatted JSON objects with consistent fields and sequential timestamps appropriate for testing CLP package functionality.

integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-19.jsonl (1)

1-8: Test data is well-structured and properly formatted.

The JSONL file contains properly formatted JSON objects representing STS-135 mission day 11 events with consistent structure and sequential timestamps.

integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-11.jsonl (1)

1-8: Test data is well-structured and properly formatted.

The JSONL file contains properly formatted JSON objects representing STS-135 mission day 3 events with consistent structure and sequential timestamps.

integration-tests/tests/fixtures/integration_test_logs.py (1)

46-58: Fixture implementation follows established patterns.

The spark_event_logs fixture consistently follows the same pattern as existing fixtures (hive_24hr, postgresql), using the shared _download_and_extract_dataset helper function with appropriate parameters.

integration-tests/tests/data/logs/text-singlefile/logs/voyager-1.txt (1)

1-150: Test data is properly formatted for integration testing.

The text file contains well-structured Voyager-1 mission logs with consistent formatting (ISO timestamps, spacecraft identifier, event type, and descriptions). The static analysis hints regarding hyphenation are stylistic suggestions for the log message content and are not applicable to test data.

integration-tests/tests/data/jobs/post_compression/__init__.py (1)

1-1: Package initializer is appropriate.

The file properly establishes the post_compression package namespace with a descriptive docstring.

integration-tests/tests/data/jobs/presto/presto_filters.py (1)

1-5: Scaffolding is properly structured.

The module provides a well-typed, empty dictionary for Presto filter job definitions, following good patterns for test job configuration infrastructure.

integration-tests/tests/data/logs/json-multifile/README.md (1)

1-16: Documentation clearly describes the test dataset.

The README provides accurate metadata about the json-multifile logs, including type, file count, events per file, and timestamp range, which aligns with the actual JSONL files added in this PR.

integration-tests/tests/conftest.py (1)

3-35: Pytest options and BASE_PORT ini wiring look coherent

The added ini key and CLI options match downstream usage (NO_JOBS, JOB_NAME_CONTAINS, BASE_PORT) and provide clear help text; configuration and typing look sound.

integration-tests/tests/utils/docker_utils.py (1)

3-3: New Docker helper correctly narrows containers by name prefix

The combination of docker ps --filter name=... with a re.fullmatch(re.escape(prefix) + r"\d+") check gives a precise prefix + digits match while avoiding regex pitfalls; error handling is consistent with the existing compose helper.

Also applies to: 39-64

integration-tests/tests/test_package_start.py (1)

7-9: End-to-end package, mode, and job validation flow reads solid

Validating package components, then mode correctness, then conditionally checking Presto for clp-json-presto, and finally dispatching jobs only when package_job_list is present aligns cleanly with the fixture/job list design and avoids 6961 unnecessary work when --no-jobs is used.

Also applies to: 10-12, 14-16, 25-44

integration-tests/tests/data/logs/json-multifile/logs/sts-135-2011-07-21.jsonl (1)

1-8: JSONL log entries are well-formed and consistent

All records share the expected schema, timestamps are strictly increasing, and line_index values are unique and ordered, which should work well for multifile log test scenarios.

integration-tests/tests/data/jobs/post_compression/search_jobs.py (1)

1-5: Search job registry skeleton matches existing job-modelling pattern

Defining PACKAGE_SEARCH_JOBS: dict[str, PackagePostCompressionJob] = {} mirrors the other job registries and gives a clear, type-safe place to add search jobs later without affecting current test behaviour.

integration-tests/tests/utils/utils.py (1)

8-11: YAML loading helper is robust and consistent with its contract

load_yaml_to_dict cleanly distinguishes parse/IO failures (ValueError) from structural issues (TypeError on non-mapping), and the typing/return contract (dict[str, Any]) matches the enforced top-level mapping requirement.

Also applies to: 69-93

integration-tests/tests/data/jobs/post_compression/admin_tools_jobs.py (1)

1-5: Admin-tools job registry skeleton is aligned with the job model

The PACKAGE_ADMIN_TOOLS_JOBS: dict[str, PackagePostCompressionJob] = {} definition follows the same pattern as other job registries and safely acts as a placeholder until specific admin-tools jobs are added.

integration-tests/tests/utils/asserting_utils.py (2)

9-21: LGTM!

The new imports are well-organized and correctly support the validation utilities for mode verification and Presto container checks.

77-97: LGTM!

The validate_running_mode_correct function correctly validates the shared config against the intended configuration by comparing mode signatures. The error handling for ValidationError is appropriate.

integration-tests/tests/data/jobs/compression/compression_jobs.py (1)

1-38: LGTM!

The compression job definitions are well-structured and follow the PackageCompressionJob dataclass schema. The dictionary keys appropriately match the job_name fields, which aids maintainability and lookup consistency.

integration-tests/tests/fixtures/package_config.py (4)

3-26: LGTM!

The imports are well-organized and the logger setup follows the standard pattern for the test infrastructure.

45-54: LGTM!

Good defensive error handling for the BASE_PORT configuration value. The explicit error message clearly indicates the expected type.

59-70: LGTM!

The job list construction correctly respects the NO_JOBS and JOB_NAME_CONTAINS options, allowing flexible test filtering.

75-82: LGTM!

The enhanced teardown properly cleans up both the temporary config file and the generated data/tmp directories, ensuring test isolation.

integration-tests/tests/utils/clp_mode_utils.py (4)

44-56: LGTM!

The new clp-json-presto mode configuration correctly sets up the Presto query engine with CLP_S storage, disables results cache retention (appropriate for Presto-based queries), and configures the Presto coordinator.

97-114: LGTM!

The compute_mode_signature function provides a clean way to compare mode-defining aspects of configurations. The tuple captures the essential fields that determine operating mode behaviour.

139-145: LGTM!

The refactored initialization correctly builds the component list by extending from CLP_BASE_COMPONENTS and conditionally adding query components and the reducer for FULL deployment type.

91-94: Consider using a constant for the Presto worker component name.

PRESTO_COORDINATOR_COMPONENT_NAME is imported from clp_py_utils.clp_config, but "presto-worker" is hardcoded. Verify whether a PRESTO_WORKER_COMPONENT_NAME constant exists in clp_py_utils.clp_config and use it for consistency with the coordinator component name.

integration-tests/tests/utils/port_utils.py (6)

9-18: LGTM!

Constants are well-defined with clear naming. The valid port range correctly excludes privileged ports (0-1023) and includes the full range up to 65535.

21-33: LGTM!

The PortAssignment dataclass is well-structured with clear documentation. Using Any for component is appropriate given the dynamic nature of CLP config components.

36-58: LGTM!

The assign_ports_from_base function correctly discovers port requirements, validates the range, checks availability, and assigns ports sequentially. The logic is sound and well-organized.

61-76: LGTM!

Port availability checking logic is correct. The error message is informative and actionable.

128-141: Socket resource handling is correct.

The context manager properly ensures the socket is closed. Catching OSError covers both address-in-use and other binding failures appropriately.

144-163: LGTM!

Bounds validation logic is correct. The error message clearly indicates the issue and provides both the required and valid ranges.
integration-tests/tests/utils/package_utils.py (2)
258-263: Placeholder assertions should be tracked.

The assert True statements with TODO comments are placeholders that effectively skip validation. While the TODOs reference specific PRs/issues, these could easily be forgotten.

Consider adding pytest.skip() with a reason instead, which makes the skip visible in test reports:
     if compression_job.mode in ("clp-json", "clp-json-presto"):
-        # TODO: Waiting for PR 1299 to be merged.
-        assert True
+        pytest.skip("Decompression validation pending PR #1299")
     elif compression_job.mode == "clp-text":
-        # TODO: As of the morning of Dec. 10, clp-text decompression is broken.
-        assert True
+        pytest.skip("clp-text decompression is currently broken")
91-167: Good orchestration of Presto cluster startup.

The function correctly validates paths, generates split filters from compression jobs, writes configuration, and starts the Docker Compose cluster.
integration-tests/tests/utils/config.py (5)

156-178: LGTM!

PackageCompressionJob dataclass is well-structured with clear field documentation. The frozen=True ensures immutability which is appropriate for job definitions.

180-204: LGTM!

PackagePostCompressionJob correctly models post-compression dependencies with a reference to the parent PackageCompressionJob. The requires_archive_id default is sensible.

206-231: LGTM!

PrestoFilterJob and PackageJobList dataclasses are well-defined. The comment explaining why Presto filters have their own class is helpful.

130-153: LGTM!

New script path properties follow the established pattern and provide convenient access to package scripts. The paths are consistent with the CLP package structure.

283-302: Good addition of shared_config_file_path validation.

The new field is properly initialized and validated during __post_init__, ensuring the shared config file exists before the instance is considered valid.

integration-tests/.pytest.ini

integration-tests/tests/data/logs/text-multifile/logs/apollo-17_day10.txt

coderabbitai · 2025-12-10T20:05:52Z

integration-tests/tests/data/logs/text-multifile/README.md

+# Description of logs
+
+**Type**
+Unstructured text 


⚠️ Potential issue | 🟡 Minor

Remove trailing whitespace from line 4.

The line ends with a single trailing space, which violates markdown formatting standards.

Apply this diff to fix the trailing whitespace:

**Type** -Unstructured text +Unstructured text

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Unstructured text

**Type**

Unstructured text

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

4-4: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)

🤖 Prompt for AI Agents

integration-tests/tests/data/logs/text-multifile/README.md around line 4: the line "Unstructured text " contains a single trailing space; remove the trailing whitespace at the end of line 4 so the line ends with "Unstructured text" (no extra spaces) and save the file.

coderabbitai · 2025-12-10T20:05:52Z

integration-tests/tests/fixtures/package_instance.py

@@ -1,5 +1,6 @@
 """Fixtures that start and stop CLP package instances for integration tests."""

+import logging


🧹 Nitpick | 🔵 Trivial

Fixture behaviour for NO_JOBS and Presto looks correct, with a small opportunity to preserve error detail

The skip condition when package_job_list is None and not no_jobs matches how the job list is built and avoids spinning up instances for modes with no matching jobs, and the Presto start/stop tied to mode_name == "clp-json-presto" in try/finally ensures cleanup. One optional improvement would be to catch RuntimeError as err and include err in the pytest.fail message (or log it) so callers keep the original failure context alongside the helpful BASE_PORT hint.

Also applies to: 13-17, 19-19, 23-26, 31-31, 35-40, 46-60

🤖 Prompt for AI Agents

In integration-tests/tests/fixtures/package_instance.py around lines 3, 13-17, 19, 23-26, 31, 35-40 and 46-60, the tests currently call pytest.fail (or otherwise re-raise) on RuntimeError without preserving the original exception details; update each try/except that catches RuntimeError to use "except RuntimeError as err" and include the err in the failure message (e.g., pytest.fail(f"...: {err}") or log the err before failing) so the original exception context is preserved alongside the existing BASE_PORT hint and cleanup logic.

coderabbitai · 2025-12-10T20:05:52Z

integration-tests/tests/utils/asserting_utils.py

+def validate_presto_running() -> None:
+    """
+    Validate that a Presto cluster is running.
+
+    :param package_instance:
+    """
+    required_components = CLP_PRESTO_COMPONENTS
+
+    for component in required_components:
+        prefix = f"presto-clp-{component}-"
+        running_matches = list_running_containers_with_prefix(prefix)
+        if len(running_matches) == 0:
+            pytest.fail(
+                f"No running container found for component '{component}' "
+                f"(expected name prefix '{prefix}')."
+            )


⚠️ Potential issue | 🟡 Minor

Remove stale docstring parameter.

The docstring references :param package_instance: but the function takes no parameters.

def validate_presto_running() -> None: """ Validate that a Presto cluster is running. - - :param package_instance: """ required_components = CLP_PRESTO_COMPONENTS

🤖 Prompt for AI Agents

In integration-tests/tests/utils/asserting_utils.py around lines 100 to 115, the function validate_presto_running() has a stale docstring parameter reference ":param package_instance:" that does not match the function signature; remove that incorrect param line from the docstring and, if desired, replace it with a brief description of the function or relevant params (none in this case) so the docstring accurately reflects the function signature.

coderabbitai · 2025-12-10T20:05:53Z

integration-tests/tests/utils/package_utils.py

+        str(package_config.path_config.clp_package_dir),
+    ]
+    # fmt: on
+    subprocess.run(setup_presto_cmd, check=True)


🧹 Nitpick | 🔵 Trivial

Consider adding timeouts to subprocess calls.

The subprocess.run calls for Presto setup, start, and stop don't have timeouts. If the Docker daemon is unresponsive or the commands hang, the test will block indefinitely.

- subprocess.run(setup_presto_cmd, check=True) + subprocess.run(setup_presto_cmd, check=True, timeout=DEFAULT_CMD_TIMEOUT_SECONDS)

Apply similarly to lines 167 and 202.

Also applies to: 167-167, 202-202

🤖 Prompt for AI Agents

In integration-tests/tests/utils/package_utils.py around lines 112, 167, and 202, the subprocess.run invocations for Presto setup/start/stop lack timeouts and can hang indefinitely; add a reasonable timeout argument (e.g., timeout=30 or configurable via a constant/ENV var) to each subprocess.run call and wrap them in try/except to catch subprocess.TimeoutExpired to perform cleanup or raise a clear exception; ensure any resources (docker containers/processes) are cleaned up or the test fails fast when a timeout occurs.

coderabbitai · 2025-12-10T20:05:53Z

integration-tests/tests/utils/package_utils.py

+    if package_compression_jobs is not None:
+        for package_compression_job in package_compression_jobs:
+            flags = package_compression_job.flags
+            if flags is not None:
+                dataset_name = flags["--dataset"]
+                timestamp_key = flags["--timestamp-key"]
+
+            # Skip jobs that do not specify both dataset_name and timestamp_key.
+            if dataset_name is None or timestamp_key is None:
+                continue
+
+            dataset_timestamp_dict[dataset_name] = timestamp_key


⚠️ Potential issue | 🔴 Critical

Potential UnboundLocalError if flags is None.

When flags is None, the variables dataset_name and timestamp_key are never assigned, but the code at lines 130-131 still checks and uses them. This will raise an UnboundLocalError.

Apply this diff to fix the scoping issue:

if package_compression_jobs is not None: for package_compression_job in package_compression_jobs: flags = package_compression_job.flags - if flags is not None: - dataset_name = flags["--dataset"] - timestamp_key = flags["--timestamp-key"] - - # Skip jobs that do not specify both dataset_name and timestamp_key. - if dataset_name is None or timestamp_key is None: + if flags is None: + continue + + dataset_name = flags.get("--dataset") + timestamp_key = flags.get("--timestamp-key") + + # Skip jobs that do not specify both dataset_name and timestamp_key. + if dataset_name is None or timestamp_key is None: continue dataset_timestamp_dict[dataset_name] = timestamp_key

This also uses .get() to avoid KeyError if the flags exist but don't contain the expected keys.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if package_compression_jobs is not None:

for package_compression_job in package_compression_jobs:

flags = package_compression_job.flags

if flags is not None:

dataset_name = flags["--dataset"]

timestamp_key = flags["--timestamp-key"]

# Skip jobs that do not specify both dataset_name and timestamp_key.

if dataset_name is None or timestamp_key is None:

continue

dataset_timestamp_dict[dataset_name] = timestamp_key

if package_compression_jobs is not None:

for package_compression_job in package_compression_jobs:

flags = package_compression_job.flags

if flags is None:

continue

dataset_name = flags.get("--dataset")

timestamp_key = flags.get("--timestamp-key")

# Skip jobs that do not specify both dataset_name and timestamp_key.

if dataset_name is None or timestamp_key is None:

continue

dataset_timestamp_dict[dataset_name] = timestamp_key

🤖 Prompt for AI Agents

In integration-tests/tests/utils/package_utils.py around lines 122 to 133, the code reads flags = package_compression_job.flags but only assigns dataset_name and timestamp_key when flags is not None, yet later checks those variables unconditionally which can raise UnboundLocalError; fix by initializing dataset_name and timestamp_key to None before the flags check, then if flags is not None assign dataset_name = flags.get("--dataset") and timestamp_key = flags.get("--timestamp-key") and keep the existing skip/continue logic so jobs missing flags or keys are ignored.

coderabbitai · 2025-12-10T20:05:53Z

integration-tests/tests/utils/package_utils.py

+def _compare_output_with_ground_truth(output: str, ground_truth_file_path: Path) -> bool:
+    with tempfile.TemporaryDirectory() as temporary_output_directory_name:
+        # Write output to tempfile.
+        output_file_path = Path(temporary_output_directory_name) / "output.tmp"
+        output_file_path.write_text(output)
+
+        # Compare tempfile with ground truth.
+        return is_dir_tree_content_equal(output_file_path, ground_truth_file_path)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for the is_dir_tree_content_equal function definition rg -n "def is_dir_tree_content_equal" --type py -A 20

Repository: y-scope/clp

Length of output: 166

🌐 Web query:

is_dir_tree_content_equal function Python file directory comparison

💡 Result:

Here’s a concise, robust implementation to compare two directory trees by names and file contents (recursively), using Python’s filecmp utilities:

Code:

import os import filecmp def is_dir_tree_content_equal(dir1: str, dir2: str) -> bool: """Return True if dir trees rooted at dir1 and dir2 have the same filenames, subdirectories and file contents (recursively).""" dcmp = filecmp.dircmp(dir1, dir2) if dcmp.left_only or dcmp.right_only or dcmp.funny_files: return False # compare common files by content match, mismatch, errors = filecmp.cmpfiles(dir1, dir2, dcmp.common_files, shallow=False) if mismatch or errors: return False # recurse into common subdirectories for sub in dcmp.common_dirs: if not is_dir_tree_content_equal(os.path.join(dir1, sub), os.path.join(dir2, sub)): return False return True

Notes:

Uses shallow=False so file contents are compared (not just stat). See Python’s filecmp docs. [1]

A similar recursive approach is shown and discussed on Stack Overflow. [2]

Sources:
[1] Python filecmp documentation.
[2] Stack Overflow — recursive directory comparison example.

Function compares single file, but uses directory comparison utility.

_compare_output_with_ground_truth writes output to a single file but uses is_dir_tree_content_equal for comparison. The underlying utility is designed for directory tree comparison using filecmp.dircmp(), which expects directory paths. Comparing a single file path with this function is semantically incorrect and may fail at runtime. Consider either wrapping the file comparison in a directory structure or using a dedicated file comparison function.

🤖 Prompt for AI Agents

In integration-tests/tests/utils/package_utils.py around lines 344 to 351, the helper writes the output to a single temp file but then calls is_dir_tree_content_equal (a directory tree comparator), which is semantically wrong and can fail; change the comparison to use a proper file comparison instead (e.g., filecmp.cmp or pathlib.read_text() equality) or, if you want to keep using the directory comparator, create a temporary directory mirroring the ground-truth layout (place the temp file under the same relative name) before calling is_dir_tree_content_equal; update the function to perform one of these fixes and return the boolean result of the correct file-or-directory comparison.

coderabbitai · 2025-12-10T20:05:53Z

integration-tests/tests/utils/package_utils.py

+def run_presto_filter(
+    presto_filter_job: PrestoFilterJob,
+) -> None:
+    """
+    Run a Presto filter.
+
+    :param request:
+    :param presto_filter:
+    :param package_instance:


⚠️ Potential issue | 🟡 Minor

Docstring parameters don't match function signature.

The docstring references request, presto_filter, and package_instance parameters, but the function only accepts presto_filter_job.

def run_presto_filter( presto_filter_job: PrestoFilterJob, ) -> None: """ Run a Presto filter. - :param request: - :param presto_filter: - :param package_instance: + :param presto_filter_job: The Presto filter job to execute. """

🤖 Prompt for AI Agents

In integration-tests/tests/utils/package_utils.py around lines 354 to 362, the function run_presto_filter(presto_filter_job) has a docstring listing parameters request, presto_filter, and package_instance that do not match the actual signature; update the docstring to document the single parameter presto_filter_job (and its type/purpose) and remove or rename the incorrect parameter entries so the docstring accurately reflects the function signature and expected argument.

integration-tests/tests/utils/port_utils.py

quinntaylormitchell added 30 commits October 17, 2025 21:01

Test a generic clp package start up and spin down (default: clp-text).

8e75e15

Merge branch 'main' into feature branch

44ee20e

Expand to clp-text and clp-json; store JSON config pairs in temp conf…

bbe0575

…ig file; address comments.

Merge branch 'main' into feature branch

5cd47f8

Include and employ clp_py_utils.clp_config; refrain from using the us…

c5375eb

…er-level clp-config.yml

Lint.

2be837f

Refactor; move all utility functions to utils files as appropriate.

bf14b5e

Add comments.

27b7f5b

Streamline logger calls.

e041f0e

Deduplicate fixture code.

75672fc

Merge branch 'main' into feature branch

0822a79

Change method of clp_config i 10BC0 mport; temporary, will change again once y…

2df9515

…-scope#1549 is merged.

Generalize code so that it's better equipped to handle future CLP_MOD…

63f3ae4

…E_CONFIGS other than clp-text and clp-json

Merge branch 'main' into feature branch

0011447

Combine PackageConfig, PackageModeConfig, and PackageInstanceConfig i…

0ff9465

…nto one class.

Employ a 'CLPConfig'-centric approach rather than 'dict'-centric.

cdcfaee

Store list of required components for each mode

53e7343

Generalize the method by which the mode of operation is determined.

796ffe3

Move all mode-related utilities to their own file.

bae12de

Minor error.

9c9ef74

Move load_yaml_to_dict to general utilities file.

0f3dcd5

Add taskfile tasks for package tests.

f9865a8

Address rabbit comments.

865e9ed

Add mariadb_config install for linting purposes (will be unecessary a…

09d5200

…fter y-scope#1549).

Lint YAML.

5fffc32

Merge branch 'main' into feature branch

571f313

Adapt code after y-scope#1549 merge.

e348be0

Merge branch 'main' into feature branch

30f186f

Merge branch 'main' into feature branch

e5da7e9

Merge branch 'main' into feature branch

292ec70

quinntaylormitchell added 14 commits December 3, 2025 18:01

Merge branch 'main' into feature branch

e806e29

Lint.

fcd5d0f

API server in wrong component list.

ddd6c24

Begin implementing infra to facilitate filter jobs.

b962e54

Construct Presto filter command.

d548769

Finish infra to facilitate filter jobs.

3297dc3

Lint.

d40bbdd

Add infra to facilitate admin-tools queries.

6a8dcd4

Overhaul job system.

c38db04

Change job filter to --job-name-contains.

8bb84df

Merge branch 'main' into feature branch

c31be9e

Merge branch 'main' into feature branch

7aeb4ba

Got rid of decompression check for now.

70e8bf0

Lint.

f737e83

quinntaylormitchell requested a revi 8000 ew from a team as a code owner December 10, 2025 20:00

quinntaylormitchell marked this pull request as draft December 10, 2025 20:00

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

quinntaylormitchell added 12 commits December 11, 2025 16:29

Add some search jobs.

df9c33e

Merge branch 'main' into feature branch

26f1e4d

Lint.

7c5bb62

3737 Modify for log-ingestor.

f7e190f

Add search jobs.

b52996a

Change logging experience.

a55e840

Merge branch 'main' into feature branch

e1313ba

Reorder functions.

66e4852

Continue logging and error handling changes.

fa339ef

Merge branch 'main' into feature branch

f93a146

Try out assert.

a63b4c1

Merge branch 'main' into feature branch

4a5f978

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(integration-tests): package integration test jobs (do not review) #1766

feat(integration-tests): package integration test jobs (do not review) #1766

Uh oh!

Uh oh!

Review skipped

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1,5 +1,6 @@
		"""Fixtures that start and stop CLP package instances for integration tests."""

		import logging

feat(integration-tests): package integration test jobs (do not review) #1766

Are you sure you want to change the base?

feat(integration-tests): package integration test jobs (do not review) #1766

Uh oh!

Conversation

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants