8000 ORM: Optimize eager loading 1:N associations using CTEs to replace WHERE IN (...) · Issue #19107 · cakephp/cakephp · GitHub
[go: up one dir, main page]

Skip to content

ORM: Optimize eager loading 1:N associations using CTEs to replace WHERE IN (...) #19107

@celsowm

Description

@celsowm

Description

Problem Description
Currently, when eager loading HasMany or BelongsToMany associations using contain(), CakePHP typically employs a "two-query" strategy (default):

Query 1: Fetches the main records (e.g., Articles).
Query 2: Fetches the associated records (e.g., Comments) using the IDs collected from the first result set, typically via a WHERE foreign_key IN (1, 2, 3...) clause.

While this effectively prevents the N+1 problem, the WHERE IN approach has limitations when working with large datasets:

  1. Packet Size Limits: If the main query returns thousands of rows, the list of IDs passed to the IN clause can hit the max_allowed_packet limit of the database.
  2. Parsing Overhead: The database engine has to parse a massive SQL string containing thousands of integers.
  3. Round-trip Data: We are fetching IDs to PHP memory just to send them immediately back to the database driver.

Proposed Solution:

I propose introducing a loading strategy that utilizes Common Table Expressions (CTEs) (available in MySQL 8.0, PostgreSQL, SQLite, and SQL Server) to handle the association filtering entirely within the database engine.

Instead of passing a PHP array of IDs to the second query, the ORM could generate a CTE based on the constraints of the primary query.

Example Scenario
Context: Articles hasMany Comments.

Current Behavior:

SQL

-- Query 1

SELECT id, title FROM articles WHERE published = 1;

-- PHP extracts IDs: [100, 101, 102 ... 5000]

-- Query 2

SELECT * FROM comments WHERE article_id IN (100, 101, 102 ... 5000);

Proposed Behavior (Using CTE):

The second query could look like this:

WITH TargetArticles AS (
    SELECT id FROM articles WHERE published = 1
)
SELECT comments.* 
FROM comments
INNER JOIN TargetArticles ON comments.article_id = TargetArticles.id;

Benefits

  1. Performance: significantly faster for large result sets as it avoids parsing massive IN lists.
  2. Memory Efficiency: Reduces PHP memory usage by not having to build giant arrays of IDs for query binding.
  3. Scalability: Removes the risk of hitting query length/packet size limits.

Implementation Considerations:

This would likely need to be an opt-in strategy (e.g., 'strategy' => 'cte') or automatically detected based on DB driver capabilities, as older database versions (like MySQL 5.7) do not support CTEs.

Since CakePHP 4/5 requires modern database versions that generally support CTEs, this could eventually become a standard standard approach for HasMany loading.

I would love to hear the core team's thoughts on the feasibility of adding this to the Query Builder/ORM.

CakePHP Version

6.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0