8000 Add `last_deletion_at` property to datasets by brimoor · Pull Request #5853 · voxel51/fiftyone · GitHub
[go: up one dir, main page]

Skip to content

Add last_deletion_at property to datasets #5853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 10, 2025
Merged

Conversation

brimoor
Copy link
Contributor
@brimoor brimoor commented May 5, 2025

Change log

  • Adds a Dataset.last_deletion_at property that is automatically updated when samples are deleted
  • Dataset.last_modified_at is no longer automatically updated when samples are deleted
  • Adds documentation for the behavior of FiftyOne's builtin datetime fields to the user guide
  • Updates _is_full_collection() to include full views, eg dataset.view() and full generated views like to_patches()

Motivation

#5723 added automatic incrementing of Dataset.last_modified_at when samples are deleted. This was released in fiftyone==1.5.0.

After reflecting on this implementation in the context of use cases like voxel51/fiftyone-plugins#122, I strongly believe that deletion times should be tracked separately from other dataset metadata changes.

For example, changing a dataset's description and creating a new saved view both increment Dataset.last_modified_at, but these have no effect on whether a saved to_patches() view needs to be reloaded.

As penance for my decision to change course from how deletion is tracked in fiftyone==1.5.0, I've added previously non-existent documentation on automatic datetime fields to the user guide 🙇

Summary by CodeRabbit

  • New Features

    • Added a new property to datasets that records the datetime when a sample was last deleted.
    • Enhanced documentation with a comprehensive guide on builtin datetime fields for datasets and samples, including new examples and usage notes.
  • Bug Fixes

    • Improved accuracy of dataset modification and deletion timestamps for better tracking.
  • Tests

    • Updated and expanded tests to verify correct behavior of the new deletion timestamp and collection status features.
  • Documentation

    • Expanded user guide with detailed explanations and examples for all builtin datetime fields and their update semantics.

@brimoor brimoor requested a review from swheaton May 5, 2025 00:57
@brimoor brimoor requested review from a team as code owners May 5, 2025 00:57
Copy link
Contributor
coderabbitai bot commented May 5, 2025

Walkthrough

This update introduces a new last_deletion_at property to the Dataset class, which tracks the datetime of the most recent sample deletion. The documentation is expanded to explain the semantics of all builtin datetime fields, including last_loaded_at, last_modified_at, and the new last_deletion_at, both at the dataset and sample/frame levels. The SampleCollection class receives optimizations for aggregation methods (min, max) on datetime fields and improved logic to detect when a collection represents the full dataset. Internal ODM and document classes are updated to support the new deletion timestamp, and related tests are added and updated.

Changes

File(s) Change Summary
docs/source/user_guide/using_datasets.rst Added comprehensive documentation for builtin datetime fields, including the new last_deletion_at property, with usage examples and behavioral clarifications.
fiftyone/core/collections.py Enhanced SampleCollection with optimized min/max for datetime fields, improved full collection detection, updated docstrings, and added _none_max utility.
fiftyone/core/dataset.py Introduced last_deletion_at property, updated docstrings for last_modified_at, changed deletion logic to update deletion timestamp, and handled cloning initialization.
fiftyone/core/odm/dataset.py Added last_deletion_at field to DatasetDocument.
fiftyone/core/odm/document.py Added _update_last_deletion_at method to update and persist deletion timestamps.
tests/unittests/dataset_tests.py Renamed and revised test to focus on last_deletion_at behavior and assertions.
tests/unittests/video_tests.py Renamed and updated frame deletion tests to check last_deletion_at and timestamp ordering.
tests/unittests/view_tests.py Added FullCollectionTests class to verify _is_full_collection() logic for datasets, views, and grouped datasets.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Dataset
    participant SampleCollection
    participant Sample
    participant ODM/Document

    User->>Dataset: Delete sample(s)
    Dataset->>ODM/Document: _update_last_deletion_at()
    ODM/Document->>Dataset: Update last_deletion_at timestamp
    Dataset->>SampleCollection: Reflect updated last_deletion_at
    User->>Dataset: Query last_deletion_at
    Dataset-->>User: Return timestamp of last sample deletion
Loading

Possibly related PRs

  • voxel51/fiftyone#5029: Introduced general min() and max() aggregation methods, which are now optimized for datetime fields in this PR.
  • voxel51/fiftyone#5723: Updated last_modified_at on deletions; this PR instead introduces a separate last_deletion_at property for clearer semantics.
  • voxel51/fiftyone#4787: Improved efficiency of sample deletion batching; related to deletion handling but focuses on performance rather than timestamp tracking.

Suggested labels

enhancement, documentation

Suggested reviewers

  • benjaminpkane
  • swheaton

Poem

In the garden of datasets, a new field now blooms,
last_deletion_at tells when samples met their dooms.
With timestamps for loading, for edits, for fate,
Now deletion's own moment is easy to state!
The docs are much clearer, the tests hop along—
This bunny records every change in a song.
🐰🕰️

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d577ccf and 6b0582a.

📒 Files selected for processing (2)
  • docs/source/user_guide/using_datasets.rst (2 hunks)
  • fiftyone/core/dataset.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/source/user_guide/using_datasets.rst
  • fiftyone/core/dataset.py
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
  • GitHub Check: build / build
  • GitHub Check: test / test-app
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@brimoor brimoor requested a review from benjaminpkane May 5, 2025 00:58
Copy link
Contributor
@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
docs/source/user_guide/using_datasets.rst (2)

993-996: Add documentation for last_deletion_at property
The new lines clearly describe the Dataset.last_deletion_at property. It might help to add a brief note reminding users that last_modified_at is intentionally unchanged on sample deletions to emphasize the separation of concerns.


2123-2130: Introduce “Builtin datetime fields” section
This new subsection nicely consolidates all automatic timestamp fields. Please verify that this heading is included in the page’s navigation or table of contents so readers can easily locate it. You may also consider adding an explicit cross-reference from the “Dates and datetimes” section to this subsection for better discoverability.

fiftyone/core/dataset.py (1)

737-742: Consider expanding the last_deletion_at docstring

While the current docstring is clear, it might be helpful to provide a bit more context about which operations cause this property to update (similar to the detailed explanation in the last_modified_at docstring).

 @property
 def last_deletion_at(self):
-    """The datetime that a sample was last deleted from the dataset, or
-    ``None`` if no samples have been deleted.
+    """The datetime that a sample was last deleted from the dataset, or
+    ``None`` if no samples have been deleted.
+
+    This property is updated whenever samples or frames are deleted from the
+    dataset via methods like :meth:`delete_samples`, :meth:`delete_frames`,
+    :meth:`clear`, or :meth:`clear_frames`.
     """
     return self._doc.last_deletion_at
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50420b7 and 3781abd.

📒 Files selected for processing (8)
  • docs/source/user_guide/using_datasets.rst (2 hunks)
  • fiftyone/core/collections.py (6 hunks)
  • fiftyone/core/dataset.py (4 hunks)
  • fiftyone/core/odm/dataset.py (1 hunks)
  • fiftyone/core/odm/document.py (1 hunks)
  • tests/unittests/dataset_tests.py (2 hunks)
  • tests/unittests/video_tests.py (3 hunks)
  • tests/unittests/view_tests.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
fiftyone/core/odm/dataset.py (2)
fiftyone/core/dataset.py (1)
  • last_deletion_at (738-742)
fiftyone/core/fields.py (1)
  • DateTimeField (915-944)
fiftyone/core/odm/document.py (5)
fiftyone/core/dataset.py (3)
  • _update_last_deletion_at (8299-8300)
  • last_deletion_at (738-742)
  • save (4293-4299)
fiftyone/core/odm/mixins.py (1)
  • save (1870-1871)
fiftyone/core/collections.py (1)
  • save (144-197)
fiftyone/core/sample.py (2)
  • save (544-546)
  • save (735-743)
fiftyone/core/frame.py (3)
  • save (485-487)
  • save (1042-1044)
  • save (1092-1099)
tests/unittests/dataset_tests.py (2)
fiftyone/core/dataset.py (4)
  • last_deletion_at (738-742)
  • created_at (713-715)
  • clear (5004-5010)
  • delete_samples (3945-3964)
fiftyone/core/view.py (2)
  • keep (1320-1330)
  • clear (1297-1306)
fiftyone/core/dataset.py (1)
fiftyone/core/odm/document.py (1)
  • _update_last_deletion_at (810-815)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: e2e / test-e2e
  • GitHub Check: test / test-app
  • GitHub Check: build / build
  • GitHub Check: lint / eslint
  • GitHub Check: build
🔇 Additional comments (17)
fiftyone/core/odm/dataset.py (1)

901-901: Added new DateTimeField for tracking sample deletions.

Adding last_deletion_at to the dataset document creates a dedicated timestamp for tracking when samples were deleted, separate from general modifications. This is a good improvement that allows for more precise tracking of dataset state changes.

fiftyone/core/odm/document.py (1)

810-816: Added method to update the deletion timestamp.

This method maintains consistency with the existing _update_last_modified_at implementation pattern. It properly defaults to the current UTC time when no timestamp is provided and saves the document with the virtual=True flag to persist the change.

tests/unittests/dataset_tests.py (1)

620-659: The new test_last_deletion_at test comprehensively validates deletion timestamp tracking.

This test method thoroughly verifies the last_deletion_at property behavior by:

  1. Confirming it's initially None after adding samples
  2. Verifying it's set and strictly increases after each deletion operation
  3. Testing different deletion paths (keep, clear, delete_samples)
  4. Validating dataset size after each operation

The test aligns well with the PR objectives of tracking deletion timestamps separately from other dataset modifications.

tests/unittests/video_tests.py (2)

746-796: Tests correctly verify that frame deletions update last_modified_at but not last_deletion_at

The test has been appropriately renamed from test_last_modified_at_deletions to test_last_modified_at_frame_deletions1 for clarity. The test now correctly verifies that:

  1. The last_deletion_at property is initially None
  2. Frame deletions update the maximum last_modified_at timestamp
  3. After reloading the dataset, last_deletion_at remains None

This properly validates that deletions at the frame level only affect the last_modified_at property and not the last_deletion_at property.


798-850: Tests sample-level frame deletions correctly

The test has been appropriately renamed from test_last_modified_at_deletions_frames to test_last_modified_at_frame_deletions2 for clarity. This test verifies the same behavior as the previous test but at the sample level:

  1. The last_deletion_at property is initially None
  2. Deleting frames from a sample updates the sample's last_modified_at property
  3. After reloading the dataset, last_deletion_at remains None

This correctly validates that frame deletions performed on individual samples only affect the last_modified_at property and not the new last_deletion_at property.

tests/unittests/view_tests.py (1)

5057-5143: Tests for _is_full_collection() enhancements look comprehensive and well-structured.

The added test class FullCollectionTests properly validates the enhanced behavior of the _is_full_collection() method that was updated as part of this PR. The tests cover important scenarios including:

  1. Regular dataset views with and without filtering
  2. Generated views like patches
  3. Grouped datasets with different slice selections

These tests directly align with the PR objective to update the _is_full_collection() method to recognize full views created by dataset.view() and full generated views like to_patches().

fiftyone/core/dataset.py (5)

719-734: Clear explanation of last_modified_at behavior

The updated docstring provides excellent clarity on when this property is incremented and when it is not. The distinction between dataset-level metadata changes and sample-level changes is well-explained, and the guidance on how to determine when samples were last added, edited, or deleted is helpful.


737-742: Good addition of last_deletion_at property

This new property provides a cleaner way to track sample deletions separately from other dataset modifications, which aligns with the PR's objective of improving timestamp semantics.


5038-5038: Appropriate update to use _update_last_deletion_at in _clear method

This change properly updates the new last_deletion_at timestamp rather than last_modified_at when samples are deleted, ensuring consistent behavior with the updated property semantics.


8299-8301: Well-implemented _update_last_deletion_at method

This method correctly updates the document's last_deletion_at timestamp and saves the changes. It follows the same pattern as the existing timestamp update methods.


8770-8770: Proper initialization of last_deletion_at during cloning

Setting last_deletion_at to None when cloning a dataset is appropriate as the new dataset hasn't had any samples deleted yet.

fiftyone/core/collections.py (6)

604-609: Documentation updated to reflect both modification and deletion time tracking.

The docstring has been updated to clarify that the sync_last_modified_at method now takes into account both modification and deletion timestamps of samples, which aligns with the introduction of the new last_deletion_at property.


668-677: Implementation updated to consider both modification and deletion timestamps.

The _sync_dataset_last_modified_at method now uses the new _none_max helper function to calculate the maximum between the dataset's last_deletion_at and the maximum sample last_modified_at, ensuring that both kinds of timestamps are properly considered when updating the dataset's last_modified_at.


8509-8531: Performance optimization added for the min method on datetime fields.

This is a nice optimization that allows the method to use a more efficient _min() implementation when operating on common datetime fields with no expression and when the collection represents the full dataset. This should improve performance for these common operations.

🧰 Tools
🪛 Ruff (0.8.2)

8529-8531: Do not assign a lambda expression, use a def

Rewrite make as a def

(E731)


8610-8632: Performance optimization added for the max method on datetime fields.

Similar to the optimization for min, this change improves performance when calculating the maximum value of datetime fields on full collections. The implementation correctly mirrors the min method's optimization.

🧰 Tools
🪛 Ruff (0.8.2)

8630-8632: Do not assign a lambda expression, use a def

Rewrite make as a def

(E731)


11070-11089: Enhanced _is_full_collection detection with support for full views.

The _is_full_collection method has been extended to recognize additional cases where a collection represents the full dataset, including empty-stage views. This improvement supports the optimizations in the min and max methods and aligns with the PR objective of better handling full views like those created by dataset.view().


12523-12529: Added utility function _none_max for handling None values in max operations.

This helper function safely computes the maximum of multiple arguments while ignoring None values, which is useful for the updated logic in _sync_dataset_last_modified_at. The implementation is concise and correctly handles the case when all arguments are None by returning the specified default.

@swheaton
Copy link
Contributor
swheaton commented May 6, 2025

For example, changing a dataset's description and creating a new saved view both increment Dataset.last_modified_at, but these have no effect on whether a saved to_patches() view needs to be reloaded.

Couldn't the same be said for insertion? @brimoor

@swheaton
Copy link
Contributor
swheaton commented May 6, 2025

Looks alright to me but wondering if we should also have last_insertion_at or last_addition_at

benjaminpkane
benjaminpkane previously approved these changes May 6, 2025
Copy link
Contributor
@swheaton swheaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requesting changes merely to get an answer to insertion question

@brimoor
Copy link
Contributor Author
brimoor commented May 6, 2025

Looks alright to me but wondering if we should also have last_insertion_at or last_addition_at

@swheaton good question. Here's my thinking:

  • Insertion sets both created_at and last_modified_at and both are indexed by default. Therefore:
  • max("created_at") is a performant way to check for last addition specifically
  • max("last_modified_at") is a performant way to check for last edit OR addition

Deletion is special because it does not reliably change the min/max statistic of created_at/last_modified_at.

swheaton
swheaton previously approved these changes May 6, 2025
Copy link
Contributor
@swheaton swheaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it thanks. looks good to me then

@brimoor brimoor dismissed stale reviews from swheaton and benjaminpkane via 6b0582a May 10, 2025 02:36
@brimoor brimoor merged commit 3f093da into release/v1.6.0 May 10, 2025
19 checks passed
@brimoor brimoor deleted the last-deletion-at branch May 10, 2025 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0