8000 Diff: Add rename detection by ben · Pull Request #573 · libgit2/libgit2sharp · GitHub
[go: up one dir, main page]

Skip to content

Diff: Add rename detection #573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 5, 2013
Merged

Conversation

ben
Copy link
Member
@ben ben commented Nov 22, 2013

Currently, if a file is renamed, Diff.Compare<TreeChanges>() will return two TreeEntryChanges structures; one with the original file deleted, and one with the new file added. That just ain't right.

  • Marshalling structures
  • Sane API
  • Tests
  • Add IncludeUnmodified functionality

@ben
Copy link
Member Author
ben commented Nov 22, 2013

What kind of a sane API should we be using here? Should we always do the mostest possible rename detection? It looks like there's a test that expects that. If so, there's no external surface for this; if not, should it just be public bool CompareOptions.DetectRenames?

@ben
Copy link
Member Author
ben commented Nov 22, 2013

Just pushed a version that always does basic rename detection, and it makes that old skipped test pass. Is this what we want?

@arrbee
Copy link
Member
arrbee commented Nov 22, 2013

I have sense of the aesthetic here, so I can't advise on what will feel good in this API, but I would advise against defaulting to "mostest possible rename detection" (i.e. with GIT_DIFF_FIND_ALL) since it can be extremely expensive to calculate as it considers a very broad range of possible rename / copy sources and targets.

The more modest GIT_DIFF_FIND_RENAMES that you've put in is pretty safe (albeit with some slow degenerate cases). You could reduce the impact even further by adding GIT_DIFF_FIND_EXACT_MATCH_ONLY which only notices renames where no changes were made to the file content at all, although that is of pretty limited usefulness.

@dahlbyk
Copy link
Member
dahlbyk commented Nov 22, 2013

Have a look at #278 for discussion on rename/copies API. Also, https://github.com/dahlbyk/libgit2sharp/compare/gh278

@ben
Copy link
Member Author
ben commented Nov 25, 2013

Just ported in the work from #278, including all the great tests.

Looks like there was a bit of debate as to how to expose all the rename-detection options. Right now there are just three booleans. It sort of seemed like a flags enum was the favored solution, with possibly a fluent API layered on top of it? Would something like this be okay?

[Flags]
public enum RenameDetectionFlags
{
    Renames = 1 << 0,
    Copies  = 1 << 1,
    CopiesHarder = 1 << 3,
    RenamesAndCopiesForUntrackedItems = 1 << 6,
    IgnoreWhitespace = 1 << 12,
    ConsiderWhitespace = 1 << 13,
    ExactMatchOnly = 1 << 14,
}

public RenameDetectionFlags Flags { get; set; }

@ben
Copy link
Member Author
ben commented Nov 25, 2013

Also: do we want to worry about all the break-rewrites-into-separate-adds-and-deletes flags? That adds a bit of complexity and quite a few tests, and I'm not sure of the use case for it.

@nulltoken
Copy link
Member

@dahlbyk was right. Again.
And I was wrong. Again 😝

Flags look like an inappropriate solution. We'd have to guard against invalid combinations.

@dahlbyk
Copy link
Member
dahlbyk commented Nov 25, 2013

I think @nulltoken was thinking more a non-flags enum:

public enum DetectionFlags
{
    None,
    Renames,
    RenamesAndCopies,
    RenamesAndCopiesHarder,
}

IgnoreWhitespace seems like it could be a separate bool (is it respected even if rename detection is off?). Same for ExactMatchOnly.

@dahlbyk
Copy link
Member
dahlbyk commented Nov 25, 2013

is it respected even if rename detection is off?

No:

/** measure similarity ignoring all whitespace */

@dahlbyk
Copy link
Member
dahlbyk commented Nov 25, 2013

In that case, it seems like maybe we should split out SimilarityWhitespaceHandling (better name?) as a separate enum.

git_diff_find_options has other thresholds that could reasonably be added to CompareOptions, or maybe we need a SimilarityOptions child of some sort?

@arrbee
Copy link
Member
arrbee commented Nov 25, 2013

In case there is any confusion regarding ignoring whitespace, there are three supported settings, not two: the default value (GIT_DIFF_FIND_IGNORE_LEADING_WHITESPACE) will ignore all CR and will also ignore any whitespace found immediately after an LF (so changing the indentation of a line will have no effect), the GIT_DIFF_FIND_IGNORE_WHITESPACE value will ignore all whitespace completely (even inside a string inside a line), and both GIT_DIFF_FIND_DONT_IGNORE_WHITESPACE and GIT_DIFF_FIND_EXACT_MATCH_ONLY will pay attention to any whitespace changes in the file.

Of course, GIT_DIFF_FIND_EXACT_MATCH_ONLY will only consider files to match if there are no changes at all, whereas the other values judge based on the portion of lines that match between the files.

@ben
Copy link
Member Author
ben 8000 commented Nov 25, 2013

Something like this, that Diff.Compare<> takes as an optional argument?

namespace LibGit2Sharp
{
    public sealed class SimilarityOptions
    {
        public enum RenameDetectionMode
        {
            // ...
        }

        public enum WhitespaceMode
        {
            // ...
        }

        public int RenameThreshold { get; set; }
        public int RenameFromRewriteThreshold { get; set; }
        public int CopyThreshold { get; set; }
        public int BreakRewriteThreshold { get; set; }
        public int RenameLimit { get; set; }

        // TODO: similarity metric
    }
}

@dahlbyk
Copy link
Member
dahlbyk commented Nov 26, 2013

Rather than an additional argument to Compare, I was thinking a type like that set as a property on CompareOptions.

@dahlbyk
Copy link
Member
dahlbyk commented Nov 26, 2013

For reference: git_diff_find_t

@arrbee I'm a little confused which of those options imply which others. For example, does GIT_DIFF_FIND_COPIES imply GIT_DIFF_FIND_RENAMES as --find-copies implies --find-renames?

For this first pass, I'd suggest we pass on including options related to finding/breaking rewrites.

@arrbee
Copy link
Member
arrbee commented Nov 26, 2013

I'm a little confused which of those options imply which others. For example, does GIT_DIFF_FIND_COPIES imply GIT_DIFF_FIND_RENAMES as --find-copies implies --find-renames?

@dahlbyk That's a good question and it should be answered more clearly in the docs. I

So, just reading from the C code, the current implemented rules are:

  • If you don't pass in an options structure:
    • If diff.renames is set to "copies" (or "copy"), we use GIT_DIFF_FIND_RENAMES | GIT_DIFF_FIND_COPIES
    • Otherwise we default to GIT_DIFF_FIND_RENAMES
  • If you pass in a options structure:
    • GIT_DIFF_FIND_RENAMES_FROM_REWRITES turns on GIT_DIFF_FIND_RENAMES
    • GIT_DIFF_FIND_COPIES_FROM_UNMODIFIED turns on GIT_DIFF_FIND_COPIES
    • GIT_DIFF_BREAK_REWRITES turns on GIT_DIFF_FIND_REWRITES
    • GIT_DIFF_FIND_EXACT_MATCH_ONLY turns off all REWRITE detection flags

Currently, that's it in the code. GIT_DIFF_FIND_COPIES does not turn on the GIT_DIFF_FIND_RENAMES flag right now. And there are no other rules to speak of.

However, reading over the code, it also appears to me that the GIT_DIFF_FIND_RENAMES flag is not actually checked and that calling git_diff_find_similar() will always find renames when it is invoked, regardless of whether that flag is set or not. That is obviously a bug, although it doesn't seem too urgent. If we fix that bug, however, we should probably implement that turning on any other flags should turn on GIT_DIFF_FIND_RENAMES I suspect.

Anyhow, feedback is welcome - this corner of functionality has not had much scrutiny, so this is really the first external consideration of whether these rules actually make sense or not.

@dahlbyk
Copy link
Member
dahlbyk commented Nov 26, 2013

If we fix that bug, however, we should probably implement that turning on any other flags should turn on GIT_DIFF_FIND_RENAMES I suspect.

That seems reasonable to me, in which case GIT_DIFF_FIND_COPIES would indeed imply GIT_DIFF_FIND_RENAMES.

So it looks to me like we're good with four rename modes:

  • None
  • Renames
  • Copies
  • CopiesHarder (document what "harder" means)

And to support None, until the libgit2 bug is fixed, we'll need to skip the git_diff_find_similar call altogether.

I had forgotten about the diff.renames setting - do we want to support that too? Maybe if SimilarityOptions is left as null we just call git_diff_find_similar without options and let it do the right thing?

GitDiffFindFlags.GIT_DIFF_FIND_COPIES |
GitDiffFindFlags.GIT_DIFF_FIND_COPIES_FROM_UNMODIFIED;
break;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dahlbyk: I think this logic is correct? (There's an early return above.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displaye 8000 d to describe this comment to others. Learn more.

👍

@dahlbyk
Copy link
Member
dahlbyk commented Dec 1, 2013

❤️ this API

I wonder to what extent we should exercise the other options (whitespace, thresholds)? Perhaps just open an up-for-grabs issue that more extensive testing would not be a bad thing?

By my reading, there are now only a handful of settings not supported:

  1. Rewrites
    • GIT_DIFF_FIND_RENAMES_FROM_REWRITES
    • GIT_DIFF_FIND_REWRITES
    • GIT_DIFF_FIND_AND_BREAK_REWRITES (according to @arrbee, GIT_DIFF_BREAK_REWRITES implies GIT_DIFF_FIND_REWRITES)
    • GIT_DIFF_BREAK_REWRITES_FOR_RENAMES_ONLY
  2. GIT_DIFF_FIND_FOR_UNTRACKED
  3. GIT_DIFF_FIND_EXACT_MATCH_ONLY
  4. Custom similarity metric

string originalPath = Path.Combine(repo.Info.WorkingDirectory, "original.txt");
string renamedPath = Path.Combine(repo.Info.WorkingDirectory, "renamed.txt");

File.WriteAllText(originalPath, "a\nb\nc\nd\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace all the File.Writexxx() calls with Touch()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nulltoken
Copy link
Member

❤️ this API

I do agree!

However, once again Travis is playing the PITA game.

/home/travis/build/libgit2/libgit2sharp/CI-build.msbuild (Deploy) ->
(Test target) ->
    /home/travis/build/libgit2/libgit2sharp/CI-build.msbuild: error : LibGit2Sharp.Tests.DiffTreeToTreeFixture.CanDetectTheExactRenamingExactCopyingOfNonModifiedAndModifiedFilesWhenEnabled: Assert.Equal() Failure
Expected: 2
Actual:   1
    /home/travis/build/libgit2/libgit2sharp/CI-build.msbuild: error :   at LibGit2Sharp.Tests.DiffTreeToTreeFixture.CanDetectTheExactRenamingExactCopyingOfNonModifiedAndModifiedFilesWhenEnabled () [0x00000] in <filename unknown>:0 
  at (wrapper managed-to-native) System.Reflection.MonoMethod:InternalInvoke (System.Reflection.MonoMethod,object,object[],System.Exception&)
  at System.Reflection.MonoMethod.Invoke (System.Object obj, BindingFlags invokeAttr, System.Reflection.Binder binder, System.Object[] parameters, System.Globalization.CultureInfo culture) [0x00000] in <filename unknown>:0 
    /home/travis/build/libgit2/libgit2sharp/CI-build.msbuild: error : LibGit2Sharp.Tests.DiffTreeToTreeFixture.CanDetectTheExactCopyingOfNonModifiedFilesWhenEnabled: Assert.Equal() Failure
Expected: 1
Actual:   0
    /home/travis/build/libgit2/libgit2sharp/CI-build.msbuild: error :   at LibGit2Sharp.Tests.DiffTreeToTreeFixture.CanDetectTheExactCopyingOfNonModifiedFilesWhenEnabled () [0x00000] in <filename unknown>:0 
  at (wrapper managed-to-native) System.Reflection.MonoMethod:InternalInvoke (System.Reflection.MonoMethod,object,object[],System.Exception&)
  at System.Reflection.MonoMethod.Invoke (System.Object obj, BindingFlags invokeAttr, System.Reflection.Binder binder, System.Object[] parameters, System.Globalization.CultureInfo culture) [0x00000] in <filename unknown>:0 

I wonder if relying on a Scottish post-britpop band as a CI server was such a good idea... They seem very picky 😉

travis

@ben
Copy link
Member Author
ben commented Dec 4, 2013

One might expect Default would be the user's preference (diff.renames), in which case SimilarityOptions.Default should actually return null.

Good point. I guess the question here is how diff.renames will be handled by libgit2. My feeling is that, if/when we do end up supporting it, it'll act as an override. So if diff.renames is set to copy, passing GIT_DIFF_FIND_RENAMES will actually be like passing GIT_DIFF_FIND_COPIES.

Or we'll just ignore it. diff.renames can be thought of as a tweak to the command-line behavior of git, and it may not be desirable for a script running through libgit2.

I think you're right about the naming, though; I'll roll that change back for clarity.

should the default tests you added have variations based on setting diff.renames values?

This is basically the case now. CanNotDetectTheExactRenamingFilesWhenNotEnabled and DetectsTheExactRenamingOfFilesByDefault are the same test with different SimilarityOptions settings, and different assertions. Would you want to see different naming for these?

@dahlbyk
Copy link
Member
dahlbyk commented Dec 4, 2013

Or we'll just ignore it. diff.renames can be thought of as a tweak to the command-line behavior of git, and it may not be desirable for a script running through libgit2.

I'd tend to go the other direction - if a consumer of this library cares how renames/copies are handled, they can specify SimilarityOptions. Otherwise (by leaving SimilarityOptions == null), it will respect diff.renames. Or we could reintroduce RenameDetectionMode.Default (in addition to Renames!) that results in the same behavior as not passing SimilarityOptions at all, to support the scenario that thresholds need to be adjusted but within the user's rename preference (e.g. git diff -B70%).

At least, I thought that was the direction we were headed - as @arrbee describes the current implementation, the only missing piece is switching libgit2 to not detect renames if options is null and diff.renames is false (or unset?).

So that's what was rattling around in my head when I asked about the default tests - I was thinking of them as tests to confirm diff.renames handling (via unspecified compareOptions), apart from tests for the non-default settings.

@ben
Copy link
Member Author
ben commented Dec 4, 2013

Or we could reintroduce RenameDetectionMode.Default (in addition to Renames!) that results in the same behavior as not passing SimilarityOptions at all, to support the scenario that thresholds need to be adjusted but within the user's rename preference (e.g. git diff -B70%).

This would take a change at the libgit2 level; you can't pass NULL and a structure that defines the thresholds. I'll add it to my todo list, and leave a placeholder test in this code that we can come back to later. This last push introduces RenameDetectionMode.Default and SimilarityOptions.Default to support that case.

@dahlbyk
Copy link
Member
dahlbyk commented Dec 4, 2013

This would take a change at the libgit2 level; you can't pass NULL and a structure that defines the thresholds.

Good point. Certainly not a reason to hold this up, just an edge case I wanted to mention. Thanks for your patience with me.

Since the "no similarity detection" case is handled simply enough by not calling git_diff_find_similar, may I propose that passing flags without any of GIT_DIFF_FIND_RENAMES | GIT_DIFF_FIND_COPIES | GIT_DIFF_FIND_COPIES_FROM_UNMODIFIED set would defer to the config setting?

/// <summary>
/// Obey the user's `diff.renames` configuration setting
/// </summary>
Default,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is also literally our default behavior, can we list this option first so it's default(RenameDetectionMode)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dahlbyk
Copy link
Member
dahlbyk commented Dec 4, 2013

I believe @nulltoken had commented on replacing this with a no-args ctor that initializes the values.

Bump

Does Exact mode also detect exact copies (uncommon, but possible)? If so, it seems to make more sense as a separate bool rather than a mode. Wherever it ends up, you might note that using that flag is equivalent to setting the corresponding threshold(s) to 100.

Bump

@arrbee
Copy link
Member
arrbee commented Dec 4, 2013

Good point. I guess the question here is how diff.renames will be handled by libgit2. My feeling is that, if/when we do end up supporting it, it'll act as an override. So if diff.renames is set to copy, passing GIT_DIFF_FIND_RENAMES will actually be like passing GIT_DIFF_FIND_COPIES.

That is not how it is implemented currently.

I'd tend to go the other direction - if a consumer of this library cares how renames/copies are handled, they can specify SimilarityOptions. Otherwise (by leaving SimilarityOptions == null), it will respect diff.renames.

That is how it is implemented currently.

If you pass NULL for the options structure to git_diff_find_similar then it uses diff.renames to decide what to do. If you do not pass NULL, then you are providing an explicit setting for what to do.

I feel pretty strongly that an explicitly chosen value in the options structure should take precedence over any setting in the config file, otherwise it is too difficult to override the config (just as an explicit command-line option should override a config value).

Reading over this discussion, I see that the current implementation of diff.renames is flawed. In particular, only the values copy / copies are checked for, not the boolean baseline setting. It should be a three-state check (unset or false, set and true, set and "copy"/"copies") but it is only unset vs. the last.

Additionally, if you pass in git_diff_find_options with zero options set for finding things (i.e. (flags & GIT_DIFF_FIND_ALL) == 0), I think it probably also makes sense to fall back to the diff.renames config setting in that case as well. If we were to make that change, then I'm tempted to make the GIT_DIFF_FIND_OPTIONS_INIT structure initializer include GIT_DIFF_FIND_RENAMES by default and require you to F438 explicitly clear it if you want to fall back on diff.renames, but I'd love to hear your opinions on this!

@ben
Copy link
Member Author
ben commented Dec 4, 2013

I believe @nulltoken had commented on replacing this with a no-args ctor that initializes the values.

Bump

This is done.

Does Exact mode also detect exact copies (uncommon, but possible)? If so, it seems to make more sense as a separate bool rather than a mode. Wherever it ends up, you might note that using that flag is equivalent to setting the corresponding threshold(s) to 100.

Bump

I just spiked out a test to check: no, it doesn't. I think it works by looking at add/remove pairs and seeing if their SHAs match. Do you think adding an ExactModeDoesntDetectExactCopies test would be valuable?

@arrbee
Copy link
Member
arrbee commented Dec 4, 2013

Does Exact mode also detect exact copies (uncommon, but possible)? If so, it seems to make more sense as a separate bool rather than a mode. Wherever it ends up, you might note that using that flag is equivalent to setting the corresponding threshold(s) to 100.

Bump

I just spiked out a test to check: no, it doesn't. I think it works by looking at add/remove pairs and seeing if their SHAs match. Do you think adding an ExactModeDoesntDetectExactCopies test would be valuable?

Are you sure? I just modified the existing test_diff_rename__match_oid test from the libgit2 suite to add in the GIT_DIFF_FIND_EXACT_MATCH_ONLY flag to the final call to git_diff_find_similar and it does find a COPIED file as well as a RENAMED file. The copy detection logic should still run if you request it, it just requires the that target file have the exact same SHA as one of the valid source files (i.e. you may need to do the whole UNMODIFIED dance to make sure you have the right sources available). Is there a bug that I'm not finding?

@ben
Copy link
Member Author
ben commented Dec 4, 2013

@arrbee: Let me see if I've got the design scenarios right:

  1. Don't do any rename detection.
  2. Specify exactly what should be done.
  3. Do what the user has configured with diff.renames, with default thresholds.
  4. Same as 3, but with custom thresholds.

(Aside: am I missing anything from that list?)

(1) is accomplished with RenameDetectionMode.None, and results in not calling git_diff_find_similar at all. (2) is available through SimilarityOptions. (3) we do with RenameDetectionMode.Default by by passing NULL to git_diff_find_similar. (4) is I think impossible right now.

I'm proposing adding a GIT_DIFF_FIND_DEFAULT value that causes git_diff_find_similar to use the diff.renames, but override the default thresholds with the values passed with the structure (unless they're zero). Rereading your proposal, it sounds a lot like this, with #define GIT_DIFF_FIND_DEFAULT 0. Does that sound right?

@dahlbyk
Copy link
Member
dahlbyk commented Dec 4, 2013

This is done.

So it is. Reading are hard.

Do you think adding an ExactModeDoesntDetectExactCopies test would be valuable?

Wouldn't hurt.

@dahlbyk
Copy link
Member
dahlbyk commented Dec 4, 2013

Rereading your proposal, it sounds a lot like this, with #define GIT_DIFF_FIND_DEFAULT 0. Does that sound right?

That's my reading, and I think it makes sense.

@arrbee
Copy link
Member
arrbee commented Dec 4, 2013

@ben Your four cases sound right to me.

I'm proposing adding a GIT_DIFF_FIND_DEFAULT value that causes git_diff_find_similar to use the diff.renames, but override the default thresholds with the values passed with the structure (unless they're zero). Rereading your proposal, it sounds a lot like this, with #define GIT_DIFF_FIND_DEFAULT 0. Does that sound right?

Yes, I agree. I also intended to ask whether GIT_DIFF_FIND_DEFAULT should be the default (when an options struct is passed in) or whether GIT_DIFF_FIND_RENAMES should be the default if given other options. My thinking there is that if you are calling git_diff_find_similar() and setting a bunch of threshold values, it will be a little confusing for it to default to doing nothing if diff.renames is set to false. We could try GIT_DIFF_FIND_BY_CONFIG if we went down this route to avoid confusion (and document that passing a NULL options struct fell back to GIT_DIFF_FIND_BY_CONFIG).

Regardless of what names we end up with for base libgit2, I certainly think the libgit2sharp names (where rename detection is more naturally integrated with the diff creation process) should be independent and natural (as I think you've achieved here).

@ben
Copy link
Member Author
ben commented Dec 4, 2013

The copy detection logic should still run if you request it, it just requires the that target file have the exact same SHA as one of the valid source files (i.e. you may need to do the whole UNMODIFIED dance to make sure you have the right sources available).

Hwhoops, you're right! I should be setting all the rename and copy flags along with the exact flag. This last commit does exactly that.

@dahlbyk
Copy link
Member
dahlbyk commented Dec 4, 2013

My thinking there is that if you are calling git_diff_find_similar() and setting a bunch of threshold values, it will be a little confusing for it to default to doing nothing if diff.renames is set to false.

Isn't it unlikely that one would set thresholds without also noticing flags and checking which ones are desired?

Allow exact matching to find copies as well

👍 Doesn't seem like it would be much more expensive to check for exact renames and copies, so might as well simplify the API and always check both. Good call.

@arrbee
Copy link
Member
arrbee commented Dec 4, 2013

My thinking there is that if you are calling git_diff_find_similar() and setting a bunch of threshold values, it will be a little confusing for it to default to doing nothing if diff.renames is set to false.

Isn't it unlikely that one would set thresholds without also noticing flags and checking which ones are desired?

That makes sense to me.

@ben
Copy link
Member Author
ben commented Dec 4, 2013

Alright, let's move the native-level discussion over to libgit2/libgit2; I'll spin up a PR. Is there anything more I should do with the C# layer here?

@dahlbyk
Copy link
Member
dahlbyk commented Dec 4, 2013

Is there anything more I should do with the C# layer here?

Looks great to me! :shipit:


// TODO: custom metric
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:trollface: ^^

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😒

@nulltoken
Copy link
Member

🔨 it 🙏?

@ben
Copy link
Member Author
ben commented Dec 5, 2013

🔨'd.

@nulltoken nulltoken merged commit f5d9d0e into libgit2:vNext Dec 5, 2013
@nulltoken
Copy link
Member

A. W. E. S. O. M. E.

@ben ben deleted the diff-rename-detection branch December 6, 2013 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0