Canonizing diff-of-the-diffs

Canonizing diff-of-the-diffs - linux

My question, in short, is following. (I will explain the rationale later, it's not very relevant to the question itself.) Sometimes, git-diff produces different diffs for the same changeset. It happens, for instance, when there is a large chunk of code that was replaced with a wholly different large chunk of code. Such change, could be expressed in diff terms as "add 2 lines, remove 4 lines, add 5 lines, etc" or "remove 4 lines, add 7 lines, etc". I would like it to be somehow consistent.
The question is how to make the diff canonical, i.e. cause the same change always yield the same diff? From the top of my head -- make the diff "greedy" or something.
The reason it annoys me so much is following. I use much the technique we call "diff-of-the-diffs". (Bear with me if I'm explaining trivial things). After I resolve git rebase conflicts, I produce the unified diff of the old change (i.e. this specific commit in the branch before rebase) and the new change (i.e. current index against HEAD) and then diff these diffs against each other. Such diff-of-the-diffs is amazing as a conflicts resolution sanity check -- you can see right away that a specific conflict was only because of the context change, etc.

Related

Why/How can a revision graph like this be produced in perforce?

I have this revision graph from perforce and I am baffled how we managed to produce this graph. The part I find most puzzling is the part with the red square around it with revision 6 of row 2 being merged twice. The first time revision 6 is merged to revision 8 but with revision 11, I have never seen two revisions merged at the same time from separate branches. The second time revision 6 is merged like a normal merge but this file has been renamed for some time now so I don't know why the delete would suddenly get merged again. Can anyone explain the scenarios under which this can happen?
The top two rows are our trunk, the first row is our current trunk, the second row is the file before it was renamed, and the bottom row is a branch, we have many branches with the same strange merges in this branch.

A lot depends on how revision #6 was created. If you select revision 6, and open the "integrations" tab in the details pane, you can click on the source(s) of the integration(s) to reveal them in the graph.
A delete on top of a delete suggests that this file was re-added (under the old name) in some other branch, and then something unusual happened with that file that entailed the creation of an extra revision on the trunk for record-keeping purposes. The relationship between that revision and revision #12 of the "new" file may be significant -- it looks like there are effectively 5 different file variants sharing 2 different names in this history graph, and maybe there are additional variants that we can't see here.
You may also be able to get some information by clicking on the specific revision you're curious about (#10 on the bottom?) and reading the changelist description.
In general, a merge operation will propagate whatever has happened in the source that has not yet happened in the target. It does this by setting up a resolve, and in some cases the choices made during a resolve operation may negate some part of the changes made in the source and target (although the default option, where possible, will be to preserve and combine both). So the final outcome of a merge depends on its inputs, and what happened during the resolve, and the inputs in turn may be influenced by merges from elsewhere (and resolve choices made as a consequence of those merges), and so on.
Note also that not every arrow you see in a revision graph is the result of a merge -- copy has its own semantics, and a forced integrate via p4 integ -f has its own as well. (This is where reading the changelist description is useful -- hopefully the author of the change will have left some record there as to what their intent was and anything unusual they might have done, which can shortcut a lot of tedious forensic work.)

Measuring distance of values / strings

martin fownler was discussing event sourcing
https://martinfowler.com/eaaDev/EventSourcing.html
e.g. storing data as a set of events.
Now an example would be an account. You create an account with balance 0.
Then you put 10$. You withdraw 5$. You put another 100$. Now the balance is 105$, but you don't store 105$. What you do store is
+10
-5
+100
as a series of events in the database.
Now if I want I can say "undo the last 2 steps." then I just remove the 2 last changes in the database -> account is 10
Now: how can you do that with strings?
Say account name first is empty string. Then
dirk dietmeier then hans hansenmann then foo bar how can you capture this data as set of changes? While letting it be reversable e.g. the events need to be able to reverse itself. E.g. you could just say 'delete everything and then put foo bar but is there no better solution?
is there like a svn or git like algorithm? some encoding (hex, binary?)?

Now if I want I can say "undo the last 2 steps." then I just remove
the 2 last changes in the database -> account is 10
Not if you want to preserve the history. In production event sourced applications, I would issue a compensating event. E.g. New event Y that undoes what event X did. The git analogue to this would be git revert.
Now: how can you do that with strings?
It depends on your application.
If you are tracking changes to code, it makes sense to do some research on how to express differences between two files, such that you can revert at a later time. In this sense, your event would be similar to a git commit. I suggest you look at the diff linux command http://linuxcommand.org/man_pages/diff1.html and look at the source code, or how you can implement it.
If your event is something like CustomerFirstNameChanged, doing a diff makes very little sense. You would always want to revert to a previous state such as John or Rick.
Number 2 would also make sense with an event such as ArticleRedrafted, where you can go back to a previous version. Content editors don't see revisions as we see git commits when we use git revert... They see them as points in time that can be returned to.

There are many ways you could represent the change from one string to another string as a reversible operation. It really would depend on requirements.
I would not expect the deltification in a source control system to necessarily meet the general needs you've outlined above. For example, git's purpose in deltifying files for storage (which it only does when generating a pack) is to save space. Older versions of an object may be stored as deltas from newer versions, but there is never a need to reconstruct the newer version via delta from the older version - it's stored in its entirety specifically so that it can be accessed quickly (without need to combine deltas). So storing enough information for the delta to be used as a two-way transformation would be wasteful from git's point of view.
Instead of looking at the pack deltas, you could think about how diff represents a change. This is more line-oriented than you seem to want. If you have an edited file, you might represent the edits as
d37
- Line 37 used to contain this text, but it was removed
a42
+ We added this text on line 42
c99
- Line 99 used to have this text
+ Line 99 now has this text
You could do a character-oriented version of this. Say you define operators for "add characters at offset", "remove characters at offset", and "change characters at offset". You either have to be careful with escaping delimiters, or use explicit lengths instead of simply delimiting strings. And you also should consider that each edit might change the offset of the subsequent edits. Technically pre-image offsets have all the information you need, but reversal of the patch is more intuitive if you also store the post-image offsets.
So given
well, here is the original string
you might have a reversible delta like
0/0d6:well, 14/8c12:the original1:a33/16a11: with edits
to yield
here is a string with edits
Of course, calculating the "most correct" patch may not be easy.
0/0c33:well, here is the original string27:here is a string with edits
would be equally valid but degenerates to the line-oriented approach.
But that's just one example; like I said, you could define any number of ways, depending on requirements.

Why does "p4 integrate" operate on more files than "p4 interchanges" lists?

I maintain a family of branches, .../base/..., .../base-staging/..., .../base-production/.... Changes are usually made in base, reviewed, and then integrated out to base-staging and eventually base-production. Before an integration, I typically do a p4 interchanges to confirm that only expected changes will be carried forward.
Usually this works. But sometimes p4 integrate pulls over files and changes not listed by p4 interchanges. By my understanding, that shouldn't happen! What am I misunderstanding?
Details:
Ancient history:
create .../base/...
p4 integrate base/... base-staging/...
p4 integrate base/... base-production/...
make changes in .../base/... and submit
p4 interchanges base/... base-staging/...
review and approve
p4 integrate base/... base-staging/...
whoah! what are all these other listed files???
review: real changes, made well before my interchanges command, but not mentioned therein

I'm going to assume we're talking about a relatively recent server release -- on older releases "interchanges" will indeed match exactly what "integrate" lists, but in newer releases they diverge a bit in order to address the subtly different common use cases of the two commands.
One of the functions of "integrate" is to record history that ensures that future integrates behave in a predictable fashion. In some edge cases that means opening a file for integrate even though no actual changes need to be propagated to the target, so that it can be recorded that the file is up to date. A very simple example: make a change in A, ignore from A to B, merge from B to C, now merge A to C. Your expectation will typically be that integration is "transitive" such that if A->B reports nothing to do and B->C reports nothing to do then A->C will report nothing to do; if the B->C merge skips the change because it was an "ignore", though, then the A->C merge has no record that the B->C merge was ever even attempted, and it will propagate the change, violating the expectation of transitivity.
"Interchanges", however, does not record any history, and it is usually desirable for it to limit its scope to report only source changes that are not present in the target. It's therefore tuned to produce a smaller set of changes than you might get by running "integrate". The changes that are excluded from "interchanges" but included by "integrate" are typically going to be ones that do not strictly require merging, but that will probably make it easier to merge the "required" changes if they're included as well.
Using the "-Rs" flag on "integrate" will cause it to mimic the "interchanges" behavior and select the absolute minimum set of source revisions. It's briefly described in this old blog post from when it was introduced:
http://www.perforce.com/blog/101206/p4-integ-3-exciting-things-afoot

How to merge two checked out branches of Subversion ( Offline Merge )?

I am using Subversion with rabbit-vcs on Linux:
Under merge it shows only the option to browse my branches on online svn url
There is no option to give a offline svn folder as branch.
Since, I am pretty new to Subversion, so is it actually possible to merge 2 branches offline on svn ?
I have two branches already checked-out :
/home/user/branch1
/home/user/trunk

First of all, read this. Better yet, read this as well. Arguably, understanding merging is the most important part of knowing how to use SVN correctly (for one, you'll think thousands of times before creating a new branch :) ).
Note that you merge two committed sources into a working copy. That is, even if you specify one of the sources as a working copy it will still take its URL for merge purposes. So this is sort of syntactic sugar that a client may or may not support. The reason for it is that the merge operation needs to identify the common ancestor of the sources and merge them change by change. That information is not present in a working copy.
Note a source for some possible confusion here: in many (most?) instances the working copy argument may specify both a source to be merged and the working copy to merge into).
Here's an example of what I mean: suppose you merge S1 and S2 into W. S1 and W contain file F. S2 does not. Now, there are at least two possibilities: (1) the common ancestor S of S1 and S2 contained the file and it was deleted in S2. Then merge should delete it from W; (2) S did not contain F and it was added in S1. Then F should remain in W. The information about S in simply not present locally, so the repository has to be contacted.
To find out exactly branch URLs your offline working copies come from run svn info in branch1 and trunk.

Does perforce track deltas unique to a changeset or does it just store the whole file?

I tried to merge some work that a developer did in a working branch to a stable branch. The files a, b, and c had been changed by at least a dozen changesets since the common ancestor of STABLE and HEAD branches were separated.
I expected that since this developer changed five lines in each of file a, b, and c, that when I integrated from the HEAD to the STABLE branch, I would get his changes in my pending changset, which I could then review and commit.
Instead, it seems that it has taken every change that happened to file A, since the two were branched, and applied all of those changes that also existed in my colleague's working copy.
In other words, there seems to be no record in a perforce changeset, of what my colleague actually changed, versus what the file before contained.
If I browse the submitted changesets, I can see the difference between my colleague's version of the file, and the immediately preceding version. But then, that does not, it seems, determine what goes into the merge.
Doesn't a changeset mean, "a set of changes made between rev X and revision X+1 of a file"?
Can anybody help me understand what it means to "integrate a changeset" when in fact, what it seems is that Perforce doesn't track changes, it tracks files.
It is entirely possible that I am doing everything wrong, and would appreciate any pointer as to how it is that you can can merge accurately and safely between Perforce working branches and stable branches, without stuff that you don't want to get integrated to the stable branch getting integrated. It seems that no matter how simple the changes that actually get made in the product, the merge does not actually work for me.

Perforce does save changes to text files as deltas (binary files get saved in their entirety every time a change is submitted). It sounds like you're not properly restricting the revision range during your integration.
You say the working branch has "...been changed by at least a dozen changesets since the ...branches were separated." Let's call them changelists 1-12. If I understand you correctly you are trying to integrate the modifications made in just one of those changelists, not all of them.
During a simple integration operation Perforce will assume you want to integrate all of the changes that have been submitted since the branch was made. If you only want a subset of these changes, you have to specify a revision range. So, if you just want to integrate the changes that occurred between changelist 11 and 12, you would specify that revision range as shown in the screen capture. (Note: the revision range is inclusive, so specifying a range of 11-12, as I do in this screen shot will actually include changes in changelists 11 and 12. If you just want to integrate the changes made in changelist 12, enter 12 in both fields of the revision range.)
Just be aware that the inevitable conflicts that arise may be difficult to resolve, depending upon how far the branches have diverged and the nature of the changes.

Could you be more specific on how you did the integration? My guess is that you probably have integrated all the changes up to that changelist instead of just that changelist only. If so all you need to do is to specify the same changelist as both the upper and lower limit of the integration.
It's very easy to do in the visual client, but I'm not sure of the exact command line switch you need to use.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string