How to "GC" or "strip" hidden evolution changesets in Hg? - garbage-collection

Hg has a new'ish Changeset Evolution feature and related Evolve extension.
This is pretty cool because many 'rewrite' operations are now moved into the DAG (like Git) - no more patch/linearization with MQ or shelving required! It also avoids painfully-slow-on-large-repository individual strips previously required for amend, rebase, histedit, etc.
However, after a period of time and many local rewrites there can accumulate a "significant number" of changesets that become hidden / tagged with obsolescence.
It is that time, and there are many changesets it would be nice to longer have (at all):
Is there a good/approved method to strip hidden/obsolesced changesets from the local Hg repository?
The 'comparable' operation in Git would be a GC which prunes orphaned commits.
I'd prefer not to re-clone the repository. In addition the hidden commits have (thankfully) not been pushed/published.

A simple way to safely get rid of obsolete changesets (well, as safe as hg strip can be) is to use the extinct() revset, i.e.:
hg strip --hidden -r "extinct()"
Extinct changesets are those that are obsolete and also only have obsolete descendants (i.e. no live changesets that still depend on them).
Note that unless disk space becomes scarce, there should be no need to get rid of those changesets.

Related

Is it possible to add the SHA of my current commit to the core file pattern?

I'm looking to add the git sha to the core file pattern so I know exactly which commit was used to generate the core file.
Is there a way to do this?
It's not clear to me what you mean by "the core file pattern". (In particular, when a process crashes and the Linux kernel generates a core dump, it uses kernel.core_pattern. This setting is system-wide, not per-process. There is a way to run an auxiliary program—see How to change core pattern only for a particular application?—but that only gets you so far; you still have to write that program. See also https://wiki.ubuntu.com/Apport.) But there is a general problem here, which has some hacky solutions, all of which are variants on a pretty obvious method that is still a little bit clever.
The general problem
The hash of the commit you are about to make is not known until after you have made it. Worse, even if you can compute the hash of the commit you are about to make—which you can, it's just difficult—if you then change the content of some committed file that will go into the commit, so as to include this hash, you change the content of the commit you do make which means that you get a different actual commit hash.
In short, it is impossible to commit the commit hash of the commit inside the commit.
The hacky solution
The general idea is to write an untracked file that you use in your build process, so that the binary contains the commit hash somewhere easily found. For projects built with Make, see how to include git commit-number into a c++ executable? for some methods.
The same kind of approach can be used when building tarballs. Git has the ability to embed the hash ID of a file (blob object) inside a work-tree file, using the ident filter, but this is the ID of the file, which is usually not useful. So, instead, if you use git archive to produce tar or zip files, you can use export-subst, as described in the gitattributes documentation and referred-to in the git archive documentation. Note that the tar or zip archive also holds the commit hash ID directly.
Last, you can write your own custom smudge filter that embeds a commit hash ID into a work-tree file. This might be useful in languages where there is no equivalent of an external make process run to produce the binary. The problem here is that when the smudge filter reads HEAD, it's set to the value before the git checkout finishes, rather than the value after it finishes. This makes it much too difficult to extract the correct commit hash ID (if there is even a correct one—note that git describe will append -dirty if directed, to indicate that the work-tree does not match the HEAD commit, when appropriate).

Obliterating already integrated feature branch

I work in a team creating plenty of feature "stream task".
Even if we do delete the stream task after integration, the associated branch still exists in the depot and is somewhat cluttering various user interfaces.
I am tempted to ask the admin to obliterate them as we go along.
I have already read carefully : http://answers.perforce.com/articles/KB/2565
However, the obliterate is always associated with the scary warning "please contact Perforce Support first". So before going down that path I would like to know what are the risks, except erasing the wrong branch.
What will happen to files that have been initially created in feature branches ? Will obliterating the original, transform the lazy copy into a full fledged file ?Since the lazy copy is in the mainline, will the oldest revision will now point to the on in the mainline ?
Will it interfere with the "interchange" command ? If I have 2 "dev" branch moving in parallel, I believe it will still work because I will be actually compare the "merge changelist" that won't be affected by the removal of the task branch ?
What happens if a file is renamed in feature branch ? Will I lose the full range of history and the 2 files will look "disconnected" ?
Is there any other risk I have not taken into account ?
Issue 3 is particularly dangerous, and could be a good reason to not go on with the plan.
I currently believe it is "safe" to obliterate an already integrated feature branch if 1 & 2 are true :
No move/add/delete has been done in the branch (this can be checked by fstat headaction property)
No subranch has been created from the branch (since we are using task stream this is enforced by default)
Please correct me if I am wrong.
In general, if a file has been integrated elsewhere, obliterating only the file in the task stream is safe, and you will still have the file by its other name.
But, the record of the changes (add/edit/delete, rename, further branching, etc.) that occurred to the file in the task stream will indeed be removed if you obliterate the task stream's history of the file, and so the overall history can end up being confusing and harder to read.
Myself, I prefer to maintain the entire history of those files, but I understand the view that, in the abstract, more history is not always better history.
When you are done with your task stream, are you deleting the stream spec? This will cause the unmodified files from the task stream to disappear, leaving you with only the history of the files that were actually modified in the task stream, which is typically a much smaller set of files.

Can git use patch/diff based storage?

As I understand it, git stores full files of each revision committed. Even though it's compressed there's no way that can compete with, say, storing compressed patches against one original revision full file. It's especially an issue with poorly compressible binary files like images, etc.
Is there a way to make git use a patch/diff based backend for storing revisions?
I get why the main use case of git does it the way it does but I have a particular use case where I would like to use git if I could but it would take up too much space.
Thanks
Git does use diff based storage, silently and automatically, under the name "delta compression". It applies only to files that are "packed", and packs don't happen after every operation.
git-repack docs:
A pack is a collection of objects, individually compressed, with delta compression applied, stored in a single file, with an associated index file.
Git Internals - Packfiles:
You have two nearly identical 22K objects on your disk. Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?
It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server.
Later:
The really nice thing about this is that it can be repacked at any time. Git will occasionally repack your database automatically, always trying to save more space, but you can also manually repack at any time by running git gc by hand.
"The woes of git gc --aggressive" (Dan Farina), which describes that delta compression is a byproduct of object storage and not revision history:
Git does not use your standard per-file/per-commit forward and/or backward delta chains to derive files. Instead, it is legal to use any other stored version to derive another version. Contrast this to most version control systems where the only option is simply to compute the delta against the last version. The latter approach is so common probably because of a systematic tendency to couple the deltas to the revision history. In Git the development history is not in any way tied to these deltas (which are arranged to minimize space usage) and the history is instead imposed at a higher level of abstraction.
Later, quoting Linus, about the tendency of git gc --aggressive to throw out old good deltas and replace them with worse ones:
So the equivalent of "git gc --aggressive" - but done properly - is to
do (overnight) something like
git repack -a -d --depth=250 --window=250

Perforce: How does files get stored with branching?

A very basic question about branching and duplicating resources, I have had discussion like this due to the size of our main branch, but put aside it is great to know how this really works.
Consider the problem of branching dozens of Gb.
What happens when you create a branch of this massive amount of information?
Am reading the official doc here and here, but am still confused on how the files are stored for each branch on the server.
Say a file A.txt exists in main branch.
When creating the branch (Xbranch) and considering A.txt won't have changes, will the perforce server duplicate the A.txt (one keeping the main changes and another for the Xbranch)?
For a massive amount of data, it becomes a matter because it will mean duplicate the dozens of Gb. So how does this really work?
Some notes in addition to Bryan Pendleton's answer (and the questions from it)
To really check your understanding of what is going on, it is good to try with a test repository with a small number of files and to create checkpoints after each major action and then compare the checkpoints to see what actual database rows were written (as well as having a look at the archive files that the server maintains). This is very quick and easy to setup. You will notice that every branched file generates records in db.integed, db.rev, db.revcx and db.revhx - let alone any in db.have.
You also need to be aware of which server version you are using as the behavior has been enhanced over time. Check the output of "p4 help obliterate":
Obliterate is aware of lazy copies made when 'p4 integrate' creates
a branch, and does not remove copies that are still in use. Because
of this, obliterating files does not guarantee that the corresponding
files in the archive will be removed.
Some other points:
The default flags for "p4 integrate" to create branches copied the files down to the client workspace and then copied them back to the server with the submit. This took time depending on how many and how big the files were. It has long been possible to avoid this using the -v (virtual) flag, which just creates the appropriate rows on the server and avoids updating the client workspace - usually hugely faster. The possible slight downside is you have to sync the files afterwards to work on them.
Newer releases of Perforce have the "p4 populate" command which does the same as an "integrate -v" but also does not actually require the target files to be mapped into the current client workspace - this avoids the dreaded "no target file(s) in client view" error which many beginners have struggled with! [In P4V this is the "Branch files..." command on right click menu, rather than "Merge/Integrate..."]
Streams has made branching a lot slicker and easier in many ways - well worth reading up on and playing with (the only potential fly in the ointment is a flat 2 level naming hierarchy, and also potential challenges in migrating existing branches with existing relationships into streams)
Task streams are pretty nifty and save lots of space on the server
Obliterate has had an interesting flag -b for a few releases which is like being able to quickly and easily remove unchanged branch files - so like retro-creating a task stream. Can potentially save millions of database rows in larger installations with lots of branching
In general, branching a file does not create a copy of the file's contents; instead, the Perforce server just writes an additional database record describing the new revision, but shares the single copy of the file's contents.
Perforce refers to these as "lazy copies"; you can learn more about them here: http://answers.perforce.com/articles/KB_Article/How-to-Identify-a-Lazy-Copy-of-a-File
One exception is if you use the "+S" filetype modifier, as in this case each branch will have its own copy of the content, so that the +S semantics can be performed properly on each branch independently.

Git/Linux: What is a good strategy for maintaining a Linux kernel with patches from multiple Git repositories?

I am maintaining a custom Linux kernel which is comprised of merged changes from a variety of sources. This is for an embedded system.
The chip vendor we are working with releases a board support package as a changes against a mainline kernel (2.6.31). I have since made changes to support our custom hardware, and also merged with the stable (2.6.31.y) kernel releases. I've also merged in bug fixes for a specific file system driver that we use, sometimes before the changes make it to the mainline kernel.
I haven't been very systematic about how I have managed the various contributing sources and my own changes. If the change was large I tended to merge; if it was small I tended to rebase the third party changeset on to my own. Generally speaking merge conflicts are rare, since most of my work affects drivers that are not in the mainline kernel anyway.
I'm wondering if there is a better way to manage all of this. One concern is that my changes are mixed in with merges. The history might look something like:
2.6.31 + board support package + my changes (1) + 2.6.31.12 changes + my changes (2) + file system driver update + my changes (3) + 2.6.31.14 changes + my changes(4) + ....
It worries me a bit that my changes are mixed in, sometimes on the other side of merges. Is there a better way to do this? In particular, is there a way to do this that will make life easier when I switch to a newer kernel?
I don't think it will be easy to clean up your current set-up, unless your history is reasonably short, but I would suggest this:
Set up a Master repository which has remotes set up for each of the other places that your code comes from -- the mainline kernel, patches, ...
Keep a separate branch specifically for the updates from your driver supplier.
When you fetch, the updates will not mess with your branches.
When you are ready to merge, merge into some kind of "release" branch. The idea is to keep each source separate from the others, except when it needs to be merged in. Base your changes off of this branch, merging/rebasing as necessary.
Here's a quick diagram which I hope is helpful:
mainline-\----------\-------------------------------\
\ \ /you---\---/-/ \
\release----\-/---/----/-/--------\-/ / --\-----
patches-----------------/---/ / /
/ /
driver-------------------------/--------------/
With so many branches, it is difficult to diagram effectively, but I hope that gives you an idea of what I mean. release holds the official code for your system, you holds your own modifications, driver holds patches from the driver supplier, patches holds patches from some other repo, and mainline is the mainline kernel. Everything gets merged into release, and you base your changes off of release but only interact by merging in each direction, not making changes directly to release.
I think the generally accepted best policy is
(a) patch sets
(b) topic branches
Topic branches are essentially the same but are regularly rebased onto mainline. Topgit is a known tool that makes handling topic branches 'easier' if you have a lot of them. Otherwise: plan ahead and limit the number of branches

Resources