A very basic question about branching and duplicating resources, I have had discussion like this due to the size of our main branch, but put aside it is great to know how this really works.
Consider the problem of branching dozens of Gb.
What happens when you create a branch of this massive amount of information?
Am reading the official doc here and here, but am still confused on how the files are stored for each branch on the server.
Say a file A.txt exists in main branch.
When creating the branch (Xbranch) and considering A.txt won't have changes, will the perforce server duplicate the A.txt (one keeping the main changes and another for the Xbranch)?
For a massive amount of data, it becomes a matter because it will mean duplicate the dozens of Gb. So how does this really work?
Some notes in addition to Bryan Pendleton's answer (and the questions from it)
To really check your understanding of what is going on, it is good to try with a test repository with a small number of files and to create checkpoints after each major action and then compare the checkpoints to see what actual database rows were written (as well as having a look at the archive files that the server maintains). This is very quick and easy to setup. You will notice that every branched file generates records in db.integed, db.rev, db.revcx and db.revhx - let alone any in db.have.
You also need to be aware of which server version you are using as the behavior has been enhanced over time. Check the output of "p4 help obliterate":
Obliterate is aware of lazy copies made when 'p4 integrate' creates
a branch, and does not remove copies that are still in use. Because
of this, obliterating files does not guarantee that the corresponding
files in the archive will be removed.
Some other points:
The default flags for "p4 integrate" to create branches copied the files down to the client workspace and then copied them back to the server with the submit. This took time depending on how many and how big the files were. It has long been possible to avoid this using the -v (virtual) flag, which just creates the appropriate rows on the server and avoids updating the client workspace - usually hugely faster. The possible slight downside is you have to sync the files afterwards to work on them.
Newer releases of Perforce have the "p4 populate" command which does the same as an "integrate -v" but also does not actually require the target files to be mapped into the current client workspace - this avoids the dreaded "no target file(s) in client view" error which many beginners have struggled with! [In P4V this is the "Branch files..." command on right click menu, rather than "Merge/Integrate..."]
Streams has made branching a lot slicker and easier in many ways - well worth reading up on and playing with (the only potential fly in the ointment is a flat 2 level naming hierarchy, and also potential challenges in migrating existing branches with existing relationships into streams)
Task streams are pretty nifty and save lots of space on the server
Obliterate has had an interesting flag -b for a few releases which is like being able to quickly and easily remove unchanged branch files - so like retro-creating a task stream. Can potentially save millions of database rows in larger installations with lots of branching
In general, branching a file does not create a copy of the file's contents; instead, the Perforce server just writes an additional database record describing the new revision, but shares the single copy of the file's contents.
Perforce refers to these as "lazy copies"; you can learn more about them here: http://answers.perforce.com/articles/KB_Article/How-to-Identify-a-Lazy-Copy-of-a-File
One exception is if you use the "+S" filetype modifier, as in this case each branch will have its own copy of the content, so that the +S semantics can be performed properly on each branch independently.
Related
We have a large repository of files that we want to keep in sync between one central location and multiple remote locations. Currently, this is being done using rsync, but it's a slow process mainly because of how long it takes to determine the changes.
My current thought is to find a VCS-like solution where instead of having to check all of the files, we can check the diffs between revisions to determine what gets sent over the wire. My biggest concern, however, is that we'd have to re-sync all of the files that are currently in-sync, which is a significant effort. I've been told that the current repository is about .5 TB and consists of a variety of files of different sizes. I understand that an initial commit will most likely take a significant amount of time, but I'd rather avoid the syncing between clusters if possible.
One thing I did look at briefly is git-annex, but my first concern is that it may not like dealing with thousands of files. Also, one thing I didn't see is what would happen if the file already exists on both systems. If I create a repo using git-annex on the central system and then set up repos on the remote clusters, will pushing from central to a remote repo cause it to sync all of the files?
If anyone has alternative solutions/ideas, I'd love to see them.
Thanks.
I wonder if there is any way i could retrieve all changes i made to my various configuration files since install(residing in /etc and so on) in one shot?
I imagine some kind of loop, that uses 'diff' to compare all those files to a 'standard installation' of ubuntu. Output should be a single file with information regarding the changes that were made and a timestamp.
Perhaps there is even a way to put all that in a script and let it run regularly to automatically keep track of future config file changes.
If the files are already modified, I guess your only option is to diff your files with a fresh install. Keep in mind some files might be specific to you computer, I'm thinking of files that can hold device-specific values like your mac address udev/rules.d/70-persistent-net.rules, your drives uuid /etc/fstab, etc.
If you're planning this ahead, there are at least two options you can consider:
use a VCS such as git.
use a filesystem that keeps a complete history of the changes made.
I work in a team creating plenty of feature "stream task".
Even if we do delete the stream task after integration, the associated branch still exists in the depot and is somewhat cluttering various user interfaces.
I am tempted to ask the admin to obliterate them as we go along.
I have already read carefully : http://answers.perforce.com/articles/KB/2565
However, the obliterate is always associated with the scary warning "please contact Perforce Support first". So before going down that path I would like to know what are the risks, except erasing the wrong branch.
What will happen to files that have been initially created in feature branches ? Will obliterating the original, transform the lazy copy into a full fledged file ?Since the lazy copy is in the mainline, will the oldest revision will now point to the on in the mainline ?
Will it interfere with the "interchange" command ? If I have 2 "dev" branch moving in parallel, I believe it will still work because I will be actually compare the "merge changelist" that won't be affected by the removal of the task branch ?
What happens if a file is renamed in feature branch ? Will I lose the full range of history and the 2 files will look "disconnected" ?
Is there any other risk I have not taken into account ?
Issue 3 is particularly dangerous, and could be a good reason to not go on with the plan.
I currently believe it is "safe" to obliterate an already integrated feature branch if 1 & 2 are true :
No move/add/delete has been done in the branch (this can be checked by fstat headaction property)
No subranch has been created from the branch (since we are using task stream this is enforced by default)
Please correct me if I am wrong.
In general, if a file has been integrated elsewhere, obliterating only the file in the task stream is safe, and you will still have the file by its other name.
But, the record of the changes (add/edit/delete, rename, further branching, etc.) that occurred to the file in the task stream will indeed be removed if you obliterate the task stream's history of the file, and so the overall history can end up being confusing and harder to read.
Myself, I prefer to maintain the entire history of those files, but I understand the view that, in the abstract, more history is not always better history.
When you are done with your task stream, are you deleting the stream spec? This will cause the unmodified files from the task stream to disappear, leaving you with only the history of the files that were actually modified in the task stream, which is typically a much smaller set of files.
As I understand it, git stores full files of each revision committed. Even though it's compressed there's no way that can compete with, say, storing compressed patches against one original revision full file. It's especially an issue with poorly compressible binary files like images, etc.
Is there a way to make git use a patch/diff based backend for storing revisions?
I get why the main use case of git does it the way it does but I have a particular use case where I would like to use git if I could but it would take up too much space.
Thanks
Git does use diff based storage, silently and automatically, under the name "delta compression". It applies only to files that are "packed", and packs don't happen after every operation.
git-repack docs:
A pack is a collection of objects, individually compressed, with delta compression applied, stored in a single file, with an associated index file.
Git Internals - Packfiles:
You have two nearly identical 22K objects on your disk. Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?
It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server.
Later:
The really nice thing about this is that it can be repacked at any time. Git will occasionally repack your database automatically, always trying to save more space, but you can also manually repack at any time by running git gc by hand.
"The woes of git gc --aggressive" (Dan Farina), which describes that delta compression is a byproduct of object storage and not revision history:
Git does not use your standard per-file/per-commit forward and/or backward delta chains to derive files. Instead, it is legal to use any other stored version to derive another version. Contrast this to most version control systems where the only option is simply to compute the delta against the last version. The latter approach is so common probably because of a systematic tendency to couple the deltas to the revision history. In Git the development history is not in any way tied to these deltas (which are arranged to minimize space usage) and the history is instead imposed at a higher level of abstraction.
Later, quoting Linus, about the tendency of git gc --aggressive to throw out old good deltas and replace them with worse ones:
So the equivalent of "git gc --aggressive" - but done properly - is to
do (overnight) something like
git repack -a -d --depth=250 --window=250
Is it possible to map the same part of a depot to two (or more) different places?
//depot/branches/foo/... //my_client/foo/...
//depot/branches/foo/... //my_client/foo1/...
The reason I want this is to be able to make unrelated and non-overlapping changes to the same file(s) simultaneously.
(If they were different files, I could simply use different change-lists in a single mapping, of course.)
A given client view can only have one of a given depot file at a given time. That said, here are three possible ways to make two different changes to the same file at the same time:
1) Do your two changes need to both exist on your client machine simultaneously? If not, when you want to pause work on your first change, "shelve" it, revert your local file, and then make your second change. You can have any number of "shelved" versions of a file (in different changelists) associated with a single client, but only the "open" file is actually present in the workspace.
2) Do you in fact need both files on your machine, but not necessarily need to run Perforce commands on them simultaneously (like merge changes between them, diff them vs each other, submit them both as a single change, etc)? If so, having multiple client specs is a good option. Make sure they have different roots (hence different local filesystem locations), and use P4CONFIG files so that you'll automatically use the client spec that matches your working directory.
3) Do you need both files and want the ability to version different sets of changes to them simultaneously, diff the two variants, and merge changes between them? If so, you want to make a new branch. Do:
p4 integ //depot/branches/foo/... //depot/branches/foo1/...
p4 submit
Now there are two sets of files in the depot and in your workspace; you can make independent changes to them, and use "p4 integ" later to merge those changes between them (in either direction) as desired.
Bryan's suggestion is a good option for what you would like to accomplish.
In terms of overlay mappings in a client workspace, Perforce allows you to map multiple depots to the same workspace location as documented here:
http://www.perforce.com/perforce/doc.current/manuals/p4guide/chapter.configuration.html#configuration.refine_workspace.map_diff_depot_locations