What is the fastest way to keep a subversion repository in sync for migrating to a new environment? - linux

I'm attempting to migrate large subversion repositories from a system with subversion 1.6 to one with subversion 1.7. When I make the final sync, I can take downtime and simply use rsync, but I didn't choose this because I wanted the new repositories to be created/upgraded with version 1.7.
So I chose to use svnsync because it can pick up where it left off and because the repositories are not mounted on the same device. The only problem is for large repositories svnsync copy-revprops takes FOREVER, as it goes through every revision over again.
So my question is can I do one of:
Reduce the time it takes to do svnsync copy-revprops
Skip copy-revprops entirely
Use a faster method of incrementally syncing repositories from one version of subversion to another.
?

According to http://svnbook.red-bean.com/en/1.7/svn.ref.svnsync.c.copy-revprops.html:
Because Subversion revision properties can be changed at any time,
it's possible that the properties for some revision might be changed
after that revision has already been synchronized to another
repository. Because the svnsync synchronize command operates only on
the range of revisions that have not yet been synchronized, it won't
notice a revision property change outside that range. Left as is, this
causes a deviation in the values of that revision's properties between
the source and mirror repositories. svnsync copy-revprops is the
answer to this problem. Use it to resynchronize the revision
properties for a particular revision or range of revisions.
This means that you can choose to skip copy-revprops if the revision properties have not been changing. Even if you're not sure, there is no point in doing copy-revprops multiple times--once at the very end of all sync's will suffice.

Related

git forces refresh index after switching between Windows and Linux

I have a disk partition (format: NTFS) shared by Windows and Linux. It contains a git repository (about 6.7 GB).
If I only use Windows or only use Linux to manipulate the git repository everything is okay.
But everytime I switch the system, the git status command will refresh the index, which takes about 1 minute. If I run the git status in the same system again, it only take less than 1 second. Here is the result
# Just after switch from windows
[#5#wangx#manjaro:duishang_design] git status # this command takes more than 60s
Refresh index: 100% (2751/2751), done.
On branch master
nothing to commit, working tree clean
[#10#wangx#manjaro:duishang_design] git status # this time the command takes less than 1s
On branch master
nothing to commit, working tree clean
[#11#wangx#manjaro:duishang_design] git status # this time the command takes less than 1s
On branch master
nothing to commit, working tree clean
I guess there is some problem with the git cache. For example: Windows and Linux all use the .git/index file as cache file, but the git in Linux system can't recognize the .git/index changed by Windows. So it can only refresh the index and replace the .git/index file, which makes the next git status super fast and git status in Windows very slow (because the Windows system will refresh the index file again).
Is my guess correct? If so, how can I set the index file for different system? How can I solve the problem?
You are completely correct here:
The thing you're using here, which Git variously calls the index, the staging area, or the cache, does in fact contain cache data.
The cache data that it contains is the result of system calls.
The system call data returned by a Linux system is different from the system call data returned by a Windows system.
Hence, an OS switch completely invalidates all the cache data.
... how can I use set the index file for different system?
Your best bet here is not to do this at all. Make two different work-trees, or perhaps even two different repositories. But, if that's more painful than this other alternative, try out these ideas:
The actual index file that Git uses merely defaults to .git/index. You can specify a different file by setting GIT_INDEX_FILE to some other (relative or absolute) path. So you could have .git/index-linux and .git/index-windows, and set GIT_INDEX_FILE based on whichever OS you're using.
Some Git commands use a temporary index. They do this by setting GIT_INDEX_FILE themselves. If they un-set it afterward, they may accidentally use .git/index at this point. So another option is to rename .git/index out of the way when switching OSes. Keep a .git/index-windows and .git/index-linux as before, but rename whichever one is in use to .git/index while it's in use, then rename it to .git/index-name before switching to the other system.
Again, I don't recommend attempting either of these methods, but they are likely to work, more or less.
As torek mentioned, you probably don't want to do this. It's not generally a good idea to share a repo between operating systems.
However, it is possible, much like it's possible to share a repo between Windows and Windows Subsystem for Linux. You may want to try setting core.checkStat to minimal, and if that isn't sufficient, core.trustctime to false. That leads to the minimal amount of information being stored in the index, which means that the data is going to be as portable as possible.
Note, however, that if your repository has symlinks, that it's likely that nothing you do is going to prevent refreshes. Linux typically considers the length of a symlink to be its length in bytes, and Windows considers it to take one or more disk blocks, so there will be a mismatch in size between the operating systems. This isn't avoidable, since size is one of the attributes used in the index that can't be disabled.
This might not apply to the original poster, but if Linux is being used under the Windows Subsystem for Linux (WSL), then a quick fix is use git.exe even on the Linux side. Use an alias or something to make it seamless. For example:
alias git=git.exe
Auto line ending setting solved my issue as in this discussion. I am referring to Windows, WSL2, Portable Linux OS, and Linux as well which I have setup and working as my work requirement. I will update in case I face any issue while preferring this approach for updating code base from different filesystems (NTFS or Linux File System).
git config --global core.autocrlf true

Prevent change propagation for single files

I work with Perforce streams, following the suggested mainline model (release, mainline and development streams). In addition, we use a odd/even release version numbering (similar to linux kernel) with odd minor version numbers for development versions, even minor numbers for release versions.
After fixing a bug in a release stream, I need to update several files with version information to create a new release version/installer. These version changes must not be merged to mainline (only the bugfix itself), because the version of the mainline has been already increased to the next development version.
Now, when merging from the release stream to main, I get conflicts for all files containing version information. Currently, I need to manually resolve all conflicts, undoing the version number change (keeping the development version).
Example:
Release stream started with version 2.4.0 (stable/release version number)
Increase mainline version to 2.5 (next development version)
Fix bug in release stream, increase version number to 2.4.1
Merge changes to mainline: accept bugfix, manually undo conflicts in version files
Is there a way to exclude single files / a set of files from integration so that I don't have to go through this tedious (and potentially error-prone) manual process? (NB: The version information is separate from the code.)
Three possible options:
1) Specify these files as "isolate" in the stream spec so that changes to them are isolated, i.e. not included in copy/merge operations.
2) Merge the version changes but use "resolve -ay" (ignore) to specify that these changes are to be ignored.
3) Revert these files after opening them for a merge operation so that you will not need to resolve or submit them.

Can git use patch/diff based storage?

As I understand it, git stores full files of each revision committed. Even though it's compressed there's no way that can compete with, say, storing compressed patches against one original revision full file. It's especially an issue with poorly compressible binary files like images, etc.
Is there a way to make git use a patch/diff based backend for storing revisions?
I get why the main use case of git does it the way it does but I have a particular use case where I would like to use git if I could but it would take up too much space.
Thanks
Git does use diff based storage, silently and automatically, under the name "delta compression". It applies only to files that are "packed", and packs don't happen after every operation.
git-repack docs:
A pack is a collection of objects, individually compressed, with delta compression applied, stored in a single file, with an associated index file.
Git Internals - Packfiles:
You have two nearly identical 22K objects on your disk. Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?
It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server.
Later:
The really nice thing about this is that it can be repacked at any time. Git will occasionally repack your database automatically, always trying to save more space, but you can also manually repack at any time by running git gc by hand.
"The woes of git gc --aggressive" (Dan Farina), which describes that delta compression is a byproduct of object storage and not revision history:
Git does not use your standard per-file/per-commit forward and/or backward delta chains to derive files. Instead, it is legal to use any other stored version to derive another version. Contrast this to most version control systems where the only option is simply to compute the delta against the last version. The latter approach is so common probably because of a systematic tendency to couple the deltas to the revision history. In Git the development history is not in any way tied to these deltas (which are arranged to minimize space usage) and the history is instead imposed at a higher level of abstraction.
Later, quoting Linus, about the tendency of git gc --aggressive to throw out old good deltas and replace them with worse ones:
So the equivalent of "git gc --aggressive" - but done properly - is to
do (overnight) something like
git repack -a -d --depth=250 --window=250

Implementing an update/upgrade system for embedded Linux devices

I have an application that runs on an embedded Linux device and every now and then changes are made to the software and occasionally also to the root file system or even the installed kernel.
In the current update system the contents of the old application directory are simply deleted and the new files are copied over it. When changes to the root file system have been made the new files are delivered as part of the update and simply copied over the old ones.
Now, there are several problems with the current approach and I am looking for ways to improve the situation:
The root file system of the target that is used to create file system images is not versioned (I don't think we even have the original rootfs).
The rootfs files that go into the update are manually selected (instead of a diff)
The update continually grows and that becomes a pita. There is now a split between update/upgrade where the upgrade contains larger rootfs changes.
I have the impression that the consistency checks in an update are rather fragile if at all implemented.
Requirements are:
The application update package should not be too large and it must also be able to change the root file system in the case modifications have been made.
An upgrade can be much larger and only contains the stuff that goes into the root file system (like new libraries, kernel, etc.). An update can require an upgrade to have been installed.
Could the upgrade contain the whole root file system and simply do a dd on the flash drive of the target?
Creating the update/upgrade packages should be as automatic as possible.
I absolutely need some way to do versioning of the root file system. This has to be done in a way, that I can compute some sort of diff from it which can be used to update the rootfs of the target device.
I already looked into Subversion since we use that for our source code but that is inappropriate for the Linux root file system (file permissions, special files, etc.).
I've now created some shell scripts that can give me something similar to an svn diff but I would really like to know if there already exists a working and tested solution for this.
Using such diff's I guess an Upgrade would then simply become a package that contains incremental updates based on a known root file system state.
What are your thoughts and ideas on this? How would you implement such a system? I prefer a simple solution that can be implemented in not too much time.
I believe you are looking wrong at the problem - any update which is non atomic (e.g. dd a file system image, replace files in a directory) is broken by design - if the power goes off in the middle of an update the system is a brick and for embedded system, power can go off in the middle of an upgrade.
I have written a white paper on how to correctly do upgrade/update on embedded Linux systems [1]. It was presented at OLS. You can find the paper here: https://www.kernel.org/doc/ols/2005/ols2005v1-pages-21-36.pdf
[1] Ben-Yossef, Gilad. "Building Murphy-compatible embedded Linux systems." Linux Symposium. 2005.
I absolutely agree that an update must be atomic - I have started recently a Open Source project with the goal to provide a safe and flexible way for software management, with both local and remote update. I know my answer comes very late, but it could maybe help you on next projects.
You can find sources for "swupdate" (the name of the project) at github.com/sbabic/swupdate.
Stefano
Currently, there are quite a few Open Source embedded Linux update tools growing, with different focus each.
Another one that is worth being mentioned is RAUC, which focuses on handling safe and atomic installations of signed update bundles on your target while being really flexible in the way you adapt it to your application and environment. The sources are on GitHub: https://github.com/rauc/rauc
In general, a good overview and comparison of current update solutions you might find on the Yocto Project Wiki page about system updates:
https://wiki.yoctoproject.org/wiki/System_Update
Atomicity is critical for embedded devices, one of the reasons highlighted is power loss; but there could be others like hardware/network issues.
Atomicity is perhaps a bit misunderstood; this is a definition I use in the context of updaters:
An update is always either completed fully, or not at all
No software component besides the updater ever sees a half installed update
Full image update with a dual A/B partition layout is the simplest and most proven way to achieve this.
For Embedded Linux there are several software components that you might want to update and different designs to choose from; there is a newer paper on this available here: https://mender.io/resources/Software%20Updates.pdf
File moved to: https://mender.io/resources/guides-and-whitepapers/_resources/Software%2520Updates.pdf
If you are working with the Yocto Project you might be interested in Mender.io - the open source project I am working on. It consists of a client and server and the goal is to make it much faster and easier to integrate an updater into an existing environment; without needing to redesign too much or spend time on custom/homegrown coding. It also will allow you to manage updates centrally with the server.
You can journal an update and divide your update flash into two slots. Power failure always returns you to the currently executing slot. The last step is to modify the journal value. Non atomic and no way to make it brick. Even it if fails at the moment of writing the journal flags. There is no such thing as an atomic update. Ever. Never seen it in my life. Iphone, adroid, my network switch -- none of them are atomic. If you don't have enough room to do that kind of design, then fix the design.

Git/Linux: What is a good strategy for maintaining a Linux kernel with patches from multiple Git repositories?

I am maintaining a custom Linux kernel which is comprised of merged changes from a variety of sources. This is for an embedded system.
The chip vendor we are working with releases a board support package as a changes against a mainline kernel (2.6.31). I have since made changes to support our custom hardware, and also merged with the stable (2.6.31.y) kernel releases. I've also merged in bug fixes for a specific file system driver that we use, sometimes before the changes make it to the mainline kernel.
I haven't been very systematic about how I have managed the various contributing sources and my own changes. If the change was large I tended to merge; if it was small I tended to rebase the third party changeset on to my own. Generally speaking merge conflicts are rare, since most of my work affects drivers that are not in the mainline kernel anyway.
I'm wondering if there is a better way to manage all of this. One concern is that my changes are mixed in with merges. The history might look something like:
2.6.31 + board support package + my changes (1) + 2.6.31.12 changes + my changes (2) + file system driver update + my changes (3) + 2.6.31.14 changes + my changes(4) + ....
It worries me a bit that my changes are mixed in, sometimes on the other side of merges. Is there a better way to do this? In particular, is there a way to do this that will make life easier when I switch to a newer kernel?
I don't think it will be easy to clean up your current set-up, unless your history is reasonably short, but I would suggest this:
Set up a Master repository which has remotes set up for each of the other places that your code comes from -- the mainline kernel, patches, ...
Keep a separate branch specifically for the updates from your driver supplier.
When you fetch, the updates will not mess with your branches.
When you are ready to merge, merge into some kind of "release" branch. The idea is to keep each source separate from the others, except when it needs to be merged in. Base your changes off of this branch, merging/rebasing as necessary.
Here's a quick diagram which I hope is helpful:
mainline-\----------\-------------------------------\
\ \ /you---\---/-/ \
\release----\-/---/----/-/--------\-/ / --\-----
patches-----------------/---/ / /
/ /
driver-------------------------/--------------/
With so many branches, it is difficult to diagram effectively, but I hope that gives you an idea of what I mean. release holds the official code for your system, you holds your own modifications, driver holds patches from the driver supplier, patches holds patches from some other repo, and mainline is the mainline kernel. Everything gets merged into release, and you base your changes off of release but only interact by merging in each direction, not making changes directly to release.
I think the generally accepted best policy is
(a) patch sets
(b) topic branches
Topic branches are essentially the same but are regularly rebased onto mainline. Topgit is a known tool that makes handling topic branches 'easier' if you have a lot of them. Otherwise: plan ahead and limit the number of branches

Resources