Number of threads for git gc depending on repo size

Number of threads for git gc depending on repo size - multithreading

Can I use single-threaded compression in Git for large repositories and usual parallelized one for small ones? Like "pack.threads=1" if don't easily fit in momory and "pack.threads=4" otherwise.
As I heart somewhere, multithreaded "git gc" requires a lot memory and thrashes (or just fails) longer that singlethreaded.
I want it to work fast for small repos and don't fail on big repos.

You can configure pack.threads per repository, but I doubt that there is a setting to do this automatically depending on the size of the repository.

Related

Managing large quantity of files between two systems

We have a large repository of files that we want to keep in sync between one central location and multiple remote locations. Currently, this is being done using rsync, but it's a slow process mainly because of how long it takes to determine the changes.
My current thought is to find a VCS-like solution where instead of having to check all of the files, we can check the diffs between revisions to determine what gets sent over the wire. My biggest concern, however, is that we'd have to re-sync all of the files that are currently in-sync, which is a significant effort. I've been told that the current repository is about .5 TB and consists of a variety of files of different sizes. I understand that an initial commit will most likely take a significant amount of time, but I'd rather avoid the syncing between clusters if possible.
One thing I did look at briefly is git-annex, but my first concern is that it may not like dealing with thousands of files. Also, one thing I didn't see is what would happen if the file already exists on both systems. If I create a repo using git-annex on the central system and then set up repos on the remote clusters, will pushing from central to a remote repo cause it to sync all of the files?
If anyone has alternative solutions/ideas, I'd love to see them.
Thanks.

Should I use git-lfs for packages info files?

As a developer working with several languages, I notice that in most modern languages, dependencies metadata files can change a lot.
For instance, in NodeJS (which in my opinion is the worst when it comes to package management), a change of dependencies or in the version of NPM (respectively yarn) version can lead to huge changes in package-lock.json (respectively yarn.lock), sometimes with tens of thousands of modified lines.
In Golang for instance, this would be go.sum which can have important changes (in smaller magnitude when compared to Node of course) when modifying dependencies or running go mod tidy at times.
Would it be more efficient to track these dependencies files with git-lfs? Is there a reason not to do it?
Even if they are text files, I know that it is advised to push SVG files with git-lfs, because they are mostly generated files and their diff has no reason to be small when regenerating them after a change.
Are there studies about what language and what size/age of a project that makes git-lfs become profitable?

Would it be more efficient to track these dependencies files with git-lfs?
Git does a pretty good job at compressing text files, so initially you probably wouldn't see much gains. If the file gets heavily modified often, then over time the total cloneable repo size would increase by less if you use Git LFS, but it may not be worth it compared to the total repo size increases, which could make the size savings negligible as a percentage. The primary use case for LFS is largish binary files that change often.
Is there a reason not to do it?
If you aren't already using Git LFS, I wouldn't recommend starting for this reason. Also, AFAIK there isn't native built in support for diffing versions of files stored in LFS, though workarounds exist. If you often find yourself diffing the files you are considering moving into LFS, the nominal storage size gain may not be worth it.

git forces refresh index after switching between Windows and Linux

I have a disk partition (format: NTFS) shared by Windows and Linux. It contains a git repository (about 6.7 GB).
If I only use Windows or only use Linux to manipulate the git repository everything is okay.
But everytime I switch the system, the git status command will refresh the index, which takes about 1 minute. If I run the git status in the same system again, it only take less than 1 second. Here is the result
# Just after switch from windows
[#5#wangx#manjaro:duishang_design] git status # this command takes more than 60s
Refresh index: 100% (2751/2751), done.
On branch master
nothing to commit, working tree clean
[#10#wangx#manjaro:duishang_design] git status # this time the command takes less than 1s
On branch master
nothing to commit, working tree clean
[#11#wangx#manjaro:duishang_design] git status # this time the command takes less than 1s
On branch master
nothing to commit, working tree clean
I guess there is some problem with the git cache. For example: Windows and Linux all use the .git/index file as cache file, but the git in Linux system can't recognize the .git/index changed by Windows. So it can only refresh the index and replace the .git/index file, which makes the next git status super fast and git status in Windows very slow (because the Windows system will refresh the index file again).
Is my guess correct? If so, how can I set the index file for different system? How can I solve the problem?

You are completely correct here:
The thing you're using here, which Git variously calls the index, the staging area, or the cache, does in fact contain cache data.
The cache data that it contains is the result of system calls.
The system call data returned by a Linux system is different from the system call data returned by a Windows system.
Hence, an OS switch completely invalidates all the cache data.
... how can I use set the index file for different system?
Your best bet here is not to do this at all. Make two different work-trees, or perhaps even two different repositories. But, if that's more painful than this other alternative, try out these ideas:
The actual index file that Git uses merely defaults to .git/index. You can specify a different file by setting GIT_INDEX_FILE to some other (relative or absolute) path. So you could have .git/index-linux and .git/index-windows, and set GIT_INDEX_FILE based on whichever OS you're using.
Some Git commands use a temporary index. They do this by setting GIT_INDEX_FILE themselves. If they un-set it afterward, they may accidentally use .git/index at this point. So another option is to rename .git/index out of the way when switching OSes. Keep a .git/index-windows and .git/index-linux as before, but rename whichever one is in use to .git/index while it's in use, then rename it to .git/index-name before switching to the other system.
Again, I don't recommend attempting either of these methods, but they are likely to work, more or less.

As torek mentioned, you probably don't want to do this. It's not generally a good idea to share a repo between operating systems.
However, it is possible, much like it's possible to share a repo between Windows and Windows Subsystem for Linux. You may want to try setting core.checkStat to minimal, and if that isn't sufficient, core.trustctime to false. That leads to the minimal amount of information being stored in the index, which means that the data is going to be as portable as possible.
Note, however, that if your repository has symlinks, that it's likely that nothing you do is going to prevent refreshes. Linux typically considers the length of a symlink to be its length in bytes, and Windows considers it to take one or more disk blocks, so there will be a mismatch in size between the operating systems. This isn't avoidable, since size is one of the attributes used in the index that can't be disabled.

This might not apply to the original poster, but if Linux is being used under the Windows Subsystem for Linux (WSL), then a quick fix is use git.exe even on the Linux side. Use an alias or something to make it seamless. For example:
alias git=git.exe

Auto line ending setting solved my issue as in this discussion. I am referring to Windows, WSL2, Portable Linux OS, and Linux as well which I have setup and working as my work requirement. I will update in case I face any issue while preferring this approach for updating code base from different filesystems (NTFS or Linux File System).
git config --global core.autocrlf true

Can git use patch/diff based storage?

As I understand it, git stores full files of each revision committed. Even though it's compressed there's no way that can compete with, say, storing compressed patches against one original revision full file. It's especially an issue with poorly compressible binary files like images, etc.
Is there a way to make git use a patch/diff based backend for storing revisions?
I get why the main use case of git does it the way it does but I have a particular use case where I would like to use git if I could but it would take up too much space.
Thanks

Git does use diff based storage, silently and automatically, under the name "delta compression". It applies only to files that are "packed", and packs don't happen after every operation.
git-repack docs:
A pack is a collection of objects, individually compressed, with delta compression applied, stored in a single file, with an associated index file.
Git Internals - Packfiles:
You have two nearly identical 22K objects on your disk. Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?
It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server.
Later:
The really nice thing about this is that it can be repacked at any time. Git will occasionally repack your database automatically, always trying to save more space, but you can also manually repack at any time by running git gc by hand.
"The woes of git gc --aggressive" (Dan Farina), which describes that delta compression is a byproduct of object storage and not revision history:
Git does not use your standard per-file/per-commit forward and/or backward delta chains to derive files. Instead, it is legal to use any other stored version to derive another version. Contrast this to most version control systems where the only option is simply to compute the delta against the last version. The latter approach is so common probably because of a systematic tendency to couple the deltas to the revision history. In Git the development history is not in any way tied to these deltas (which are arranged to minimize space usage) and the history is instead imposed at a higher level of abstraction.
Later, quoting Linus, about the tendency of git gc --aggressive to throw out old good deltas and replace them with worse ones:
So the equivalent of "git gc --aggressive" - but done properly - is to
do (overnight) something like
git repack -a -d --depth=250 --window=250

Git/Linux: What is a good strategy for maintaining a Linux kernel with patches from multiple Git repositories?

I am maintaining a custom Linux kernel which is comprised of merged changes from a variety of sources. This is for an embedded system.
The chip vendor we are working with releases a board support package as a changes against a mainline kernel (2.6.31). I have since made changes to support our custom hardware, and also merged with the stable (2.6.31.y) kernel releases. I've also merged in bug fixes for a specific file system driver that we use, sometimes before the changes make it to the mainline kernel.
I haven't been very systematic about how I have managed the various contributing sources and my own changes. If the change was large I tended to merge; if it was small I tended to rebase the third party changeset on to my own. Generally speaking merge conflicts are rare, since most of my work affects drivers that are not in the mainline kernel anyway.
I'm wondering if there is a better way to manage all of this. One concern is that my changes are mixed in with merges. The history might look something like:
2.6.31 + board support package + my changes (1) + 2.6.31.12 changes + my changes (2) + file system driver update + my changes (3) + 2.6.31.14 changes + my changes(4) + ....
It worries me a bit that my changes are mixed in, sometimes on the other side of merges. Is there a better way to do this? In particular, is there a way to do this that will make life easier when I switch to a newer kernel?

I don't think it will be easy to clean up your current set-up, unless your history is reasonably short, but I would suggest this:
Set up a Master repository which has remotes set up for each of the other places that your code comes from -- the mainline kernel, patches, ...
Keep a separate branch specifically for the updates from your driver supplier.
When you fetch, the updates will not mess with your branches.
When you are ready to merge, merge into some kind of "release" branch. The idea is to keep each source separate from the others, except when it needs to be merged in. Base your changes off of this branch, merging/rebasing as necessary.
Here's a quick diagram which I hope is helpful:
mainline-\----------\-------------------------------\
\ \ /you---\---/-/ \
\release----\-/---/----/-/--------\-/ / --\-----
patches-----------------/---/ / /
/ /
driver-------------------------/--------------/
With so many branches, it is difficult to diagram effectively, but I hope that gives you an idea of what I mean. release holds the official code for your system, you holds your own modifications, driver holds patches from the driver supplier, patches holds patches from some other repo, and mainline is the mainline kernel. Everything gets merged into release, and you base your changes off of release but only interact by merging in each direction, not making changes directly to release.

I think the generally accepted best policy is
(a) patch sets
(b) topic branches
Topic branches are essentially the same but are regularly rebased onto mainline. Topgit is a known tool that makes handling topic branches 'easier' if you have a lot of them. Otherwise: plan ahead and limit the number of branches

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string