Create git history, blob and tree objects, in parallel - multithreading

our company have decided to migrate our source code from clearcase to git, that's great :-)
I know that clearcase and git are completely different source code management systems.
But we developer, would have only one SCM that containing the complete history.
My colleague found the following tool, which importing our clearcase history into git: https://github.com/charleso/git-cc
Unfortunately our code has more than 46000 source code files and the history to import is more than 10 years.
I analyzed this tool and in my opinion there are two bottlenecks.
The first is the import of files from clearcase server. This is easy to solve by doing this in multiple threads.
The second is the workflow of git-cc itself.
Get history of master-branch via cleartool lshistory
Create changesets of files and group them to comit's
Get specified version of file(s) from cc server and copy to working directory
git add .
git commit
pick next group and start with 3. again
I think I could improve it by using low level git commands and using multiple threads.
Each commit-group queries its changes from server and creating a blob object within git database, so this could run for multiple groups in multiple threads.
Additional I have one thread which create the history in git from just now created blob objects.
My question is now, does this make sense to you or do you think I'm naive?
Have I forget any git locking mechanism?
Have you any other ideas?

Using multiple thread for importing commits in the same branch of a Git repo is risky (unless, as you put it, you create "blob object", that is patches that you can replay).
But using multiple thread for commits on different branches is possible: you create different repo, each one for a branch import, and then you can fetch those repos into one common repo and reattach them with git replace or grafts.
But remember: each Git repo is a component, so if your giant ClearCase Vob includes several components (group of files), it would be best to separate them in multiple Git repo rather than attempting to create a giant Git one.
I detail that in "ClearCase to Git migration".

Related

Gitlab Project - Merge the Merge request data of 2 separate projects that shared the same source

I have 2 projects that once shared git histories I would like to merge together.
One is "Active" which was just a push of our old existing Git repository and not a formal import. People have been working on this repository for several months now and have processed about 900+ merge requests. Unfortunately this "new" repo is missing past merge requests that happened in the old repo host.
I now have a full proper import "historical" of our old repository from bitbucket. This one has the git history up until we switched to a self-hosted gitlab and has 10k+ merge requests that happened over a few years.
Is it possible with just using the gitlab interface to merge these two repositories or at least their merge request lists together?
The key part for me is just getting the merge requests uh, merged together so there is just 1 repo with 11k+ merge requests and all their comments / summaries in it. The branches, tags and other data is largely irrelevant to me.
My backup plan is to just dump the historical somewhere and just tell people to go search that one if they need to look something older up.

Clearcase multisite vs Gitlab.... anything?

The company I work for uses Clearcase, even though it was EOL'd on the platforms in which we run it years ago. It is ancient and fragile tech, but one thing it does have is a multisite support that allows for the synchronization of air-gapped repos. Because of security issues, we use secure USB sticks to copy packets and take them to the other side, then apply them with scripts.
Developers and DevOps people want to make a business case to migrate to GitLab, but I cannot find any mention of a feature in GitLab that would allow me to do easily do this. There's something about bundles, but the info I have found is years old and it doesn't seem like too many people are using it.
Does GitLab not support this? Simple synchronization of one repo to another over an air gap using some sort of secure media? If so, it's no wonder so many teams are still using ClearCase.
While not exactly easy, air-gap updates of Git repository is possible through the git bundle command.
It produces one file (with all the history, or only the latest commits for an incremental update), that you can:
copy and distribute easily (it is just one file after all)
clone or pull from(!)
This is not tied to GitLab, and can be applied to any Git repository.
From there, I have written before on migration from ClearCase to Git, and I usually:
do not import the full history, only major labels or UCM baselines
split VObs per project, each project being one Git repository
revisit what was versioned in Vobs: some large files/binaries might need to be .gitignore'd in the new Git repository.
You would not "migrate views": they are just workspace (be it static or dynamic). A simple clone of a repository is enough to recreate such a workspace (static here).

Share one file across multiple git repo to be updated by multiple users

I am working on automating the markdown spell check for all the documents on my website which involves multiple git repo. I have a .spelling file that contains all the word to be excluded from the documents. I would like to keep it one file and updated across the entire website. I can get it to work for one repo. I looked into the npm package method. Is there a way to configure package.json to share this file to many repo? Or is there a better way to do it without npm? Thanks!
Make a separate spell-check repository with the .spelling file and script in it, then include it as a submodule in each of your docs repos. You can then reference it from each repository separately, and pull its latest updates into each one.
This could be cumbersome if you have a large number of docs repos, so another alternative is to centralize the spelling check script by making a separate repository for it and adding a configuration file to tell your script which Github repositories to spellcheck. This way, you can selectively apply the spell check process to any number of repositories in your organization.

How to use git namespace to hide branches

Background
I'm working with a large team using git for version control. The normal flow is:
People selecting a ticket from the "backlog queue".
Working on the issue via a local branch (i.e. git checkout -b my_feature_branch).
Making several commits as they go (i.e. git commit).
Pushing local changes to a remote branch in order to "backup" their work so it lives on more than one machine, in case the laptop is damaged or stolen (i.e. git push -u origin my_feature_branch).
Eventually creating a code review on our private github page, and doing a squashed merge from the feature branch to master.
In addition to the remote feature branches created by employees on an as-needed basis, we have several dozen release branches that are used to create the "gold builds" we ship to customers, i.e. 1.00, 1.01, 2.00, 2.01, 2.02, etc.
Problem
Some developers have begun to complain that there are too many branches, and I tend to agree. Some developers haven't been diligent about cleaning up old branches when they are no longer needed (even though github provides a one-button delete feature for this once the code review is complete).
Question
Is there a way to configure our company github deployment so that, when people use git branch via the CLI:
Only our "important/release/gold" branches appear.
The one-off developer (temporary) branches only appear via git branch -a?
The main goal of this is to reduce clutter.
Edit: I found a similar question, but the only answer is not at all applicable (don't use remote branches), which violates my key constraint of allowing people to push to remote branches as a form of data backup. The concept of private namespaces, as hinted by #Mort, seems to be exactly what I'm looking for. Now, how do I accomplish that?
Long story short: you can - but it may be a bit tricky.
You should use the namespace concept (give a look here: gitnamespaces)
Quoting from the docs:
Git supports dividing the refs of a single repository into multiple namespaces, each of which has its own branches, tags, and HEAD. Git can expose each namespace as an independent repository to pull from and push to, while sharing the object store
and
Storing multiple repositories as namespaces of a single repository avoids storing duplicate copies of the same objects, such as when storing multiple branches of the same source.
To activate a namespace you can simply:
export GIT_NAMESPACE=foo
or
git --namespace=foo clone/pull/push
When a namespace is active, through git remote show origin you can see only the remote branches created in the current namespace. If you deactivate it (unset GIT_NAMESPACE), you will see again the main remote branches.
A possible workflow in your situation may be:
Create a feature branch and work on it
export GIT_NAMESPACE=foo
git checkout -b feature_branch
# ... do the work ...
git commit -a -m "Fixed my ticket from backlog"
git push origin feature_branch # (will push into the namespace and create the branch there)
Merging upstream
unset GIT_NAMESPACE
git checkout master
git pull (just to have the latest version)
git merge --squash --allow-unrelated-histories feature_branch
git commit -a -m "Merged feature from backlog"
git push # (will push into the main refs)
The tricky part
Namespace provides a complete isolation of branches, but you need to activate and to deactivate namespace each time
Pay attention
Pay attention when pushing. Git will push in the current namespace. If you are working in the feature branch and you forgot to activate the namespace, when pushing, you will create the feature branch in the main refs.
It seems as if the simplest solution here, since you're using GitHub and a pull-request workflow, is that developers should be pushing to their own fork of the repository rather than to a shared repository. This way their remote feature branches aren't visible to anybody else, so any "clutter" they see will be entirely their own responsibility.
If everything lives in a single repository, another option would be to set up a simple web service that receives notifications from github when you close a pull request (responding to the PullRequest event). You could then have the service delete the source branch corresponding to the pull request.
This is substantially less simple than the previous solution, because it involves (a) writing code and (b) running a persistent service that is (c) accessible to github webooks and that (d) has appropriate permissions on the remote repository.
The first answers are good. If you can fork repositories and use pull-requests or just keep the branches for yourself, do it.
I will however put my 2 cents in case you are in my situation : a lot of WIP branches that you have to push to a single repository since you work on multiple workstations, don't have fork possibilities, and don't want to annoy your fellow developers.
Make branches starting with a specific prefix, i.e. wip/myuser/, fetch from / push to a custom refspec, i.e. refs/x-wip/myuser/*.
Here is a standard remote configuration after a clone:
[remote "origin"]
url = file:///c/temp/remote.git
fetch = +refs/heads/*:refs/remotes/origin/*
To push branches starting with wip/myuser/ to refs/x-wip/myuser/, you will add:
push = refs/heads/wip/myuser/*:refs/x-wip/myuser/*
This will however override the default push rule for the normal branches. To restore it, you will add:
push = refs/heads/*:refs/heads/*
Finally, to fetch you WIP branches that are now outside the conventional refs/heads/* refspec, you will add:
fetch = +refs/x-wip/myuser/*:refs/remotes/origin/wip/myuser/*
You will end up with this 2nd remote configuration:
[remote "origin"]
url = file:///c/temp/remote.git
fetch = +refs/x-wip/myuser/*:refs/remotes/origin/wip/myuser/*
fetch = +refs/heads/*:refs/remotes/origin/*
push = refs/heads/wip/myuser/*:refs/x-wip/myuser/*
push = refs/heads/*:refs/heads/*
(Git evaluates fetch / push rules from the top to the bottom, and stops as soon as one matches; this means you want to order your rules from the most to the less specific rule.)
People using the standard remote configuration will only fetch branches from refs/heads/*, while you will fetch branches from both refs/heads/* and refs/x-wip/myuser/* with the 2nd configuration.
When your branch is ready to be "public", remove the wip/myuser/ prefix.
The refspec internal documentation was useful to make it.
Please note that once you have push rules in your remote configuration, running the command...
git push
... with no arguments will no longer only push your current branch, nor use any strategy defined with the push.default configuration. It will push everything according to your remote push rules.
You will either need to always specify the remote and the branch you want to push, or use an alias as suggested in this answer.

How to set up a git repository where different users can only see certain parts?

How do you set up a git repository where some users can see certain parts of the source code and other users can see all of it? I've seen lots of guides for only giving certain users commit access, but these assume everyone should have read access. I've also heard of gitosis, but I'm not sure it supports this and it hasn't had any commits in over a year so I think it's dead.
In short: you can't. Git is snapshot based (at conceptual level at least) version control system, not changeset based one. It treats project (repository) as a whole. The history is a history of a project, not a union of single-file histories (it is more than joining of per-file histories).
Using hooks like update-paranoid hook in contrib, or VREFs mechanism of gitolite, you can allow or forbid access to repository, you can allow or forbid access to individual branches. You can even forbid any commits that change things in specified subdirectory. But the project is always treated as a whole.
Well, there is one thing you can do: make a directory you want to restrict access to into submodule, and restrict access to this submodule repository.
The native git protocol doesn't support this; git assumes in many places that everybody has a complete copy of all of the history.
That said, one option may be to use git-subtree to split off part of the repository into its own subset repository, and periodically merge back.
Git doesn't support access control on the repository. You can however, implement access control on the repository yourself, by using hooks, more specifically the update hook.
Jörg has already pointed out that you can use hooks to do this. Exactly which hook(s) you need depends on your setup. If you want the permissions on a repo that gets pushed to, you'll need the update hook like he said. However, if it's on a repo that you're actually working in (committing and merging), you'll also need the pre-commit and post-merge hooks. The githooks manpage (Jörg linked to this too) notes that there's in fact a script in the contrib section demonstrating a way to do this. You can get this by grabbing a git tarball, or pull it out of git's gitweb repo: setgitperms.perl. Even if you're only using the update hook, that might be a useful model.
In general, Git is not intended for this. By now it seems to have out-of-the-box access control only up to the repository level.
But if you need just to hide some part of secret information in your Git repository (which is often the case) you can use git-crypt (https://github.com/AGWA/git-crypt) with encryption keys shared based on users GPG keys (https://gnupg.org/).
Alternatively you can use git submodules (https://git-scm.com/book/en/v2/Git-Tools-Submodules) if you can break your codebase to logical parts. Then all users receive access only to certain repositories which you then integrate into 'large' codebase through sub-modules where you add other code and allow it for only 'privileged' users.

Resources