Should I use git-lfs for packages info files? - node.js

As a developer working with several languages, I notice that in most modern languages, dependencies metadata files can change a lot.
For instance, in NodeJS (which in my opinion is the worst when it comes to package management), a change of dependencies or in the version of NPM (respectively yarn) version can lead to huge changes in package-lock.json (respectively yarn.lock), sometimes with tens of thousands of modified lines.
In Golang for instance, this would be go.sum which can have important changes (in smaller magnitude when compared to Node of course) when modifying dependencies or running go mod tidy at times.
Would it be more efficient to track these dependencies files with git-lfs? Is there a reason not to do it?
Even if they are text files, I know that it is advised to push SVG files with git-lfs, because they are mostly generated files and their diff has no reason to be small when regenerating them after a change.
Are there studies about what language and what size/age of a project that makes git-lfs become profitable?

Would it be more efficient to track these dependencies files with git-lfs?
Git does a pretty good job at compressing text files, so initially you probably wouldn't see much gains. If the file gets heavily modified often, then over time the total cloneable repo size would increase by less if you use Git LFS, but it may not be worth it compared to the total repo size increases, which could make the size savings negligible as a percentage. The primary use case for LFS is largish binary files that change often.
Is there a reason not to do it?
If you aren't already using Git LFS, I wouldn't recommend starting for this reason. Also, AFAIK there isn't native built in support for diffing versions of files stored in LFS, though workarounds exist. If you often find yourself diffing the files you are considering moving into LFS, the nominal storage size gain may not be worth it.

Related

Managing large quantity of files between two systems

We have a large repository of files that we want to keep in sync between one central location and multiple remote locations. Currently, this is being done using rsync, but it's a slow process mainly because of how long it takes to determine the changes.
My current thought is to find a VCS-like solution where instead of having to check all of the files, we can check the diffs between revisions to determine what gets sent over the wire. My biggest concern, however, is that we'd have to re-sync all of the files that are currently in-sync, which is a significant effort. I've been told that the current repository is about .5 TB and consists of a variety of files of different sizes. I understand that an initial commit will most likely take a significant amount of time, but I'd rather avoid the syncing between clusters if possible.
One thing I did look at briefly is git-annex, but my first concern is that it may not like dealing with thousands of files. Also, one thing I didn't see is what would happen if the file already exists on both systems. If I create a repo using git-annex on the central system and then set up repos on the remote clusters, will pushing from central to a remote repo cause it to sync all of the files?
If anyone has alternative solutions/ideas, I'd love to see them.
Thanks.

Why is a fresh install of Haskell-Stack and GHC so large/big?

When doing a fresh install of Haskell Stack through the install script from here:
wget -qO- https://get.haskellstack.org/ | sh
Followed by:
stack setup
you will end up with a $HOME/.stack/ directory of 1.5 GB size (from just a 120+ MB download). Further if you run:
stack update
the size increases to 2.5 GB.
I am used to Java which is usually considered large/big (covers pretty much everything and has deprecated alternatives for backwards compatibility), but as a comparison: an IDE including a JDK, a stand alone JDK, and the JDK source is probably around 1.5 GB in size.
On the other hand, that Haskell which is a "small beautiful" language (from what I have heard and read, this is probably referring mostly to the syntax and semantics, but still), is that large/big, seems strange to me.
Why is it so big (is it related to this question?)?
Is this size normal or have I installed something extra?
If there are several (4?, 5?) flavors of everything, then can I remove all but one?
Are some of the data cache/temporary that can be removed?
The largest directories are: .stack/programs/x86_64-linux/ghc-tinfo6-nopie-8.2.2/lib/ghc-8.2.2 (1.3 GB) and .stack/indices/Hackage (980 MB). I assume the first one are installed packages (and related to stack setup) and the latter is some index over the Hackage package archive (and related to stack update)? Can these be reduced (as above in 3 or grabbing needed Hackage information online)?
As you can probably see by inspection, it is a combination of:
three flavors (static, dynamic, and profiled) of the GHC runtime (about 400 megs total) and the core GHC libraries (another 700 megs total) plus 100 megs of interface files and another 200 megs of documentation and 120 megs of compressed source (1.5 gigs total, all under programs/x86_64-linux/ghc-8.2.2* or similar)
two identical copies of the uncompressed Hackage index 00-index.tar and 01-index.tar, each containing the .cabal file for every version of every package ever published in the Hackage database, each about 457 megs, plus a few other files to bring the total up to 1.0 gigs
The first of these is installed when you run stack setup; the second when you run stack update.
To answer your questions:
It's so big because clearly no one has made any effort to make it smaller, as evidenced by the whole 00-index.tar, 00-index.tar.gz, and 01-index.tar situation.
That's a normal size for a minimum install.
You can remove the profile versions (the *_p.a files) if you never want to compile a program with profiling. I haven't tested this extensively, but it seems to work. I guess this'll save you around 800 megs. You can also remove the static versions (all *.a files) if you only want to dynamically link programs (i.e., using ghc -dynamic). Again, I haven't tested this extensively, but it seems to work. Removing the dynamic versions would be very difficult -- you'd have to find a way to remove only those *.so files that GHC itself doesn't need, and anything you did remove would no longer be loadable in the interpreter.
Several things are cached and you can remove them. For example, you can remove 00-index.tar and 00-index.tar.gz (saving about half a gigabyte), and Stack seems to run fine. It'll recreate them the next time you run stack update, though. I don't think this is documented anywhere, so it'll be a lot of trial and error determining what can be safely removed.
I think this question has already been covered above.
A propos of nothing, the other day, I saw a good deal on some 3-terabyte drives, and in my excitement I ordered two before realizing I didn't really have anything to put on them. It kind of puts a few gigabytes in perspective, doesn't it?
I guess I wouldn't expend a lot of effort trying to trim down your .stack directory, at least on a beefy desktop machine. If you're working on a laptop with a relatively small SSD, think about maybe putting your .stack directory on a filesystem that supports transparent compression (e.g., Btrfs), if you think it's likely to get out of hand.

Can git use patch/diff based storage?

As I understand it, git stores full files of each revision committed. Even though it's compressed there's no way that can compete with, say, storing compressed patches against one original revision full file. It's especially an issue with poorly compressible binary files like images, etc.
Is there a way to make git use a patch/diff based backend for storing revisions?
I get why the main use case of git does it the way it does but I have a particular use case where I would like to use git if I could but it would take up too much space.
Thanks
Git does use diff based storage, silently and automatically, under the name "delta compression". It applies only to files that are "packed", and packs don't happen after every operation.
git-repack docs:
A pack is a collection of objects, individually compressed, with delta compression applied, stored in a single file, with an associated index file.
Git Internals - Packfiles:
You have two nearly identical 22K objects on your disk. Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?
It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server.
Later:
The really nice thing about this is that it can be repacked at any time. Git will occasionally repack your database automatically, always trying to save more space, but you can also manually repack at any time by running git gc by hand.
"The woes of git gc --aggressive" (Dan Farina), which describes that delta compression is a byproduct of object storage and not revision history:
Git does not use your standard per-file/per-commit forward and/or backward delta chains to derive files. Instead, it is legal to use any other stored version to derive another version. Contrast this to most version control systems where the only option is simply to compute the delta against the last version. The latter approach is so common probably because of a systematic tendency to couple the deltas to the revision history. In Git the development history is not in any way tied to these deltas (which are arranged to minimize space usage) and the history is instead imposed at a higher level of abstraction.
Later, quoting Linus, about the tendency of git gc --aggressive to throw out old good deltas and replace them with worse ones:
So the equivalent of "git gc --aggressive" - but done properly - is to
do (overnight) something like
git repack -a -d --depth=250 --window=250

Perforce: How does files get stored with branching?

A very basic question about branching and duplicating resources, I have had discussion like this due to the size of our main branch, but put aside it is great to know how this really works.
Consider the problem of branching dozens of Gb.
What happens when you create a branch of this massive amount of information?
Am reading the official doc here and here, but am still confused on how the files are stored for each branch on the server.
Say a file A.txt exists in main branch.
When creating the branch (Xbranch) and considering A.txt won't have changes, will the perforce server duplicate the A.txt (one keeping the main changes and another for the Xbranch)?
For a massive amount of data, it becomes a matter because it will mean duplicate the dozens of Gb. So how does this really work?
Some notes in addition to Bryan Pendleton's answer (and the questions from it)
To really check your understanding of what is going on, it is good to try with a test repository with a small number of files and to create checkpoints after each major action and then compare the checkpoints to see what actual database rows were written (as well as having a look at the archive files that the server maintains). This is very quick and easy to setup. You will notice that every branched file generates records in db.integed, db.rev, db.revcx and db.revhx - let alone any in db.have.
You also need to be aware of which server version you are using as the behavior has been enhanced over time. Check the output of "p4 help obliterate":
Obliterate is aware of lazy copies made when 'p4 integrate' creates
a branch, and does not remove copies that are still in use. Because
of this, obliterating files does not guarantee that the corresponding
files in the archive will be removed.
Some other points:
The default flags for "p4 integrate" to create branches copied the files down to the client workspace and then copied them back to the server with the submit. This took time depending on how many and how big the files were. It has long been possible to avoid this using the -v (virtual) flag, which just creates the appropriate rows on the server and avoids updating the client workspace - usually hugely faster. The possible slight downside is you have to sync the files afterwards to work on them.
Newer releases of Perforce have the "p4 populate" command which does the same as an "integrate -v" but also does not actually require the target files to be mapped into the current client workspace - this avoids the dreaded "no target file(s) in client view" error which many beginners have struggled with! [In P4V this is the "Branch files..." command on right click menu, rather than "Merge/Integrate..."]
Streams has made branching a lot slicker and easier in many ways - well worth reading up on and playing with (the only potential fly in the ointment is a flat 2 level naming hierarchy, and also potential challenges in migrating existing branches with existing relationships into streams)
Task streams are pretty nifty and save lots of space on the server
Obliterate has had an interesting flag -b for a few releases which is like being able to quickly and easily remove unchanged branch files - so like retro-creating a task stream. Can potentially save millions of database rows in larger installations with lots of branching
In general, branching a file does not create a copy of the file's contents; instead, the Perforce server just writes an additional database record describing the new revision, but shares the single copy of the file's contents.
Perforce refers to these as "lazy copies"; you can learn more about them here: http://answers.perforce.com/articles/KB_Article/How-to-Identify-a-Lazy-Copy-of-a-File
One exception is if you use the "+S" filetype modifier, as in this case each branch will have its own copy of the content, so that the +S semantics can be performed properly on each branch independently.

Where is the best place to store your Smarty template cache files?

I'm considering either
/tmp
or
/var/cache
or
some folder in your code
I like /temp more, because if it grows too much, the system will usually take care of it, and it's universally writeable so probably more portable code.
But at the other hand I will have to store files in a folder within any of these, so making a folder and checking if it exists has to be done on /tmp, not on /var/cache, since /var/cache is not likely to get removed by linux or any other sort of common software.
What do you think? What is the best practice?
There are many approaches to storing smarty cache and, apparently, no best-case scenario i.e. the matter being more a matter of preference.
I can only say that I have witnessed hundreds of projects where Smarty cache was stored in the project's relative folders (for example /projects/cache/compiled/) for a number of reasons:
Full control of the application's cache
Ability to share the same cache amongst several servers
No need to re-create the cache after the system has tidied the /tmp folder
Moreover, we see compiled templates residing inside memcache more and more each day.

Resources