Managing large quantity of files between two systems

Managing large quantity of files between two systems - git-annex

We have a large repository of files that we want to keep in sync between one central location and multiple remote locations. Currently, this is being done using rsync, but it's a slow process mainly because of how long it takes to determine the changes.
My current thought is to find a VCS-like solution where instead of having to check all of the files, we can check the diffs between revisions to determine what gets sent over the wire. My biggest concern, however, is that we'd have to re-sync all of the files that are currently in-sync, which is a significant effort. I've been told that the current repository is about .5 TB and consists of a variety of files of different sizes. I understand that an initial commit will most likely take a significant amount of time, but I'd rather avoid the syncing between clusters if possible.
One thing I did look at briefly is git-annex, but my first concern is that it may not like dealing with thousands of files. Also, one thing I didn't see is what would happen if the file already exists on both systems. If I create a repo using git-annex on the central system and then set up repos on the remote clusters, will pushing from central to a remote repo cause it to sync all of the files?
If anyone has alternative solutions/ideas, I'd love to see them.
Thanks.

Related

Mirroring files from one partition to another on the second disk without RAID1

I am looking for a program that would allow me to mirror one partition to another disk (something like RAID1) for Linux. It doesn't have to be a windowed application, it can be a console application, I just want what is in one place to be mirrored to another.
It would be nice if it were possible to mirror a specific folder that I would care for instead of copying everything from the given partition.
I was looking on the internet, but it's hard to find something that would give such opportunities, hence the idea to ask such a question.
I do not want to make fake RAID on Linux or hardware RAID because I read that if the motherboard fails then it is best to have the same second one to recover data.
I will be grateful for every suggestion :)

You can check my script "CopyDirFile" written in bash, which is located on github.
You can perform a replication (mirroring) task of any source folder to another destination folder (deleting a file in the source folder means deleting it in the destination folder).
The script also allows you to create copy tasks (deleted files in the source folder will not be deleted in the target folder).
The tasks are executed in background at a specified time, not all the time, frequency is set by the user when creating the task.
You can also set the task to start automatically when the user logs on.
All the necessary information can be found in the README file in repository.
If I understood you correctly, I think it meets your requirements.

Linux has standard support for software RAID: mdraid.
It allows you to bundle two disk devices into a RAID 1 device (among other things); you then create a filesystem on top of that device.
LVM offers another way to do software RAID; it doesn't seem to be very popular, but it's certainly supported.
(If your system supports hardware RAID, on the motherboard or with a separate RAID controller, Linux can use that, too, but that doesn't seem to be what you're asking here.)

How to get all changes you made to your config files (since system install) in one shot?

I wonder if there is any way i could retrieve all changes i made to my various configuration files since install(residing in /etc and so on) in one shot?
I imagine some kind of loop, that uses 'diff' to compare all those files to a 'standard installation' of ubuntu. Output should be a single file with information regarding the changes that were made and a timestamp.
Perhaps there is even a way to put all that in a script and let it run regularly to automatically keep track of future config file changes.

If the files are already modified, I guess your only option is to diff your files with a fresh install. Keep in mind some files might be specific to you computer, I'm thinking of files that can hold device-specific values like your mac address udev/rules.d/70-persistent-net.rules, your drives uuid /etc/fstab, etc.
If you're planning this ahead, there are at least two options you can consider:
use a VCS such as git.
use a filesystem that keeps a complete history of the changes made.

Node .fs Working with a HUGE Directory

Picture a directory with a ton of files. As a rough gauge of magnitude I think the most that we've seen so far is a couple of million but it could technically go another order higher. Using node, I would like to read files from this directory, process them (upload them, basically), and then move them out of the directory. Pretty simple. New files are constantly being added while the application is running, and my job (like a man on a sinking ship holding a bucket) is to empty this directory as fast as it's being filled.
So what are my options? fs.readdir is not ideal, it loads all of the filenames into memory which becomes a problem at this kind of scale. Especially as new files are being added all the time and so it would require repeated calls. (As an aside for anybody referring to this in the future, there is something being proposed to address this whole issue which may or may not have been realised within your timeline.)
I've looked at the myriad of fs drop-ins (graceful-fs, chokadir, readdirp, etc), none of which have this particular use-case within their remit.
I've also come across a couple of people suggesting that this can be handled with child_process, and there's a wrapper called inotifywait which tasks itself with exactly what I am asking but I really don't understand how this addresses the underlying problem, especially at this scale.
I'm wondering if what I really need to do is find a way to just get the first file (or, realistically, batch of files) from the directory without having the overhead of reading the entire directory structure into memory. Some sort of stream that could be terminated after a certain number of files had been read? I know Go has a parameter for reading the first n files from a directory but I can't find a node equivalent, has anybody here come across one or have any interesting ideas? Left-field solutions more than welcome at this point!

You can use your operation system listing file command, and stream the result into NodeJS.
For example in Linux:
var cp=require('child_process')
var stdout=cp.exec('ls').stdout
stdout.on('data',function(a){
console.log(a)
});0
RunKit: https://runkit.com/aminanadav/57da243180f3bb140059a31d

Perforce: How does files get stored with branching?

A very basic question about branching and duplicating resources, I have had discussion like this due to the size of our main branch, but put aside it is great to know how this really works.
Consider the problem of branching dozens of Gb.
What happens when you create a branch of this massive amount of information?
Am reading the official doc here and here, but am still confused on how the files are stored for each branch on the server.
Say a file A.txt exists in main branch.
When creating the branch (Xbranch) and considering A.txt won't have changes, will the perforce server duplicate the A.txt (one keeping the main changes and another for the Xbranch)?
For a massive amount of data, it becomes a matter because it will mean duplicate the dozens of Gb. So how does this really work?

Some notes in addition to Bryan Pendleton's answer (and the questions from it)
To really check your understanding of what is going on, it is good to try with a test repository with a small number of files and to create checkpoints after each major action and then compare the checkpoints to see what actual database rows were written (as well as having a look at the archive files that the server maintains). This is very quick and easy to setup. You will notice that every branched file generates records in db.integed, db.rev, db.revcx and db.revhx - let alone any in db.have.
You also need to be aware of which server version you are using as the behavior has been enhanced over time. Check the output of "p4 help obliterate":
Obliterate is aware of lazy copies made when 'p4 integrate' creates
a branch, and does not remove copies that are still in use. Because
of this, obliterating files does not guarantee that the corresponding
files in the archive will be removed.
Some other points:
The default flags for "p4 integrate" to create branches copied the files down to the client workspace and then copied them back to the server with the submit. This took time depending on how many and how big the files were. It has long been possible to avoid this using the -v (virtual) flag, which just creates the appropriate rows on the server and avoids updating the client workspace - usually hugely faster. The possible slight downside is you have to sync the files afterwards to work on them.
Newer releases of Perforce have the "p4 populate" command which does the same as an "integrate -v" but also does not actually require the target files to be mapped into the current client workspace - this avoids the dreaded "no target file(s) in client view" error which many beginners have struggled with! [In P4V this is the "Branch files..." command on right click menu, rather than "Merge/Integrate..."]
Streams has made branching a lot slicker and easier in many ways - well worth reading up on and playing with (the only potential fly in the ointment is a flat 2 level naming hierarchy, and also potential challenges in migrating existing branches with existing relationships into streams)
Task streams are pretty nifty and save lots of space on the server
Obliterate has had an interesting flag -b for a few releases which is like being able to quickly and easily remove unchanged branch files - so like retro-creating a task stream. Can potentially save millions of database rows in larger installations with lots of branching

In general, branching a file does not create a copy of the file's contents; instead, the Perforce server just writes an additional database record describing the new revision, but shares the single copy of the file's contents.
Perforce refers to these as "lazy copies"; you can learn more about them here: http://answers.perforce.com/articles/KB_Article/How-to-Identify-a-Lazy-Copy-of-a-File
One exception is if you use the "+S" filetype modifier, as in this case each branch will have its own copy of the content, so that the +S semantics can be performed properly on each branch independently.

Is it OK (performance-wise) to have hundreds or thousands of files in the same Linux directory?

It's well known that in Windows a directory with too many files will have a terrible performance when you try to open one of them. I have a program that is to execute only in Linux (currently it's on Debian-Lenny, but I don't want to be specific about this distro) and writes many files to the same directory (which acts somewhat as a repository). By "many" I mean tens each day, meaning that after one year I expect to have something like 5000-10000 files. They are meant to be kept (once a file is created, it's never deleted) and it is assumed that the hard disk has the required capacity (if not, it should be upgraded). Those files have a wide range of sizes, from a few KB to tens of MB (but not much more than that). The names are always numeric values, incrementally generated.
I'm worried about long-term performance degradation, so I'd ask:
Is it OK to write all to the same directory? Or should I think about creating a set of subdirectories for every X files?
Should I require a specific filesystem to be used for such directory?
What would be the more robust alternative? Specialized filesystem? Which?
Any other considerations/recomendations?

It depends very much on the file system.
ext2 and ext3 have a hard limit of 32,000 files per directory. This is somewhat more than you are asking about, but close enough that I would not risk it. Also, ext2 and ext3 will perform a linear scan every time you access a file by name in the directory.
ext4 supposedly fixes these problems, but I cannot vouch for it personally.
XFS was designed for this sort of thing from the beginning and will work well even if you put millions of files in the directory.
So if you really need a huge number of files, I would use XFS or maybe ext4.
Note that no file system will make "ls" run fast if you have an enormous number of files (unless you use "ls -f"), since "ls" will read the entire directory and the sort the names. A few tens of thousands is probably not a big deal, but a good design should scale beyond what you think you need at first glance...
For the application you describe, I would probably create a hierarchy instead, since it is hardly any additional coding or mental effort for someone looking at it. Specifically, you can name your first file "00/00/01" instead of "000001".

If you use a filesystem without directory-indexing, then it is a very bad idea to have lots of files in one directory (say, > 5000).
However, if you've got directory indexing (which is enabled by default on more recent distros in ext3), then it's not such a problem.
However, it does break quite a few tools to have many files in one directory (For example, "ls" will stat() all the files, which takes a long time). You can probably easily split it into subdirectories.
But don't overdo it. Don't use many levels of nested subdirectory unnecessarily, this just uses lots of inodes and makes metadata operations slower.
I've seen more cases of "too many levels of nested directories" than I've seen of "too many files per directory".

The best solution I have for you (rather than quoting some values from a micro-filesystem-benchmark) is to test it yourself.
Just use the file system of your choice. Create some random test data for 100, 1000 and 10000 entries. Then, measure the time it takes your system to perform the action you are concerned about time-wise (opening a file, reading 100 random files, etc).
Then, you compare the times and use the best solution (put them all into one directory; put each year into a new directory; put each month of each year into a new directory).
I do not know in detail what you are using, but creating a directory is a one time (and probably quite easy) operation, so why not do it instead of changing filesystems or trying some other more time-consuming stuff?

In addition to the other answers, if the huge directory is managed by a known application or library, you could consider replacing it by something else, e.g:
a GDBM index file; GDBM is a very common library providing indexed file, which associates to an arbitrary key (a sequence of bytes) an arbitrary value (another sequence of byte).
perhaps a table inside a database like MySQL or PostGresQL. Be careful about indexing.
some other way to index data
The advantages of the above approaches include:
space performance for a large collection of small items (less than a kilobyte each). A filesystem need an inode for each item. Indexed systems may have much less granularity
time performance: you don't access the filesystem for every item
scalability: indexed approaches are designed to fit large needs: either a GDBM index file, or a database can handle many millions of items. I'm not sure your directory approach will scale as easily.
The disadvantage of such approach is that they don't show as files. But as MarkR's answer remind you, ls is behaving quite poorly on huge directories.
If you stick to a filesystem approach, many software using large number of files are organizing them in subdirectories like aa/ ab/ ac/ ...ay/ az/ ba/ ... bz/ ...

Is it OK to write all to the same directory? Or should I think about creating a set of subdirectories for every X files?
In my experience the only slow down a directory with many files will give is if you do things such as getting a listing with ls. But that mostly is the fault of ls, there are faster ways of listing the contents of a directory using tools such as echo and find (see below).
Should I require a specific filesystem to be used for such directory?
I don't think so with regards to amount of files in one directory. I am sure some filesystems perform better with many small files in one dir whilst others do a better job on huge files. It's also a matter of personal taste, akin to vi vs. emacs. I prefer to use the XFS filesystem so that'd be my advice. :-)
What would be the more robust alternative? Specialized filesystem? Which?
XFS is definitely robust and fast, I use it in many places, as boot partition, oracle tablespaces, space for source control you name it. It lacks a bit on delete performance, but otherwise it's a safe bet. Plus it supports growing the size whilst it is still mounted (that's a requirement actually). That is you just delete the partition, recreate it at the same starting block and whatever ending block that's larger than the original partition, then you run xfs_growfs on it with the filesystem mounted.
Any other considerations/recomendations?
See above. With the addition that having 5000 to 10000 files in one directory should not be a problem. In practice it doesn't arbitrarily slow down the filesystem as far as I know, except for utilities such as "ls" and "rm". But you could do:
find * | xargs echo
find * | xargs rm
The benefit that a directory tree with files, such as directory "a" for file names starting with an "a" etc., will give you is that of looks, it looks more organised. But then you have less of an overview... So what you're trying to do should be fine. :-)
I neglected to say you could consider using something called "sparse files" http://en.wikipedia.org/wiki/Sparse_file

It is bad for performance to have a huge number of files in one directory. Checking for the existence of a file will typically require an O(n) scan of the directory. Creating a new file will require that same scan with the directory locked to prevent the directory state changing before the new file is created. Some file systems may be smarter about this (using B-trees or whatever), but the fewer ties your implementation has to the filesystem's strengths and weaknesses the better for long term maintenance. Assume someone might decide to run the app on a network filesystem (storage appliance or even cloud storage) someday. Huge directories are a terrible idea when using network storage.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string