What's the best way to keep multiple Linux servers synced? - linux

I have several different locations in a fairly wide area, each with a Linux server storing company data. This data changes every day in different ways at each different location. I need a way to keep this data up-to-date and synced between all these locations.
For example:
In one location someone places a set of images on their local server. In another location, someone else places a group of documents on their local server. A third location adds a handful of both images and documents to their server. In two other locations, no changes are made to their local servers at all. By the next morning, I need the servers at all five locations to have all those images and documents.
My first instinct is to use rsync and a cron job to do the syncing over night (1 a.m. to 6 a.m. or so), when none of the bandwidth at our locations is being used. It seems to me that it would work best to have one server be the "central" server, pulling in all the files from the other servers first. Then it would push those changes back out to each remote server? Or is there another, better way to perform this function?

The way I do it (on Debian/Ubuntu boxes):
Use dpkg --get-selections to get your installed packages
Use dpkg --set-selections to install those packages from the list created
Use a source control solution to manage the configuration files. I use git in a centralized fashion, but subversion could be used just as easily.

An alternative if rsync isn't the best solution for you is Unison. Unison works under Windows and it has some features for handling when there are changes on both sides (not necessarily needing to pick one server as the primary, as you've suggested).
Depending on how complex the task is, either may work.

One thing you could (theoretically) do is create a script using Python or something and the inotify kernel feature (through the pyinotify package, for example).
You can run the script, which registers to receive events on certain trees. Your script could then watch directories, and then update all the other servers as things change on each one.
For example, if someone uploads spreadsheet.doc to the server, the script sees it instantly; if the document doesn't get modified or deleted within, say, 5 minutes, the script could copy it to the other servers (e.g. through rsync)
A system like this could theoretically implement a sort of limited 'filesystem replication' from one machine to another. Kind of a neat idea, but you'd probably have to code it yourself.

AFAIK, rsync is your best choice, it supports partial file updates among a variety of other features. Once setup it is very reliable. You can even setup the cron with timestamped log files to track what is updated in each run.

I don't know how practical this is, but a source control system might work here. At some point (perhaps each hour?) during the day, a cron job runs a commit, and overnight, each machine runs a checkout. You could run into issues with a long commit not being done when a checkout needs to run, and essentially the same thing could be done rsync.
I guess what I'm thinking is that a central server would make your sync operation easier - conflicts can be handled once on central, then pushed out to the other machines.

rsync would be your best choice. But you need to carefully consider how you are going to resolve conflicts between updates to the same data on different sites. If site-1 has updated
'customers.doc' and site-2 has a different update to the same file, how are you going to resolve it?

I have to agree with Matt McMinn, especially since it's company data, I'd use source control, and depending on the rate of change, run it more often.
I think the central clearinghouse is a good idea.

Depends upon following
* How many servers/computers that need to be synced ?
** If there are too many servers using rsync becomes a problem
** Either you use threads and sync to multiple servers at same time or one after the other.
So you are looking at high load on source machine or in-consistent data on servers( in a cluster ) at given point of time in the latter case
Size of the folders that needs to be synced and how often it changes
If the data is huge then rsync will take time.
Number of files
If number of files are large and specially if they are small files rsync will again take a lot of time
So all depends on the scenario whether to use rsync , NFS , Version control
If there are less servers and just small amount of data , then it makes sense to run rysnc every hour.
You can also package content into RPM if data changes occasionally
With the information provided , IMO Version Control will suit you the best .
Rsync/scp might give problems if two people upload different files with same name .
NFS over multiple locations needs to be architect-ed with perfection
Why not have a single/multiple repositories and every one just commits to those repository .
All you need to do is keep the repository in sync.
If the data is huge and updates are frequent then your repository server will need good amount of RAM and good I/O subsystem

Related

Best practices for maintaining configuration of embedded linux system for production

I have a small single board computer which will be running a linux distribution and some programs and has specific user configuration, directory structure, permissions settings etc.
My question is, what is the best way to maintain the system configuration for release? In my time thinking about this problem I've thought of a few ideas but each has its downsides.
Configure the system and burn the image to an iso file for distribution
This one has the advantage that the system will be configured precisely the way I want it, but committing an iso file to a repository is less than desirable since it is quite large and checking out a new revision means reflashing the system.
Install a base OS (which is version locked) and write a shell script to configure the settings from scratch.
This one has the advantage that I could maintain the script in a repository and update and config changes by pulling changes to the script and running it again, however now I have to maintain a shell script to configure a system and its another place where something can go wrong.
I'm wondering what the best practices are in embedded in general so that I can maybe implement a good deployment and maintenance strategy.
Embeddded systems tend to have a long lifetime. Do not be surprised if you need to refer to something released today in ten years' time. Make an ISO of the whole setup, source code, diagrams, everything... and store it away redundantly. Someone will be glad you did a decade from now. Just pretend it's going to last forever and that you'll have to answer a question or research a defect in ten years.

What solutions are there to backup millions of image files and sub-directories on a webserver efficiently?

I have a website that I host on a Linux VPS which has been growing over the years. One of its primary functions is to store images/photos and these image files are typically around 20-40kB each. The way the site is organised at the moment is all images are stored in a root folder ‘photos’ and under that root folder are many subfolders determined by a random filename. For example, one image could have a file name abcdef1234.jpg and that would be stored in the folder photos/ab/cd/ef/. The advantage of this is that there are no directories with excessive numbers of images in them and accessing files is quick. However, the entire photos directory is huge and is set to grow. I currently have almost half a million photos in tens of thousands of sub-folders and whilst the system works fine, it is fairly cumbersome to back up. I need advice on what I could do to make life easier for back-ups. At the moment, I am backing up the entire photos directory each time and I do that by compressing the folder and downloading it. It takes a while and puts some strain on the server. I do this because every FTP client I use takes ages to sift through all the files and find the most recent ones by date. Also, I would like to be able to restore the entire photo set quickly in the event of a catastrophic webserver failure so even if I could back up the data recursively, how cumbersome would it be to have to upload each back stage by stage?
Does anyone have any suggestions perhaps from experience? I am not a webserver administrator and my experience of Linux is very limited. I have also looked into CDN’s and Amazon S3 but this would require a great deal of change to my site in order to make these system work – perhaps I’ll use something like this in the future.
Since you indicated that you run a VPS, I assume you have shell access which gives you substantially more flexibility (as opposed to a shared webhosting plan where you can only interact with a web frontend and an FTP client). I'm pretty sure that rsync is specifically designed to do what you need to do (sync large numbers of files between machines, and do so efficiently).
This gets into Superuser territory, so you might get more advice over on that forum.

using torrents to back up vhd's

Hi it's a question and it may be redundant but I have a hunch there is a tool for this - or there should be and if there isn't I might just make it - or maybe I am barking up the wrong tree in which case correct my thinking:
But my problem is this: I am looking for some way to migrate large virtual disk drives off a server once a week via an internet connection of only moderate speed, in a solution that must be able to be throttled for bandwidth because the internet connection is always in use.
I thought about it and the problem is familar: large files that can moved that also be throttled that can easily survive disconnection/reconnection/large etc etc - the only solution I am familiar with that just does it perfectly is torrents.
Is there a way to automatically strategically make torrents and automatically "send" them to a client download list remotely? I am working in Windows Hyper-V Host but I use only Linux for the guests and I could easily cook up a guest to do the copying so consider it a windows or linux problem.
PS: the vhds are "offline" copies of guest servers by the time I am moving them - consider them merely 20-30gig dum files.
PPS: I'd rather avoid spending money
Bittorrent is an excellent choice, as it handles both incremental updates and automatic resume after connection loss very well.
To create a .torrent file automatically, use the btmakemetainfo script found in the original bittorrent package, or one from the numerous rewrites (bittornado, ...) -- all that matters is that it's scriptable. You should take care to set the "disable DHT" flag in the .torrent file.
You will need to find a tracker that allows you to track files with arbitrary hashes (because you do not know these in advance); you can either use an existing open tracker, or set up your own, but you should take care to limit the client IP ranges appropriately.
This reduces the problem to transferring the .torrent files -- I usually use rsync via ssh from a cronjob for that.
For point to point transfers, torrent is an expensive use of bandwidth. For 1:n transfers it is great as the distribution of load allows the client's upload bandwidth to be shared by other clients, so the bandwidth cost is amortised and everyone gains...
It sounds like you have only one client in which case I would look at a different solution...
wget allows for throttling and can resume transfers where it left off if the FTP/http server supports resuming transfers... That is what I would use
You can use rsync for that (http://linux.die.net/man/1/rsync). Search for the --partial option in man and that should do the trick. When a transfer is interrupted the unfinished result (file or directory) is kept. I am not 100% sure if it works with telnet/ssh transport when you send from local to a remote location (never checked that) but it should work with rsync daemon on the remote side.
You can also use that for sync in two local storage locations.
rsync --partial [-r for directories] source destination
edit: Just confirmed the crossed out statement with ssh

Linux file synchronization between computers

I'm looking for a software which will allow me to synchronize files in specyfic folders between my linux boxes. I have searched a lot of topics and what I've found is Unison. It looks prety good but it is not under development anymore and does not allow me to see file change history.
So the question is - what is the best linux file synchronizer, that:
(required) will synchronize only selected folders
(required) will synchronize computers at given time (for example each hour)
(required) will be intelligent - will remember what was deleted and when and will ask me if I want to delete it on remote machine too.
(optionally) will keep track of changes and allow to see history of changes
(optionally) will be multiplatform
Rsync is probably the de facto.
I see Unison is based on Rsync -- not sure if Rsync alone can achieve number 3 above.
Also, see this article with detailed information about rsync, including available GUI's for it.
While I agree Rsync is defacto swissknife for linux users, I found 2 other projects more interesting especially for use case where I have 2 workstations in different locations and laptop, all 3 machines for work, so I felt pain here. I found really nice project called:
https://syncthing.net/
I run it on public server with vpn access where my machines are always connected and it simply works. It has gui for monitoring purposes (basic, but enough infor available)
Second is paid, but with similar functionality on top built in:
https://www.resilio.com/
Osync is probably what you're looking for (see http://www.netpower.fr/osync )
Osync is actually rsync based but will handle number 3 above without trouble.
Number 4, keeping track of modified files can be more or less achieved by adding --verbose parameter which will log file updates.
Actually, only number 5 won't work. Osync runs on most unix flavors but not windows.

NodeJS: How would one watch a large amount of files/folders on the server side for updates?

I am working on a small NodeJS application that essentially serves as a browser based desktop search for a LAN based server that multiples users can query. The users on the LAN all have access to a shared folder on that server and are traditionally used to just placing files within that folder to sharing among everyone, and I want to keep that process the same.
The first solution I came across was the fs.watchFile which has been touched on in other stackoverflow questions. In the first question user Ivo Wetzel noted that on a linux system fs.watchFile uses inotify but, was of the opinion that fs.watchFile should not be used for large amounts of files/folders.
In another question about fs.watchFile user tjameson first reiterated that on Linux inotify would be used by fs.fileWatch and recommended to just use a combination of node-inotify-plusplus and node-walk but again stated this method should not be used for a large number of files. With a comment and response he suggested only watching the modified times of directories and then rescanning the relevant directory for file changes.
My biggest hurdles seem to be that even with tjameson's suggestion there is still a hard limit to the number of folders monitored (of which there are many and growing). Also it would have to be done recursively because the directory tree is somewhat deep and can also be subject to change at the lower branches so I would have to monitor the following at every folder level (or alternatively monitor the modified time of the folders and then scan to find out what happened):
creation of file or subfolder
deletion of file or subfolder
move of file or subfolder
deletion of self
move of self
Assuming the inotify has limits in line with what was said above then this alone to me seems like it may be too many monitors when I have a significant amount of nested subfolders. The real awesome way looks like it would involve kqueue which I subsequently found as a topic of discussion on a better fs.fileWatch in a google group.
It seems clear to me that keeping a database of the relevant file and folder information is the appropriate course of action on the query side of things, but keeping that database synchronized with the actual state of the file system under the directories of concern will be the challenge.
So what does the community think? Is there a better or well known solution for attacking this problem that I am just unaware of? Is it best just to watch all directories of interest for a single change e.g. modified time and then scan to find out what happened? Is it better to watch all the relevant inotify alerts and modify the database appropriately? Is this not a problem which is solvable by a peasant like me?
Have a look at monit. I use it to monitor files for changes in my dev environment and restart my node processes when relevant project files change.
I recommend you to take a look at the Dropbox API.
I implemented something similar with ruby on the client side and nodejs on the server side.
The best approach is to keep hashes to check if the files or folders changed.

Resources