Linux file synchronization between computers

Linux file synchronization between computers - linux

I'm looking for a software which will allow me to synchronize files in specyfic folders between my linux boxes. I have searched a lot of topics and what I've found is Unison. It looks prety good but it is not under development anymore and does not allow me to see file change history.
So the question is - what is the best linux file synchronizer, that:
(required) will synchronize only selected folders
(required) will synchronize computers at given time (for example each hour)
(required) will be intelligent - will remember what was deleted and when and will ask me if I want to delete it on remote machine too.
(optionally) will keep track of changes and allow to see history of changes
(optionally) will be multiplatform

Rsync is probably the de facto.
I see Unison is based on Rsync -- not sure if Rsync alone can achieve number 3 above.
Also, see this article with detailed information about rsync, including available GUI's for it.

While I agree Rsync is defacto swissknife for linux users, I found 2 other projects more interesting especially for use case where I have 2 workstations in different locations and laptop, all 3 machines for work, so I felt pain here. I found really nice project called:
https://syncthing.net/
I run it on public server with vpn access where my machines are always connected and it simply works. It has gui for monitoring purposes (basic, but enough infor available)
Second is paid, but with similar functionality on top built in:
https://www.resilio.com/

Osync is probably what you're looking for (see http://www.netpower.fr/osync )
Osync is actually rsync based but will handle number 3 above without trouble.
Number 4, keeping track of modified files can be more or less achieved by adding --verbose parameter which will log file updates.
Actually, only number 5 won't work. Osync runs on most unix flavors but not windows.

Related

Best practices for maintaining configuration of embedded linux system for production

I have a small single board computer which will be running a linux distribution and some programs and has specific user configuration, directory structure, permissions settings etc.
My question is, what is the best way to maintain the system configuration for release? In my time thinking about this problem I've thought of a few ideas but each has its downsides.
Configure the system and burn the image to an iso file for distribution
This one has the advantage that the system will be configured precisely the way I want it, but committing an iso file to a repository is less than desirable since it is quite large and checking out a new revision means reflashing the system.
Install a base OS (which is version locked) and write a shell script to configure the settings from scratch.
This one has the advantage that I could maintain the script in a repository and update and config changes by pulling changes to the script and running it again, however now I have to maintain a shell script to configure a system and its another place where something can go wrong.
I'm wondering what the best practices are in embedded in general so that I can maybe implement a good deployment and maintenance strategy.

Embeddded systems tend to have a long lifetime. Do not be surprised if you need to refer to something released today in ten years' time. Make an ISO of the whole setup, source code, diagrams, everything... and store it away redundantly. Someone will be glad you did a decade from now. Just pretend it's going to last forever and that you'll have to answer a question or research a defect in ten years.

How to determine that the shell script is safe

I downloaded this shell script from this site.
It's suspiciously large for a bash script. So I opened it with text editor and noticed
that behind the code there is a lot of non-sense characters.
I'm afraid of giving the script execution right with chmod +x jd.sh. Can you advise me how to recognize if it's safe or how to set it's limited rights in the system?
thank you

The "non-sense characters" indicate binary files that are included directly into the SH file. The script will use the file itself as a file archive and copy/extract files as needed. That's nothing unusual for an SH installer. (edit: for example, makeself)
As with other software, it's virtually impossible to decide wether or not running the script is "safe".

Don't run it! That site is blocked where I work, because it's known to serve malware.
Now, as to verifying code, it's not really possible without isolating it completely (technically difficult, but a VM might serve if it has no known vulnerabilities) and running it to observe what it actually does. A healthy dose of mistrust is always useful when using third-party software, but of course nobody has time to verify all the software they run, or even a tiny fraction of it. It would take thousands (more likely millions) of work years, and would find enough bugs to keep developers busy for another thousand years. The best you can usually do is run only software which has been created or at least recommended by someone you trust at least somewhat. Trust has to be determined according to your own criteria, but here are some which would count in the software's favor for me:
Part of a major operating system/distribution. That means some larger organization has decided to trust it.
Source code is publicly available. At least any malware caused by company policy (see Sony CD debacle) would have a bigger chance of being discovered.
Source code is distributed on an appropriate platform. Sites like GitHub enable you to gauge the popularity of software and keep track of what's happening to it, while a random web site without any commenting features, version control, or bug database is an awful place to keep useful code.

While the source of the script does not seem trustworthy (IP address?), this might still be legit. With shell scripts it is possible to append binary content at the end and thus build a type of installer. Years ago, Sun would ship the JDK for Solaris in exactly that form. I don't know if that's still the case, though.
If you wanna test it without risk, I'd install a Linux in a VirtualBox (free virtual-machine software), run the script there and see what it does.
Addendum on see what it does: There's a variety of tools on UNIX that you can use to analyze a binary program, like strace, ptrace, ltrace. What might also be interesting is running the script using chroot. That way you can easily find all files that are installed.
But at the end of the day this will probably yield more binary files which are not easy to examine (as probably any developer of anti-virus software will tell you). Therefore, if you don't trust the source at all, don't run it. Or if you must run it, do it in a VM where at least it won't be able to do too much damage or access any of your data.

NodeJS: How would one watch a large amount of files/folders on the server side for updates?

I am working on a small NodeJS application that essentially serves as a browser based desktop search for a LAN based server that multiples users can query. The users on the LAN all have access to a shared folder on that server and are traditionally used to just placing files within that folder to sharing among everyone, and I want to keep that process the same.
The first solution I came across was the fs.watchFile which has been touched on in other stackoverflow questions. In the first question user Ivo Wetzel noted that on a linux system fs.watchFile uses inotify but, was of the opinion that fs.watchFile should not be used for large amounts of files/folders.
In another question about fs.watchFile user tjameson first reiterated that on Linux inotify would be used by fs.fileWatch and recommended to just use a combination of node-inotify-plusplus and node-walk but again stated this method should not be used for a large number of files. With a comment and response he suggested only watching the modified times of directories and then rescanning the relevant directory for file changes.
My biggest hurdles seem to be that even with tjameson's suggestion there is still a hard limit to the number of folders monitored (of which there are many and growing). Also it would have to be done recursively because the directory tree is somewhat deep and can also be subject to change at the lower branches so I would have to monitor the following at every folder level (or alternatively monitor the modified time of the folders and then scan to find out what happened):
creation of file or subfolder
deletion of file or subfolder
move of file or subfolder
deletion of self
move of self
Assuming the inotify has limits in line with what was said above then this alone to me seems like it may be too many monitors when I have a significant amount of nested subfolders. The real awesome way looks like it would involve kqueue which I subsequently found as a topic of discussion on a better fs.fileWatch in a google group.
It seems clear to me that keeping a database of the relevant file and folder information is the appropriate course of action on the query side of things, but keeping that database synchronized with the actual state of the file system under the directories of concern will be the challenge.
So what does the community think? Is there a better or well known solution for attacking this problem that I am just unaware of? Is it best just to watch all directories of interest for a single change e.g. modified time and then scan to find out what happened? Is it better to watch all the relevant inotify alerts and modify the database appropriately? Is this not a problem which is solvable by a peasant like me?

Have a look at monit. I use it to monitor files for changes in my dev environment and restart my node processes when relevant project files change.

I recommend you to take a look at the Dropbox API.
I implemented something similar with ruby on the client side and nodejs on the server side.
The best approach is to keep hashes to check if the files or folders changed.

Good Secure Backups Developers at Home [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
What is a good, secure, method to do backups, for programmers who do research & development at home and cannot afford to lose any work?
Conditions:
The backups must ALWAYS be within reasonably easy reach.
Internet connection cannot be guaranteed to be always available.
The solution must be either FREE or priced within reason, and subject to 2 above.
Status Report
This is for now only considering free options.
The following open-source projects are suggested in the answers (here & elsewhere):
BackupPC is a high-performance,
enterprise-grade system for backing
up Linux, WinXX and MacOSX PCs and
laptops to a server's disk.
Storebackup is a backup utility
that stores files on other disks.
mybackware: These scripts were
developed to create SQL dump files
for basic disaster recovery of small
MySQL installations.
Bacula is [...] to manage
backup, recovery, and verification
of computer data across a network of
computers of different kinds. In
technical terms, it is a network
based backup program.
AutoDL 2 and Sec-Bk: AutoDL 2
is a scalable transport independant
automated file transfer system. It
is suitable for uploading files from
a staging server to every server on
a production server farm [...]
Sec-Bk is a set of simple utilities
to securely back up files to a
remote location, even a public
storage location.
rsnapshot is a filesystem
snapshot utility for making backups
of local and remote systems.
rbme: Using rsync for backups
[...] you get perpetual incremental
backups that appear as full backups
(for each day) and thus allow easy
restore or further copying to tape
etc.
Duplicity backs directories by
producing encrypted tar-format
volumes and uploading them to a
remote or local file server. [...]
uses librsync, [for] incremental
archives
simplebup, to do real-time backup of files under active development, as they are modified. This tool can also be used for monitoring of other directories as well. It is intended as on-the-fly automated backup, and not as a version control. It is very easy to use.
Other Possibilities:
Using a Distributed Version Control System (DVCS) such as Git(/Easy Git), Bazaar, Mercurial answers the need to have the backup available locally.
Use free online storage space as a remote backup, e.g.: compress your work/backup directory and mail it to your gmail account.
Strategies
See crazyscot's answer

I prefer http://www.jungledisk.com/ .
It's based on Amazon S3, cheap, multiplatform, multiple machines with a single license.

usb hard disk + rsync works for me
(see here for a Win32 build)

Scott Hanselman recommends Windows Home Server in his aptly titled post
The Case of the Failing Disk Drive or Windows Home Server Saved My Marriage.

First of all: keeping backups off-site is as important for individuals as it is for businesses. If you house burns down, you don't want to loose everything.
This is especially true because it is so easy to accomplish. Personally, I have an external USB harddisk I keep at my fathers house. Normally, it is hooked up to his internet connections and I backup over the net (using rsync), but when I need to backup really big things, I collect it and copy things over USB. Ideally, I should get another disk, to spread the risk.
Other options are free online storage facilities (use encryption!).
For security, just use TrueCrypt. It has a good name in the IT world, and seems to work very well.

Depends on which platform you are running on (Windows/Linux/Mac/...?)
As a platform independent way, I use a personal subversion server. All the valuables are there, so if I lose one of the machines, a simple 'svn checkout' will take things back. This takes some initial work, though, and requires discipline. It might not be for you?
As a second backup for the non-svn stuff, I use Time Machine, which is built-in to OS X. Simply great. :)

I highly recommend www.mozy.com. Their software is easy and works great, and since it's stored on their servers you implicitly get offsite backups. No worrying about running a backup server and making sure it's working. Also, the company is backed by EMC (a leading data storage product company), so gives me enough confidence to trust them.

I'm a big fan of Acronis Trueimage.Make sure you rotate through a few backup HDDs to you have a few generations to go back to, or if one of the backups goes bang. If it's a major milestone I snail-mail a set of DVDs to Mum and she files em for me. She lives in a different state so it should cover most disasters of less-than-biblical proportions.
EDIT: Acronis has encryption via a password. I also find the bandwidth of snailmail to be somewhat infinite - 10GB overnight = 115 kb/s, give or take. Never been throttled by Australia Post.

My vote goes for cloud storage of some kind. The problem with nearly all 'home' backups is they stay in the home, that means any catastrophic damage to the system being backed up will probably damage the backups as well (fire, flood etc). My requirements would be
1) automated - manual backups get forgotten, usually just when most needed
2) off-site - see above
3) multiple versions - that is backup to more than one thing, in case that one thing fails.
As a developer, usually data sizes for backup are relatively small so a couple of free cloud backup accounts might do. They also often fulfil part 1 as they can usually be automated. I've heard good things about www.getdropbox.com/.
The other advantage of more than 1 account is you could have one on 'daily sync' and another on 'weekly sync' to give you some history. This is nowhere near as good as true incremental backups.
Personally I prefer a scripted backup (to local hard-drives, which I rotate to work as 'offsites'. This is in large part due to my hobby (photography) and thus my relatively lame internet upstream bandwith not coping with the data volume.
Take home message - don't rely on one solution and don't assume that your data is not important enough to think about the issues as deeply as the 'Enterprise' does.

Buy a fire-safe.
This is not just a good idea for storing backups, but a good idea period.
Exactly what media you put in it is the subject of other answers here.
But, from the perspective of recovering from a fire, having a washable medium is good. As long as the temperature doesn't get too high CDs and DVDs seem reasonably resilient, although I'd be concerned about smoke damage.
Ditto for hard-drives.
A flash drive does have the benefit that there are no moving parts to be damaged and you don't need to be concerned about the optical properties.

mozy.com is king. I started using it just to backup code and then ponied up the 5 bux a month to backup my personal pictures and other stuff that I'd rather not lose if the house burns down. The initial backup can take a little while but after that you can pretty much forget about it until you need to restore something.

Get an external hard drive with a network port so you can keep your backups in another room which provides a little security against fire in addition to being a simple solution you can do yourself at home.
The next step is to get storage space in some remote location (there are very cheap monthly prices for servers for example) or to have several external hard drives and periodically switch between the one at home and a remote location. If you use encryption, this can be anywhere such as a friend's or parents' place or work.

Bacula is a good software, it's open source, and shall give good performance, kind of commercial software, a bit difficult the first time to configure, but not so hard. It has good documentation

I second the vote for JungleDisk. I use it to push my documents and project folders to S3. My average monthly bill from amazon is about 20c.
All my projects are in Subversion on an external host.
As well as this, I am on a Mac, so I use SuperDuper to take a nightly image of my drive. I am sure there are good options in the Windows/Linux world.
I have two external drives that I rotate on a weekly basis, and I store one of the drives off-site during it's week off.
This means that I am only ever 24 hours away from an image in case of failure, and I am only 7 days from an image in case of catastrophic failure (fire theft). The ability to plug the drive in to a machine and be running instantly from the image has saved me immensely. My boot partition was corrupted during a power failure (not a hardware failure, luckily). I plugged the backup in, restored and was working again in the time it took to transfer the files of the external drive.

Another vote for mozy.com
You get 2gb for free, or for $5/month gives you unlimited backup space. Backups can occur on a timed basis, or when your PC/Mac is not busy. It's encrypted during transit and storage.
You can retrieve files via built in software, through the web or pay for a DVD to be burned and posted back.
William Macdonald

If you feel like syncing to the cloud and don't mind the initial, beta, 2GB cap, I've fallen in love with Dropbox.
It has versions for Windows, OSX, and Linux, works effortlessly, keeps files versioned, and works entirely in the background based on when the files changed (not a daily schedule or manual activations).
Ars Technica and Joel Spolsky have both fallen in love (though the love seems strong with Spolsky, but lets pretend!) with the service if the word of a random internet geek is not enough.

These are interesting times for "the personal backup question".
There are several schools of thought now:
Frequent Automated Local Backup + Periodic Local Manual Backup
Automated: Scheduled Nightly backup to external drive.
Manual: Copy to second external drive once per week / month / year / oops-forgot
and drop it of at "Mom's house".
Lot's of software in the field, but here's a few: There's RSync and TimeMachine on Mac, and DeltaCopy www.aboutmyip.com/AboutMyXApp/DeltaCopy.jsp for Windows.
Frequent Remote Backup
There are a pile of services that enable you to backup across you internet connection to a remote data centre. Amazon's S3 service + JungleDisk's client software is a strong choice these days - not the cheapest option, but you pay for what you use and Amazon's track record suggests as a company it will be in business as long or longer than any other storage providers who hang their shingle today.
Did I mention it should be encrypted? Props to JungleDisk for handling the "encryption issue" and future-proofing (open source library to interoperate with Jungle Disk) pretty well.
All of the above.
Some people call it being paranoid ... others think to themselves "Ahhh, I can sleep at night now".
Also, it's more fault-tolerance than backup, but you should check out Drobo - basically it's dead simple RAID that seems to work quite well.

Here are the features I'd look out for:
As near to fully automatic as possible. If it relies on you to press a button or run a program regularly, you will get bored and eventually stop bothering. An hourly cron job takes care of this for me; I rsync to the 24x7 server I run on my home net.
Multiple removable backup media so you can keep some off site (and/or take one with you when you travel). I do this with a warm-pluggable SATA drive bay and a cron job which emails me every week to remind me to change drives.
Strongly encrypted media, in case you lose one. The linux encrypted device support (cryptsetup et al) does this for me.
Some sort of point-in-time recovery, but consider carefully what resolution you want. Daily might be enough - having multiple backup media probably gets you this - or you might want something more comprehensive like Apple's Time Machine. I've used some careful rsync options with my removable drives: every day creates a fresh snapshot directory, but files which are unchanged from the previous day are hard linked instead of copied, to save space.

Or simply just set up a gmail account and mail it to yourself :) Unless you're a bit paranoid about google knowing about your stuff since you said research. It doesn't help you much with structure and stuff but it's free, big storage and off-site so quite safe.

If you use OS X 10.5 or above then the cost of Time Machine is the cost of an external hard drive. Not only that, but the interface is dead simple to use. Open the folder you wish to recover, click on the time machine icon, and browse the directory as if it was 1999 all over again!
I haven't tried to encrypt it, but I imagine you could use truecrypt.
Yes this answer was posted quite some time after the question was asked, however I believe it should help those who stumble across this posting in the future (like I did).

Setup a Linux or xBSD server:
-Setup a source control system of your choice on it.
--Mirror Raid (raid 1) at min
--Daily (or even hourly) backups to external drive[s].
From the server you could also setup an automatic offsite backup. If the internet is out, you'd still have your external drive and just have it auto sync once it comes back.
Once it's setup it should be about 0 work.
You don't need anything "fancy" for offsite backup. Get a webhost that allows storing non-web data. sync via sftp or rsync over ssh. Store data on other end in true crypt container if your paranoid.
If you work for an employeer/contractor also ask them. Most places already have something in place or let you work with their IT.

My vote goes to dirvish (for linux). It uses rsync as backend but is very easy to configure.
It makes automatic, periodically and differential backups of directories. The big benefit is, that it creates hardlinks to all files not changed since the last backup. So restore is easy: Just copy last created directory back - instead of restoring all diffs one after another like other differential backup tools need to do.

I have the following backup scenarios and use rsync scripts to store on USB and network shares.
(weekly) Windows backup for "bare metal" recovery
Content of System drive C:\ using Windows Backup for quick recovery after physical disk failure, as I don't want to reinstall Windows and applications from scratch. This is configured to run automatically using Windows Backup schedule.
(daily and conditional) Active content backup using rsync
Rsync takes care of all changed files from laptop, phone, other devices. I backup laptop every night and after significant changes in content, like import of the recent photo RAWs from SD card to laptop.
I've created a bash script that I run from Cygwin on Windows to start rsync: https://github.com/paravz/windows-rsync-backup

If you're using deduplicaiton STAY AWAY from JungleDisk. Their restore client makes a mess of the reparsepoint, and makes the file unusable. You hopefully can fix it in safe mode with:
fsutil reparsepoint delete

What's the best way to keep multiple Linux servers synced?

I have several different locations in a fairly wide area, each with a Linux server storing company data. This data changes every day in different ways at each different location. I need a way to keep this data up-to-date and synced between all these locations.
For example:
In one location someone places a set of images on their local server. In another location, someone else places a group of documents on their local server. A third location adds a handful of both images and documents to their server. In two other locations, no changes are made to their local servers at all. By the next morning, I need the servers at all five locations to have all those images and documents.
My first instinct is to use rsync and a cron job to do the syncing over night (1 a.m. to 6 a.m. or so), when none of the bandwidth at our locations is being used. It seems to me that it would work best to have one server be the "central" server, pulling in all the files from the other servers first. Then it would push those changes back out to each remote server? Or is there another, better way to perform this function?

The way I do it (on Debian/Ubuntu boxes):
Use dpkg --get-selections to get your installed packages
Use dpkg --set-selections to install those packages from the list created
Use a source control solution to manage the configuration files. I use git in a centralized fashion, but subversion could be used just as easily.

An alternative if rsync isn't the best solution for you is Unison. Unison works under Windows and it has some features for handling when there are changes on both sides (not necessarily needing to pick one server as the primary, as you've suggested).
Depending on how complex the task is, either may work.

One thing you could (theoretically) do is create a script using Python or something and the inotify kernel feature (through the pyinotify package, for example).
You can run the script, which registers to receive events on certain trees. Your script could then watch directories, and then update all the other servers as things change on each one.
For example, if someone uploads spreadsheet.doc to the server, the script sees it instantly; if the document doesn't get modified or deleted within, say, 5 minutes, the script could copy it to the other servers (e.g. through rsync)
A system like this could theoretically implement a sort of limited 'filesystem replication' from one machine to another. Kind of a neat idea, but you'd probably have to code it yourself.

AFAIK, rsync is your best choice, it supports partial file updates among a variety of other features. Once setup it is very reliable. You can even setup the cron with timestamped log files to track what is updated in each run.

I don't know how practical this is, but a source control system might work here. At some point (perhaps each hour?) during the day, a cron job runs a commit, and overnight, each machine runs a checkout. You could run into issues with a long commit not being done when a checkout needs to run, and essentially the same thing could be done rsync.
I guess what I'm thinking is that a central server would make your sync operation easier - conflicts can be handled once on central, then pushed out to the other machines.

rsync would be your best choice. But you need to carefully consider how you are going to resolve conflicts between updates to the same data on different sites. If site-1 has updated
'customers.doc' and site-2 has a different update to the same file, how are you going to resolve it?

I have to agree with Matt McMinn, especially since it's company data, I'd use source control, and depending on the rate of change, run it more often.
I think the central clearinghouse is a good idea.

Depends upon following
* How many servers/computers that need to be synced ?
** If there are too many servers using rsync becomes a problem
** Either you use threads and sync to multiple servers at same time or one after the other.
So you are looking at high load on source machine or in-consistent data on servers( in a cluster ) at given point of time in the latter case
Size of the folders that needs to be synced and how often it changes
If the data is huge then rsync will take time.
Number of files
If number of files are large and specially if they are small files rsync will again take a lot of time
So all depends on the scenario whether to use rsync , NFS , Version control
If there are less servers and just small amount of data , then it makes sense to run rysnc every hour.
You can also package content into RPM if data changes occasionally
With the information provided , IMO Version Control will suit you the best .
Rsync/scp might give problems if two people upload different files with same name .
NFS over multiple locations needs to be architect-ed with perfection
Why not have a single/multiple repositories and every one just commits to those repository .
All you need to do is keep the repository in sync.
If the data is huge and updates are frequent then your repository server will need good amount of RAM and good I/O subsystem

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string