s3cmd sync "remote copy" style operation in rsync or alternatives? - linux

There is a very useful "remote copy" feature in s3cmd sync where any duplicate copies of a file existing in the source directory will not be transferred more than once to the S3 bucket, but a remote copy will take place instead, reducing the bandwidth taken for the transfer.
I have been searching for a similar solution to do a similar file transfer between 2 Linux servers. I've used rsync many times in the past, doesn't look like it has an option for this but perhaps I have missed something.
Simple example :
/sourcedir/dir1/filea
/sourcedir/dir1/fileb
/sourcedir/dir1/filec
/sourcedir/dir2/filea
/sourcedir/dir2/filed
/sourcedir/dir2/filee
/sourcedir/dir3/filea
/sourcedir/dir3/filef
/sourcedir/dir3/fileg
With a typical transfer, filea would be transferred across the network 3 times.
I'd like to transfer this file only once and have the remote server copy the file twice to restore it to the correct directories on the other side.
I need to perform a sync on a large directory with many duplicates in the fastest time possible.
I know it would be possible to script a solution to this, but if anyone knows an application with this native functionality then that would be great!
Thanks

Related

Managing large quantity of files between two systems

We have a large repository of files that we want to keep in sync between one central location and multiple remote locations. Currently, this is being done using rsync, but it's a slow process mainly because of how long it takes to determine the changes.
My current thought is to find a VCS-like solution where instead of having to check all of the files, we can check the diffs between revisions to determine what gets sent over the wire. My biggest concern, however, is that we'd have to re-sync all of the files that are currently in-sync, which is a significant effort. I've been told that the current repository is about .5 TB and consists of a variety of files of different sizes. I understand that an initial commit will most likely take a significant amount of time, but I'd rather avoid the syncing between clusters if possible.
One thing I did look at briefly is git-annex, but my first concern is that it may not like dealing with thousands of files. Also, one thing I didn't see is what would happen if the file already exists on both systems. If I create a repo using git-annex on the central system and then set up repos on the remote clusters, will pushing from central to a remote repo cause it to sync all of the files?
If anyone has alternative solutions/ideas, I'd love to see them.
Thanks.

Mirroring files from one partition to another on the second disk without RAID1

I am looking for a program that would allow me to mirror one partition to another disk (something like RAID1) for Linux. It doesn't have to be a windowed application, it can be a console application, I just want what is in one place to be mirrored to another.
It would be nice if it were possible to mirror a specific folder that I would care for instead of copying everything from the given partition.
I was looking on the internet, but it's hard to find something that would give such opportunities, hence the idea to ask such a question.
I do not want to make fake RAID on Linux or hardware RAID because I read that if the motherboard fails then it is best to have the same second one to recover data.
I will be grateful for every suggestion :)
You can check my script "CopyDirFile" written in bash, which is located on github.
You can perform a replication (mirroring) task of any source folder to another destination folder (deleting a file in the source folder means deleting it in the destination folder).
The script also allows you to create copy tasks (deleted files in the source folder will not be deleted in the target folder).
The tasks are executed in background at a specified time, not all the time, frequency is set by the user when creating the task.
You can also set the task to start automatically when the user logs on.
All the necessary information can be found in the README file in repository.
If I understood you correctly, I think it meets your requirements.
Linux has standard support for software RAID: mdraid.
It allows you to bundle two disk devices into a RAID 1 device (among other things); you then create a filesystem on top of that device.
LVM offers another way to do software RAID; it doesn't seem to be very popular, but it's certainly supported.
(If your system supports hardware RAID, on the motherboard or with a separate RAID controller, Linux can use that, too, but that doesn't seem to be what you're asking here.)

I need to copy images from Linux server to my windows desktop, should i use threads? and how many?

I need to copy between 400 - 5000 images, it changes every run.
how can i calculate how many threads will give me the fastest result?
should i open new SSH connection to each thread?
i use paramiko to open ssh connection.
and use sftp to copy the images.
thx
I guess best solution before copying it's add images to one archive, because each time it's checks that each file copied and creating of new file it's very consumable operation.
If you will copy archive in one thread it's can have much faster speed of copying, because it's will not wait for each image copy.
So, will be much faster
pack to archive
copy
unpack
You can check it even without connection between any computers, just copy about 1 gb little files from one hard drive to another, and than pack these files into archive and copy again, you will notice that 2nd way will be muuuuuch faster

labelsync help required

Does labelsync actually sync a file from the depot?
or does it only require a file name?
If it does sync, then is there any alternate command like flush that wud work similar to labelsync but faster(i.e without syncing the files) ?
Please Help
Labelsync modifies the set of files associated with a label, it does not sync a file from the depot. It is quite different from sync. There is a flush command: http://www.perforce.com/perforce/doc.current/manuals/cmdref/flush.html#1040665 Flush is similar to sync but it does not actually transfer the files.
I have no idea what sort of command you are trying to run, since flush and labelsync are used for two very different purposes, and both run very fast.
Perhaps you are looking for: http://www.perforce.com/perforce/doc.current/manuals/intro/01_intro.html#1067317

Alternative to creating multipart .tar.gz files?

I have a folder with >20GB of images on a linux server, I need to make a backup and download it, so I was thinking about using "split" to create 1GB files. My question is: instead of splitting a .tar.gz and then having to join it again on my computer, is there a way I could create 20 x 1GB valid .tar.gz files, so I can then view/extract them separately?
Edit: I forgot to add that I need to do it without ssh access. I'm using mostly PHP.
You could try rsnapshot to backup using rsync/hardlinks instead. It not only solves the filesize issue but also gives you high storage and bandwidth efficiency when existing images aren't changed often.
Why not just use rsync?
FYI, rsync is a command-line tool that synchronises directories between two machines across the network. If you have Linux at both ends and ssh access properly configured, it's as simple as rsync -av server:/path/to/images/ images/ (make sure the trailing slashes are there). It also optimises subsequent synchronisations so that only changes are transmitted. You can even tell it to compress data in transit, but that usually doesn't help with images.
First I would give rsnapshot a miss if you don't have SSH access. (Though I do and love it)
I would assume you're likely backing up jpeg's and they are already compressed. Zipping them up doesn't make them much smaller, plus you don't need exactly 1GB files. It sounds like they can be a bit bigger or smaller.
So you could just write a script which bundles jpegs into a gz(or whatever) until it has put about 1gb worth in and then starts a new archive.
You could do all this in PHP easy enough.

Resources