Collecting Files From Many Machines? - linux

I have many machines (20+) connected in a network. each machine accesses a central database, queries it, processes the information queried, and then writes the results to files on its local hard drive.
Following the processing, I'd like to be able to 'grab' all these files (from all the remote machines) back to the main machine for storage.
I thought of three possible ways to do so:
(1) rsync to each remote machine from the main machine, and 'ask' for the files
(2) rsync from every remote machine to the main machine, and 'send' the files
(3) create a NFS share on each remote machine, to which the main machine can access and read the files (no 'rsync' is needed in such a case)
Is one of the ways better than others? are there better ways I am not aware of?
All machines use Ubuntu 10.04LTS. Thanks in advance for any suggestions.

You could create one NFS share on the master machine and have each remote machine mount that. Seems like less work.

Performance-wise, it's practically the same. You are still sending files over a (relatively) slow network connection.
Now, I'd say which approach you take depends on where you want to handle errors or irregularities. If you want the responsibility to lie on your processing computers, use rsync back to the main one; or the other way round if you want the main one to work on assembling the data and assuring everything is in order.
As for the shared space approach, I would create a share on the main machine, and have the others write to it. They can start as soon as the processing finishes, ensure the file is transferred correctly, and then verify checksums or whatever.

I would prefer option (2) since you know when the processing is finished on the client machine. You could use the same SSH key on all client machines or collect the different keys in the authorized_keys file on the main machine. It's also more reliable if the main machine is unavailable for some reason, you can still sync the results later while in the NFS setup the clients are blocked.

Related

Syncing between a windows based server (host) and linux server(client) using SFTP

My task is to sync folders between two computers. One which acts as a windows server which is the host and the other one is a linux based server. The file transfer has to be secure and encrypted. Are there are any free softwares which will help me do this task.
Additionally the syncing should automatically happen after every pre decided interval.
I have a recollection that WinSCP can be invoked through command line. There, you have the option to synchronize folders (and the whole hierarchy there in). It may be worth trying.
Total Commander also has FTP/SFTP capabilities, but I'm not sure you can invoke it through command line.
One point to consider: If the process is to run automatically, you need to hard-code the username and password for the connection. There your security becomes compromised.

Processing speed over mounted path

I have two scenarios.
Scenario 1: Machine A contains 1000 documents as folders. This folder of machine A is mounted in machine B. I process documents within these folders in machine B and store the output result in mounted path in machine B.
Scenario 2: The documents in machine A is directly copied into machine B and processed
Scenario 2 is much faster than Scenario 1. I could guess its because there is no data transfer happening over the network between 2 machines. Is there a way I can use mounting and still achieve better performance?
Did you try enabling a cache? - for NFS: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/fscachenfs.html - CIFS should have caching enabled by default (unless you disabled it)
The other option would be to use something like Windows’ offline files, which copies files and folders between client and server in the background, so you don’t need to deal with it. The only thing I’ve found for linux is OFS.
But the performance depends on the size of the files and if you read them randomly or sequentially. For instance when I am encoding videos, I access the file right away via the network from my NFS, because it takes as much time as it would take to read and write the file. This way no additional time is “wasted” on the encoding, as the application can encode the stream which is coming from the network.
So for large files you might want to change the algorithms to a sequential read, on the other hand small files which are copied within seconds, could be also synced between server and client using rsync, bittorrent sync, dropbox or one of the other hundreds of tools. And this is actually quite commonly done.

How to download a lot of remote files with a few Linux computers?

Suppose I need to download a lot of small files from a remote host by HTTP and I have a list of the URLs to download. Suppose also that the remote host allows only K connections to my local network. My local network has M computers and I would like to distribute the files across them evenly. All my computers run Linux. Now I wonder how to organize the download.
Now I assume that one computer is enough to handle all K connections and store all those files in its local file system. Thus I would allocate a computer to "download" files to M folders named after M local hosts. The local hosts copy (move) files from those folders to their file systems. Does it make sense ? What is the simplest way to implement it ?
Your approach is fine but there is an assumption that all files are of same size and all computers have equal performance.
What happens if one computers is done with it files and other is still half way through? In this case although you have processor available but it will be sitting idle.
But to implement this you will need distributed computing which will be lot more complicated, so i would say if this is one time task or if total time take is not large then your approach shld be fine else you need to evaluate distributed approach.

Migrate data from one server to another

I bought a new server and I want to move all the data (directories, sub directories, users, passwords, ..etc) from my old server to it.
Is there a way to do that?
Thanks,
Do you have physical access to both servers? If so you can use the dd command to make a clone of the disk from the old server to the disk that is going into the new server.
In order to do this though, both hard drives have to be installed in one of the servers.
You can also use netcat and dd to clone a disk over a network.
for the directories and files, use a FTP client from your server, if it allows you to, if not, just download all the content to your computer and upload it to the new server.
For the users and passwords, i guess they are in a Database, connect to the database using SSH, telnet, or MysqlAdmin or any RMDB client system and export a dump file, then log in to the new server's SQL system and import that dump file.
Anyway you should give more details of both servers anyway so we can help you, for example, are they Shared hosting or dedicated machine? and what kind of access do you have to them, also, their operative system would help people to reply you accurately
In principle, yes.
If the hardware is similar (= just more RAM, disk space but same CPU architecture and no special graphics card drivers), you might be able to copy every file and then install the boot loader once more (the boot loader config usually changes when the hard disk size changes).
Or you can create a list of all services that you use, determine which config files each one uses and then just copy those. Ideally, you shouldn't copy them but compare the old and the new versions and merge them.
The most work intensive way is to use a tool like puppet. In a nutshell, puppet allows to create install scripts for services (along with all the configuration that you need). So if you need to install a service again (new hardware, second server), you just tell puppet to do it. On the plus side, your whole installation will be documented, too. If you ever wonder why something is the way it is, you can look into the puppet files.
Of course, this approach takes a lot of time and discipline, so it might not be worth it in your case. Apply common sense.

using torrents to back up vhd's

Hi it's a question and it may be redundant but I have a hunch there is a tool for this - or there should be and if there isn't I might just make it - or maybe I am barking up the wrong tree in which case correct my thinking:
But my problem is this: I am looking for some way to migrate large virtual disk drives off a server once a week via an internet connection of only moderate speed, in a solution that must be able to be throttled for bandwidth because the internet connection is always in use.
I thought about it and the problem is familar: large files that can moved that also be throttled that can easily survive disconnection/reconnection/large etc etc - the only solution I am familiar with that just does it perfectly is torrents.
Is there a way to automatically strategically make torrents and automatically "send" them to a client download list remotely? I am working in Windows Hyper-V Host but I use only Linux for the guests and I could easily cook up a guest to do the copying so consider it a windows or linux problem.
PS: the vhds are "offline" copies of guest servers by the time I am moving them - consider them merely 20-30gig dum files.
PPS: I'd rather avoid spending money
Bittorrent is an excellent choice, as it handles both incremental updates and automatic resume after connection loss very well.
To create a .torrent file automatically, use the btmakemetainfo script found in the original bittorrent package, or one from the numerous rewrites (bittornado, ...) -- all that matters is that it's scriptable. You should take care to set the "disable DHT" flag in the .torrent file.
You will need to find a tracker that allows you to track files with arbitrary hashes (because you do not know these in advance); you can either use an existing open tracker, or set up your own, but you should take care to limit the client IP ranges appropriately.
This reduces the problem to transferring the .torrent files -- I usually use rsync via ssh from a cronjob for that.
For point to point transfers, torrent is an expensive use of bandwidth. For 1:n transfers it is great as the distribution of load allows the client's upload bandwidth to be shared by other clients, so the bandwidth cost is amortised and everyone gains...
It sounds like you have only one client in which case I would look at a different solution...
wget allows for throttling and can resume transfers where it left off if the FTP/http server supports resuming transfers... That is what I would use
You can use rsync for that (http://linux.die.net/man/1/rsync). Search for the --partial option in man and that should do the trick. When a transfer is interrupted the unfinished result (file or directory) is kept. I am not 100% sure if it works with telnet/ssh transport when you send from local to a remote location (never checked that) but it should work with rsync daemon on the remote side.
You can also use that for sync in two local storage locations.
rsync --partial [-r for directories] source destination
edit: Just confirmed the crossed out statement with ssh

Resources