I have two scenarios.
Scenario 1: Machine A contains 1000 documents as folders. This folder of machine A is mounted in machine B. I process documents within these folders in machine B and store the output result in mounted path in machine B.
Scenario 2: The documents in machine A is directly copied into machine B and processed
Scenario 2 is much faster than Scenario 1. I could guess its because there is no data transfer happening over the network between 2 machines. Is there a way I can use mounting and still achieve better performance?
Did you try enabling a cache? - for NFS: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/fscachenfs.html - CIFS should have caching enabled by default (unless you disabled it)
The other option would be to use something like Windows’ offline files, which copies files and folders between client and server in the background, so you don’t need to deal with it. The only thing I’ve found for linux is OFS.
But the performance depends on the size of the files and if you read them randomly or sequentially. For instance when I am encoding videos, I access the file right away via the network from my NFS, because it takes as much time as it would take to read and write the file. This way no additional time is “wasted” on the encoding, as the application can encode the stream which is coming from the network.
So for large files you might want to change the algorithms to a sequential read, on the other hand small files which are copied within seconds, could be also synced between server and client using rsync, bittorrent sync, dropbox or one of the other hundreds of tools. And this is actually quite commonly done.
Related
I have a website that I host on a Linux VPS which has been growing over the years. One of its primary functions is to store images/photos and these image files are typically around 20-40kB each. The way the site is organised at the moment is all images are stored in a root folder ‘photos’ and under that root folder are many subfolders determined by a random filename. For example, one image could have a file name abcdef1234.jpg and that would be stored in the folder photos/ab/cd/ef/. The advantage of this is that there are no directories with excessive numbers of images in them and accessing files is quick. However, the entire photos directory is huge and is set to grow. I currently have almost half a million photos in tens of thousands of sub-folders and whilst the system works fine, it is fairly cumbersome to back up. I need advice on what I could do to make life easier for back-ups. At the moment, I am backing up the entire photos directory each time and I do that by compressing the folder and downloading it. It takes a while and puts some strain on the server. I do this because every FTP client I use takes ages to sift through all the files and find the most recent ones by date. Also, I would like to be able to restore the entire photo set quickly in the event of a catastrophic webserver failure so even if I could back up the data recursively, how cumbersome would it be to have to upload each back stage by stage?
Does anyone have any suggestions perhaps from experience? I am not a webserver administrator and my experience of Linux is very limited. I have also looked into CDN’s and Amazon S3 but this would require a great deal of change to my site in order to make these system work – perhaps I’ll use something like this in the future.
Since you indicated that you run a VPS, I assume you have shell access which gives you substantially more flexibility (as opposed to a shared webhosting plan where you can only interact with a web frontend and an FTP client). I'm pretty sure that rsync is specifically designed to do what you need to do (sync large numbers of files between machines, and do so efficiently).
This gets into Superuser territory, so you might get more advice over on that forum.
Suppose I need to download a lot of small files from a remote host by HTTP and I have a list of the URLs to download. Suppose also that the remote host allows only K connections to my local network. My local network has M computers and I would like to distribute the files across them evenly. All my computers run Linux. Now I wonder how to organize the download.
Now I assume that one computer is enough to handle all K connections and store all those files in its local file system. Thus I would allocate a computer to "download" files to M folders named after M local hosts. The local hosts copy (move) files from those folders to their file systems. Does it make sense ? What is the simplest way to implement it ?
Your approach is fine but there is an assumption that all files are of same size and all computers have equal performance.
What happens if one computers is done with it files and other is still half way through? In this case although you have processor available but it will be sitting idle.
But to implement this you will need distributed computing which will be lot more complicated, so i would say if this is one time task or if total time take is not large then your approach shld be fine else you need to evaluate distributed approach.
Hi it's a question and it may be redundant but I have a hunch there is a tool for this - or there should be and if there isn't I might just make it - or maybe I am barking up the wrong tree in which case correct my thinking:
But my problem is this: I am looking for some way to migrate large virtual disk drives off a server once a week via an internet connection of only moderate speed, in a solution that must be able to be throttled for bandwidth because the internet connection is always in use.
I thought about it and the problem is familar: large files that can moved that also be throttled that can easily survive disconnection/reconnection/large etc etc - the only solution I am familiar with that just does it perfectly is torrents.
Is there a way to automatically strategically make torrents and automatically "send" them to a client download list remotely? I am working in Windows Hyper-V Host but I use only Linux for the guests and I could easily cook up a guest to do the copying so consider it a windows or linux problem.
PS: the vhds are "offline" copies of guest servers by the time I am moving them - consider them merely 20-30gig dum files.
PPS: I'd rather avoid spending money
Bittorrent is an excellent choice, as it handles both incremental updates and automatic resume after connection loss very well.
To create a .torrent file automatically, use the btmakemetainfo script found in the original bittorrent package, or one from the numerous rewrites (bittornado, ...) -- all that matters is that it's scriptable. You should take care to set the "disable DHT" flag in the .torrent file.
You will need to find a tracker that allows you to track files with arbitrary hashes (because you do not know these in advance); you can either use an existing open tracker, or set up your own, but you should take care to limit the client IP ranges appropriately.
This reduces the problem to transferring the .torrent files -- I usually use rsync via ssh from a cronjob for that.
For point to point transfers, torrent is an expensive use of bandwidth. For 1:n transfers it is great as the distribution of load allows the client's upload bandwidth to be shared by other clients, so the bandwidth cost is amortised and everyone gains...
It sounds like you have only one client in which case I would look at a different solution...
wget allows for throttling and can resume transfers where it left off if the FTP/http server supports resuming transfers... That is what I would use
You can use rsync for that (http://linux.die.net/man/1/rsync). Search for the --partial option in man and that should do the trick. When a transfer is interrupted the unfinished result (file or directory) is kept. I am not 100% sure if it works with telnet/ssh transport when you send from local to a remote location (never checked that) but it should work with rsync daemon on the remote side.
You can also use that for sync in two local storage locations.
rsync --partial [-r for directories] source destination
edit: Just confirmed the crossed out statement with ssh
If I want to transfer data using RPC or component technology, but the size of data can be very big, how deal with this situation ?
for example, I want to transfer a file to remote as a parameter, but I don't want put the whole file into memory for transferring . How should I do?
I think you should consider the file transfer solution, smth like establishing FTP connection in the background and make operations supposed to perform on this file data to wait until file transferring completes. Also you should take care of correctness of transferred data, checksumming for instance. The other solution probably is mounting remote directory containing files as a local volume or even setting up a distributed file system if you have all files in one place and you are powered with Linux.
Let's me answer my question.
The answer is MTOM, make sure the framework you are using support it.
I have many machines (20+) connected in a network. each machine accesses a central database, queries it, processes the information queried, and then writes the results to files on its local hard drive.
Following the processing, I'd like to be able to 'grab' all these files (from all the remote machines) back to the main machine for storage.
I thought of three possible ways to do so:
(1) rsync to each remote machine from the main machine, and 'ask' for the files
(2) rsync from every remote machine to the main machine, and 'send' the files
(3) create a NFS share on each remote machine, to which the main machine can access and read the files (no 'rsync' is needed in such a case)
Is one of the ways better than others? are there better ways I am not aware of?
All machines use Ubuntu 10.04LTS. Thanks in advance for any suggestions.
You could create one NFS share on the master machine and have each remote machine mount that. Seems like less work.
Performance-wise, it's practically the same. You are still sending files over a (relatively) slow network connection.
Now, I'd say which approach you take depends on where you want to handle errors or irregularities. If you want the responsibility to lie on your processing computers, use rsync back to the main one; or the other way round if you want the main one to work on assembling the data and assuring everything is in order.
As for the shared space approach, I would create a share on the main machine, and have the others write to it. They can start as soon as the processing finishes, ensure the file is transferred correctly, and then verify checksums or whatever.
I would prefer option (2) since you know when the processing is finished on the client machine. You could use the same SSH key on all client machines or collect the different keys in the authorized_keys file on the main machine. It's also more reliable if the main machine is unavailable for some reason, you can still sync the results later while in the NFS setup the clients are blocked.