Sending files to remote server in the quickest manner - linux

I have a separate server that processes the media uploaded to my main, web facing server. For now I upload files to it using FTP but the problem with this is that to ensure the files are done uploading I have a timeout running, which adds a delay in the overall processing time. I can't seem to get it to wait less than 5 seconds and still guarantee to pick up the media and this delay is no longer acceptable. So:
Is there a better way to implement this cleanly? I've considered sticking with FTP and sending another file after the initial upload that will indicate it's done but then there are two uploads for every upload = expensive. Another option I've considered is implementing a custom server that will just get a content-length header, do some authentication, and then receive the file and kickoff the processing as soon as its ready. Socket programming doesn't seem too intimidating but I have some worries about sending binary files and different formats, is this a valid concern? Also are there any other protocols out there I could implement to do this, rather than reinvent the wheel? Something like FTP but with a little verification.
I'd be glad for any pointers or tips you can share, thanks!

I suggest you use rsync. This runs over ssh, will move entire directories / hierarchies of files, do incremental copies, in short everything you could possibly want.

Related

Get whole domain content with wget or other commands in linux?

I would like to copy a client project but I only have FTP-access. Normally I'd do it with SSH-access, but in this case it's not possible. The problem is the size of the project (nearly 3GB)
Is there a solution to copy the project to my server only with FTP-access?
The size isn't the problem here. Because of the encryption a SSH upload produces much more overhead than a FTP upload, therefore the answer is: Of course you can use FTP for file uploads, even if they large uploads. FTP was meant for this.
The more important concern is security. If you are normally using SSH for file uploads you'll for sure having security in mind (because FTP would been faster than SSH). If your provider does support SFTP you could use it as an alternative.

using torrents to back up vhd's

Hi it's a question and it may be redundant but I have a hunch there is a tool for this - or there should be and if there isn't I might just make it - or maybe I am barking up the wrong tree in which case correct my thinking:
But my problem is this: I am looking for some way to migrate large virtual disk drives off a server once a week via an internet connection of only moderate speed, in a solution that must be able to be throttled for bandwidth because the internet connection is always in use.
I thought about it and the problem is familar: large files that can moved that also be throttled that can easily survive disconnection/reconnection/large etc etc - the only solution I am familiar with that just does it perfectly is torrents.
Is there a way to automatically strategically make torrents and automatically "send" them to a client download list remotely? I am working in Windows Hyper-V Host but I use only Linux for the guests and I could easily cook up a guest to do the copying so consider it a windows or linux problem.
PS: the vhds are "offline" copies of guest servers by the time I am moving them - consider them merely 20-30gig dum files.
PPS: I'd rather avoid spending money
Bittorrent is an excellent choice, as it handles both incremental updates and automatic resume after connection loss very well.
To create a .torrent file automatically, use the btmakemetainfo script found in the original bittorrent package, or one from the numerous rewrites (bittornado, ...) -- all that matters is that it's scriptable. You should take care to set the "disable DHT" flag in the .torrent file.
You will need to find a tracker that allows you to track files with arbitrary hashes (because you do not know these in advance); you can either use an existing open tracker, or set up your own, but you should take care to limit the client IP ranges appropriately.
This reduces the problem to transferring the .torrent files -- I usually use rsync via ssh from a cronjob for that.
For point to point transfers, torrent is an expensive use of bandwidth. For 1:n transfers it is great as the distribution of load allows the client's upload bandwidth to be shared by other clients, so the bandwidth cost is amortised and everyone gains...
It sounds like you have only one client in which case I would look at a different solution...
wget allows for throttling and can resume transfers where it left off if the FTP/http server supports resuming transfers... That is what I would use
You can use rsync for that (http://linux.die.net/man/1/rsync). Search for the --partial option in man and that should do the trick. When a transfer is interrupted the unfinished result (file or directory) is kept. I am not 100% sure if it works with telnet/ssh transport when you send from local to a remote location (never checked that) but it should work with rsync daemon on the remote side.
You can also use that for sync in two local storage locations.
rsync --partial [-r for directories] source destination
edit: Just confirmed the crossed out statement with ssh

How does RPC transfer big binary data?

If I want to transfer data using RPC or component technology, but the size of data can be very big, how deal with this situation ?
for example, I want to transfer a file to remote as a parameter, but I don't want put the whole file into memory for transferring . How should I do?
I think you should consider the file transfer solution, smth like establishing FTP connection in the background and make operations supposed to perform on this file data to wait until file transferring completes. Also you should take care of correctness of transferred data, checksumming for instance. The other solution probably is mounting remote directory containing files as a local volume or even setting up a distributed file system if you have all files in one place and you are powered with Linux.
Let's me answer my question.
The answer is MTOM, make sure the framework you are using support it.

NodeJS: How would one watch a large amount of files/folders on the server side for updates?

I am working on a small NodeJS application that essentially serves as a browser based desktop search for a LAN based server that multiples users can query. The users on the LAN all have access to a shared folder on that server and are traditionally used to just placing files within that folder to sharing among everyone, and I want to keep that process the same.
The first solution I came across was the fs.watchFile which has been touched on in other stackoverflow questions. In the first question user Ivo Wetzel noted that on a linux system fs.watchFile uses inotify but, was of the opinion that fs.watchFile should not be used for large amounts of files/folders.
In another question about fs.watchFile user tjameson first reiterated that on Linux inotify would be used by fs.fileWatch and recommended to just use a combination of node-inotify-plusplus and node-walk but again stated this method should not be used for a large number of files. With a comment and response he suggested only watching the modified times of directories and then rescanning the relevant directory for file changes.
My biggest hurdles seem to be that even with tjameson's suggestion there is still a hard limit to the number of folders monitored (of which there are many and growing). Also it would have to be done recursively because the directory tree is somewhat deep and can also be subject to change at the lower branches so I would have to monitor the following at every folder level (or alternatively monitor the modified time of the folders and then scan to find out what happened):
creation of file or subfolder
deletion of file or subfolder
move of file or subfolder
deletion of self
move of self
Assuming the inotify has limits in line with what was said above then this alone to me seems like it may be too many monitors when I have a significant amount of nested subfolders. The real awesome way looks like it would involve kqueue which I subsequently found as a topic of discussion on a better fs.fileWatch in a google group.
It seems clear to me that keeping a database of the relevant file and folder information is the appropriate course of action on the query side of things, but keeping that database synchronized with the actual state of the file system under the directories of concern will be the challenge.
So what does the community think? Is there a better or well known solution for attacking this problem that I am just unaware of? Is it best just to watch all directories of interest for a single change e.g. modified time and then scan to find out what happened? Is it better to watch all the relevant inotify alerts and modify the database appropriately? Is this not a problem which is solvable by a peasant like me?
Have a look at monit. I use it to monitor files for changes in my dev environment and restart my node processes when relevant project files change.
I recommend you to take a look at the Dropbox API.
I implemented something similar with ruby on the client side and nodejs on the server side.
The best approach is to keep hashes to check if the files or folders changed.

Process text files ftp'ed into a set of directories in a hosted server

The situation is as follows:
A series of remote workstations collect field data and ftp the collected field data to a server through ftp. The data is sent as a CSV file which is stored in a unique directory for each workstation in the FTP server.
Each workstation sends a new update every 10 minutes, causing the previous data to be overwritten. We would like to somehow concatenate or store this data automatically. The workstation's processing is limited and cannot be extended as it's an embedded system.
One suggestion offered was to run a cronjob in the FTP server, however there is a Terms of service restriction to only allow cronjobs in 30 minute intervals as it's shared-hosting. Given the number of workstations uploading and the 10 minute interval between uploads it looks like the cronjob's 30 minute limit between calls might be a problem.
Is there any other approach that might be suggested? The available server-side scripting languages are perl, php and python.
Upgrading to a dedicated server might be necessary, but I'd still like to get input on how to solve this problem in the most elegant manner.
Most modern Linux's will support inotify to let your process know when the contents of a diretory has changed, so you don't even need to poll.
Edit: With regard to the comment below from Mark Baker :
"Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files."
That will happen with the inotify watch you set on the directory level - the way to make sure you then don't pick up the partial file is to set a further inotify watch on the new file and look for the IN_CLOSE event so that you know the file has been written to completely.
Once your process has seen this, you can delete the inotify watch on this new file, and process it at your leisure.
You might consider a persistent daemon that keeps polling the target directories:
grab_lockfile() or exit();
while (1) {
if (new_files()) {
process_new_files();
}
sleep(60);
}
Then your cron job can just try to start the daemon every 30 minutes. If the daemon can't grab the lockfile, it just dies, so there's no worry about multiple daemons running.
Another approach to consider would be to submit the files via HTTP POST and then process them via a CGI. This way, you guarantee that they've been dealt with properly at the time of submission.
The 30 minute limitation is pretty silly really. Starting processes in linux is not an expensive operation, so if all you're doing is checking for new files there's no good reason not to do it more often than that. We have cron jobs that run every minute and they don't have any noticeable effect on performance. However, I realise it's not your rule and if you're going to stick with that hosting provider you don't have a choice.
You'll need a long running daemon of some kind. The easy way is to just poll regularly, and probably that's what I'd do. Inotify, so you get notified as soon as a file is created, is a better option.
You can use inotify from perl with Linux::Inotify, or from python with pyinotify.
Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files.
With polling it's less likely you'll see partial files, but it will happen eventually and will be a nasty hard-to-reproduce bug when it does happen, so better to deal with the problem now.
If you're looking to stay with your existing FTP server setup then I'd advise using something like inotify or daemonized process to watch the upload directories. If you're OK with moving to a different FTP server, you might take a look at pyftpdlib which is a Python FTP server lib.
I've been a part of the dev team for pyftpdlib a while and one of more common requests was for a way to "process" files once they've finished uploading. Because of that we created an on_file_received() callback method that's triggered on completion of an upload (See issue #79 on our issue tracker for details).
If you're comfortable in Python then it might work out well for you to run pyftpdlib as your FTP server and run your processing code from the callback method. Note that pyftpdlib is asynchronous and not multi-threaded, so your callback method can't be blocking. If you need to run long-running tasks I would recommend a separate Python process or thread be used for the actual processing work.

Resources