Process Many Files Concurrently — Copy Files Over or Read Through NFS?

Process Many Files Concurrently — Copy Files Over or Read Through NFS? - linux

I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be processed by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
What is the better thing to do:
(1) For every processing machine M, Copy all files that will be processed by M into its local hard drive, and then read & process the files locally on machine M.
(2) Instead of copying the files to every machine, every machine will access the 'incoming' folder directly (using NFS), and will read the files from there, and then process them locally.
Which idea is better? Are there any 'do' and 'donts' when one is doing such a thing?
I am mostly curious if it is a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...). Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)

I would definitely do #2 - and I would do it as follows:
Run Apache on your main server with all the files. (Or some other HTTP server, if you really want). There are several reason's I'd do it this way:
HTTP is basically pure TCP (with some headers on it). Once the request is sent - it's a very "one-way" protocol. Low overhead, not chatty. High performance and efficiency - low overhead.
If you (for whatever reason) decided you needed to move or scale it out (using a could service, for example) HTTP would be a much better way to move the data around over the open Internet, than NFS. You could use SSL (if needed). You could get through firewalls (if needed). etc..etc..etc...
Depending on the access pattern of your file, and assuming the whole file is required to be read - it's easier/faster just to do one network operation - and pull the whole file in in one whack - rather than to constantly request I/Os over the network every time you're reading a smaller piece of the file.
It could be easy to distribute and run an application that does all this - and doesn't rely on the existance of network mounts - specific file paths, etc. If you have the URL to the files - the client can do it's job. It doesn't need to have established mounts, hard directory - or to become root to set-up such mounts.
If you have NFS connectivity problems - the whole system can get whacky when you try to access the mounts and they hang. With HTTP running in a user-space context - you just get a timeout error - and your application can take whatever action it chooses (like page you - log errors, etc).

Related

Scaling out scenario with multiple web server and shared files

I need some recommendation or a better suggestion. I have been building a platform and start thinking about what kind of server architecture I need to have. I am not an expert in the server architecture, but when I launch, I need at least a stable production environment until we find a system architect.
I will have about 500GB (or even more) product images and some PDF files as we have more clients later.
I would like to have a minimal set of files (HTML and javascript files) on web servers(2 or 3 in the beginning) and a shared directory where all the product images will reside. I will have some standalone backend java process which will download images and store it into the shared directory, so when a request goes to any web server, a client should be able to see images and pdf files.
I will have Spring MVC in the backend and session will be handled by Redis cluster, so I don't worry about this distributed session handling.
Basically, I need a solution to centralize all the static files(images and PDF files) which will grow exponentially as time goes by and those files are accessible all the time from the web servers.
I have read NFS which can be accessible from web servers.
I am wondering if this NFS is a good solution for this usecase. I am sure this usecase might be a common issue.
Is there a better option instead of NFS?
Thanks.

Many variables will influence the options you could use. But one of the main criteria is budget.
On the cheap:
1) you have 2 or 3 servers. So purchase one large disk per system to store your static files, and use rsync to ensure they are all the same.
Large disks are cheap, you could even get SSD! This can grow for a while.
2) same with disks and use something a bit more evolved to ensure sync. gluster or the inotify mechanisms would do. There are many more software you could use.
3) NFS is ok. But it does not work very well with high hit web servers. So your file is there, and available, but if you are hit a lot, you will have performance and/or network issues. We had that once and we cut the NFS, it was slowing down the site.
4) the NFS shortcomings can be minimized by using caching on the web servers for frequent images.
More expensive:
5) NAS. There is some dedicated NAS software you could setup with a dedicated file server system.
6) NAS, dedicated hardware, super fast, expensive. Can grow but $$$.
7) Distributed static files services. Ex. Akamai. They store the files and distribute them for you to your clients. So their infrastructure gets the hits. But it comes at a cost. The setup is not super complicated, if you can afford it. You pay by volume. FYI this is not an endorsement, we used it at my last company, there are probably other vendors that do something similar.
This is a large subject, I hope I got you started with some ideas.

Does application size matters when storing static content?

I'm planning to store HTMLs,PDFs and image files in my node application in the public folder instead of some s3 bucket b/c i want to have the cleaner urls from my domain instead of s3 url. over time my application grows to contain more than 50k HTMLs, PDFs and images.
Does this slows down the application in the future since the application footprint will be huge or will it still work fine?
What are the potential downsides of storing huge amount of static content within the app?

The size of the application has a small impact on its performance. There are many other factors that have a larger impact.
One downside of storing static content within the app is that it isn’t distributed and doesn’t scale well.

Does this slows down the application in the future since the application footprint will be huge or will it still work fine?
It does not matter if you have 100Kb or 100Gb stored locally. The amount of data stored on the local hard drive has nothing to do with your application's performance.
If you put a zillion files all in one directory, that could slightly impact the OS performance of file operations in that directory, but if you spread your files out with no more than a couple thousand in a directory, it should not affect things at all.
The amount of data your app actually reads and writes to the hard disk has a lot to do with the app's performance. So, if these are static files that your server is asked to serve at high volume and you're comparing that situation to one where the files are hosted elsewhere and served by some other infrastructure (like S3), then that does make a difference. Serving static files is a pretty simple operation so you could always just put something like NGINX in front of your web server to handle the serving of static files very efficiently if needed.
What are the potential downsides of storing huge amount of static content within the app?
Presumably, you don't really mean "within the app", but rather you mean "on the local hard drive". As long as the only server process that needs to get access to these files is on the local machine, then there is really no downside. You will want to make sure that there is some sort of backup/redundancy solution for the local hard drive since it contains a lot of important data. Storing the data on a service like S3 will often times take care of the backup and redundancy for you (or you can easily enable such features).

Is there any advantages in serving files from a massive object instead of from a hard drive in Node.JS

I'm curious if there's any advantages in loading my website in to a huge global object (containing file content, file names and so on..) at startup.
Is there a speed advantage (considering such a massive Object)?
Is there a maximum size of a string or an object?
Do the files need to be encoded?
How will this affect my server RAM?
I'm aware that all files will be cached and I will need to reload parts of the object whenever a file is edited.

1) Yes there is a obvious benefit: Reading from RAM is faster than reading from disk (http://norvig.com/21-days.html#answers)
2) Every time you read a file from the filesystem with Node, you get back a Buffer object. Buffer objects are stored outside of the JS heap so you're not limited by the total v8 heap size. However each Buffer has a limit of 1Gb in size (this is changing: https://twitter.com/trevnorris/status/603345087028793345). Obvious the total limit is the limit of your process (see ulimit) and of your system in total.
3) That's up to you. If you just read the files as Buffers, you don't need to specify encoding. It's just raw memory
Other thoughts:
You should be aware that file caching is already happening in the Kernel by ways of the page cache. Every time you read a file from the filesystem, you're not necessarily incurring a disk seek/read.
You should benchmark your idea vs just reading from the filesystem and see what the gains are. If you're saving 10ms but it still takes > 150ms for a user to retrieve the web page over the network, it's probably a waste of time.

It's going to be a lot of programming work to load all of your static assets onto some sort of in-memory object and then to serve them from node. I don't know of any web frameworks that have built in facilities for this and you're probably going to poorly reinvent a whole bunch of wheels... So no; there's absolutely no advantage in you doing this.
Web servers like apache handle caching files really well, if set up to do so. You can use one as a proxy for node. They also access the file system much more quickly than node does. Using a proxy essentially implements most of the in-memory solution you're interested in.
Proper use of expiration headers will ensure that clients won't request unchanging assets unnecessarily. You can also use a content delivery network, like akamai, to serve static assets from servers closer to your users. Both of these approaches mean that clients never even hit your server, though a CDN will cost you.
Serving files isn't terribly expensive as compared to sending them down the wire or doing things like querying a database.
Use a web servers to proxy your static content. Then make sure client side caching policies are set up correctly. Finally, consider a content delivery network. Don't re-invent the wheel!

Slow execution of Exe in Azure

I am facing a problem of slow execution of exe in Azure platform
Following are the Steps:
Read data from SQL Azure Server& CSV files & display in on HTML5 pages.
Write data on CSV files.
Executing a external Fortron exe, which reads data from csv files generated in step 23.
Fortron exe after calculations write data on .txt file.
Read text file data generated in step 5 & display it on HTML5 pages.
Issue:
In point # 3, when we are invoking fortron exe using process start method, then –
On local machines in usually take 17~18 secs
On cloud server this is taking 34~35 secs.
Rest all other activities are taking same time on local as well as cloud server.

Regarding step 3: What size local machine are you using (e.g. number of cores), since you're running an exe that may be doing some number-crunching. Now compare that the the machine size allocated in Windows Azure? Are you using an Extra Small (shared core) or Small (single core)? Plus what size cpu does your local machine have? If you're not comparing like-kind configurations, you'll certainly have performance differences. Same goes for RAM (an Extra Small offers 768MB, with Small through XL offering 1.75GB per core) and bandwidth (XS has 5Mbps, Small through XL have 100Mbps per core).

The Azure systems have slower IO process than a local server this will be the reason you see the performance impact you are also on a shared system so your IO also may vary depending on your neighbours also and server load. If you are task is IO intensive the best bet is to run a VM and you need to persist the data is to attach multiple disks to the VM and then use stripping on the disk.
http://www.windowsazure.com/en-us/manage/windows/how-to-guides/attach-a-disk/
Striped IO Disks performance stats.
http://blinditandnetworkadmin.blogspot.co.uk/2012/08/vm-io-performance-on-windows-azure.html
You will need to have a warm set of disks to get true performance set.
Also I found the Temp storage on VM Normally the D drive to have very good IO so maybe worth if you are going to use a VM to try there first.

Architecture to Sandbox the Compilation and Execution Of Untrusted Source Code

The SPOJ is a website that lists programming puzzles, then allows users to write code to solve those puzzles and upload their source code to the server. The server then compiles that source code (or interprets it if it's an interpreted language), runs a battery of unit tests against the code, and verifies that it correctly solves the problem.
What's the best way to implement something like this - how do you sandbox the user input so that it can not compromise the server? Should you use SELinux, chroot, or virtualization? All three plus something else I haven't thought of?
How does the application reliably communicate results outside of the jail while also insuring that the results are not compromised? How would you prevent, for instance, an application from writing huge chunks of nonsense data to disk, or other malicious activities?
I'm genuinely curious, as this just seems like a very risky sort of application to run.

A chroot jail executed from a limited user account sounds like the best starting point (i.e. NOT root or the same user that runs your webserver)
To prevent huge chunks of nonsense data being written to disk, you could use disk quotas or a separate volume that you don't mind filling up (assuming you're not testing in parallel under the same user - or you'll end up dealing with annoying race conditions)
If you wanted to do something more scalable and secure, you could use dynamic virtualized hosts with your own server/client solution for communication - you have a pool of 'agents' that receive instructions to copy and compile from X repository or share, then execute a battery of tests, and log the output back via the same server/client protocol. Your host process can watch for excessive disk usage and report warnings if required, the agents may or may not execute the code under a chroot jail, and if you're super paranoid you would destroy the agent after each run and spin up a new VM when the next sample is ready for testing. If you're doing this large scale in the cloud (e.g. 100+ agents running on EC2) you only ever have enough spun up to accommodate demand and therefore reduce your costs. Again, if you're going for scale you can use something like Amazon SQS to buffer requests, or if you're doing a experimental sample project then you could do something much simpler (just think distributed parallel processing systems, e.g. seti#home)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string