Does application size matters when storing static content?

Does application size matters when storing static content? - node.js

I'm planning to store HTMLs,PDFs and image files in my node application in the public folder instead of some s3 bucket b/c i want to have the cleaner urls from my domain instead of s3 url. over time my application grows to contain more than 50k HTMLs, PDFs and images.
Does this slows down the application in the future since the application footprint will be huge or will it still work fine?
What are the potential downsides of storing huge amount of static content within the app?

The size of the application has a small impact on its performance. There are many other factors that have a larger impact.
One downside of storing static content within the app is that it isn’t distributed and doesn’t scale well.

Does this slows down the application in the future since the application footprint will be huge or will it still work fine?
It does not matter if you have 100Kb or 100Gb stored locally. The amount of data stored on the local hard drive has nothing to do with your application's performance.
If you put a zillion files all in one directory, that could slightly impact the OS performance of file operations in that directory, but if you spread your files out with no more than a couple thousand in a directory, it should not affect things at all.
The amount of data your app actually reads and writes to the hard disk has a lot to do with the app's performance. So, if these are static files that your server is asked to serve at high volume and you're comparing that situation to one where the files are hosted elsewhere and served by some other infrastructure (like S3), then that does make a difference. Serving static files is a pretty simple operation so you could always just put something like NGINX in front of your web server to handle the serving of static files very efficiently if needed.
What are the potential downsides of storing huge amount of static content within the app?
Presumably, you don't really mean "within the app", but rather you mean "on the local hard drive". As long as the only server process that needs to get access to these files is on the local machine, then there is really no downside. You will want to make sure that there is some sort of backup/redundancy solution for the local hard drive since it contains a lot of important data. Storing the data on a service like S3 will often times take care of the backup and redundancy for you (or you can easily enable such features).

Related

Scaling out scenario with multiple web server and shared files

I need some recommendation or a better suggestion. I have been building a platform and start thinking about what kind of server architecture I need to have. I am not an expert in the server architecture, but when I launch, I need at least a stable production environment until we find a system architect.
I will have about 500GB (or even more) product images and some PDF files as we have more clients later.
I would like to have a minimal set of files (HTML and javascript files) on web servers(2 or 3 in the beginning) and a shared directory where all the product images will reside. I will have some standalone backend java process which will download images and store it into the shared directory, so when a request goes to any web server, a client should be able to see images and pdf files.
I will have Spring MVC in the backend and session will be handled by Redis cluster, so I don't worry about this distributed session handling.
Basically, I need a solution to centralize all the static files(images and PDF files) which will grow exponentially as time goes by and those files are accessible all the time from the web servers.
I have read NFS which can be accessible from web servers.
I am wondering if this NFS is a good solution for this usecase. I am sure this usecase might be a common issue.
Is there a better option instead of NFS?
Thanks.

Many variables will influence the options you could use. But one of the main criteria is budget.
On the cheap:
1) you have 2 or 3 servers. So purchase one large disk per system to store your static files, and use rsync to ensure they are all the same.
Large disks are cheap, you could even get SSD! This can grow for a while.
2) same with disks and use something a bit more evolved to ensure sync. gluster or the inotify mechanisms would do. There are many more software you could use.
3) NFS is ok. But it does not work very well with high hit web servers. So your file is there, and available, but if you are hit a lot, you will have performance and/or network issues. We had that once and we cut the NFS, it was slowing down the site.
4) the NFS shortcomings can be minimized by using caching on the web servers for frequent images.
More expensive:
5) NAS. There is some dedicated NAS software you could setup with a dedicated file server system.
6) NAS, dedicated hardware, super fast, expensive. Can grow but $$$.
7) Distributed static files services. Ex. Akamai. They store the files and distribute them for you to your clients. So their infrastructure gets the hits. But it comes at a cost. The setup is not super complicated, if you can afford it. You pay by volume. FYI this is not an endorsement, we used it at my last company, there are probably other vendors that do something similar.
This is a large subject, I hope I got you started with some ideas.

Is there any advantages in serving files from a massive object instead of from a hard drive in Node.JS

I'm curious if there's any advantages in loading my website in to a huge global object (containing file content, file names and so on..) at startup.
Is there a speed advantage (considering such a massive Object)?
Is there a maximum size of a string or an object?
Do the files need to be encoded?
How will this affect my server RAM?
I'm aware that all files will be cached and I will need to reload parts of the object whenever a file is edited.

1) Yes there is a obvious benefit: Reading from RAM is faster than reading from disk (http://norvig.com/21-days.html#answers)
2) Every time you read a file from the filesystem with Node, you get back a Buffer object. Buffer objects are stored outside of the JS heap so you're not limited by the total v8 heap size. However each Buffer has a limit of 1Gb in size (this is changing: https://twitter.com/trevnorris/status/603345087028793345). Obvious the total limit is the limit of your process (see ulimit) and of your system in total.
3) That's up to you. If you just read the files as Buffers, you don't need to specify encoding. It's just raw memory
Other thoughts:
You should be aware that file caching is already happening in the Kernel by ways of the page cache. Every time you read a file from the filesystem, you're not necessarily incurring a disk seek/read.
You should benchmark your idea vs just reading from the filesystem and see what the gains are. If you're saving 10ms but it still takes > 150ms for a user to retrieve the web page over the network, it's probably a waste of time.

It's going to be a lot of programming work to load all of your static assets onto some sort of in-memory object and then to serve them from node. I don't know of any web frameworks that have built in facilities for this and you're probably going to poorly reinvent a whole bunch of wheels... So no; there's absolutely no advantage in you doing this.
Web servers like apache handle caching files really well, if set up to do so. You can use one as a proxy for node. They also access the file system much more quickly than node does. Using a proxy essentially implements most of the in-memory solution you're interested in.
Proper use of expiration headers will ensure that clients won't request unchanging assets unnecessarily. You can also use a content delivery network, like akamai, to serve static assets from servers closer to your users. Both of these approaches mean that clients never even hit your server, though a CDN will cost you.
Serving files isn't terribly expensive as compared to sending them down the wire or doing things like querying a database.
Use a web servers to proxy your static content. Then make sure client side caching policies are set up correctly. Finally, consider a content delivery network. Don't re-invent the wheel!

How does a service like Put.io work?

Just got invited to put.io ... it's a service that takes a torrent file (or a magnet link) as input and gives a static file available for download from it's own server. I've been trying to understand how a service like this works?
It can't be as simple as simply torrenting the site and serving it via a CDN... can it? Because the speeds it offers seems insanely fast to me
Any idea about the bandwidth implications (or the amount used) by the service?

I believe services like this typically just are running one or more bittorrent clients on beefy machines with a fast link. You only have to download the torrent the first time someone asks for it, then you can cache it for the next person to request it.
The bandwidth usage is not unreasonable, since you're caching the files, you actually end up using less bandwidth than if you would, say, simply proxy downloads for people.
I would imagine that using a CDN would not be very common. There's a certain overhead involved in that. You could possibly promote files out of your cache to a CDN once you're certain that they are and will stay popular.
The service I was involved with simply ran 14 instances if libtorrent, each on a separate drive, serving completed files straight off of those drives with nginx. Torrents were requested from the web front end and prioritized before handed over to the downloader. Each instance would download around 70 or so torrents in parallel.

With Azure should i store my static shared JS and images files in the package release or in a blob?

Speed and cost in mind.
Say I have a few JS and images files shared for multiple websites. that is not huge images files, this is only few static files like PNG sprites and common JS files.
I'm kind of lost on the choice :
- Should i keep it in my webpackage to release in Azure ?
- Or should i put these in blobs ?
The things I don't know is if i have a lot of hits on the blob solution, it might cost more than the hits on the IIS level of the package ?
Right, wrong ?
Edit : I realize storing JS files on the blob won't deliver it gziped ?

No need for the blobs that I can see. The database round trip isn't adding value. I'd just put the static content on the web server and let it serve it up. Let the web server handle compressing the bytes on the wire for those cases where the client indicates that they can handle GZIP compression.

Will your JS and image files be modified often? If so, putting them into the service package would mean that every time you want to update those files, you will have to recompile the service package and redeploy your instance. If you find yourself needing to update often, this will become cumbersome. From a speed perspective, you're not going to see too much of a difference between service them files up from the blogs or serving them up from the web role (assuming the files are in fact not huge). Last but not least, from a cost perspective, if you look at the cost of blob storage ($0.15 per GB stored per month, $0.01 per 10,000 storage transactions), its really not much. Your site would have to have a lot of traffic for the cost to become significant.

Architecture recommendation for load-balanced ASP.NET site

UPDATE 2009-05-21
I've been testing the #2 method of using a single network share. It is resulting in some issues with Windows Server 2003 under load:
http://support.microsoft.com/kb/810886
end update
I've received a proposal for an ASP.NET website that works as follows:
Hardware load-balancer -> 4 IIS6 web servers -> SQL Server DB with failover cluster
Here's the problem...
We are choosing where to store the web files (aspx, html, css, images). Two options have been proposed:
1) Create identical copies of the web files on each of the 4 IIS servers.
2) Put a single copy of the web files on a network share accessible by the 4 web servers. The webroots on the 4 IIS servers will be mapped to the single network share.
Which is the better solution?
Option 2 obviously is simpler for deployments since it requires copying files to only a single location. However, I wonder if there will be scalability issues since four web servers are all accessing a single set of files. Will IIS cache these files locally? Would it hit the network share on every client request?
Also, will access to a network share always be slower than getting a file on a local hard drive?
Does the load on the network share become substantially worse if more IIS servers are added?
To give perspective, this is for a web site that currently receives ~20 million hits per month. At recent peak, it was receiving about 200 hits per second.
Please let me know if you have particular experience with such a setup. Thanks for the input.
UPDATE 2009-03-05
To clarify my situation - the "deployments" in this system are far more frequent than a typical web application. The web site is the front end for a back office CMS. Each time content is published in the CMS, new pages (aspx, html, etc) are automatically pushed to the live site. The deployments are basically "on demand". Theoretically, this push could happen several times within a minute or more. So I'm not sure it would be practical to deploy one web server at time. Thoughts?

I'd share the load between the 4 servers. It's not that many.
You don't want that single point of contention either when deploying nor that single point of failure in production.
When deploying, you can do them 1 at a time. Your deployment tools should automate this by notifying the load balancer that the server shouldn't be used, deploying the code, any pre-compilation work needed, and finally notifying the load balancer that the server is ready.
We used this strategy in a 200+ web server farm and it worked nicely for deploying without service interruption.

If your main concern is performance, which I assume it is since you're spending all this money on hardware, then it doesn't really make sense to share a network filesystem just for convenience sake. Even if the network drives are extremely high performing, they won't perform as well as native drives.
Deploying your web assets are automated anyway (right?) so doing it in multiples isn't really much of an inconvenience.
If it is more complicated than you're letting on, then maybe something like DeltaCopy would be useful to keep those disks in sync.

One reason the central share is bad is because it makes the NIC on the share server the bottleneck for the whole farm and creates a single point of failure.

With IIS6 and 7, the scenario of using a network single share across N attached web/app server machines is explicitly supported. MS did a ton of perf testing to make sure this scenario works well. Yes, caching is used. With a dual-NIC server, one for the public internet and one for the private network, you'll get really good performance. The deployment is bulletproof.
It's worth taking the time to benchmark it.
You can also evaluate a ASP.NET Virtual Path Provider, which would allow you to deploy a single ZIP file for the entire app. Or, with a CMS, you could serve content right out of a content database, rather than a filesystem. This presents some really nice options for versioning.
VPP For ZIP via #ZipLib.
VPP for ZIP via DotNetZip.

In an ideal high-availability situation, there should be no single point of failure.
That means a single box with the web pages on it is a no-no. Having done HA work for a major Telco, I would initially propose the following:
Each of the four servers has it's own copy of the data.
At a quiet time, bring two of the servers off-line (i.e., modify the HA balancer to remove them).
Update the two off-line servers.
Modify the HA balancer to start using the two new servers and not the two old servers.
Test that to ensure correctness.
Update the two other servers then bring them online.
That's how you can do it without extra hardware. In the anal-retentive world of the Telco I worked for, here's what we would have done:
We would have had eight servers (at the time, we had more money than you could poke a stick at). When the time came for transition, the four offline servers would be set up with the new data.
Then the HA balancer would be modified to use the four new servers and stop using the old servers. This made switchover (and, more importantly, switchback if we stuffed up) a very fast and painless process.
Only when the new servers had been running for a while would we consider the next switchover. Up until that point, the four old servers were kept off-line but ready, just in case.
To get the same effect with less financial outlay, you could have extra disks rather than whole extra servers. Recovery wouldn't be quite as quick since you'd have to power down a server to put the old disk back in, but it would still be faster than a restore operation.

Use a deployment tool, with a process that deploys one at a time and the rest of the system keeps working (as Mufaka said). This is a tried process that will work with both content files and any compiled piece of the application (which deploy causes a recycle of the asp.net process).
Regarding the rate of updates this is something you can control. Have the updates go through a queue, and have a single deployment process that controls when to deploy each item. Notice this doesn't mean you process each update separately, as you can grab the current updates in the queue and deploy them together. Further updates will arrive to the queue, and will be picked up once the current set of updates is over.
Update: About the questions in the comment. This is a custom solution based on my experience with heavy/long processes which needs their rate of updates controlled. I haven't had the need to use this approach for deployment scenarios, as for such dynamic content I usually go with a combination of DB and cache at different levels.
The queue doesn't need to hold the full information, it just need to have the appropriate info (ids/paths) that will let your process pass the info to start the publishing process with an external tool. As it is custom code, you can have it join the information to be published, so you don't have to deal with that in the publishing process/tool.
The DB changes would be done during the publishing process, again you just need to know where the info for the required changes is and let the publishing process/tool handle it. Regarding what to use for the queue, the main ones I have used is msmq and a custom implementation with info in sql server. The queue is just there to control the rate of the updates, so you don't need anything specially targeted at deployments.
Update 2: make sure your DB changes are backwards compatible. This is really important, when you are pushing changes live to different servers.

I was in charge of development for a game website that had 60 million hits a month. The way we did it was option #1. User did have the ability to upload images and such and those were put on a NAS that was shared between the servers. It worked out pretty well. I'm assuming that you are also doing page caching and so on, on the application side of the house. I would also deploy on demand, the new pages to all servers simultaneously.

What you gain on NLB with the 4IIS you loose it with the BottleNeck with the app server.
For scalability I'll recommend the applications on the front end web servers.
Here in my company we are implementing that solution. The .NET app in the front ends and an APP server for Sharepoint + a SQL 2008 Cluster.
Hope it helps!
regards!

We have a similar situation to you and our solution is to use a publisher/subscriber model. Our CMS app stores the actual files in a database and notifies a publishing service when a file has been created or updated. This publisher then notifies all the subscribing web applications and they then go and get the file from the database and place it on their file systems.
We have the subscribers set in a config file on the publisher but you could go the whole hog and have the web app do the subscription itself on app startup to make it even easier to manage.
You could use a UNC for the storage, we chose a DB for convenience and portability between or production and test environments (we simply copy the DB back and we have all the live site files as well as the data).

A very simple method of deploying to multiple servers (once the nodes are set up correctly) is to use robocopy.
Preferably you'd have a small staging server for testing and then you'd 'robocopy' to all deployment servers (instead of using a network share).
robocopy is included in the MS ResourceKit - use it with the /MIR switch.

To give you some food for thought you could look at something like Microsoft's Live Mesh
. I'm not saying it's the answer for you but the storage model it uses may be.
With the Mesh you download a small Windows Service onto each Windows machine you want in your Mesh and then nominate folders on your system that are part of the mesh. When you copy a file into a Live Mesh folder - which is the exact same operation as copying to any other foler on your system - the service takes care of syncing that file to all your other participating devices.
As an example I keep all my code source files in a Mesh folder and have them synced between work and home. I don't have to do anything at all to keep them in sync the action of saving a file in VS.Net, notepad or any other app initiates the update.
If you have a web site with frequently changing files that need to go to multiple servers, and presumably mutliple authors for those changes, then you could put the Mesh service on each web server and as authors added, changed or removed files the updates would be pushed automatically. As far as the authors go they would just be saving their files to a normal old folder on their computer.

Assuming your IIS servers are running Windows Server 2003 R2 or better, definitely look into DFS Replication. Each server has it's own copy of the files which eliminates a shared network bottleneck like many others have warned against. Deployment is as simple as copying your changes to any one of the servers in the replication group (assuming a full mesh topology). Replication takes care of the rest automatically including using remote differential compression to only send the deltas of files that have changed.

We're pretty happy using 4 web servers each with a local copy of the pages and a SQL Server with a fail over cluster.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string