G-Wan / Ngnix / Squid / Varnish as HTTP Live Streaming Reverse Proxy

G-Wan / Ngnix / Squid / Varnish as HTTP Live Streaming Reverse Proxy - http-live-streaming

I'm planing to build a caching reverse proxy for HTTP Live Stream(Apple HLS)
For my situation, I configured size of each segment file will be about 500-700KB.I read a lot of article about the performance review for popular Web Server software. But all of them are testing small-file-size caching. So is there anybody has experience about build cache server for large file(honestly, 700KB is not too large I think)? Or any review article I missed you can provide to me?
I think I can get answers from review article before. But maybe I list my questions below.
If I increase the number of total segments, will this cause performance decrease(since search take longer time) and how serious it is?
If I want to maximize the throughput(Let's say 1Gbps I got), what server software and CPU I should choose?(This is as same as asking which server software can provide highest throughput)
As jeremy reminds me, caching time will really affect the hit rate and performance. For caching segments, should I set the caching time to be the rotation time?(exp. 00-99.ts#10s each .ts file should be changed 990s after last time updated, so rotation time is 990s) Or any better suggestion?
Thank you.

500-700KB files are still very small, I've had great success with both NGinx and Varnish for this exact task.
You will want to make sure you have a fairly large expiration time on your .ts files (you want cache hits on these). And you want to set your expiration on .m3u8 files to be less than 1/2 of your segment length.
This is especially true if you are going to be using a CDN, since the CDN will honor the cache control headers (typically) and you will want to limit the number of requests back to the origin.

Related

Should I add caching or not to my web app?

I'm working on a node.js web app that's mainly an image based bulletin board. I read somewhere that we always need to cache since it helps improve performance, so normally I added my own caching which basically works like this: when a user visits the bulletin board the number of uploaded posts is cached, when visited again a comparison is made between the cached number and the the available number of posts, if it's the same then the cached query kicks in if not the new posts gets uploaded. Now this isn't a matter of code or anything. My code works perfectly and it does what I want it to do even if I might've explained it wrong or gave a wrong impression somehow. Problem is I read after doing all the work that caching frequently updated data is wrong and this is a bulletin board I'm talking about where users always upload images. So is it best to remove or keep the cache?

"There are two hard things in computer science: cache invalidation,
naming things, and off-by-one errors."
— Phil Karlton
Caching depends heavily on the kind of application. Just because you've heard "caching is always better" doesn't mean you should. A good rule of thumb:
Lean toward caching the data that is read-heavy but seldom updated.
Have a strategy for cache invalidation on write-heavy operations.
If you believe you will be doing crazy amounts of read-writes against the same entity, consider microcaching if you're comfortable with a bit of stale data for, say, less than a minute. Just set an expiry on your cache on a minute.
If you feel like you will be benefit from serving from the cache (and, by the way, what kind of cache are you talking about? In-memory? HTTP caching? Some kind of backend, such as Redis/Memcached?), have a plan to always check cache, and if you can't find it there, serve from your data store and then cache. Then, on POST/PATCH/DELETE calls, invalidate by that key.
In the case of your bulletin board, you might want to consider invalidating cache any time someone posts or uploads anything, then cache the subsequent call. But, as I mentioned, be sure to have a strategy for invalidating.

Adding caching makes your application more complex. You shouldn't add it just because some random people on the internet say you should. There are tons of situations were caching makes things better and there are other that things are made much worse.
Caching is a solution to a problem. So what problem are you seeing? Slow client experience because of repeated downloads? Is your server load high? Bandwidth high? Then perhaps you should add in some kind of caching.
Too often people hear of solutions and work towards a problem. Identify your problems and then look into solutions.

simulate user load on hardware router

I am trying to simulate user load on a hardware router. I am specifically trying to emulate the average load of a home router.
What i need to to do is load it up over a week long period at different times and perform the following:
Data Transfer
Torrent Downloads
HTTP/HTTPS Pages requests to different pages. Static content, dynamic content. etc.
I would need to this repeat at my specific intervals and be able to test multiple routers at once.
Anyone know of any software or scripts that will achieve this.
Cheers

Sure. You might be surprised to learn that the load on an average home router is probably pretty low most of the time. Do the math: even downloading at maximum DSL or cable router speed (even if it were small packet sizes, which in higher loads is not usually the case) is just not a significant load on a modern CPU these days.
Scripting loads is easy. I have a script that I bang against Comcast sometimes when I doubt their last mile link to my home. It simply uses wget (or try curl) to download a file of reasonable size repeatedly and records the download statistics (time and/or data rate) of the transfers. Just find a .pdf or other file of the size you need from around the net somewhere, or use a busy website with lots of content. Just avoid the little guys who might have to pay for that bandwidth you are consuming in your test. Better yet, Amazon S3 storage (and transfer bandwidth) is very cheap these days and easy to use. You could put some files of your own choosing up there, and download those repeatedly for your test environment instead of stealing bandwidth from someone else! ;)
Never played with any torrent clients, so I can't help you there, but I bet there are some you can script.
Also, you might check out netperf. I don't know the status of that project, but I've used it in the past to generate very high network loads. Google for it.
Have fun and good luck!
-Chris

How much does a single request to the server cost

I was wondering how much do you win by putting all of your css scripts and stuff that needs to be downloaded in one file?
I know that you would win a lot by using sprites, but at some point it might actually hurt to do that.
For example my website uses a lot of small icons and most of the pages has different icons after combining all those icons together i might get over 500kb in total, but if i make one sprite per page it is reduced to almost 50kb/page so that's cool.
But what about scripts js/css how much would i win by making a script for each page which has just over ~100 lines? Or maybe i wouldn't win at all?
Question, basically i want to know how much does a single request cost to download a file and is it really bad to to have many script/image files with todays modern browsers and a high speed connections.
EDIT
Thank you all for your answers, it was hard to chose just one because every answer did answer my question, I chose to reward the one that in my opinion answered my question about request cost the most directly, I will not accept any answer as correct because everyone was.

Multiple requests means more latency, so that will often make a difference. Exactly how costly that is will depend on the size of the response, the performance of the server, where in the world it's hosted, whether it's been cached, etc... To get real measurements you should experiment with your real world examples.
I often use PageSpeed, and generally follow the documented best practices: https://developers.google.com/speed/docs/insights/about.
To try answering your final question directly: additional requests will cost more. It's not necessarily "really bad" to have many files, but it's generally a good idea to combine content into a single file when you can.

Your question isn't answerable in a real generic way.
There are a few reasons to combine scripts and stylesheets.
Browsers using HTTP/1.1 will open multiple connections, typically 2-4 for every host. Because almost every site has the actual HTML file and at least one other resource like a stylesheet, script or image, these connections are created right when you load the initial URL like index.html.
TCP connections are costly. That's why browsers open directly multiple connections ahead of time.
Connections are usually limited to a small number and each connection can only transfer one file at a time.
That said, you could split your files across multiple hosts (e.g. an additional static.example.com), which increases the number of hosts / connections and can speed up the download. On the other hand, this brings additional overhead, because of more connections and additional DNS lookups.
On the other hand, there are valid reasons to leave your files split.
The most important one is HTTP/2. HTTP/2 uses only a single connection and multiplexes all file downloads over that connection. There are multiple demos online that demonstrate this, e.g. http://www.http2demo.io/
If you leave your files split, they can also be cached separately. If you have just small parts changing, the browser could just reload the changed file and all others would be answered using 304 Not Modified. You should have appropriate caching headers in place of course.
That said, if you have the resources, you could serve all your files separately using HTTP/2 for clients that support it. If you have a lot of older clients, you could fallback to combined files for them when they make requests using HTTP/1.1.

Tricky question :)
Of course, the trivial answer is that more requests takes more time, but that is not necessarily this simple.
browsers open multiple http connections to the same host, see http://sgdev-blog.blogspot.hu/2014/01/maximum-concurrent-connection-to-same.html Because that, not using parallel download but rather downloading one huge file is considered as a performance bottleneck by http://www.sitepoint.com/seven-mistakes-that-make-websites-slow/
web servers shall use gzip content-encoding whenever possible. Therefore size of the text resources such as HTML, JS, CSS are quite compressed.
most of those assets are static content, therefore a standard web server shall use etag caching on them. It means that next time the download will be like 26 bytes, since the server tells "not changed" instead of sending the 32kbyte of JavaScript over again
Because of the etag cache, the whole web site shall be cacheable (I assume you're programming a game or something like that, not some old-school J2EE servlet page).
I would suggest making 2-4 big files and download that, if you really want to go for the big files
So to put it together:
if you have only static content, then it is all the same, because etag caching will shortcut any real download from the server, server returns 304 Not modified answer
if you have some generated dynamic content (such as servlet pages), keep the JS and CSS separate as they can be etag cached separately, and only the servlet page needs to be downloaded
check that your server supports gzip content encoding for compression, this helps a lot :)
if you have multiple dynamic content (such as mutliple dynamically changing images), it makes sense to have them represented as 2-4 separate images to utilize the parallel http connections for download (although I can hardly imagine this use case in the real life)
Please, ensure that you're not serving static content dynamically. I.e. try to load the image to a web browser, open the network traffic view, reload with F5 and see that you get 304 Not modified from the server, instead of 200 OK and real traffic.
The biggest performance optimization is that you don't pull anything from the server, and it comes out of the box if used properly :)

I think #DigitalDan has the best answer.
But the question belies the real one, how do I make my page load faster? Or at least , APPEAR to load faster...
I would add something about "above the fold": basically you want to inline as much as will allow your page to render the main visible content on the first round trip, as that is what is perceived as the fastest by the user, and make sure nothing else on the page blocks that...
Archibald explains it well:
https://www.youtube.com/watch?v=EVEiIlJSx_Y

How much you win if you use any of these types might vary based on your specific needs, but I will talk about my case: in my web application we don't combine all files, instead, have 2 types of files, common files, and per page files, where we have common files that needed globally for our application, and other files that is used for its case only, and here is why.
Above is a chart request analysis for my web application, what you need to consider is this
DNS Lookup happens only once as it cached after that, however, DNS name might be cached already, then.
On each request we have:
request start + initial connection + SSL negotiation+ time to first byte + content download
The main factor here which takes majority of request time in most cases is the content download size, so if I have multiple files that all of them needed to be used in all pages, I would combine them into one file so I can save the TCP stack time, on the other hand, if I have files needed to be used in specific pages, I would make it separate so I can save the content download time in other pages.

Actually very relevant question (topic) that many web developer face.
I would also add my answer among other contributors of this question.
Introduction before going to answer
High performance web sites depending on different factors, here is some consideration:
Website size
Content type of website (primary content Text, image, video or mixture)
Traffic on your website (How many people visiting your website average)
Web-host Location vs your primary visitor location (with in your country, region and world wide), it matters a lot if you have website for Europe and your host is in US.
Web-host server (hardware) technology, I prefer SSD disks.
How web-server (software) is setup and optimized
Is it dynamic or static web site
If dynamic, how your code and database is structured and designed
By defining your need you might be able to find the proper strategy.
Regarding your question in general
What regards your website. I recommend you to look at Steve Souders 14 recommendation in his Book High Performance Web Sites.
Steve Souders 14 advice:
Make fewer HTTP requests
Use a Content Delivery Network (CDN)
Add an Expires Header
Gzip Components
Put Style-sheets at the Top
Put Scripts at the Bottom
Avoid CSS Expressions
Make JavaScript and CSS External if possible
Reduce DNS Lookups
Minify JavaScript
Avoid Redirects
Remove Duplicates Scripts
Configure ETages
Make Ajax Cacheable
Regarding your question
So if we take js/css in consideration following will help a lot:
It is better to have different codes on different files.
Example: you might have page1, page2, page3 and page4.
Page1 and page2 uses js1 and js2
Page3 uses only js3
Page4 uses all js1, js2 and js3
So it will be a good idea to have JavaScript in 3 files. You are not interested in including every thing you have that you do not use.
CSS Sprites
CSS at top and JS at the end
Minifying JavaScript
Put your JavaScript and CSS in external files
CDN, in case you use jQuery for example do not download it to your website just use the recommended CDN address.
Conclusion
I am pretty sure there is more details to write. And not all advice are necessary to implement, but it is important to be aware of those. As I mentioned before, I suggest you reading this tiny book, it gives you more details. And finally there is no perfect final solution. You need to start some where, do your best and improved it. No thing is permanent.
Good luck.

the answer to your question is it really depends.
the ultimate goal of page load optimization is to make your users feel your page load is fast.
some suggestions:
do not merge common library js css files like jquery coz they might have already cached by brower when you visited other sites so u don't even need to download them;
merge resources, but at least separate first screen required resouces and the others coz the earlier user could see some meaningful stuff, the faster they feel about your page;
if several of your pages shared some resources, separate the merged files for shared resources and page specific resources so that when you visit the second page, the shared ones might have already been cached by browser, so the page load is faster;
user might be using a phone with slow or inconsistent speed 3g/4g network, so even 50k of data or 2 more requests does make them feel different a lot;

Is really bad to have a lot of 100-lines-files and is also really bad to have just one or two big files, though for each type css/js/markup.
Desktops have mostly high speed connection, and mobile has also high latency.
Taking all the theory about this topic, i think the best approach shall be more practical, less accurate and based upon actual connection speed and device types from a statistical point of view.
For example, i think this is the best way to go today:
1) put all the stuff needed to show the first page/functionality to the user, in one file, shall be under 100KB - this is absolutely a requirement.
2) after that, split or group the files in sizes so that the latency is no longer noticeable together with the download time.
To make it simple and concrete, if we assume: time to first byte is around ~200ms, the size of each file should be between ~120KB and ~200 KB, good for the most connections of today, averaged.

Is there such a thing as a reverse CDN? (content 'retrieval' network)

Our clients upload a serious amount of data from all over the world and we'd like to do our best to make that as painless as possible. Our clients upload 2GB worth of files over their sometimes very 'retail' broadband packages (with capped upload speeds) that draw out upload times to 24-48 hours. At any given time we have 10 or more concurrent uploads and peek periods we can have 100 concurrent uploads. So we decided to consider ways to reduce latency and keep our clients traffic local... so just as a CDN has download servers in various locations, we'd like upload servers.
Any experience or thoughts?
We're not a huge company but this is a problem worth solving so we'll consider all options.

What about putting some servers physically closer to your clients ?
Same ISP, or at the very least in the same countries. Then you just collect it on schedule. I don't imagine that they're getting top speeds when there's 100 of them uploading to you either, so the sooner you can get them completed the better.
Also, do they need to upload this stuff immediately ?? Can some of them post DVD for whatever isn't time sensitive ? I know it sux dealing with media in the post.... so it's hardly ideal.
A reverse CDN sort of situation would only really happen if you had multiple clients using torrents and seeding their uploads (somehow) to one of your servers.
You haven't really said if this is a problem for you, or your clients. So, some more info is going to get you a better answer here.

2GB per what time period? Hour? Day?
If your operation is huge, I wouldn't be too surprised if Akamai or one of the other usual CDN suspects can provide this service to you for the right price. You might get your bizdev folks (or purchasing) in touch with them.

Tux, Varnish or Squid? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
We need a web content accelerator for static images to sit in front of our Apache web front end servers
Our previous hosting partner used Tux with great success and I like the fact it's part of Red Hat Linux which we're using, but its last update was in 2006 and there seems little chance of future development. Our ISP recommends we use Squid in reverse caching proxy role.
Any thoughts between Tux and Squid? Compatibility, reliability and future support are as important to us as performance.
Also, I read in other threads here about Varnish; anyone have any real-world experience of Varnish compared with Squid, and/or Tux, gained in high-traffic environments?
Cheers
Ian
UPDATE: We're testing Squid now. Using ab to pull the same image 10,000 times with a concurrency of 100, both Apache on its own and Squid/Apache burned through the requests very quickly. But Squid made only a single request to Apache for the image then served them all from RAM, whereas Apache alone had to fork a large number of workers in order to serve the images. It looks like Squid will work well in freeing up the Apache workers to handle dynamic pages.

In my experience varnish is much faster than squid, but equally importantly it's much less of a black box than squid is. Varnish gives you access to very detailed logs that are useful when debugging problems. It's configuration language is also much simpler and much more powerful that squid's.

#Daniel, #MKUltra, to elaborate on Varnish's supposed problems with cookies, there aren't really any. It is completely normal not to cache a request if it returns a cookie with it. Cookies are mostly meant to be used to distinguish different user preferences, so I don't think one would want to cache these (especially if you they include some secret information like a session id or a password!).
If you server sends cookies with your .js and images, that's a problem on your backend side, not on Varnish's side. As referenced by #Daniel (link provided), you can force the caching of these files anyway, thanks to the really cool language/DSL integrated in Varnish...

If you're looking to push static images and a lot of them, you may want to look at some basics first.
Your application should ensure that all correct headers are being passed, Cache-Control and Expires for example. That should result in the clients browsers caching those images locally and cutting down on your request count.
Use a CDN (if it's in your budget), this brings the images closer to your clients (generally) and will result in a better user experience for them. For the CDN to be a productive investment you'll again need to make sure all your necessary caching headers are properly set,
as per the point I made in the previous paragraph.
After all that if you are still going to use a reverse proxy, I recommend using nginx in proxy mode, over Varnish and squid. Yes Varnish is fast, and as fast as nginx, but what you're wanting to do is really quite simple, Varnish comes into it's own when you want to do complex caching, and ESI. So Keep It Simple, Stupid. nginx will do your job very nicely indeed.
I have no experience with Tux, so I can't comment on it sorry.

For what it's worth, I recently set up nginx as a reverse-proxy in front of Apache on a 6-year-old low-power webserver (running Fedora Core 2) which was under a mild DDoS attack (10K req/sec). Pages loading was snappy (<100ms) and system load stayed low at around 20% CPU utilization, and very little memory consumption. The attack lasted 1 week, and visitors saw no ill effects.
Not bad for over half a million hits per minute sustained. Just be sure to log to /dev/null.

It's interesting that no one mentioned the Apache Traffic Server (formerly, Yahoo! Traffic Server) http://trafficserver.apache.org/
Please have a look at it, it works beautifully.

We use Varnish on http://www.mangahigh.com and have been able to scale from around 100 concurrent pre-varnish to over 560 concurrent post-varnish (server load remained at 0 at this point, so there's plenty of space to grow!). Documentation for varnish could be better, but it is quite flexible once you get used to it.
Varnish is meant to be a lot faster than Squid (having never used Squid, I can't say for certain) - and http://users.linpro.no/ingvar/varnish/stats-2009-05-19 shows Twitter, Wikia, Hulu, perezhilton.com and quite a number of other big names also using it.

both Squid and nginx are specifically designed for this. nginx is particularly easy to configure for a server farm, and can also be a frontend to FastCGI.

I've only used squid and can't compare. We use squid to cache an entire site on a server in the USA (all data gets pulled from a machine in Germany). It was pretty easy to set up and works nicely. I've found the documentation to be kind of lacking unless you already know what to look for.

Since you already have apache serving the static and dynamic content I would recommend you to go with Varnish.
In this way you can use your apache to deliver the static content and use varnish to cache it for you. Varnish is very flexible, giving you both caching and loadbalancing features for growing your website in the best ways.

We are about to roll out a varnish 2.01 server in front of an IIS 6 installation. The only caveats we've had was with our SSL (as varnish can't handle SSL). So we've also installed Nginx to handle those requests.
In all our testing we've shown a 66% percent increase in the amount of traffic the site can handle.
My only gripe is that varnish doesn't handle cookies well, and the documentation is still a bit scattered.

Nobody mentions that Squid follows the HTTP specification to the letter (or at least they try to) whereas Varnish does not. In my opinion, this means Varnish is better suited for caching content for individual sites (by extensively tuning Varnish) and Squid is better for caching content for many sites (each of which will have to make their content "cachable" according to spec).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string