We have a dedicated godaddy server and it seemed to grind to a halt when we had users downloading only 3MB every 2 seconds (this was over about 20 http requests).
I want to look into database locking etc. to see if that is a problem - but first I'm curious as to what a dedicated server ought to be able to serve.
to help diagnose the problem, host a large file and download it. That will give you the transfer that the server and your web server can cope with. If the transfer rate is poor, then you know its the network, server or webserver.
If its acceptable or good, then you know its the means you have of generating those 3MB files.
check, measure and calculate!
PS. download the file over a fast link, you don't want the bottleneck to be your 64kbps modem :)
A lot depends on what the 3MB is. Serving up 1.5MBps of static data is way, way, way, within the bounds of even the weakest server.
Perhaps godaddy does bandwidt throtling? 60MB downloads every 2 seconds might fire some sort of bandwidt protection (either to protect their service or you from being overcharged, or both).
Check netspeed.stanford.edu from the dedicated server and see what your inbound and outbound traffic is like.
Also make sure your ISP is not limiting you at 10MBps (godaddy by default limits to 10Mbps and will set it at 100Mbps on request)
Related
I'm working on several different webapps written in node. Each webapp has very little traffic (maybe a few HTTP requests per day) so I run them all on a single machine with haproxy as a reverse proxy. It seems each webapp is consuming almost 100MB RAM memory which adds up to a lot when you have many webapps. Because each webapp receives so little traffic I was wondering if there is a way to have all the webapps turned off by default but setup so that they automatically start if there is an incoming HTTP request (and then turn off again if there hasn't been any HTTP requests within some fixed time period).
Yes. These a dozen different ways to handle this. With out more details not sure the best way to handle this. One option is using node VM https://nodejs.org/api/vm.html Another would be some kind of Serverless setup. See: https://www.serverless.com/ Honestly, 100MB is a drop in the bucket with ram prices these days. Quick google shows 16GB ram for $32 or to put that differently, 160 nodes apps. I'm guessing you could find better prices on EBay or a something like that.
Outside learning this would be a total waste of time. Your time is worth more than the effort it would take to set this up. If you only make minimum wage in the US it'd take you less than 4 hours to make back the cost of the ram. Better yet go learn Docker/k8s and containerize each of those apps. That said learning Serverless would be a good use of time.
Just got invited to put.io ... it's a service that takes a torrent file (or a magnet link) as input and gives a static file available for download from it's own server. I've been trying to understand how a service like this works?
It can't be as simple as simply torrenting the site and serving it via a CDN... can it? Because the speeds it offers seems insanely fast to me
Any idea about the bandwidth implications (or the amount used) by the service?
I believe services like this typically just are running one or more bittorrent clients on beefy machines with a fast link. You only have to download the torrent the first time someone asks for it, then you can cache it for the next person to request it.
The bandwidth usage is not unreasonable, since you're caching the files, you actually end up using less bandwidth than if you would, say, simply proxy downloads for people.
I would imagine that using a CDN would not be very common. There's a certain overhead involved in that. You could possibly promote files out of your cache to a CDN once you're certain that they are and will stay popular.
The service I was involved with simply ran 14 instances if libtorrent, each on a separate drive, serving completed files straight off of those drives with nginx. Torrents were requested from the web front end and prioritized before handed over to the downloader. Each instance would download around 70 or so torrents in parallel.
Suppose that a data source sets a tight IP-based throttle. Would a web scraper have any way to download the data if the throttle starts rejecting their requests as early as 1% of the data being downloaded?
The only technique I could think of a hacker using here would be some sort of proxy system. But, it seems like the proxies (even if fast) would eventually all reach the throttle.
Update: Some people below have mentioned big proxy networks like Yahoo Pipes and Tor, but couldn't these IP ranges or known exit nodes be blacklisted as well?
A list of thousands or poxies can be compiled for FREE. IPv6 addresses can be rented for pennies. Hell, an attacker could boot up an Amazon EC2 micro instance for 2-7 cents an hour.
And you want to stop people from scraping your site? The internet doesn't work that way, and hopefully it never will.
(I have seen IRC servers do a port scan on clients to see if the following ports are open: 8080,3128,1080. However there are proxy servers that use different ports and there are also legit reasons to run proxy server or to have these ports open, like if you are running Apache Tomcat. You could bump it up a notch by using YAPH to see if a client is running a proxy server. In effect you'd be using an attacker's too against them ;)
Someone using Tor would be hopping IP addresses every few minutes. I used to run a website where this was a problem, and resorted to blocking the IP addresses of known Tor exit nodes whenever excessive scraping was detected. You can implement this if you can find a regularly updated list of Tor exit nodes, for example, https://www.dan.me.uk/tornodes
You could use a P2P crawling network to accomplish this task. There will be a lot of IPs availble and there will be no problem if one of them become throttled. Also, you may combine a lot of client instances using some proxy configuration as suggested in previous answers.
I think you can use YaCy, a P2P opensource crawling network.
A scraper that wants the information will get the information. Timeouts, changing agent names, proxies, and of course EC2/RackSpace or any other cloud services that have the ability to start and stop servers with new IP addresses for pennies.
I've heard of people using Yahoo Pipes to do such things, essentially using Yahoo as a proxy to pull the data.
Maybe try running your scraper on amazon ec2 instances. Every time you get throttled, startup a new instance (at new IP), and kill the old one.
It depends on the time the attacker has for obtaining the data. If most of the data is static, it might be interesting for an attacker to run his scraper for, say, 50 days. If he is on a DSL line where he can request a "new" IP address twice a day, 1% limit would not harm him that much.
Of course, if you need the data more quickly (because it is outdated quickly), there are better ways (use EC2 instances, set up a BOINC project if there is public interest in the collected data, etc.).
Or have a Pyramid scheme a la "get 10 people to run my crawler and you get PORN, or get 100 people to crawl it and you get LOTS OF PORN", as it was quite common a few years ago with ad-filled websites. Because of the competition involved (who gets the most referrals) you might quickly get a lot of nodes running your crawler for very little money.
What solutions do you have in place for handling bandwidth billing for your vhosts on a shared environment in apache? If you are using log parsing, does your solution scale well when the logs become very very large? Anyone using any sort of module out there for this?
There exist certain modules for Apache 1.x and 2.x that will allow you to set a maximum on the transfer amount, most of them keep track using the scoreboard file that Apache generates (when mod_status is enabled with ExtendedStatus on). The one I still have bookmarked from when I was looking for one is mod_curb, however it is not complete and at the current moment in time looks to only work on a server-wide scale and not for individual virtual hosts.
Apache modules can be set to be outbound filters, so you could write a costume module that would sit at the end of the chain, and add up all the outgoing packets, using the data that APR provides you can then add it to a counter for that specific domain/sub-domain. After that you have a choice of what to do with the data.
For specific examples, take a look at mod_deflate that Apache provides, to see how it sits at the end of the chain and compresses everything but the headers the server sends out. This should give you a good start.
As for log based processing, it becomes slower the more logs exist. This is just the nature of the beast. When we were using a log based solution we had a custom perl script that ran every 15 minutes. Eventually it would take longer than 15 minutes to parse, and since we had proper locking after a while multiple of these log processing perl scripts were now running, all waiting on each other. We ended up re-writing it with a simple call to tail -F, which then let perl parse each and every request as it came in, while not entirely efficient, it worked. The upside of that was that we were now able to update traffic statistics in near realtime so that clients were updated sooner rather than later if they went over their limits.
You could go the poor man's route, and use Webalizer or Awstats. Both of these will give you an idea of traffic based off of access logs, and can be done on a per virtual host basis. In the case of Awstats, I know once you start doing 10GB+ of traffic daily, it starts to consume resources. You can always nice it, but then you'll get your data next week, rather than when you actually need it. In the past with Webalizer I've had to use some hackery to get it to handle large access logs, by chunking up the logs to smaller pieces that it could manage. It didn't provide as many useful metrics from what I've done with it, but I've also never needed to save a server from it :)
If virtual host does not have own IP, there is no easier way than logfile parsing. Just use mod_logio to calculate actual bytes transferred. mod_logio handles broken connections, compressed data etc. correctly. You should be able to parse logs realtime using piped logs. Use BufferedLogs to scale further (just check that parser handles lines broken when buffered correctly). Parser should save data periodically (like every minute) somewhere, just avoid locking issues as parsing must not slow down httpd. If httpd connections is spending time in L-state at server-status, you are too slow. After you have numbers, you can sum then further and then save data to billing system.
If you save billing logs as file too you can correct and doublecheck realtime traffic calculations. If you boot httpd you can end up missing some lines. But generally losing couple hundred requests is acceptable as it less than seconds worth on a high volume site.
There is modules that try to handle and limit bandwidth, like mod_cband and mod_bw. But they don't work when you have same vhost on multiple machines. I guess they would work ok on smaller scale.
If you have IP per vhost you could try IP based methods like feeding firewall logs to traffic calculator. Simple way is to use iptables.
Although we use IIS rather than apache we do use log file analysis for bandwidth billing (and bandwidth profiling / analysis). We use a custom application to load data collected in the log files in one hour increments, and act upon any required notifications or bandwidth overuse.
The log file loader runs as a low priority process, so as not to interupt operation of the server. Even on high usage servers with a large number of sites, processing takes less than 15 minutes, so we don't see scalability as a problem with this methodology.
There may be better ways of doing this, but this is perfectly adequate for what we need. I look forward to viewing the other responses.
It can be easily achieved with mod_cband. We've rewritten the module to fix a few bugs, provide true redundancy on restarts and incorporate FTP and Mail statistics.
http://www.howtoforge.com/mod_cband_apache2_bandwidth_quota_throttling
Well mod_cband would be great, except for when i'm using it, the max_connections (the overall, total value for every client combined), decides to crawl upwards until it hits the max value i've set. when it does reach the highest value, it just stays there and leaves all my clients receiving a constant "503 Service Temporarily Unavailable" error.
for example, i set "CbandSpeed 1000Mbps 500 1200", and the server connections crawls up to 1200 in about 8 hrs, then stays there. at this point, i count the total number of connections under Remote Clients in the mod_cband status window, and i see around 50. i've also used ps aux and i see around the same amount (~50) open http processes, which is normal, except for the fact that nobody can access the site at all because of the 503 errors.
Any ideas what could be wrong, or can this be fixed?
When running any kind of server under load there are several resources that one would like to monitor to make sure that the server is healthy. This is specifically true when testing the system under load.
Some examples for this would be CPU utilization, memory usage, and perhaps disk space.
What other resource should I be monitoring, and what tools are available to do so?
As many as you can afford to, and can then graph/understand/look at the results. Monitoring resources is useful for not only capacity planning, but anomaly detection, and anomaly detection significantly helps your ability to detect security events.
You have a decent start with your basic graphs. I'd want to also monitor the number of threads, number of connections, network I/O, disk I/O, page faults (arguably this is related to memory usage), context switches.
I really like munin for graphing things related to hosts.
I use Zabbix extensively in production, which comes with a stack of useful defaults. Some examples of the sorts of things we've configured it to monitor:
Network usage
CPU usage (% user,system,nice times)
Load averages (1m, 5m, 15m)
RAM usage (real, swap, shm)
Disc throughput
Active connections (by port number)
Number of processes (by process type)
Ping time from remote location
Time to SSL certificate expiry
MySQL internals (query cache usage, num temporary tables in RAM and on disc, etc)
Anything you can monitor with Zabbix, you can also attach triggers to - so it can restart failed services; or page you to alert about problems.
Collect the data now, before performance becomes an issue. When it does, you'll be glad of the historical baselines, and the fact you'll be able to show what date and time problems started happening for when you need to hunt down and punish exactly which developer made bad changes :)
I ended up using dstat which is vmstat's nicer looking cousin.
This will show most everything you need to know about a machine's health,
including:
CPU
Disk
Memory
Network
Swap
"df -h" to make sure that no partition runs full which can lead to all kinds of funky problems, watching the syslog is of course also useful, for that I recommend installing "logwatch" (Logwatch Website) on your server which sends you an email if weird things start showing up in your syslog.
Cacti is a good web-based monitoring/graphing solution. Very complete, very easy to use, with a large userbase including many large Enterprise-level installations.
If you want more 'alerting' and less 'graphing', check out nagios.
As for 'what to monitor', you want to monitor systems at both the system and application level, so yes: network/memory/disk i/o, interrupts and such over the system level. The application level gets more specific, so a webserver might measure hits/second, errors/second (non-200 responses), etc and a database might measure queries/second, average query fulfillment time, etc.
Beware the afore-mentioned slowquerylog in mysql. It should only be used when trying to figure out why some queries are slow. It has the side-effect of making ALL your queries slow while it's enabled. :P It's intended for debugging, not logging.
Think 'passive monitoring' whenever possible. For instance, sniff the network traffic rather than monitor it from your server -- have another machine watch the packets fly back and forth and record statistics about them.
(By the way, that's one of my favorites -- if you watch connections being established and note when they end, you can find a lot of data about slow queries or slow anything else, without putting any load on the server you care about.)
In addition to top and auth.log, I often look at mtop, and enable mysql's slowquerylog and watch mysqldumpslow.
I also use Nagios to monitor CPU, Memory, and logged in users (on a VPS or dedicated server). That last lets me know when someone other than me has logged in.
network of course :) Use MRTG to get some nice bandwidth graphs, they're just pretty most of the time.. until a spammer finds a hole in your security and it suddenly increases.
Nagios is good for alerting as mentioned, and is easy to get setup. You can then use the mrtg plugin to get alerts for your network traffic too.
I also recommend ntop as it shows where your network traffic is going.
A good link to get you going with Munin and Monit: link text
I typically watch top and tail -f /var/log/auth.log.