Could a web-scraper get around a good throttle protection?

Could a web-scraper get around a good throttle protection? - security

Suppose that a data source sets a tight IP-based throttle. Would a web scraper have any way to download the data if the throttle starts rejecting their requests as early as 1% of the data being downloaded?
The only technique I could think of a hacker using here would be some sort of proxy system. But, it seems like the proxies (even if fast) would eventually all reach the throttle.
Update: Some people below have mentioned big proxy networks like Yahoo Pipes and Tor, but couldn't these IP ranges or known exit nodes be blacklisted as well?

A list of thousands or poxies can be compiled for FREE. IPv6 addresses can be rented for pennies. Hell, an attacker could boot up an Amazon EC2 micro instance for 2-7 cents an hour.
And you want to stop people from scraping your site? The internet doesn't work that way, and hopefully it never will.
(I have seen IRC servers do a port scan on clients to see if the following ports are open: 8080,3128,1080. However there are proxy servers that use different ports and there are also legit reasons to run proxy server or to have these ports open, like if you are running Apache Tomcat. You could bump it up a notch by using YAPH to see if a client is running a proxy server. In effect you'd be using an attacker's too against them ;)

Someone using Tor would be hopping IP addresses every few minutes. I used to run a website where this was a problem, and resorted to blocking the IP addresses of known Tor exit nodes whenever excessive scraping was detected. You can implement this if you can find a regularly updated list of Tor exit nodes, for example, https://www.dan.me.uk/tornodes

You could use a P2P crawling network to accomplish this task. There will be a lot of IPs availble and there will be no problem if one of them become throttled. Also, you may combine a lot of client instances using some proxy configuration as suggested in previous answers.
I think you can use YaCy, a P2P opensource crawling network.

A scraper that wants the information will get the information. Timeouts, changing agent names, proxies, and of course EC2/RackSpace or any other cloud services that have the ability to start and stop servers with new IP addresses for pennies.

I've heard of people using Yahoo Pipes to do such things, essentially using Yahoo as a proxy to pull the data.

Maybe try running your scraper on amazon ec2 instances. Every time you get throttled, startup a new instance (at new IP), and kill the old one.

It depends on the time the attacker has for obtaining the data. If most of the data is static, it might be interesting for an attacker to run his scraper for, say, 50 days. If he is on a DSL line where he can request a "new" IP address twice a day, 1% limit would not harm him that much.
Of course, if you need the data more quickly (because it is outdated quickly), there are better ways (use EC2 instances, set up a BOINC project if there is public interest in the collected data, etc.).
Or have a Pyramid scheme a la "get 10 people to run my crawler and you get PORN, or get 100 people to crawl it and you get LOTS OF PORN", as it was quite common a few years ago with ad-filled websites. Because of the competition involved (who gets the most referrals) you might quickly get a lot of nodes running your crawler for very little money.

Related

How to catch/record a "Burst" in HAProxy and/or NodeJS traffic

We have a real-time service, which gets binary messages from different sources (internal and external), then using a couple of NodeJS instances and one HAProxy instance, configured to route TCP traffic, we deliver them to our end-users and different services who consume the messages. HAProxy version is 1.8.14, NodeJS is 6.14.3, both hosted on a CentOS 7 machine.
Now we've got a complex problem with some "burst"s in the outbound interface of HAProxy instance. We are not sure whether the burst is real (e.g. some messages got stuck in Node and then network gets flooded with messages) or the problem is some kind of misconfig or an indirect effect of some other service (Both latter reasons are more likely, as sometimes we get these bursts during midnight, which we have minimal to zero load).
The issue is annoying right now, but it might get critical as it floods our outbound traffic so our real-time services experience a lag or a small downtime during working hours.
My question is, how can we track and record the nature or the content of these messages with minimum overhead? I've been reading through HAProxy docs to find a way to monitor this, which can be achieved by using a Unix socket, but we are worried about a couple of things:
How much is the overhead of using this socket?
Can we track what is going on in the servers using this socket? Or it only gives us stats?
Is there a way to "catch/echo" the contents of these messages, or find out some information about them? with minimum overhead?
Please let me know if you have any questions regarding this problem.

I'm not sure how to correctly configure my server setup

This is kind of a multi-tiered question in which my end goal is to establish the best way to setup my server which will be hosting a website as well as a service (using Socket.io) for an iOS (and eventually an Android) app. Both the app service and the website are going to be written in node.js as I need high concurrency and scaling for the app server and I figured whilst I'm at it may as well do the website in node because it wouldn't be that much different in terms of performance than something different like Apache (from my understanding).
Also the website has a lower priority than the app service, the app service should receive significantly higher traffic than the website (but in the long run this may change). Money isn't my greatest priority here, but it is a limiting factor, I feel that having a service that has 99.9% uptime (as 100% uptime appears to be virtually impossible in the long run) is more important than saving money at the compromise of having more down time.
Firstly I understand that having one node process per cpu core is the best way to fully utilise a multi-core cpu. I now understand after researching that running more than one per core is inefficient due to the fact that the cpu has to do context switching between the multiple processes. How come then whenever I see code posted on how to use the in-built cluster module in node.js, the master worker creates a number of workers equal to the number of cores because that would mean you would have 9 processes on an 8 core machine (1 master process and 8 worker processes)? Is this because the master process usually is there just to restart worker processes if they crash or end and therefore does so little it doesnt matter that it shares a cpu core with another node process?
If this is the case then, I am planning to have the workers handle providing the app service and have the master worker handle the workers but also host a webpage which would provide statistical information on the server's state and all other relevant information (like number of clients connected, worker restart count, error logs etc). Is this a bad idea? Would it be better to have this webpage running on a separate worker and just leave the master worker to handle the workers?
So overall I wanted to have the following elements; a service to handle the request from the app (my main point of traffic), a website (fairly simple, a couple of pages and a registration form), an SQL database to store user information, a webpage (probably locally hosted on the server machine) which only I can access that hosts information about the server (users connected, worker restarts, server logs, other useful information etc) and apparently nginx would be a good idea where I'm handling multiple node processes accepting connection from the app. After doing research I've also found that it would probably be best to host on a VPS initially. I was thinking at first when the amount of traffic the app service would be receiving will most likely be fairly low, I could run all of those elements on one VPS. Or would it be best to have them running on seperate VPS's except for the website and the server status webpage which I could run on the same one? I guess this way if there is a hardware failure and something goes down, not everything does and I could run 2 instances of the app service on 2 different VPS's so if one goes down the other one is still functioning. Would this just be overkill? I doubt for a while I would need multiple app service instances to support the traffic load but it would help reduce the apparent down time for users.
Maybe this all depends on what I value more and have the time to do? A more complex server setup that costs more and maybe a little unnecessary but guarantees a consistent and reliable service, or a cheaper and simpler setup that may succumb to downtime due to coding errors and server hardware issues.
Also it's worth noting I've never had any real experience with production level servers so in some ways I've jumped in the deep end a little with this. I feel like I've come a long way in the past half a year and feel like I'm getting a fairly good grasp on what I need to do, I could just do with some advice from someone with experience that has an idea with what roadblocks I may come across along the way and whether I'm causing myself unnecessary problems with this kind of setup.
Any advice is greatly appreciated, thanks for taking the time to read my question.

How does a service like Put.io work?

Just got invited to put.io ... it's a service that takes a torrent file (or a magnet link) as input and gives a static file available for download from it's own server. I've been trying to understand how a service like this works?
It can't be as simple as simply torrenting the site and serving it via a CDN... can it? Because the speeds it offers seems insanely fast to me
Any idea about the bandwidth implications (or the amount used) by the service?

I believe services like this typically just are running one or more bittorrent clients on beefy machines with a fast link. You only have to download the torrent the first time someone asks for it, then you can cache it for the next person to request it.
The bandwidth usage is not unreasonable, since you're caching the files, you actually end up using less bandwidth than if you would, say, simply proxy downloads for people.
I would imagine that using a CDN would not be very common. There's a certain overhead involved in that. You could possibly promote files out of your cache to a CDN once you're certain that they are and will stay popular.
The service I was involved with simply ran 14 instances if libtorrent, each on a separate drive, serving completed files straight off of those drives with nginx. Torrents were requested from the web front end and prioritized before handed over to the downloader. Each instance would download around 70 or so torrents in parallel.

How much sustained data should a dedicated server be able to serve?

We have a dedicated godaddy server and it seemed to grind to a halt when we had users downloading only 3MB every 2 seconds (this was over about 20 http requests).
I want to look into database locking etc. to see if that is a problem - but first I'm curious as to what a dedicated server ought to be able to serve.

to help diagnose the problem, host a large file and download it. That will give you the transfer that the server and your web server can cope with. If the transfer rate is poor, then you know its the network, server or webserver.
If its acceptable or good, then you know its the means you have of generating those 3MB files.
check, measure and calculate!
PS. download the file over a fast link, you don't want the bottleneck to be your 64kbps modem :)

A lot depends on what the 3MB is. Serving up 1.5MBps of static data is way, way, way, within the bounds of even the weakest server.

Perhaps godaddy does bandwidt throtling? 60MB downloads every 2 seconds might fire some sort of bandwidt protection (either to protect their service or you from being overcharged, or both).

Check netspeed.stanford.edu from the dedicated server and see what your inbound and outbound traffic is like.
Also make sure your ISP is not limiting you at 10MBps (godaddy by default limits to 10Mbps and will set it at 100Mbps on request)

Which resources should one monitor on a Linux server running a web-server or database

When running any kind of server under load there are several resources that one would like to monitor to make sure that the server is healthy. This is specifically true when testing the system under load.
Some examples for this would be CPU utilization, memory usage, and perhaps disk space.
What other resource should I be monitoring, and what tools are available to do so?

As many as you can afford to, and can then graph/understand/look at the results. Monitoring resources is useful for not only capacity planning, but anomaly detection, and anomaly detection significantly helps your ability to detect security events.
You have a decent start with your basic graphs. I'd want to also monitor the number of threads, number of connections, network I/O, disk I/O, page faults (arguably this is related to memory usage), context switches.
I really like munin for graphing things related to hosts.

I use Zabbix extensively in production, which comes with a stack of useful defaults. Some examples of the sorts of things we've configured it to monitor:
Network usage
CPU usage (% user,system,nice times)
Load averages (1m, 5m, 15m)
RAM usage (real, swap, shm)
Disc throughput
Active connections (by port number)
Number of processes (by process type)
Ping time from remote location
Time to SSL certificate expiry
MySQL internals (query cache usage, num temporary tables in RAM and on disc, etc)
Anything you can monitor with Zabbix, you can also attach triggers to - so it can restart failed services; or page you to alert about problems.
Collect the data now, before performance becomes an issue. When it does, you'll be glad of the historical baselines, and the fact you'll be able to show what date and time problems started happening for when you need to hunt down and punish exactly which developer made bad changes :)

I ended up using dstat which is vmstat's nicer looking cousin.
This will show most everything you need to know about a machine's health,
including:
CPU
Disk
Memory
Network
Swap

"df -h" to make sure that no partition runs full which can lead to all kinds of funky problems, watching the syslog is of course also useful, for that I recommend installing "logwatch" (Logwatch Website) on your server which sends you an email if weird things start showing up in your syslog.

Cacti is a good web-based monitoring/graphing solution. Very complete, very easy to use, with a large userbase including many large Enterprise-level installations.
If you want more 'alerting' and less 'graphing', check out nagios.
As for 'what to monitor', you want to monitor systems at both the system and application level, so yes: network/memory/disk i/o, interrupts and such over the system level. The application level gets more specific, so a webserver might measure hits/second, errors/second (non-200 responses), etc and a database might measure queries/second, average query fulfillment time, etc.

Beware the afore-mentioned slowquerylog in mysql. It should only be used when trying to figure out why some queries are slow. It has the side-effect of making ALL your queries slow while it's enabled. :P It's intended for debugging, not logging.
Think 'passive monitoring' whenever possible. For instance, sniff the network traffic rather than monitor it from your server -- have another machine watch the packets fly back and forth and record statistics about them.
(By the way, that's one of my favorites -- if you watch connections being established and note when they end, you can find a lot of data about slow queries or slow anything else, without putting any load on the server you care about.)

In addition to top and auth.log, I often look at mtop, and enable mysql's slowquerylog and watch mysqldumpslow.
I also use Nagios to monitor CPU, Memory, and logged in users (on a VPS or dedicated server). That last lets me know when someone other than me has logged in.

network of course :) Use MRTG to get some nice bandwidth graphs, they're just pretty most of the time.. until a spammer finds a hole in your security and it suddenly increases.
Nagios is good for alerting as mentioned, and is easy to get setup. You can then use the mrtg plugin to get alerts for your network traffic too.
I also recommend ntop as it shows where your network traffic is going.
A good link to get you going with Munin and Monit: link text

I typically watch top and tail -f /var/log/auth.log.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Could a web-scraper get around a good throttle protection? - security

A scraper that wants the information will get the information. Timeouts, changing agent names, proxies, and of course EC2/RackSpace or any other cloud services that have the ability to start and stop servers with new IP addresses for pennies.

I've heard of people using Yahoo Pipes to do such things, essentially using Yahoo as a proxy to pull the data.

Maybe try running your scraper on amazon ec2 instances. Every time you get throttled, startup a new instance (at new IP), and kill the old one.

Related

How to catch/record a "Burst" in HAProxy and/or NodeJS traffic

I'm not sure how to correctly configure my server setup

How does a service like Put.io work?

How much sustained data should a dedicated server be able to serve?

Which resources should one monitor on a Linux server running a web-server or database

Categories

Resources