So i have created an automation bot to do some stuff for me on the internet .. Using Selenium Python..After long and grooling coding sessions ..days and nights of working on this project i have finally completed it ...Only to be randomly greeted with a Error 1015 "You are being rate limited".
I understand this is to prevent DDOS attacks. But it is a major blow.
I have contacted the website to resolve the matter but to no avail ..But the third party security software they use says that they the website can grant my ip exclusion of rate limiting.
So i was wondering is there any other way to bypass this ..maybe from a coding perspective ...
I don't think stuff like clearing cookies will resolve anything ..or will it as it is my specific ip address that they are blocking
Note:
The TofC of the website i am running my bot on doesn't say you cant use automation software on it ..but it doesn't say you cant either.
I don't mind coding some more to prevent random access denials ..that i think last for 24 hours which can be detrimental as the final stage of this build is to have my program run daily for long periods of times.
Do you think i could communicate with the third party security to ask them to ask the website to grant me access ..I have already tried resolving the matter with the website. All they said was that A. On there side it says i am fine
B. The problem is most likely on my side .."Maybe some malicious software is trying to access our website" which .. malicious no but a bot yes. That's what made me think maybe it would be better if i resolved the matter myself.
Do you think i may have to implement wait times between processes or something. Im stuck.
Thanks for any help. And its a single bot!
If you are randomly greeted with...
...implies that the site owner implemented Rate Limiting that affects your visitor traffic.
rate-limiting reason
Cloudflare can rate-limit the the visitor traffic trying to counter a possible Dictionary attack.
rate-limit thresholds
In generic cases Cloudflare rate-limits the visitor when the visitor traffic crosses the rate-limit thresholds which is calculated by, dividing 24 hours of uncached website requests by the unique visitors for the same 24 hours. Then, divide by the estimated average minutes of a visit. Finally, multiply by 4 (or larger) to establish an estimated threshold per minute for your website. A value higher than 4 is fine since most attacks are an order of magnitude above typical traffic rates.
Solution
In these cases the a potential solution would be to use the undetected-chromedriver to initialize the Chrome Browsing Context.
undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.
Code Block:
import undetected_chromedriver as uc
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
driver = uc.Chrome(options=options)
driver.get('https://bet365.com')
References
You can find a couple of relevant detailed discussions in:
Selenium app redirect to Cloudflare page when hosted on Heroku
Linkedin API throttle limit
I see some possibilities for you here:
Introduce wait time between requests to the site
Reduce the requests you make
Extend your bot to detect when it hits the limit and change your ip address (e.g. by restarting you router)
The last one is the least preferable I would assume and also the most time consuming one.
First: Read to Terms of Use of the website, for example, look at the robots.txt, usually this is at the root of the website like www.google.com/robots.txt . Note that going against the website owner's explicit terms may be illegal depending on jurisdiction and may result in the owner blocking your tool and/or ip.
https://www.robotstxt.org/robotstxt.html
This will let you know what the website owner explicitly allows for automation and scraping.
After you've reviewed the website's terms and understand what they allow, and they do not respond to you, and you've determined you are not breaking the websites terms of use, the only real other option would be utilize proxies and/or VPSs that will give the system running the scripts different IPs.
Related
How to find out my site is being scraped?
I've some points...
Network Bandwidth occupation, causing throughput problems (matches if proxy used).
When querting search engine for key words the new referrences appear to other similar resources with the same content (matches if proxy used).
Multiple requesting from the same IP.
High requests rate from a single IP. (by the way: What is a normal rate?)
Headless or weird user agent (matches if proxy used).
Requesting with predictable (equal) intervals from the same IP.
Certain support files are never requested, ex. favicon.ico, various CSS and javascript files (matches if proxy used).
The client's requests sequence. Ex. client access not directly accessible pages (matches if proxy used).
Would you add more to this list?
What points might fit/match if a scraper uses proxying?
As a first note; consider if its worthwhile to provide an API for bots for the future. If you are being crawled by another company/etc, if it is information you want to provide to them anyways it makes your website valuable to them. Creating an API would reduce your server load substantially and give you 100% clarity on people crawling you.
Second, coming from personal experience (I created web-crawls for quite a while), generally you can tell immediately by tracking what the browser was that accessed your website. If they are using one of the automated ones or one out of a development language it will be uniquely different from your average user. Not to mention tracking the log file and updating your .htaccess with banning them (if that's what you are looking to do).
Its usually other then that fairly easy to spot. Repeated, very consistent opening of pages.
Check out this other post for more information on how you might want to deal with them, also for some thoughts on how to identify them.
How to block bad unidentified bots crawling my website?
I would also add analysis of when the requests by the same people are made. For example if the same IP address requests the same data at the same time every day, it's likely the process is on an automated schedule. Hence is likely to be scraping...
Possible add analysis of how many pages each user session has impacted. For example if a particular user on a particular day has browsed to every page in your site and you deem this unusual, then perhaps its another indicator.
It feels like you need a range of indicators and need to score them and combine the score to show who is most likely scraping.
I recently tried to use the default settings, this is:
5 - max number of concurrent occurances
-20 max number of requests in 200 milliseconds.
However, this started cutting of my personal connections to the website (loading javascript, css etc.). I need something that will never fire for users using the site honestly, but I do want to prevent denial of service attacks.
What are good limits to set?
I don't think that there is a good generic limit that will fit for all websites, it is personal for each website. It depends on RPS, requests execution time etc.
I suggest you to modify IIS logger and log IP of each request. Then view IIS logs to see what is the pattern of the traffic for users, how many requests they do within a normal flow. It should let you approximate average amount of requests coming from user in a selected time frame.
However in my experience 20 requests without 200 milliseconds usually looks like an attack. In this way default settings provided by IIS seem reasonable.
My website uses obscure, random URLs to provide some security for sensitive documents. E.g. a URL might be http://example.com/<random 20-char string>. The URLs are not linked to by any other pages, have META tags to opt out of search engine crawling, and have short expiration periods. For top-tier security some of the URLs are also protected by a login prompt, but many are simply protected by the obscure URL. We have decided that this is an acceptable level of security.
We have a lockout mechanism implemented where an IP address will be blocked for some period of time following several invalid URL attempts, to discourage brute-force guessing of URLs.
However, Google Chrome has a feature called "Instant" (enabled in Options -> Basic -> Search), that will prefetch URLs as they are typed into the address bar. This is quickly triggering a lockout, since it attempts to fetch a bunch of invalid URLs, and by the time the user has finished, they are not allowed any more attempts.
Is there any way to opt out of this feature, or ignore HTTP requests that come from it?
Or is this lockout mechanism just stupid and annoying for users without providing any significant protection?
(Truthfully, I don't really understand how this is a helpful feature for Chrome. For search results it can be interesting to see what Google suggests as you type, but what are the odds that a subset of your intended URL will produce a meaningful page? When I have this feature turned on, all I get is a bunch of 404 errors until I've finished typing.)
Without commenting on the objective, I ran into a similar problem (unwanted page loads from Chrome Instant), and discovered that Google does provide a way to avoid this problem:
When Google Chrome makes the request to your website server, it will send the following header:
X-Purpose: preview
Detect this, and return an HTTP 403 ("Forbidden") status code.
Or is this lockout mechanism just stupid and annoying for users without providing any significant protection?
You've potentially hit the nail on the head there: Security through obscurity is not security.
Instead of trying to "discourage brute-force guessing", use URLs that are actually hard to guess: the obvious example is using a cryptographically secure RNG to generate the "random 20 character string". If you use base64url (or a similar URL-safe base64) you get 64^20 = 2^6^20 = 2^120 bits. Not quite 128 (or 160 or 256) bits, so you can make it longer if you want, but also note that the expected bandwidth cost of a correct guess is going to be enormous, so you don't really have to worry until your bandwidth bill becomes huge.
There are some additional ways you might want to protect the links:
Use HTTPS to reduce the potential for eavesdropping (they'll still be unencrypted between SMTP servers, if you e-mail the links)
Don't link to anything, or if you do, link through a HTTPS redirect (last I checked, many web browsers will still send a Referer:, leaking the "secure" URL you were looking at previously). An alternative is to have the initial load set an unguessable secure session cookie and redirect to a new URL which is only valid for that session cookie.
Alternatively, you can alter the "lockout" to still work without compromising usability:
Serve only after a delay. Every time you serve a document or HTTP 404, increase the delay for that IP. There's probably an easy algorithm to asymptotically approach a rate limit but be more "forgiving" for the first few requests.
For each IP address, only allow one request at a time. When you receive a request, return HTTP 5xx on any existing requests (I forget which one is "server load too high").
Even if the initial delays are exponentially increasing (i.e. 1s, 2s, 4s), the "current" delay isn't going to be much greater than the time taken to type in the whole URL. If it takes you 10 seconds to type in a random URL, then another 16 seconds to wait for it to load isn't too bad.
Keep in mind that someone who wants to get around your IP-based rate limiting can just rent a (fraction of the bandwidth of a) botnet.
Incidentally, I'm (only slightly) surprised by the view from an Unnamed Australian Software Company that low-entropy randomly-generated passwords are not a problem, because there should be a CAPTCHA in your login system. Funny, since some of those passwords are server-to-server.
My boss asked me if Weblog expert (http://www.weblogexpert.com/lite.htm) is reliable in calculating the average time of the incoming visitors in a web site. Since HTTP is a stateless protocol, I think that the average time might be something left to personal interpretation. Does any one uses Weblog Expert? Is the visitor's average time reliable? Does anyone understand its criteria about how it process Apache logs to understand the average time?
From the WebLog Expert Lite help, the following definition:
Visitor - The program determines number of visitors by the IP addresses. If a request from an IP address came after 30 minutes since the last request from this IP, it is considered to belong to a different visitor. Requests from spiders aren't used to determine visitors.
That's a fairly useful heuristic to determine a visitor's visit, if all you have to go on is a timestamp and a requesting IP address. (I'm not sure how Web Log Expert determines a visitor is a spider, but it was irrelevant to my purpose.)
However, on closer inspection, I found the visitor average time to be very variable for our web app; some users request only a page or two, others are on for hours. So a single metric of "Average visit duration" might not give you a perfect understanding of your site's traffic.
I can't comment on that site in particular, but average time is usually calculated using some very clever bits of javascript.
You can set events on various parts of the page in javascript which fire off requests to servers. For example, when the user navigates away from a page or clicks on a link or closes the window the browser can send off a javascript request to their servers letting them know that the user has left. While this isn't 100% reliable, I think it provides a reasonable estimate for how long people spend there.
I get entirely different results if I change "Visitor session timeout".
Our internal network people (the majority of our visitors) all go to our website (external host) from the same IP (through our ISP), so the only way to determine a new visitor is by this Timeout. Choosing 1, 5 or 10 minutes creates very different results. HIGHLY UNRELIABLE. The only thing to do is be consistent and use the same parameters for comparative results, i.e., increased/decreased traffic. By the way, the update to WebLog Expert (version 7 -> 8) through that all out the window with entirely different counting mechanisms.
Our team have built a web application using Ruby on Rails. It currently doesn't restrict users from making excessive login requests. We want to ignore a user's login requests for a while after she made several failed attempts mainly for the purpose of defending automated robots.
Here are my questions:
How to write a program or script that can make excessive requests to our website? I need it because it will help me to test our web application.
How to restrict a user who made some unsuccessful login attempts within a period? Does Ruby on Rails have built-in solutions for identifying a requester and tracking whether she made any recent requests? If not, is there a general way to identify a requester (not specific to Ruby on Rails) and keep track of the requester's activities? Can I identify a user by ip address or cookies or some other information I can gather from her machine? We also hope that we can distinguish normal users (who make infrequent requests) from automatic robots (who make requests frequently).
Thanks!
One trick I've seen is having form fields included on the login form that through css hacks make them invisible to the user.
Automated systems/bots will still see these fields and may attempt to fill them with data. If you see any data in that field you immediately know its not a legit user and ignore the request.
This is not a complete security solution but one trick that you can add to the arsenal.
In regards to #1, there are many automation tools out there that can simulate large-volume posting to a given url. Depending on your platform, something as simple as wget might suffice; or something as complex (relatively speaking) a script that asks a UserAgent to post a given request multiple times in succession (again, depending on platform, this can be simple; also depending on language of choice for task 1).
In regards to #2, considering first the lesser issue of someone just firing multiple attempts manually. Such instances usually share a session (that being the actual webserver session); you should be able to track failed logins based on these session IDs ang force an early failure if the volume of failed attempts breaks some threshold. I don't know of any plugins or gems that do this specifically, but even if there is not one, it should be simple enough to create a solution.
If session ID does not work, then a combination of IP and UserAgent is also a pretty safe means, although individuals who use a proxy may find themselves blocked unfairly by such a practice (whether that is an issue or not depends largely on your business needs).
If the attacker is malicious, you may need to look at using firewall rules to block their access, as they are likely going to: a) use a proxy (so IP rotation occurs), b) not use cookies during probing, and c) not play nice with UserAgent strings.
RoR provides means for testing your applications as described in A Guide to Testing Rails Applications. Simple solution is to write such a test containing a loop sending 10 (or whatever value you define as excessive) login request. The framework provides means for sending HTTP requests or fake them
Not many people will abuse your login system, so just remembering IP addresses of failed logins (for an hour or any period your think is sufficient) would be sufficient and not too much data to store. Unless some hacker has access to a great many amount of IP addresses... But in such situations you'd need more/serious security measurements I guess.