How to find out my site is being scraped?
I've some points...
Network Bandwidth occupation, causing throughput problems (matches if proxy used).
When querting search engine for key words the new referrences appear to other similar resources with the same content (matches if proxy used).
Multiple requesting from the same IP.
High requests rate from a single IP. (by the way: What is a normal rate?)
Headless or weird user agent (matches if proxy used).
Requesting with predictable (equal) intervals from the same IP.
Certain support files are never requested, ex. favicon.ico, various CSS and javascript files (matches if proxy used).
The client's requests sequence. Ex. client access not directly accessible pages (matches if proxy used).
Would you add more to this list?
What points might fit/match if a scraper uses proxying?
As a first note; consider if its worthwhile to provide an API for bots for the future. If you are being crawled by another company/etc, if it is information you want to provide to them anyways it makes your website valuable to them. Creating an API would reduce your server load substantially and give you 100% clarity on people crawling you.
Second, coming from personal experience (I created web-crawls for quite a while), generally you can tell immediately by tracking what the browser was that accessed your website. If they are using one of the automated ones or one out of a development language it will be uniquely different from your average user. Not to mention tracking the log file and updating your .htaccess with banning them (if that's what you are looking to do).
Its usually other then that fairly easy to spot. Repeated, very consistent opening of pages.
Check out this other post for more information on how you might want to deal with them, also for some thoughts on how to identify them.
How to block bad unidentified bots crawling my website?
I would also add analysis of when the requests by the same people are made. For example if the same IP address requests the same data at the same time every day, it's likely the process is on an automated schedule. Hence is likely to be scraping...
Possible add analysis of how many pages each user session has impacted. For example if a particular user on a particular day has browsed to every page in your site and you deem this unusual, then perhaps its another indicator.
It feels like you need a range of indicators and need to score them and combine the score to show who is most likely scraping.
Related
So i have created an automation bot to do some stuff for me on the internet .. Using Selenium Python..After long and grooling coding sessions ..days and nights of working on this project i have finally completed it ...Only to be randomly greeted with a Error 1015 "You are being rate limited".
I understand this is to prevent DDOS attacks. But it is a major blow.
I have contacted the website to resolve the matter but to no avail ..But the third party security software they use says that they the website can grant my ip exclusion of rate limiting.
So i was wondering is there any other way to bypass this ..maybe from a coding perspective ...
I don't think stuff like clearing cookies will resolve anything ..or will it as it is my specific ip address that they are blocking
Note:
The TofC of the website i am running my bot on doesn't say you cant use automation software on it ..but it doesn't say you cant either.
I don't mind coding some more to prevent random access denials ..that i think last for 24 hours which can be detrimental as the final stage of this build is to have my program run daily for long periods of times.
Do you think i could communicate with the third party security to ask them to ask the website to grant me access ..I have already tried resolving the matter with the website. All they said was that A. On there side it says i am fine
B. The problem is most likely on my side .."Maybe some malicious software is trying to access our website" which .. malicious no but a bot yes. That's what made me think maybe it would be better if i resolved the matter myself.
Do you think i may have to implement wait times between processes or something. Im stuck.
Thanks for any help. And its a single bot!
If you are randomly greeted with...
...implies that the site owner implemented Rate Limiting that affects your visitor traffic.
rate-limiting reason
Cloudflare can rate-limit the the visitor traffic trying to counter a possible Dictionary attack.
rate-limit thresholds
In generic cases Cloudflare rate-limits the visitor when the visitor traffic crosses the rate-limit thresholds which is calculated by, dividing 24 hours of uncached website requests by the unique visitors for the same 24 hours. Then, divide by the estimated average minutes of a visit. Finally, multiply by 4 (or larger) to establish an estimated threshold per minute for your website. A value higher than 4 is fine since most attacks are an order of magnitude above typical traffic rates.
Solution
In these cases the a potential solution would be to use the undetected-chromedriver to initialize the Chrome Browsing Context.
undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.
Code Block:
import undetected_chromedriver as uc
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
driver = uc.Chrome(options=options)
driver.get('https://bet365.com')
References
You can find a couple of relevant detailed discussions in:
Selenium app redirect to Cloudflare page when hosted on Heroku
Linkedin API throttle limit
I see some possibilities for you here:
Introduce wait time between requests to the site
Reduce the requests you make
Extend your bot to detect when it hits the limit and change your ip address (e.g. by restarting you router)
The last one is the least preferable I would assume and also the most time consuming one.
First: Read to Terms of Use of the website, for example, look at the robots.txt, usually this is at the root of the website like www.google.com/robots.txt . Note that going against the website owner's explicit terms may be illegal depending on jurisdiction and may result in the owner blocking your tool and/or ip.
https://www.robotstxt.org/robotstxt.html
This will let you know what the website owner explicitly allows for automation and scraping.
After you've reviewed the website's terms and understand what they allow, and they do not respond to you, and you've determined you are not breaking the websites terms of use, the only real other option would be utilize proxies and/or VPSs that will give the system running the scripts different IPs.
One of our advertising networks for a site I administer and develop is requesting the following:
We have been working on increasing performance on XXXX.com and our team feels that if we can set up the following CNAME on that domain it will help increase rates:
srv.XXXX.com d2xf3n3fltc6dl.XXXX.net
Could you create this record with your domain registrar? The reason we need you to create this CNAME is to preserve domain transparency within our RTB. Once we get this setup I will make some modifications in your account that should have some great results.*
Would this not open up our site to cross-site scripting vulnerabilities? Wouldn't malicious code be able to masquerade as coming from our site to bypass same-origin policy protection in browsers? I questioned him on this and this was his response:
First off let me address the benefits. The reason we would like you to create this CNAME is to increase domain transparency within our RTB. Many times when ads are fired, JS is used to scrape the URL and pass it to the buyer. We have found this method to be inefficient because sometimes the domain information does not reach the market place. This causes an impression (or hit) to show up as “uncategorized” rather than as “XXXX.com” and this results in lower rates because buyer pay up to 80% less for uncategorized inventory. By creating the CNAME we are ensuring that your domain shows up 100% of the time and we usually see CPM and revenue increases of 15-40% as a result.
I am sure you are asking yourself why other ad networks don’t do this. The reason is that this is not a very scalable solution, because as you can see, we have to work with each publisher to get this setup. Unlike big box providers like Adsense and Lijit, OURCOMPANY is focused on maximizing revenue for a smaller amount of quality publishers, rather than just getting our tags live on as many sites as possible. We take the time and effort to offer these kinds of solutions to maximize revenue for all parties.
In terms of security risks, they are minimal to none. You will simply be pointing a subdomain of XXXX.com to our ad creative server. We can’t use this to run scripts on your site, or access your site in any way.
Adding the CNAME is entirely up to you. We will still work our hardest to get the best rates possible, with or without that. We have just seen great results with this for other publishers, so I thought that I would reach out and see if it was something you were interested in.
This whole situation raised red flags with me but is really outside of my knowledge of security. Can anyone offer any insight to this please?
This would enable cookies set at the XXXX.com level to be read by each site, but it would not allow other Same Origin Policy actions unless both sites opt in. Both sites would have to set document.domain = 'XXXX.com'; in client-side script to allow access to both domains.
From MDN:
Mozilla distinguishes a document.domain property that has never been set from one explicitly set to the same domain as the document's URL, even though the property returns the same value in both cases. One document is allowed to access another if they have both set document.domain to the same value, indicating their intent to cooperate, or neither has set document.domain and the domains in the URLs are the same (implementation). Were it not for this special policy, every site would be subject to XSS from its subdomains (for example, https://bugzilla.mozilla.org could be attacked by bug attachments on https://bug*.bugzilla.mozilla.org).
I have a web application that has some pretty intuitive URLs, so people have written some Chrome extensions that use these URLs to make requests to our servers. Unfortunately, these extensions case problems for us, hammering our servers, issuing malformed requests, etc, so we are trying to figure out how to block them, or at least make it difficult to craft requests to our servers to dissuade these extensions from being used (we provide an API they should use instead).
We've tried adding some custom headers to requests and junk-json-preamble to responses, but the extension authors have updated their code to match.
I'm not familiar with chrome extensions, so what sort of access to the host page do they have? Can they call JavaScript functions on the host page? Is there a special header the browser includes to distinguish between host-page requests and extension requests? Can the host page inspect the list of extensions and deny certain ones?
Some options we've considered are:
Rate-limiting QPS by user, but the problem is not all queries are equal, and extensions typically kick off several expensive queries that look like user entered queries.
Restricting the amount of server time a user can use, but the problem is that users might hit this limit by just navigating around or running expensive queries several times.
Adding static custom headers/response text, but they've updated their code to mimic our code.
Figuring out some sort of token (probably cryptographic in some way) we include in our requests that the extension can't easily guess. We minify/obfuscate our JS, so are ok with embedding it in the JS source code (since the variable name it would have would be hard to guess).
I realize this may not be a 100% solvable problem, but we hope to either give us an upper hand in combatting it, or make it sufficiently hard to scrape our UI that fewer people do it.
Welp, guess nobody knows. In the end we just sent a custom header and starting tracking who wasn't sending it.
My boss asked me if Weblog expert (http://www.weblogexpert.com/lite.htm) is reliable in calculating the average time of the incoming visitors in a web site. Since HTTP is a stateless protocol, I think that the average time might be something left to personal interpretation. Does any one uses Weblog Expert? Is the visitor's average time reliable? Does anyone understand its criteria about how it process Apache logs to understand the average time?
From the WebLog Expert Lite help, the following definition:
Visitor - The program determines number of visitors by the IP addresses. If a request from an IP address came after 30 minutes since the last request from this IP, it is considered to belong to a different visitor. Requests from spiders aren't used to determine visitors.
That's a fairly useful heuristic to determine a visitor's visit, if all you have to go on is a timestamp and a requesting IP address. (I'm not sure how Web Log Expert determines a visitor is a spider, but it was irrelevant to my purpose.)
However, on closer inspection, I found the visitor average time to be very variable for our web app; some users request only a page or two, others are on for hours. So a single metric of "Average visit duration" might not give you a perfect understanding of your site's traffic.
I can't comment on that site in particular, but average time is usually calculated using some very clever bits of javascript.
You can set events on various parts of the page in javascript which fire off requests to servers. For example, when the user navigates away from a page or clicks on a link or closes the window the browser can send off a javascript request to their servers letting them know that the user has left. While this isn't 100% reliable, I think it provides a reasonable estimate for how long people spend there.
I get entirely different results if I change "Visitor session timeout".
Our internal network people (the majority of our visitors) all go to our website (external host) from the same IP (through our ISP), so the only way to determine a new visitor is by this Timeout. Choosing 1, 5 or 10 minutes creates very different results. HIGHLY UNRELIABLE. The only thing to do is be consistent and use the same parameters for comparative results, i.e., increased/decreased traffic. By the way, the update to WebLog Expert (version 7 -> 8) through that all out the window with entirely different counting mechanisms.
Our team have built a web application using Ruby on Rails. It currently doesn't restrict users from making excessive login requests. We want to ignore a user's login requests for a while after she made several failed attempts mainly for the purpose of defending automated robots.
Here are my questions:
How to write a program or script that can make excessive requests to our website? I need it because it will help me to test our web application.
How to restrict a user who made some unsuccessful login attempts within a period? Does Ruby on Rails have built-in solutions for identifying a requester and tracking whether she made any recent requests? If not, is there a general way to identify a requester (not specific to Ruby on Rails) and keep track of the requester's activities? Can I identify a user by ip address or cookies or some other information I can gather from her machine? We also hope that we can distinguish normal users (who make infrequent requests) from automatic robots (who make requests frequently).
Thanks!
One trick I've seen is having form fields included on the login form that through css hacks make them invisible to the user.
Automated systems/bots will still see these fields and may attempt to fill them with data. If you see any data in that field you immediately know its not a legit user and ignore the request.
This is not a complete security solution but one trick that you can add to the arsenal.
In regards to #1, there are many automation tools out there that can simulate large-volume posting to a given url. Depending on your platform, something as simple as wget might suffice; or something as complex (relatively speaking) a script that asks a UserAgent to post a given request multiple times in succession (again, depending on platform, this can be simple; also depending on language of choice for task 1).
In regards to #2, considering first the lesser issue of someone just firing multiple attempts manually. Such instances usually share a session (that being the actual webserver session); you should be able to track failed logins based on these session IDs ang force an early failure if the volume of failed attempts breaks some threshold. I don't know of any plugins or gems that do this specifically, but even if there is not one, it should be simple enough to create a solution.
If session ID does not work, then a combination of IP and UserAgent is also a pretty safe means, although individuals who use a proxy may find themselves blocked unfairly by such a practice (whether that is an issue or not depends largely on your business needs).
If the attacker is malicious, you may need to look at using firewall rules to block their access, as they are likely going to: a) use a proxy (so IP rotation occurs), b) not use cookies during probing, and c) not play nice with UserAgent strings.
RoR provides means for testing your applications as described in A Guide to Testing Rails Applications. Simple solution is to write such a test containing a loop sending 10 (or whatever value you define as excessive) login request. The framework provides means for sending HTTP requests or fake them
Not many people will abuse your login system, so just remembering IP addresses of failed logins (for an hour or any period your think is sufficient) would be sufficient and not too much data to store. Unless some hacker has access to a great many amount of IP addresses... But in such situations you'd need more/serious security measurements I guess.