My boss asked me if Weblog expert (http://www.weblogexpert.com/lite.htm) is reliable in calculating the average time of the incoming visitors in a web site. Since HTTP is a stateless protocol, I think that the average time might be something left to personal interpretation. Does any one uses Weblog Expert? Is the visitor's average time reliable? Does anyone understand its criteria about how it process Apache logs to understand the average time?
From the WebLog Expert Lite help, the following definition:
Visitor - The program determines number of visitors by the IP addresses. If a request from an IP address came after 30 minutes since the last request from this IP, it is considered to belong to a different visitor. Requests from spiders aren't used to determine visitors.
That's a fairly useful heuristic to determine a visitor's visit, if all you have to go on is a timestamp and a requesting IP address. (I'm not sure how Web Log Expert determines a visitor is a spider, but it was irrelevant to my purpose.)
However, on closer inspection, I found the visitor average time to be very variable for our web app; some users request only a page or two, others are on for hours. So a single metric of "Average visit duration" might not give you a perfect understanding of your site's traffic.
I can't comment on that site in particular, but average time is usually calculated using some very clever bits of javascript.
You can set events on various parts of the page in javascript which fire off requests to servers. For example, when the user navigates away from a page or clicks on a link or closes the window the browser can send off a javascript request to their servers letting them know that the user has left. While this isn't 100% reliable, I think it provides a reasonable estimate for how long people spend there.
I get entirely different results if I change "Visitor session timeout".
Our internal network people (the majority of our visitors) all go to our website (external host) from the same IP (through our ISP), so the only way to determine a new visitor is by this Timeout. Choosing 1, 5 or 10 minutes creates very different results. HIGHLY UNRELIABLE. The only thing to do is be consistent and use the same parameters for comparative results, i.e., increased/decreased traffic. By the way, the update to WebLog Expert (version 7 -> 8) through that all out the window with entirely different counting mechanisms.
Related
So i have created an automation bot to do some stuff for me on the internet .. Using Selenium Python..After long and grooling coding sessions ..days and nights of working on this project i have finally completed it ...Only to be randomly greeted with a Error 1015 "You are being rate limited".
I understand this is to prevent DDOS attacks. But it is a major blow.
I have contacted the website to resolve the matter but to no avail ..But the third party security software they use says that they the website can grant my ip exclusion of rate limiting.
So i was wondering is there any other way to bypass this ..maybe from a coding perspective ...
I don't think stuff like clearing cookies will resolve anything ..or will it as it is my specific ip address that they are blocking
Note:
The TofC of the website i am running my bot on doesn't say you cant use automation software on it ..but it doesn't say you cant either.
I don't mind coding some more to prevent random access denials ..that i think last for 24 hours which can be detrimental as the final stage of this build is to have my program run daily for long periods of times.
Do you think i could communicate with the third party security to ask them to ask the website to grant me access ..I have already tried resolving the matter with the website. All they said was that A. On there side it says i am fine
B. The problem is most likely on my side .."Maybe some malicious software is trying to access our website" which .. malicious no but a bot yes. That's what made me think maybe it would be better if i resolved the matter myself.
Do you think i may have to implement wait times between processes or something. Im stuck.
Thanks for any help. And its a single bot!
If you are randomly greeted with...
...implies that the site owner implemented Rate Limiting that affects your visitor traffic.
rate-limiting reason
Cloudflare can rate-limit the the visitor traffic trying to counter a possible Dictionary attack.
rate-limit thresholds
In generic cases Cloudflare rate-limits the visitor when the visitor traffic crosses the rate-limit thresholds which is calculated by, dividing 24 hours of uncached website requests by the unique visitors for the same 24 hours. Then, divide by the estimated average minutes of a visit. Finally, multiply by 4 (or larger) to establish an estimated threshold per minute for your website. A value higher than 4 is fine since most attacks are an order of magnitude above typical traffic rates.
Solution
In these cases the a potential solution would be to use the undetected-chromedriver to initialize the Chrome Browsing Context.
undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.
Code Block:
import undetected_chromedriver as uc
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
driver = uc.Chrome(options=options)
driver.get('https://bet365.com')
References
You can find a couple of relevant detailed discussions in:
Selenium app redirect to Cloudflare page when hosted on Heroku
Linkedin API throttle limit
I see some possibilities for you here:
Introduce wait time between requests to the site
Reduce the requests you make
Extend your bot to detect when it hits the limit and change your ip address (e.g. by restarting you router)
The last one is the least preferable I would assume and also the most time consuming one.
First: Read to Terms of Use of the website, for example, look at the robots.txt, usually this is at the root of the website like www.google.com/robots.txt . Note that going against the website owner's explicit terms may be illegal depending on jurisdiction and may result in the owner blocking your tool and/or ip.
https://www.robotstxt.org/robotstxt.html
This will let you know what the website owner explicitly allows for automation and scraping.
After you've reviewed the website's terms and understand what they allow, and they do not respond to you, and you've determined you are not breaking the websites terms of use, the only real other option would be utilize proxies and/or VPSs that will give the system running the scripts different IPs.
How to find out my site is being scraped?
I've some points...
Network Bandwidth occupation, causing throughput problems (matches if proxy used).
When querting search engine for key words the new referrences appear to other similar resources with the same content (matches if proxy used).
Multiple requesting from the same IP.
High requests rate from a single IP. (by the way: What is a normal rate?)
Headless or weird user agent (matches if proxy used).
Requesting with predictable (equal) intervals from the same IP.
Certain support files are never requested, ex. favicon.ico, various CSS and javascript files (matches if proxy used).
The client's requests sequence. Ex. client access not directly accessible pages (matches if proxy used).
Would you add more to this list?
What points might fit/match if a scraper uses proxying?
As a first note; consider if its worthwhile to provide an API for bots for the future. If you are being crawled by another company/etc, if it is information you want to provide to them anyways it makes your website valuable to them. Creating an API would reduce your server load substantially and give you 100% clarity on people crawling you.
Second, coming from personal experience (I created web-crawls for quite a while), generally you can tell immediately by tracking what the browser was that accessed your website. If they are using one of the automated ones or one out of a development language it will be uniquely different from your average user. Not to mention tracking the log file and updating your .htaccess with banning them (if that's what you are looking to do).
Its usually other then that fairly easy to spot. Repeated, very consistent opening of pages.
Check out this other post for more information on how you might want to deal with them, also for some thoughts on how to identify them.
How to block bad unidentified bots crawling my website?
I would also add analysis of when the requests by the same people are made. For example if the same IP address requests the same data at the same time every day, it's likely the process is on an automated schedule. Hence is likely to be scraping...
Possible add analysis of how many pages each user session has impacted. For example if a particular user on a particular day has browsed to every page in your site and you deem this unusual, then perhaps its another indicator.
It feels like you need a range of indicators and need to score them and combine the score to show who is most likely scraping.
Assume http://chaseonline.chase.com is a real URL with a web server sitting behind it, i,e, this URL revolves to an IP address or probably several so that there can be a lot of identical servers that allows load balancing from client requests.
I guess that probably Chase buys up URLs that are "close" in the URL namespace(<<< how to define the term "namespace"? Lexicographically?? I think the latter is not trivial (because it depends on a post that one defines on top of URL strings ... never mind this comment).
Suppose that given of the URLs (http://mychaseonline.chase.com, http://chaseonline.chase.ua, http://chaseonline.chase.ru, etc.) is "free" (not bought). I buy one of these free URLs, write my phishing/spoofing server that sits behind
my URL and renders the following screen => https://chaseonline.chase.com/
I work to get my URL indexed (hopefully) at least as high or higher than the real one (http://chaseonline.chase.com). Chance is (hopefully) most bank clients/users won't notice my bogus URLs and I start collecting . I then use my server as a client in relationship to the real bank server http://chaseonline.chase.com, log in and using my collection/list of <user id, password> tuples to login to each <user id, password> to create mischief.
Is this a cross-site request forgery? How would one prevent this from occurring?
What I'm hearing in your description is a phishing attack albeit with slightly more complexity. Let's address some of this points
2) Really hard to get all the urls, especially when you take into consideration different variations such as unicode, or even just simple kerning hacks. For example the R and N in kerning looks a lot like an m when you look quickly. Welcome to chаse.rnobile.com! So with that said, I'd guess that most companies just buy the obvious domains.
4) Getting your url indexed higher than the real one, I'll posit is impossible. Google et al. are likely sophisticated enough to catch that type of thing from happening. One approach to getting above chase in SERP would be to buy adwords for something like "Bank Online With Chase." But there again, I'd assume that the search engines have a decent filtering/fraud prevention mechanism to catch this type of thing.
Mostly you'd be better off to keep your server from being indexed since that would simply attract attention. Because this type of thing will be shut down, I presume most phishing attacks go for large numbers of small 'fish' (larger ROI) or small numbers of large 'fish' (think targeted phishing attacks of execs, bank employees, etc.)
I think you offer up an interesting idea in point 4, that there's nothing to stop a man-in-the-middle attack from occurring wherein your site delegates out to the target site for each request. The difficulty in that approach is that you'd spend a ton of resources on creating a replica website. When you think of most hacking as being a business, trying to maximize your ROI then a lot of the "this is what I'd do if I were a hacker" ideas go way.
If I were to do this type of thing, I'd provide a login facade, have the user provide me their credentials, and then redirect to the main site on POST to my server. This way I get your credentials and you think there's just been an error on the form. I'm then free to pull all the information off of your banking site at my leisure.
There's nothing cross-site about this. It's a simple forgery.
It fails for a number of reasons: lack of security (your site isn't HTTPS), malware protection vendors explicitly check against this kind of abuse, Google won't rank your forgery above highly popular sites, and finally banks with a real sense of security use 2 Factor Authentication. The login token you'd get for my bank account is valid for a few seconds, literally, and can't be used for anything but logging in.
I recently tried to use the default settings, this is:
5 - max number of concurrent occurances
-20 max number of requests in 200 milliseconds.
However, this started cutting of my personal connections to the website (loading javascript, css etc.). I need something that will never fire for users using the site honestly, but I do want to prevent denial of service attacks.
What are good limits to set?
I don't think that there is a good generic limit that will fit for all websites, it is personal for each website. It depends on RPS, requests execution time etc.
I suggest you to modify IIS logger and log IP of each request. Then view IIS logs to see what is the pattern of the traffic for users, how many requests they do within a normal flow. It should let you approximate average amount of requests coming from user in a selected time frame.
However in my experience 20 requests without 200 milliseconds usually looks like an attack. In this way default settings provided by IIS seem reasonable.
Our team have built a web application using Ruby on Rails. It currently doesn't restrict users from making excessive login requests. We want to ignore a user's login requests for a while after she made several failed attempts mainly for the purpose of defending automated robots.
Here are my questions:
How to write a program or script that can make excessive requests to our website? I need it because it will help me to test our web application.
How to restrict a user who made some unsuccessful login attempts within a period? Does Ruby on Rails have built-in solutions for identifying a requester and tracking whether she made any recent requests? If not, is there a general way to identify a requester (not specific to Ruby on Rails) and keep track of the requester's activities? Can I identify a user by ip address or cookies or some other information I can gather from her machine? We also hope that we can distinguish normal users (who make infrequent requests) from automatic robots (who make requests frequently).
Thanks!
One trick I've seen is having form fields included on the login form that through css hacks make them invisible to the user.
Automated systems/bots will still see these fields and may attempt to fill them with data. If you see any data in that field you immediately know its not a legit user and ignore the request.
This is not a complete security solution but one trick that you can add to the arsenal.
In regards to #1, there are many automation tools out there that can simulate large-volume posting to a given url. Depending on your platform, something as simple as wget might suffice; or something as complex (relatively speaking) a script that asks a UserAgent to post a given request multiple times in succession (again, depending on platform, this can be simple; also depending on language of choice for task 1).
In regards to #2, considering first the lesser issue of someone just firing multiple attempts manually. Such instances usually share a session (that being the actual webserver session); you should be able to track failed logins based on these session IDs ang force an early failure if the volume of failed attempts breaks some threshold. I don't know of any plugins or gems that do this specifically, but even if there is not one, it should be simple enough to create a solution.
If session ID does not work, then a combination of IP and UserAgent is also a pretty safe means, although individuals who use a proxy may find themselves blocked unfairly by such a practice (whether that is an issue or not depends largely on your business needs).
If the attacker is malicious, you may need to look at using firewall rules to block their access, as they are likely going to: a) use a proxy (so IP rotation occurs), b) not use cookies during probing, and c) not play nice with UserAgent strings.
RoR provides means for testing your applications as described in A Guide to Testing Rails Applications. Simple solution is to write such a test containing a loop sending 10 (or whatever value you define as excessive) login request. The framework provides means for sending HTTP requests or fake them
Not many people will abuse your login system, so just remembering IP addresses of failed logins (for an hour or any period your think is sufficient) would be sufficient and not too much data to store. Unless some hacker has access to a great many amount of IP addresses... But in such situations you'd need more/serious security measurements I guess.