Programmatic Bot Detection

Programmatic Bot Detection - bots

I need to write some code to analyze whether or not a given user on our site is a bot. If it's a bot, we'll take some specific action. Looking at the User Agent is not something that is successful for anything but friendly bots, as you can specify any user agent you want in a bot. I'm after behaviors of unfriendly bots. Various ideas I've had so far are:
If you don't have a browser ID
If you don't have a session ID
Unable to write a cookie
Obviously, there are some cases where a legitimate user will look like a bot, but that's ok. Are there other programmatic ways to detect a bot, or either detect something that looks like a bot?

User agents can be faked. Captchas have been cracked. Valid cookies can be sent back to your server with page requests. Legitimate programs, such as Adobe Acrobat Pro can go in and download your web site in one session. Users can disable JavaScript. Since there is no standard measure of "normal" user behaviour, it cannot be differentiated from a bot.
In other words: it can't be done short of pulling the user into some form of interactive chat and hope they pass the Turing Test, then again, they could be a really good bot too.

Clarify why you want to exclude bots, and how tolerant you are of mis-classification.
That is, do you have to exclude every single bot at the expense of treating real users like bots? Or is it okay if bots crawl your site as long as they don't have a performance impact?
The only way to exclude all bots is to shut down your web site. A malicious user can distribute their bot to enough machines that you would not be able to distinguish their traffic from real users. Tricks like JavaScript and CSS will not stop a determined attacker.
If a "happy medium" is satisfactory, one trick that might be helpful is to hide links with CSS so that they are not visible to users in a browser, but are still in the HTML. Any agent that follows one of these "poison" links is a bot.

A simple test is javascript:
<script type="text/javascript">
document.write('<img src="/not-a-bot.' + 'php" style="display: none;">');
</script>
The not-a-bot.php can add something into the session to flag that the user is not a bot, then return a single pixel gif.
The URL is broken up to disguise it from the bot.

Here's an idea:
Most bots don't download css, javascript and images. They just parse the html.
If you could keep track in a user's session whether or not they download all of the above, e.g. by routing all of the download requests through a script that logs the attempts, then you could quickly identify users that only download the raw html (very few normal users will do this).

You say that it is okay that some users appear as bots, therefore,
Most bots don't run javascript. Use javascript to do an Ajax like call to the server that identifies this IP address as NonBot. Store that for a set period of time to identify future connections from this IP as good clients and to prevent further wasteful javascript calls.

For each session on the server you can determine if the user was at any point clicking or typing too fast. After a given number of repeats, set the "isRobot" flag to true and conserve resources within that session. Normally you don't tell the user that he's been robot-detected, since he'd just start a new session in that case.

Well, this is really for a particular page of the site. We don't want a bot submitting the form b/c it messes up tracking. Honestly, the friendly bots, Google, Yahoo, etc aren't a problem as they don't typically fill out the form to begin with. If we suspected someone of being a bot, we might show them a captcha image or something like that... If they passed, they're not a bot and the form submits...
I've heard things like putting a form in flash, or making the submit javascript, but I'd prefer not to prevent real users from using the site until I suspected they were a bot...

I think your idea with checking the session id will already be quite useful.
Another idea: You could check whether embedded resources are downloaded as well.
A bot which does not load images (e.g. to save time and bandwidth) should be distinguishable from a browser which typically will load images embedded into a page.
Such a check however might not be suited as a real-time check because you would have to analyze some sort of server log which might be time consuming.

Hey, thanks for all the responses. I think that a combination of a few suggestions will work well. Mainly, the hidden form element that times how fast the form was filled out, and possibly the "poison link" idea. I think that it will cover most basis. When you're talking about bots, you're not going to find them all, so there's no point thinking that you will... Silly bots.

Related

hide fetched user data from backend

I am looking for how to to get user data and expose it in the UI without show it elsewhere in devtools - so I would like that data doesn't appear in any request response.
I considered different possibilities, as cookies or session but none of them allow to hide the data before it is displayed in the UI.
So I wonder what the usual practice is and if using socket.io would be considered a hack?
The idea is:
User is logged and visits some page, regular API requests are made and serve UI display, and is required user data for UI purposes.
As an example:
Are displayed elements to which it is possible to subscribe, so depending of user and of its subscriptions, style is different between followed and unfollowed elements.
Thank you in advance for your help.

I don't get the "why" you would want to do that. The normal user doesn't open devtools. The "hacker" user will most certainly not be prevented from getting that data. In the end there're more tools than just the browser's devtools to sniff incoming and outgoing data and since that is something you cannot prevent, there's no reason to do it in the browser in the first place.
What you can do though is encrypting the response in your backend and then decrypt in your frontend. Since you need to send the decryption password as well this will still not prevent anyone from decrypting the response messages, but obfuscating the decryption part somewhere in your code can at least make it a little more difficult (emphasize "little").

How to make a proper and simple authentification for nodejs website?

I am learning to make a website with nodejsn, express, socket.io and mongodb. I am pretty much self-taught but when it comes to authentification, I can't find a tutorial that explains how it works in simple terms.
I have a login form, a signup form, the user data is stored into the database on registering. When I login, the page greets me with my username, but when I refresh or close the tab and come back, I have to login again.
All I want is that make users able to come back without having to log in systematically.
All I can find are explanations like : http://mherman.org/blog/2015/01/31/local-authentication-with-passport-and-express-4
And I don't really get it.
Can someone explain what am I missing here ?

Session management is something that Jekrb highlighted and is also a great question when it comes to highlighting users if it be anonymous or users of your application.
Though before I go into any depth I am going to highlight that cookies have a slight problem if your application is going to work on a larger scale where you have this scenario: "What happens if you have N servers where N > 1 ?" so to some degree if your unsure of your user-base, cookies may not be the correct approach.
I'm going to presume that you don't have this issue so providing cookies as a means of identifying users is appropriate, but isn't the only method available.
This article outlines a few ways in which the industry tackles this:
https://www.kompyte.com/5-ways-to-identify-your-users-without-using-cookies/
My favorite method here would be canvas fingerprinting using https://github.com/Valve/fingerprintjs2 Which will create a hash that you can store and use to verify new connections, Probably with something like socket.io which you've listed as using. A major upside of this is scalability as we can store these unique hashes centrally inside of the database without the fear of always being stuck with one server.
Finally I haven't posed any code which I dislike but the topic is hard to pin down to specifics, though I have hopefully offered some alternatives to just cookies.

Launching Custom Applications from the browser

I have been looking around SO and other on-line resources but cant seem to locate how this is done. I was wondering how things like magnet links worked on torrent website. They automatically open up and application and pass the appropriate params. I was wondering how could I create one to send a custom program params from the net?
Thanks
s654m

I wouldn't say this is an answer, but it is actually too long for a comment to fit.
Apps tend to register as authorities that can open a specific scheme. I don't know how it's done in desktop apps (especially because depending on each OS, it will vary), but on Android you can catch schemes or base urls by Intent Filters.
The way it works (and I'm pretty sure the functionality is cross-OS) is:
Your app tells the system it can "read" a specific scheme or base url (it could be magnet:// or even http://www.twitter.com/).
When you try to open a URI (Uniform resource identifier, a supergroup that can contain URLs), the system searches for any application that was registered for that kind of URI. I guess it runs from more specific and complete formats to the base. So for instance, this tweet: https://twitter.com/korcholis/status/491724155176222720 may be traced in this order:
https://twitter.com/korcholis/status/491724155176222720 Oh, no registrar? Moving on
https://twitter.com/korcholis/status Nothing yet? Ok
https://twitter.com/korcholis Nnnnnnope?
https://twitter.com Anybody? Ah, you, Totally random name for a Twitter Client know how to handle these links? Then it's yours
This random twitter client gets the full URI and does something accordingly.
As you see, nobody had a chance to track https://, since another application caught the URI before them. In this case, nobody could be your browsers.
It also defines, somehow, a default value. This is the true key why browsers tend to battle to be your default browser of choice. This just tells you they want to be the default applications that catch http://, https:// and probably some more.
The true wonder here is that, as long as there's an app that catches a scheme, you can set the one you want. For instance, it's a common practice that apps from the same developer contain the same schemes, in case the developer wants to share tasks between them. This ensures the user will have to use a group of apps. So, one app can just offer data such as:
my-own-scheme://user/12
While another app is registered to get links that start with
my-own-scheme://
So, if you want to make your own schemes, it's ok, as long as they don't collide with other's. And if you want to read other's schemes, well, that's up to you to search for that. See? This is not a real answer, but I hope it removes almost all doubt.

Identifying users without cookies etc

I want to write a new aggregator site where users can submit news and up and down vote on them (Pretty basic stuff, similar to a tiny reddit).
My problem is this:
Someone can only up or downvote a news article once a day
I don't want users so sign up
Cookies for voting could be deleted
How do i identify a user over the course of the day and how do i make sure that this user didn't vote on some article some minutes before.
Is this even possible?

You could use the browser fingerprint.
The browser fingerprint is an identifier generated from the information that every browser sends on every connection (HTTP headers) and additional information available through basic JavaScript.
Information like:
User agent
Language
Installed plugins
Screen resolution
... and more.
A browser fingerprint identification isn't bulletproof because there are self-defense tactics but it can spice up your recipe. Despite its controversy, it's widely used.
Mozilla has a great wiki article about the subject.
And you can check your own browser fingerprint at https://panopticlick.eff.org/

Short answer: No, it is not possible to reliably identify a user without login and without using cookies or a similar technique.
I hate to post this, but the evercookie project is a good collection of the techniques for making something like a cookie that is somewhat more persistent than your standard cookie. It uses some neat tricks, but one could also argue that it has some privacy issues. I would not recommend you to implement it. Even if you did (or borrow some of their ideas), then
Any remotely tech-savvy user would still be able to clear the cookie.
You can't guard against users using multiple devices and browsers.
You can't (reliably) guard against users not posting via a browser, thus circumventing cookies and other tricks.
Etc, etc.

Stopping a bot attack server side solution (without a CAPTCHA or JavaScript)

I inherited some code that was recently attacked by repeated remote form submissions.
Initially I implemented some protection by setting a unique session auth token (not the session id). While I realize this specific attack is not CSRF, I adapted my solution from these posts (albeit dated).
https://www.owasp.org/index.php/Cross-Site_Request_Forgery_%28CSRF%29
http://tyleregeto.com/a-guide-to-nonce
http://shiflett.org/articles/cross-site-request-forgeries
I've also read existing posts on SO, such as Practical non-image based CAPTCHA approaches?
However, the attacker now requests the form page first, starting a valid session, and then passes the session cookie in the following POST request. Therefore having a valid session token. So fail on my part.
I need to put some additional preventative measures in place. I'd like to avoid CAPTCHA (do to poor user experience) and JavaScript solutions if possible. I've also considered referrer checks (can be faked), honeypots (hidden fields), as well as rate limiting (which can be overcome by throttling). This attacker is persistent.
With that said, what would be a more robust solution.

If you are having a human that attacks specifically your page, then you need to find what makes this attacker different from the regular user.
If he spams you with certain URLs or text or alike - block them after they are submitted.
You can also quarantine submissions - don't let them go for say 5 minutes. Within those 5 minutes if you receive another submission to the same form from the same IP - discard both posts and block the IP.
CAPTCHA is good if you use good CAPTCHA, cause many custom home-made captchas are now recognized automatically by specially crafted software.
To summarize - your problem needs not just technical, but more social solutions, aiming at neutralizing the botmaster rather than preventing the bot from posting.

CAPTCHAs were invented for this exact reason. Because there is NO WAY to differentiate 100% between human and bot.
You can throttle your users by increasing a server-side counter, and when it reaches X times, then you can consider it as a bot attack, and lock the site out. Then, when some time elapse (save the time of the attack as well), allow entry.

i've thought a little about this myself.
i had an idea to extend the session auth token to also store a set of randomized form variable names. so instead of
<input name="title" ... >
you'd get
<input name="aZ5KlMsle2" ... >
and then additionally add a bunch of traps fields, which are hidden via css.
if any of the traps are filled out, then it was not a normal user, but a bot examining your html source...

How about a hidden form field? If it gets filled automatically by the bot, you accept the request, but dismiss it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string