Limiting number of connections from bottlepy's `run`? - wsgi

Is it possible to limit connections, restrict to certain number of unique IP addresses; with Bottle's run command?
All I could find was this: http://bottlepy.org/docs/dev/deployment.html

In a word, no.
There's no built-in way to do this. (You didn't include a lot of detail, so I'm making some assumptions about what I think you might have in mind.) In theory (and only in theory--please don't go this route), you could write a Bottle plug-in that tracks connections and denies requests depending on certain access patterns. But there are many reasons you shouldn't.
Instead, you should rely on your web server (the one in which Bottle will run in production--e.g., Apache, nginx) to handle this. For example, here's an SO discussion on rate limiting in Apache.
Sorry there's no "yes" answer; but hope this helps! Cheers.

Related

How to deal with user input files (images / video)?

In our company, we have to deal with a lot of user uploads, for example images and videos. Now I was wondering: how do you guys "deal with that" in terms of safety? Is it possible for an image to contain malicious content? Of course, there are the "unwanted" pixels, like porn or something. But that's not what I mean now. I mean images which "break" machines while they are being decoded, etc. I already saw this: How can a virus exist in an image.
Basically I was planning to do this:
Create a DMZ
Store the assets in a bucket (we use GCP here) which lives inside the DMZ
Then apply "malicious code"-detection on the file
If it turns out to be fine... then move the asset into the "real" landscape (the non-dmz)
Now the 3rd part... what can I do here?
Applying a virus scanner
No problem with this, there are a lot of options here. Simple approach and good chance that viruses are being caught.
Do mime-type detection
Based on the first few bytes, I do a mime type detection. For example, if someone sends us a "image.jpg" but in fact its an executable, then we would detect this. Right? Is this safe enough? I was thinking about this package.
What else???
Now... what else can I do? How do other big companies do this? I'm not really looking for answers in terms of orchestration, etc. I know how to use a DMZ, link it all together with a few pubsub topics, etc. I'm purely interested in what techniques to apply to really find out that an incoming asset is "safe".
What I would suggest is to not to do it outside the DMZ , let this be within your DMZ and it should have all the regular security controls as any other system will have within your data center.
Besides the things ( Virus Scan , Mime - Type detection ) that you have outlined , i would suggest a few additional checks to perform.
Size Limitation - You would not want anyone to just bloat out all the space and choke your server.
Throttling - Again you may want to control the throughput , at least have the ability to limit to some maximum value.
Heuristic Scan - Perhaps add a layer to the Anti Virus to do heuristics as well rather than simple signature scans.
File System Access Control - Make sure that the file system access control is foolproof , even in case something malicious comes in it should be able to propagate out to other folders / paths .
Network control - Make sure all the outbound connections are fire walled as well , just in case anything tries to make outward connections.

Is a DNS query with the authoritative bit set (or other bits used for responses) considered valid?

From RFC 1035:
Authoritative Answer - this bit is valid in responses,
and specifies that the responding name server is an
authority for the domain name in question section.
So, what happens if this bit is set in a DNS query (QD=0)? Do most DNS implementations treat the packet as invalid, or would the bit simply be ignored?
The same question applies to other bits that are specific to either queries or responses, such as setting the RD bit in a response.
My guess is that these bits are simply ignored if they aren't applicable to the packet in question, but I don't know for sure or how I would find out.
I'm asking because I'm writing my own DNS packet handler and want to know whether such packets should still be parsed or treated as invalid.
You either apply the Postel's law ("Be conservative in what you do, be liberal in what you accept from others") - which is often touted as one reason/condition of the success of interoperability of so many different things on top of the Internet - or if you strictly apply the RFC you deem it as invalid and you can reply immediately with FORMERR for example.
In the second case, as you will get deviating clients (not necessarily for your specific case, in the DNS world they are a lot of non conforming implementations on various points), you will need to define if you create specific rules (like ACLs) to accept some of them nevertheless because you deem them to be "important".
Note that at this stage your question is not really programming related (no code) so kind of offtopic here. But the answer also depends what kind of "packet handler" you are building. If it is for some kind of IDS/monitoring/etc. you need to parse "as much as possible" of the DNS traffic to report it. If it is to mimick a real world DNS resolver and just make sure it behaves like a resolver then you probably do not need to deal with every strange deviating case.
Also remember that all of this can be changed in transit, so if you receive some erroneous things it is not obviously always an error coming from the sender, it could be because of some intermediary, willingly or not.
To finish, it is impossible to predict everything you will get and in any wide enough experiment you will be surprised by the amount of traffic you can not undersand how it comes to exist. So instead of trying to define everything before starting you should instead iterate over versions, having a clear view of your target (parsing as much as possible for some kind of monitoring system OR being as lean/simple/secure/close to real world features for DNS resolution as possible).
And as for "how I would find out." you can study the source of various existing resolvers (bind, nsd, unbound, etc.) and see how they react. Or just launch them and throw at them some erroneous packets like you envision and see their reply. Some cases probably exist as unit/regression test and some tools like ZoneMaster could probably be extended (if not doing those specific tests already) to cover your cases.

Sort domains by number of public web pages?

I'd like a list of the top 100,000 domain names sorted by the number of distinct, public web pages.
The list could look something like this
Domain Name 100,000,000 pages
Domain Name 99,000,000 pages
Domain Name 98,000,000 pages
...
I don't want to know which domains are the most popular. I want to know which domains have the highest number of distinct, publicly accessible web pages.
I wasn't able to find such a list in Google. I assume Quantcast, Google or Alexa would know, but have they published such a list?
For a given domain, e.g. yahoo.com you can google-search site:yahoo.com; at the top of the results it says "About 141,000,000 results (0.41 seconds)". This includes subdomains like www.yahoo.com, and it.yahoo.com.
Note also that some websites generate pages on the fly, so they might, in fact, have infinite "pages". A given page will be calculated when asked for, and forgotten as soon as it is sent. Each can have a link to the next page. Since many websites compose their pages on the fly, there is no real difference (except that there are infinite pages, which you can't find out unless you ask for them all).
Keep in mind a few things:
Many websites generate pages dynamically, leaving a potentially infinite number of pages.
Pages are often behind security barriers.
Very few companies are interested in announcing how much information they maintain.
Indexes go out of date as they're created.
What I would be inclined to do for specific answers is mirror the sites of interest using wget and count the pages.
wget -m --wait=9 --limit-rate=10K http://domain.test
Keep it slow, so that the company doesn't recognize you as a Denial of Service attack.
Most search engines will allow you to search their index by site, as well, though the information on result pages might be confusing for more than a rough order of magnitude and there's no way to know how much they've indexed.
I don't see where they keep or have access to the database at a glance, but down the search engine path, you might also be interested in the Seeks and YaCy search engine projects.
The only organization I can think of that might (a) have the information easily available and (b) be friendly and transparent enough to want to share it would be the folks at The Internet Archive. Since they've been archiving the web with their Wayback Machine for a long time and are big on transparency, they might be a reasonable starting point.

When is it OK to intentionally obfuscate URLs?

Having friendly URLs is generally a good thing. However, there are sometimes when it seems like a bad idea. What is your rule of thumb?
For instance, consider a situation where I want to show a Registration Success page. I want all of the underlying logic to be the same. However, depending on how they registered, I may want to display a different message for someone who registered under a certain type of role.
Here are a few, off-the-cuff examples of "hackable" (as described in link) URLs:
http://www.example.com/RegistrationSuccess.aspx?IsCertainRole=true
http://www.example.com/RegistrationSuccess.aspx?role=CertainRole
http://www.example.com/RegistrationSuccess.aspx?r=2876
All of these seem bad since I don't want the URLs to be discoverable. On the other hand, I hate to do something more complex just to modify the success message slightly.
How would you handle this?
Bear in mind that obfuscating URLs is NOT a security measure. You should never trust outside input - filter, sanitize and implement restrictive logic. No matter how clever you believe your obfuscation scheme to be, people have cracked much more complicated security schemes with relative ease.
As a general rule of thumb - there is no good reason to obfuscate URLs intentionally. Use URLs to communicate read operations (a path to a resource). Use POST requests to communicate write operations (adding/modifying data). If a user isn't supposed to be able to do something through the URL, it should be regulated server side and through the request method.
You can either POST the data, or, if that's not an option, set the value in a Session variable and then read the value in the success page. The actual complexity of code using the Session is about the same as using the query string.
Ok, if you don't think this is a security issue since you are only displaying a different message, then why do you care if its hackable or not?
Most users won't wouldn't notice the url is editable, so why obfuscate? The "elite hackers" will get a slightly different message, big deal.
The general answer to, "Should I obfuscate...?" is no. If its for security, hell no, otherwise why are you obfuscating? Most likely, you are wasting time.
URL's are for uniquely referencing content. When the contents are the results of a process that involves several steps of dialogue, these contents can't really have a URL, because the URL does not reproduce the process.
I would forward them to RegistrationSuccess.aspx and present contents there based on the state of the session.
If somebody comes to that URL without the suitable session state, I would forward them to the front page after 5 seconds of looking at a friendly message explaining that there is nothing to see.
A better choice yet, may be to forward them to MyRegistration.aspx which is something they would perhaps like to make a favourite out of. Coming from the Registration process, it may have a box explaining that they have successfully registered. If they not coming there from the Registration process, this box is not presented. The rest of the page is the summary of all earlier Registration processes for that user.
With a post submission?
If you don't want information in the URL, don't put it in the URL.
It's not always easy to do...
I would say that pages that you want to be easily indexed by the search engines use URL Routing. This includes high traffic pages.
For other pages where the users will only visit few times a month or year you can leave those to be normal urls.
If you must absolutely use the URL for private/personalized data, you'd probably be better off generating a random unique identifier on the server and using that in your URL. Kind of like confirmation e-mails where you have to click a link.
Otherwise, if there's any other way to not include data in the URL, you shouldn't. In the case of a successful registration, either the person just registered and you should be in a current session, or you should require them to login before they see the customized page.
Why not make "registration success" message be a last step, but not change pages?
You can use Ajax or Server.Transfer() to do that.
I could check from a whitelist of referring URLs so that they can't just type in a different URL. That might eliminate obvious "hack" from a passerby.
(Obviously, you can get around this if you're a nerd.)
You could make some sort of checksum or hash on the querystring items, so if they mess around with the URL, the checksum fails and it kicks them out to the main page.

When the bots attack! [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What are some popular spam prevention methods besides CAPTCHA?
I have tried doing 'honeypots' where you put a field and then hide it with CSS (marking it as 'leave blank' for anyone with stylesheets disabled) but I have found that a lot of bots are able to get past it very quickly. There are also techniques like setting fields to a certain value and changing them with JS, calculating times between load time and submit time, checking the referer URL, and a million other things. They all have their pitfalls and pretty much all you can hope for is to filter as much as you can with them while not alienating who you're here for: the users.
At the end of the day, though, if you really, really, don't want bots to be sending things through your form you're going to want to put a CAPTCHA on it - best one I've seen that takes care of mostly everything is reCAPTCHA - but thanks to India's CAPTCHA solving market and the ingenuity of spammers everywhere that's not even successful all of the time. I would beware using something that is 'ingenious' but kind of 'out there' as it would be more of a 'wtf' for users that are at least somewhat used to your usual CAPTCHAs.
Shocking, but almost every response here included some form of CAPTCHA. The OP wanted something different, I guess maybe he wanted something that actually works, and maybe even solves the real problem.
CAPTCHA doesn't work, and even if it did - its the wrong problem - humans can still flood your system, and by definition CAPTCHA wont stop that (cuz its designed only to tell if you're a human or not - not that it does that well...)
So, what other solutions are there? Well, it depends... on your system and your needs.
For instance, if all you're trying to do is limit how many times a user can fill out a "Contact Me" form, you can simply throttle how many requests each user can submit per hour/day/whatever. If your users are anonymous, maybe you need to throttle according to IP addresses, and occasionally blacklist an IP (though this too can be circumvented, and causes other problems).
If you're referring to a forum or blog comments (such as this one), well the more I use it the more I like the solution. A mix between authenticated users, authorization (based on reputation, not likely to be accumulated through flooding), throttling (how many you can do a day), the occasional CAPTCHA, and finally community moderation to cleanup the few that get through - all combine to provide a decent solution. (I wonder if Jeff can provide some info on how much spam and other malposts actually get through...?)
Another control to consider (dont know if they have it here), is some form of IDS/IPS - if you can detect and recognize spam, you can block THAT pattern. Moderation fills that need manually, here...
Note that any one of these does not prevent the spam, but incrementally lowers the probability, and thus the profitability. This changes the economic equation, and leaves CAPTCHA to actually provide enough value to be worth it - since its no longer worth it for the spammers to bother breaking it or going around it (thanks to the other controls).
Give the user the possibility to calculate:
What is the sum of 3 and 8?
By the way: Just surfed by an interesting approach of Microsoft Research: Asirra.
http://research.microsoft.com/asirra/
It shows you several pictures and you have to identify the pictures with a given motif.
Try Akismet
Captchas or any form of human-only questions are horrible from a usability perspective. Sometimes they're necessary, but I prefer to kill spam using filters like Akismet.
Akismet was originally built to thwart spam comments on WordPress blogs, but the API is capabable of being adapted for other uses.
Update: We've started using the ruby library Rakismet on our Rails app, Yarp.com. So far, it's been working great to thwart the spam bots.
A very simple method which puts no load on the user is just to disable the submit button for a second after the page has been loaded. I used it on a public forum which had continuous spam posts, and it stopped them since.
Ned Batchelder wrote up a technique that combines hashes with honeypots for some wickedly effective bot-prevention. No captchas, just code.
It's up at Stopping spambots with hashes and honeypots:
Rather than stopping bots by having people identify themselves, we can stop the bots by making it difficult for them to make a successful post, or by having them inadvertently identify themselves as bots. This removes the burden from people, and leaves the comment form free of visible anti-spam measures.
This technique is how I prevent spambots on this site. It works. The method described here doesn't look at the content at all. It can be augmented with content-based prevention such as Akismet, but I find it works very well all by itself.
http://chongqed.org/ maintains blacklists of active spam sources and the URLs being advertised in the spams. I have found filtering posts for the latter to be very effective in forums.
The most common ones I've observed orient around user input to solve simple puzzles e.g. of the following is a picture of a cat. (displaying pictures of thumbnails of dogs surrounding a cat). Or simple math problems.
While interesting I'm sure the arms race will also overwhelm those systems too.
You can use Recaptcha to at least make a captcha useful. Then you can make questions with simple verbal math problems or similar. Microsoft's Asirra makes you find pics of cats and dogs. Requiring a valid email address to activate an account stops spammers when they wouldn't get enough benefit from the service, but might deter normal users as well.
The following is unfeasible with today's technology, but I don't think it's too far off. It's also probably overkill for dealing with forum spam, but could be useful for account sign-ups, or any situation where you wanted to be really sure you were dealing with humans and they would be prepared for it to take a few minutes to complete the process.
Have 2 users who are trying to prove themselves human connect to each other via their webcams and ask them if the person they are seeing is human and live (i.e. not a recording), by getting them to, for example, mirror each other's movements, or write something on a piece of paper. Get everyone to do this a few times with different users, and throw a few recordings into the mix which they also have to identify correctly as such.
A popular method on forums is to simply queue the threads of members with less than 10 posts in a moderation queue. Of course, this doesn't help if you don't have moderators, or it's not a forum. A more general method is the calculation of hyperlink to text ratios. Often, spam posts contain a ton of hyperlinks, and you can catch a lot this way. In the same vein is comparing the content of consecutive posts. Simply do not allow consecutive posts that are extremely similar.
Of course, anyone with knowledge of the measures you take is going to be able to get around them. To be honest, there is little you can do if you are the target of a specific attack. Rather, you should focus on preventing more general, unskilled attacks.
For human moderators it surely helps to be able to easily find and delete all posts from some IP, or all posts from some user if the bot is smart enough to use a registered account. Likewise the option to easily block IP addresses or accounts for some time, without further administration, will lessen the administrative burden for human moderators.
Using cookies to make bots and human spammers believe that their post is actually visible (while only they themselves see it) prevents them (or trolls) from changing techniques. Let the spammers and trolls see the other spam and troll messages.
Javascript evaluation techniques like this Invisible Captcha system require the browser to evaluate Javascript before the page submission will be accepted. It falls back nicely when the user doesn't have Javascript enabled by just displaying a conventional CAPTCHA test.
Animated captchas' - scrolling text - still easy to recognize by humans but if you make sure that none of the frames offer something complete to recognize.
multiple choice question - All it takes is a ______ and a smile. idea here is that the user will have to choose/understand.
session variable - checking that a variable you put into a session is part of the request. will foil the dumb bots that simply generate requests but probably not the bots that are modeled like a browser.
math question - 2 + 5 = - this again is to ask a question that is easy to solve but prevents the bots ability to generate a response.
image grid - you create grid of images - select 1 or 2 of a particular type such as 3x3 grid picture of animals and you have to pick out all the birds on the grid.
Hope this gives you some ideas for your new solution.
A friend has the simplest anti-spam method, and it works.
He has a custom text box which says "please type in the number 4".
His blog is rather popular, but still not popular enough for bots to figure it out (yet).
Please remember to make your solution accessible to those not using conventional browsers. The iPhone crowd are not to be ignored, and those with vision and cognitive problems should not be excluded either.
Honeypots are one effective method. Phil Haack gives one good honeypot method, that could be used in principle for any forum/blog/etc.
You could also write a crawler that follows spam links and analyzes their page to see if it's a genuine link or not. The most obvious would be pages with an exact copy of your content, but you could pick out other indicators.
Moderation and blacklisting, especially with plugins like these ones for WordPress (or whatever you're using, similar software is available for most platforms), will work in a low-volume environment. If your environment is a low volume one, don't underestimate the advantage this gives you. Personally deciding what is reasonable content and what isn't gives you ultimate flexibility in spam control, if you have the time.
Don't forget, as others have pointed out, that CAPTCHAs are not limited to text recognition from an image. Visual association, math problems, and other non-subjective questions relayed through an image also qualify.
Sblam is an interesting project.
Invisble form fields. Make a form field that doesn't appear on the screen to the user. using display: none as a css style so that it doesn't show up. For accessibility's sake, you could even put hidden text so that people using screen readers would know not to fill it in. Bots almost always fill in all fields, so you could block any post that filled in the invisible field.
Block access based on a blacklist of spammers IP addresses.
Honeypot techniques put an invisible decoy form at the top of the page. Users don't see it and submit the correct form, bots submit the wrong form which does nothing or bans their IP.
I've seen a few neat ideas along the lines of Asira which ask you to identify which pictures are cats. I believe the idea originated from KittenAuth a while ago..
Use something like the google image labeler with appropriately chosen images such that a computer wouldn't be able to recognise the dominant features of it that a human could.
The user would be shown an image and would have to type words associated with it. They would keep being shown images until they have typed enough words that agreed with what previous users had typed for the same image. Some images would be new ones that they weren't being tested against, but were included to record what words are associated with them. Depending on your audience you could also possibly choose images that only they would recognise.
Mollom is supposedly good at stopping spam. Both personal (free) and professional versions are available.
I know some people mentioned ASIRRA, but if you go to all the adopt me links for the images, it will say on that linked page if its a cat or dog. So it should be relatively easy for a bot to just go to all the adoptme links. So its just a matter of time for that project.
just verify the email address and let google/yahoo etc worry about it
You could get some device ID software the41 has some fraud prevention software that can detect the hardware being used to access your site. I belive they use it to catch fraudsters but could be used to stop bots. Once you have identified an device being used by a bot you can just block that device. Last time a checked it can even trace your route throught he phone network ( Not your Geo-IP !! ) so can even block a post code if you want.
Its expensive through so prop. a better cheaper solution that is a little less big brother.

Resources