How to Protect Against Unicode Security Vulnerabilities - security

"Five things everyone should know about Unicode" is a blog post showing how Unicode characters can be used as an attack vector for websites.
The main example given of such a real world attack is a fake WhatsApp app submitted to the Google Play store using a unicode non-printable space in the developer name which made the name unique and allowed it to get past Google's filters. The Mongolian Vowel Separator (U+180E) is one such non-printable space character.
Another vulnerability is to use alternative Unicode characters that look similar. The Mimic tool shows how this can work.
An example I can think of is to protect usernames when registering a new user. You don't want two usernames to be the same or for them to look the same either.
How do you protect against this? Is there a list of these characters out there? Should it be common practice to strip all of these types of characters from all form inputs?

What you are talking about is called a homoglyph attack.
There is a "confusables" list by Unicode here, and also have a look at this. There should be libraries based on these or pontentially other databases. One such library is this one that you can use in Java or Javascript. The same must exist for other languages as well, or you can write one.
The important thing I think is to not have your own database - the library or service is easy to do on top of good data.
As for whether you should filter out similar looking usernames - I think it depends. If there is an interest for users to try and fake each other's usernames, maybe yes. For many other types of data, maybe there is no point in doing so. There is no generic best practice I think other than you should assess the risk in your application, with your datapoints.
Also a different approach for a different problem, but what may often work for Unicode input validation is the \w word character in a regular expression, if your regex engine is Unicode-ready. In such an engine, \w should match all Unicode classes of word characters, ie. letters, modifiers and connectors in any language, but nothing else (no special characters). This does not protect against homoglyph attacks, but may protect against some injections while keeping your application Unicode-friendly.

All sanitization works best when you have a whitelist of known safe values, and exclude all others.
ASCII is one such set of characters.

This could be approached in various ways, however each one might increase the number of false positives, causing legitimate users' annoyance. Also, none of them will work for 100% of the cases (even if combined). They will just add an extra layer.
One approach would be to have tables with characters that look similar and check if duplicate names exist. What 'look similar' means is subjective in many cases, so building such list might be tricky. This method might produce false positives in certain occasions.
Also, reversing the order of certain letters might trick many users. Checking for anagrams or very similar names can be achieved using algorithms like Jaro-Winkler and Levenshtein distance (i.e., checking if a similar username/company name already exists). Sometimes however, this might be due to a different spelling of some word in some region (e.g., 'centre' vs 'center'), or the name of some company might deliberately contain an anagram. This approach might further increase the number of false positives.
Furthermore, as Jonathan mentioned, sanitisation is also a good approach, however it might not protect against anagrams and cause issues to legitimate users who want to use some special character.
As the OP also mentioned, special characters can also be stripped. Other parts of the name might also need to be stripped, for example common names like 'Inc.', '.com' etc.
Finally, the name can be restricted to only contain characters in one language and not a mixture of characters from various languages (a more relaxed version of this may not allow mixture of characters in the same word - while would allow if separated by space). Restricting using a capital first letter and lower case for the rest of the letters can further improve this approach, as certain lower case letters (like 'l') may look like upper case ones (like 'I') when certain fonts are used. Excluding the use of certain symbols (like '|') will enhance this approach further. This solution will increase the amount of annoyance of certain users who will not be able to use certain names.
A combination of some/all aforementioned approaches can also be used. The selection of the methods and how exactly they will be applied (e.g., you may choose to forbid similar names, or to require moderator approval in case a name is similar, or to not take any action, but simply warn a moderator/administrator) depends on the scenario that you are trying to solve.

I may have an innovative solution to this problem regarding usernames. Obviously, you want to allow ASCII characters, but in some special cases, other characters will be used (different language, as you said).
I think an intuitive way to allow both ASCII and other characters to be used in an username, while being protected against "Unicode Vulnerabilities", would be something like this:
Allow all ASCII characters and disallow other characters, except when there are x or more of these special characters in the username(the username is in another language).
Take for example this:
Whatsapp, Inc + (U+180E) - Not allowed, only has 1 special character.
элч + (U+180E) - Allowed! It has more than x special characters (for example, 3). It can use the Mongolian separator since it's Mongolian.
Obviously, this does not protect you 100% from these types of vulnerabilities, but it is a very efficient method I have been using, ESPECIALLY if you do not mention the existence of this algorithm on the "login" or "register" page, as attackers might figure out that you have an algorithm protecting the website from these types of attacks, but not mention it so they cannot reverse engineer it and find a way to bypass it.
Sorry if this is not an answer you are looking for, just sharing my ideas.
Edit: Or you can use a RNN (Recurrent Neural Network) AI to detect the language and allow specific characters from that language.

Related

Should whitespace characters be allowed in a password?

I've tried different sites/products and this seems to be split fairly evenly. Windows 7 and Gmail allow you to insert spaces in your password. Hotmail and Twitter do not.
While allowing spaces in a password increases the complexity of a password, it seems like many sites/programs do not allow them. Is there a good reason to allow/disallow spaces?
This SuperUser question might be relevant.
I think that your observation is accurate: many web-based systems accept only alphanumerics and a subset of symbolic characters (say, 0-9A-Za-z/_-!), but I think that this is simply historical convention. It may also be that programmers are used to the <space> character delimiting fields, rather than being found inside them.
There's also the issue of visibility: if you allow multiple consecutive spaces in a password, can the user easily count them? Might a system even collapse them into one (as unaided HTML would)? Can even a single space character be easily and quickly identified?
However, plenty of other types of systems do allow spaces in passwords. I'd probably still stray from them simply to help prevent user confusion (if people are indeed used to spaces in passwords being invalid, a password with a space in may be confusing to many), but there doesn't seem to be any technical reason not to allow them.
The main problem I see would be usability for the user in terms of e.g. trailing spaces. Also if you start allowing non-visible characters like the space you might also start allowing all sorts of other non-visible characters like tabs and so on. Imho the disadvantages outweigh the benefits. To make a password really secure just increase the length and allow some special characters, numbers and letters and be case specific. With e.g. > 20 digits thats practically unbreakable at this stage (at least in terms of worth the effort..).
Here is a quick way to test password strength--use google's own account password API:
https://www.google.com/accounts/RatePassword?Passwd=mypwd
Per your question about whitespace, I have entered a simple password with two characters and one whitespace "t t" . Google gave the password a rating of 3 out of 4. If I do the same password, but remove the whitespace "tt" the rating received is 1 out of 4. By Google's rating standard, including whitespace improves the quality/strength of a password.

I am building a search engine. How do I remove duplicates from search results?

When I search for something, I get content that have the same text and title.
Of course, there is always an original (where others copy/leech from)
If you have expertise in search and crawling...how do you recommend that I remove these duplicates? (in a very feasible and efficient mannter)
Sounds like a programming question to me.
If you have a clear idea about what the stolen and original components of these pages are, and those differences are general enough that you can write a filter to separate them, then do that, hash the 'stolen' content, and then you should be able to compare hashes to determine if two pages are the same.
I guess web-page thieves might go to some further code-obfuscation to mess you up, including changing whitespace, so you might want to normalise the html before hashing, for instance removing any redundant whitespace, making all attributes use " quotes etc.
Here's a technique based on simhash.
Here's one that uses stopwords to work around ads.
Have you tried looking at the origin date of the site? After comparing a value of word strings to verify duplication, whitelist the one that is earlier.

what is the best UX for non-programmer users? comma-separated tags or space-separated tags?

I'm creating a social site for teachers (non-programmers) on which teachers can add events, links, exercises, tips, lesson plans, books, etc.
Each of these items I want them to be able to add tags to as we do at StackOverflow.
However, because they are non-programming users, I thought that space-separated, nonspace tags and camelCase tags would lead to too much confusion, e.g.:
grammar teachingtips universityOfMinnesota phrasalverbs
and indeed on this similar stackoverflow question most of the answers suggested commas like this:
grammar, teaching tips, university of minnesota, phrasal verbs
but then I just signed up for a delicious.com account (which I don't think has a very programmer-centric audience) and saw that they use spaces as well:
separate tags with spaces: e.g. hotels bargains newyork (not new york)
What has been your experience on this point in terms of the current UX trend for tags? Is the average Internet user accostumed to space-separated tags by now? I have to admit, I have never seen comma-separated tags on any major site I have used. Have you come upon a good way to combine them so it doesn't even matter, e.g.:
grammar book reviews teaching tips
and e.g. have a quick algorithm which checks the number of current tags for:
grammar
grammar book
grammar book reviews
book
book reviews
book reviews teaching
...
I'd go comma separated personally. You'll note that Stackoverflow doesn't but the tags are clearly delineated into their own boxes. Plus hyphens are often used for "spacing". I'd say spaces are more natural to non-programmers than hyphens are however.
Comma separated seems the most natural - it's what English uses to punctuate lists. It also allows you to have spaces in tags if you want. People will try to enter
this, that, the other
and expect it to work.
I can't think of a good reason to use spaces.
Notice that delicious has to give an example to demonstrate how to do it their way. That's not a good sign.
If you do go with commas, take care to see how easy it is for a "space user" to see that they made a mistake, and to fix it.
I would go with comma separated tags, if only to save your users the pain of having to use quotes to indicate a tag has a space in it, ie website "stack overflow" tips, or website, stack overflow, tips. I know which I'd prefer.
Comma-separated is the way to go for your educational audience. It's simply intuitive.
Most teachers should have no trouble understanding a system where tags are comma separated, and there is no need to come up with an awkward workaround for phrases.
It depends a little on how the tags are entered. If the user gets suggestions for tags as they type like SO provides (shades of intellisense), space separated is probably fine. However, if you are going to force the user to enter each tag without a reference list it may be easier to accept case-insensitive comma (or semicolon) delimited tags.
You don't want to check all those possibilities unless you are going to severely limit the number of possible tags - that's an O(n!) algorithm, and you most likely don't want to have that extra load on your server.
Your best bet is probably just to stick with one option - the users will (should!) get used to it fairly quickly. Spaces as separators are probably the most common, so I would go with that, since it is the one the users are most likely to have had prior exposure to.
As long as what the software accepts/demands is clear, I think users will be happy with either. Confusion comes when they don't know whether to use commas, semicolons, spaces or...
If you use a number of e-mail clients you'll know how useful a simple tool-tip reminder of whether it's commas or spaces would be when entering multiple recipients.
When tagging, how you set it up depends on what kinds of things you will tag. Media that is hard to index, like pictures, audio, or video, should encourage many and varied tags, because the tags are how you will search the content.
Easily indexed content (text!) should use a very rigid tagging structure, because you don't need to rely on tags for search indexing. Instead, the purpose of tags is sort the content into well-defined categories. Tags should be more like labels or folders.
I'm gonna take a guess here that this content will be mostly text-based, with the occasional picture or video file thrown in. So you don't want either comma or space separated tag entry, but rather some mechanism that forces users to pick from an existing set of tags.
I would assume space separated tags unless there are one or more commas, in which case you should split on commas instead. In other words, support both but in a limited way. You can probably guess right 90+ percent of the time.

Are user names ever case sensitive?

I'm looking at some code that converts user names to lower case, before storing them. I'm 90% sure this is ok, but are there systems out there that actually require case sensitivity on the user names (specifically in the health industry)?
Note: my particular code is not at the point of entry. We are taking user names from other systems. The worry I have is depending on those systems (which may or may not be under our control) to consistently pass us usernames in the same case as each other (when describing the same user).
Also of note - the code is:
userName.toLowerCase(Locale.ENGLISH)
Are all user names in english? Is this just so it matches collation in the database? Note that (in java at least) String.toLowerCase() is defined as String.toLowerCase(Locale.getDefault())
unix logins are case sensitive...
Are there any other systems that do this?
toLowerCase has only one reason for it to accept a locale:
since small letter i has a dot in every standard language, the letter I is transformed to a i with a dot.
but in turkish, there is also a capital letter İ with a dot above. this is transformed to a small letter i.
the "regular" turkish capital I is transformed to a small ı - without a dot.
so, unless your turkish usernames are all called IiI1I1iiII, i would hardly worry about this.
every other language than turkish has a identical toLowerCaseImplementation. so you could chose Locale.ENGLISH or Locale.GERMAN or whatever..just make sure you do not pick turkish.
see the javadoc for more detailed information
edit: thanks to utku karatas i could/copy paste the correct glyphs in ths post.
Using case sensitive username/passwords is an easy way to increase security, so the question is, how much do you care about security vs usability. Just keep in mind that the way you're looking at solving the case insensitivity may have some localization problems, but if you don't care then don't worry about it.
Lowercasing the user name using the English locale is bound to cause you problems. I would suggest lowercasing using the invariant culture.
It depends on context, but in the Informix dialect of SQL, there are 'owners' (basically equivalent to a schema in standard SQL), and how you write the owner name matters.
SELECT *
FROM someone.sometable, "someone".sometable,
SOMEONE.sometable, "SOMEONE".sometable
The two quoted names are definitely different; the two unquote names are mapped to the same name, which (depending on database mode) could be either of the other two. There is some code around which does case-conversion on the (unquoted) names. Fortunately, most of the time you don't need to specify the name, and when you do you write the name without quotes and it all works; or you write the name with quotes and are consistent and it all works. Occasionally, though, people like me have to really understand the details to get programs to work sanely despite all the hoops.
Also, (as Stephen noted) Unix logins are case-sensitive, and always have been. I believe Windows logins are mostly case-insensitive - but I don't experiment with that (there are too many ways to get screwed up on Windows without adding that sort trickery to the game).
If you really want to confuse someone on Unix, give them a numeric user name (e.g. 123) but give them a different UID (e.g. 234).
Kerberos, which can be used in Windows environments too, has case sensitivity problems. You can configure it in a certain way to ensure that case sensitivity issues will not arise, but it can go the other way too.
If your only goal is differentiating one user from another, it seems logical that you would want more than case to be a factor.
I have never encountered a system that enforced case-sensitivity on usernames (nor would I want to).
Most likely the code forces them lowercase at the point of entry as an attempt to prevent case-sensitivity problems later.

Password complexity strategies - any evidence for them?

On more than one occasion I've been asked to implement rules for password selection for software I'm developing. Typical suggestions include things like:
Passwords must be at least N characters long;
Passwords must include lowercase, uppercase and numbers;
No reuse of the last M passwords (or passwords used within P days).
And so on.
Something has always bugged me about putting any restrictions on passwords though - by restricting the available passwords, you reduce the size of the space of all allowable passwords. Doesn't this make passwords easier to guess?
Equally, by making users create complex, frequently-changing passwords, the temptation to write them down increases, also reducing security.
Is there any quantitative evidence that password restriction rules make systems more secure?
If there is, what are the 'most secure' password restriction strategies to use?
Edit Ólafur Waage has kindly pointed out a Coding Horror article on dictionary attacks which has a lot of useful analysis in it, but it strikes me that dictionary attacks can be massively reduced (as Jeff suggests) by simply adding a delay following a failed authentication attempt.
With this in mind, what evidence is there that forced-complex passwords are more secure?
Something has always bugged me about
putting any restrictions on passwords
though - by restricting the available
passwords, you reduce the size of the
space of all allowable passwords.
Doesn't this make passwords easier to
guess?
In theory, yes. In practice, the "weak" passwords you disallow represent a tiny subset of all possible passwords that is disproportionately often chosen when there are no restrictions, and which attackers know to attack first.
Equally, by making users create
complex, frequently-changing
passwords, the temptation to write
them down increases, also reducing
security.
Correct. Forcing users to change passwords every month is a very, very bad idea, except perhaps in extreme high-security environments where everyone really understands the need for security.
Those kind of rules definitely help because it stops stupid users from using passwords like "mypassword", which unfortunately happens quite often.
So actually, you are forcing the users into an extremely large set of potential passwords. It doesn't matter that you are excluding the set of all passwords with only lowercase letters, because the remaining set is still orders of magnitude larger.
BUT my big pet peeves are password restrictions I've encountered on major sites, like
No special characters
Maximum length
Why would anyone do this? W.H.Y.????
A nice read up on this is Jeff's article on Dictionary Attacks.
Never prevent the user from doing what they really want, unless there is a technical limitation from doing so.
You may nag the hell out of the user for doing stupid things like using a dictionary word or a 3-character password, or only using numbers, but see #1 above.
There is no good technical reason to require only alphanumerics, or at least one capital letter, or at least one number; see #1 above.
I forget which website had this advice regarding passwords: "Pick a password that is very easy for you to remember, but very hard for someone else to guess." But then they proceeded to require at least one capital letter and one number.
The problem with passwords is that they are so ubiquitous that it is essentially impossible for any person without a photographic memory to actually remember them without writing them down, and therefore leaving a serious security hole should someone gain access to this list of written-down passwords.
The only way I am able to manage this for myself is to split most of my passwords -- and I just checked my list, I'm up to 130 so far! -- into two parts, one which is the same in all cases, and the other which is unique but simple. (I break this rule for sites requiring high-security like bank accounts.)
By requiring "complexity" as defined as multiple types of characters all present, is that it forces people into a disparate set of conventions for different sites, which makes it harder to remember the password in question.
The only reason I will acknowledge for sites limiting the set of allowable password characters, is that it needs to be typeable on a keyboard. If you have to assume the account needs to be accessed from multiple countries, then keyboards may not always support the same characters on the user's home keyboard.
One of these days I'll have to make a blog posting on the subject. :(
My old limit theorem:
As the security of the password approaches adequate, the probability that it will be on a sticky note attached to the computer or monitor approaches one.
One also might point out the recent fiasco over at twitter where one of their admin's password turned out to be "happiness", which fell to a dictionary attack.
For questions like this, I ask myself what Bruce Schneier would do - the linked article is about how to choose passwords which are hard to guess with typical attacks.
Also note that if you add a delay after a failed attempt, you might also want to add a delay after a successful attempt, otherwise the delay is simply a signal that the attack has failed an other attempt should be launched.
Whilst this does not directly answer your question, I personally find the most aggrevating rule I have encountered one whereby you could not reuse any password previously used. After working at the same place for a number of years, and having to change your password every 2/3 months, the ability to use a password I chose over a year ago would not seem to be particularly unsafe or unsecure. If I have used "safe" passwords in the past (Alphanumeric with changes in case), surely reusing them after a perios of say a year or 2 (depending on how regularly you have to change your password) would seem to be acceptable to me. It also means I am less likely to use "easier" passwords, which might happen if I can't think of anything easy to remember and difficult to guess!
First let me say that details such as minimum length, case sensitivity and required special characters should depend on who has access and what the password allows them to do. If it's a code to launch a nuclear missile, it should be more strict than a password to log in to play your paid online edition of Angry Birds.
But I've got a SPECIFIC beef with case sensitivity.
For starters, users hate it. The human brain thinks "A=a". Of course, developers brains' aren't usually typical. ;-) But developers are also inconvenienced by case sensitivity.
Second, the CapsLock key is too easy to hit by mistake. It's right between Tab and Shift keys, but it SHOULD be up above the Esc key. Its location was established long ago in the days of typewriters, which had no alternate font available. In those days it was useful to have it there.
All passwords have risk... You're balancing risk with ease-of-use, and yes, usability matters.
MY ARGUMENT:
Yes, case sensitivity is more secure for a given password length. But unless someone is making me do otherwise, I opt for a longer minimum password length. Even if we assume only letters and digits are allowed, each added character multiplies number of the possible passwords by 36.
Someone who's less lazy than me with math could tell you the difference in number of combinations between, say a minimum 8-character case-sensitive password, and a 12-character case-insensitive password. I think most users would prefer the latter.
Also, not all apps expose usernames to others, so there are potentially two fields the hacker may have to find.
I also prefer to allow spaces in passwords as long as the majority of the password isn't spaces.
In the project I'm developing now, my management screen allows the administrator to change password requirements, which apply to all future passwords. He can also force all users to update passwords (to new requirements) at any time after next logon. I do this because I feel my stuff doesn't need case-sensitivity, but the administrator (who probably paid me for the software) may disagree so I let that person decide.
The PIN for my bank card is only four digits. Since it's only numbers it's not case sensitive. And heck, it's my MONEY! If you consider nothing else, this sounds pretty insecure, were it not for the fact that the hacker has to steal my card to get my money. (And have his photo taken.)
One other beef: Developers who come onto StackOverflow and regurgitate hard-and-fast rules that they read in an article somewhere. "Never hard code anything." (As if that's possible.) "All queries must be parameterized" (not if the the user doesn't contribute to the query.) etc.
Please excuse the rant. ;-) I promise I respect disagreement.
Personally for this paticular problem I tend to give passwords a 'score' based on characteristics of the entered text, and refuse passwords that don't meet the score.
For example:
Contains Lower Case Letter +1
Contains different Lower Case Letter +1
Contains Upper Case Letter +1
Contains different Upper Case Letter +1
Contains Non-Alphanumeric character: +1
Contains different Non-Alphanumeric character: +1
Contains Number: +1
Contains Non Consecutive or repeated Second Number: +1
Length less than 8: -10
Length Greater than 12: +1
Contains Dictionary word: -4
Then only allowing passwords with a score greater than 4, (and providing the user feedback as they create their password via javascript)

Resources