Defense against approximate string matching - string

How can I alter a string so that variations of approximate string matching can't match it with the original?
I made an IRCbot which runs a game based on the logfile of the channel. It prints quotes from the logs and players collect points by guessing "who said it". The channel is rather geeky and it took no more than 30 minutes for one of the players to build a bot which wins the game every time. I realize manual cheating is also easy and impossible to defend against, but consider this a competition between automated bots. I want to update my bot so that any fully automated bot will not be able to play the game :)
I've considered randomly deleting a character from the quotes, but agrep would still be able to match the string. I've considered replacing some of the characters by similar-looking alternate characters, but that would be trivial to reverse-engineer. I'm looking for ideas that will be harder to break.
Example line:
[14:15] <baobot> [QUOTE 13/15] Who famously declared "minulla ainakin paperin tekemisessä 1% ajasta menee algon suunnitteluun ja 99% menee paperin kirjoittamiseen"?

Print your quote as ascii-art.
Use something similar to the command-line-tools figlet or toilet (explaination).
Here is a quick example: like string2ascii-generator.
To get you started, you might want to copy the sourcecode from figlet.

Anything that can be used to scramble can most likely be unscrambled. Below are some suggestions though for your experiment:
Humans can read words if the first and last letter are in place and the inner portion is scrambled.
You can also do substitution, such as elite speak to replace some characters with numbers.
You might be able to find other characters in other languages that also look familiar to letters that are used, which means you can randomly substitute them as well.
You can also try to randomize the positions of the spaces. So remove them from the original position then move them around, or remove them completely.
Reverse some words.
Find ways to phoneticized words... in english "ph" sounds like "f" so you can find and replace some of them.
Try a combination of different things above, remove all spaces, CaMEl CaSE words, then do character substitutions, etc.
Overall, there are lots of ways to help make it harder, however if you follow a set pattern every time, then it'll be easier to program something to undo it. If you randomly do different things, so one input can yield several different outputs, then it'll be harder for someone to write a program to reverse the process.

Use Google translate.
For example, I ran your quote to Russian, then to English, and back to Finnish, and got
Minulla on ainakin 1% ajasta kirjassa otetaan suunnittelussa Algon ja 99% menee kirjoituspaperia
I have no idea if it is a correct Finnish; as far as I can tell it is still somewhat recognizable. If you think it is too recognizable for an approximate search, do more intermediate translations.

Related

How to Protect Against Unicode Security Vulnerabilities

"Five things everyone should know about Unicode" is a blog post showing how Unicode characters can be used as an attack vector for websites.
The main example given of such a real world attack is a fake WhatsApp app submitted to the Google Play store using a unicode non-printable space in the developer name which made the name unique and allowed it to get past Google's filters. The Mongolian Vowel Separator (U+180E) is one such non-printable space character.
Another vulnerability is to use alternative Unicode characters that look similar. The Mimic tool shows how this can work.
An example I can think of is to protect usernames when registering a new user. You don't want two usernames to be the same or for them to look the same either.
How do you protect against this? Is there a list of these characters out there? Should it be common practice to strip all of these types of characters from all form inputs?
What you are talking about is called a homoglyph attack.
There is a "confusables" list by Unicode here, and also have a look at this. There should be libraries based on these or pontentially other databases. One such library is this one that you can use in Java or Javascript. The same must exist for other languages as well, or you can write one.
The important thing I think is to not have your own database - the library or service is easy to do on top of good data.
As for whether you should filter out similar looking usernames - I think it depends. If there is an interest for users to try and fake each other's usernames, maybe yes. For many other types of data, maybe there is no point in doing so. There is no generic best practice I think other than you should assess the risk in your application, with your datapoints.
Also a different approach for a different problem, but what may often work for Unicode input validation is the \w word character in a regular expression, if your regex engine is Unicode-ready. In such an engine, \w should match all Unicode classes of word characters, ie. letters, modifiers and connectors in any language, but nothing else (no special characters). This does not protect against homoglyph attacks, but may protect against some injections while keeping your application Unicode-friendly.
All sanitization works best when you have a whitelist of known safe values, and exclude all others.
ASCII is one such set of characters.
This could be approached in various ways, however each one might increase the number of false positives, causing legitimate users' annoyance. Also, none of them will work for 100% of the cases (even if combined). They will just add an extra layer.
One approach would be to have tables with characters that look similar and check if duplicate names exist. What 'look similar' means is subjective in many cases, so building such list might be tricky. This method might produce false positives in certain occasions.
Also, reversing the order of certain letters might trick many users. Checking for anagrams or very similar names can be achieved using algorithms like Jaro-Winkler and Levenshtein distance (i.e., checking if a similar username/company name already exists). Sometimes however, this might be due to a different spelling of some word in some region (e.g., 'centre' vs 'center'), or the name of some company might deliberately contain an anagram. This approach might further increase the number of false positives.
Furthermore, as Jonathan mentioned, sanitisation is also a good approach, however it might not protect against anagrams and cause issues to legitimate users who want to use some special character.
As the OP also mentioned, special characters can also be stripped. Other parts of the name might also need to be stripped, for example common names like 'Inc.', '.com' etc.
Finally, the name can be restricted to only contain characters in one language and not a mixture of characters from various languages (a more relaxed version of this may not allow mixture of characters in the same word - while would allow if separated by space). Restricting using a capital first letter and lower case for the rest of the letters can further improve this approach, as certain lower case letters (like 'l') may look like upper case ones (like 'I') when certain fonts are used. Excluding the use of certain symbols (like '|') will enhance this approach further. This solution will increase the amount of annoyance of certain users who will not be able to use certain names.
A combination of some/all aforementioned approaches can also be used. The selection of the methods and how exactly they will be applied (e.g., you may choose to forbid similar names, or to require moderator approval in case a name is similar, or to not take any action, but simply warn a moderator/administrator) depends on the scenario that you are trying to solve.
I may have an innovative solution to this problem regarding usernames. Obviously, you want to allow ASCII characters, but in some special cases, other characters will be used (different language, as you said).
I think an intuitive way to allow both ASCII and other characters to be used in an username, while being protected against "Unicode Vulnerabilities", would be something like this:
Allow all ASCII characters and disallow other characters, except when there are x or more of these special characters in the username(the username is in another language).
Take for example this:
Whatsapp, Inc + (U+180E) - Not allowed, only has 1 special character.
элч + (U+180E) - Allowed! It has more than x special characters (for example, 3). It can use the Mongolian separator since it's Mongolian.
Obviously, this does not protect you 100% from these types of vulnerabilities, but it is a very efficient method I have been using, ESPECIALLY if you do not mention the existence of this algorithm on the "login" or "register" page, as attackers might figure out that you have an algorithm protecting the website from these types of attacks, but not mention it so they cannot reverse engineer it and find a way to bypass it.
Sorry if this is not an answer you are looking for, just sharing my ideas.
Edit: Or you can use a RNN (Recurrent Neural Network) AI to detect the language and allow specific characters from that language.

Converting words to their roots

Is there an efficient way to convert all variations of words in a corpus (in a language you're not familiar with) to their roots?
In English, for example, this would mean converting plays, played, and playing into play; did, does, done, and doing into do; birds into bird; and so on.
The idea I have is to iterate through the less frequent words and test whether a substring of this word is one of the more frequent words. I don't think this is good because, first, it would not affect irregular verbs and, second, I'm not sure that it's always the "root" of the word that's going to be more frequent than some other variation. This method might also change some words erroneously that are totally different from the frequent word included in them.
The reason I want to do this is that I'm working on a classification problem and figured I'd get better results if I worked better on my preprocessing step. If you've done anything similar or have an idea, please do share.
Thank you!

Data-structure to use for sentence comparison

I'll try to be as succinct as possible as this will require some explanation. I am in a situation where I have to match strings, and extract values from those strings based on template strings that we define.
For example, the template sting would be:
I want to go to the $websiteurl homepage
And the other string could be
I want to go to the google.com homepage
We've had initial success checking that these strings match by measuring Levenshtein Distances and by using "creative" regular expressions, but we're trying to grow this to be more fault tolerant and accurate and the amount of complicated code is growing more than we would like.
Some scenarios we need to check compound words or we need to ignore extra adjectives/descriptive-words etc.
Adjective/Descriptive-word Example:
I want to immediately go to the google.com homepage
Compound Word Example (homepage is broken into two words):
I want to go to the google.com home page
And you can probably imagine even more complicated real-world sentences that should match this one, but won't match or work without some extra cases or additional checks for the sentence.
Obviously our current setup isn't ideal, due to the fact we need to do multiple passes over this string for every individual case we need to check which not only slows things down, but makes maintenance and debugging more complicated than it should be.
Is there a data structure that would be ideal for holding and comparing sentences in this manor? Something that would be fairly fault tolerant to extra or even missing words (within reason, obviously)? I imagine some kind of tree, but I do not know what type of tree or data-structure in general would be the best for this situation.
Thanks to any/all in advance!

Large free block of english non-pronoun text

As part of teaching myself python I've written a script which allows a user to play hangman. At the moment, the hangman word to be guessed is simply entered manually at the start of the script's code.
I want instead for the script to choose randomly from a large list of english words. This I know how to do - my problem is finding that list of words to work from in the first place.
Does anyone know of a source on the net for, say, 1000 common english words where they can be downloaded as a block of text or something similar that I can work with?
(My initial thought was grabbing a chunk of a novel from project gutenburg [this project is only for my own amusement and won't be available anywhere else so copyright etc doesn't matter hugely to me btw], but anything like that is likely to contain too many names or non-standard words that wouldn't be suitable for hangman. I need text that only has words legal for use in scrabble, basically).
It's a slightly odd question for here I suppose, but actually I thought the answer might be of use not just to me but anyone else working on a project for a wordgame or similar that needs a large seed list of words to work from.
Many thanks for any links or suggestions :)
Would this be useful?
Have you tried /usr/share/dict/words?
Create text list manually
Grab text from Project Gutenberg, Wikipedia or some other source. Go through the text and count how many times each word is found. The words that are found most frequently will be pronouns, conjunctions, etc... Just throw them out.
Proper Nouns will likely be the least frequently found words unless of course your text is a story, then the character names will likely be found quite often. Probably the best way to handle proper nouns is to use many sources and count how many sources the word is found in. Essentially, words that are common among a lot of different sources will likely not be proper nouns. Words that are specific to one text source, you can throw out. This idea is related to tfidf.
Once you have calculated these word frequencies, it's also easy to just look over the words, and tweak your list as necessary.
Use Wordnet
Another idea is to download words from Wordnet. Wordnet tells the parts of speech for a lot of words. You could just stick to nouns and verbs for your purpose.

what is the best UX for non-programmer users? comma-separated tags or space-separated tags?

I'm creating a social site for teachers (non-programmers) on which teachers can add events, links, exercises, tips, lesson plans, books, etc.
Each of these items I want them to be able to add tags to as we do at StackOverflow.
However, because they are non-programming users, I thought that space-separated, nonspace tags and camelCase tags would lead to too much confusion, e.g.:
grammar teachingtips universityOfMinnesota phrasalverbs
and indeed on this similar stackoverflow question most of the answers suggested commas like this:
grammar, teaching tips, university of minnesota, phrasal verbs
but then I just signed up for a delicious.com account (which I don't think has a very programmer-centric audience) and saw that they use spaces as well:
separate tags with spaces: e.g. hotels bargains newyork (not new york)
What has been your experience on this point in terms of the current UX trend for tags? Is the average Internet user accostumed to space-separated tags by now? I have to admit, I have never seen comma-separated tags on any major site I have used. Have you come upon a good way to combine them so it doesn't even matter, e.g.:
grammar book reviews teaching tips
and e.g. have a quick algorithm which checks the number of current tags for:
grammar
grammar book
grammar book reviews
book
book reviews
book reviews teaching
...
I'd go comma separated personally. You'll note that Stackoverflow doesn't but the tags are clearly delineated into their own boxes. Plus hyphens are often used for "spacing". I'd say spaces are more natural to non-programmers than hyphens are however.
Comma separated seems the most natural - it's what English uses to punctuate lists. It also allows you to have spaces in tags if you want. People will try to enter
this, that, the other
and expect it to work.
I can't think of a good reason to use spaces.
Notice that delicious has to give an example to demonstrate how to do it their way. That's not a good sign.
If you do go with commas, take care to see how easy it is for a "space user" to see that they made a mistake, and to fix it.
I would go with comma separated tags, if only to save your users the pain of having to use quotes to indicate a tag has a space in it, ie website "stack overflow" tips, or website, stack overflow, tips. I know which I'd prefer.
Comma-separated is the way to go for your educational audience. It's simply intuitive.
Most teachers should have no trouble understanding a system where tags are comma separated, and there is no need to come up with an awkward workaround for phrases.
It depends a little on how the tags are entered. If the user gets suggestions for tags as they type like SO provides (shades of intellisense), space separated is probably fine. However, if you are going to force the user to enter each tag without a reference list it may be easier to accept case-insensitive comma (or semicolon) delimited tags.
You don't want to check all those possibilities unless you are going to severely limit the number of possible tags - that's an O(n!) algorithm, and you most likely don't want to have that extra load on your server.
Your best bet is probably just to stick with one option - the users will (should!) get used to it fairly quickly. Spaces as separators are probably the most common, so I would go with that, since it is the one the users are most likely to have had prior exposure to.
As long as what the software accepts/demands is clear, I think users will be happy with either. Confusion comes when they don't know whether to use commas, semicolons, spaces or...
If you use a number of e-mail clients you'll know how useful a simple tool-tip reminder of whether it's commas or spaces would be when entering multiple recipients.
When tagging, how you set it up depends on what kinds of things you will tag. Media that is hard to index, like pictures, audio, or video, should encourage many and varied tags, because the tags are how you will search the content.
Easily indexed content (text!) should use a very rigid tagging structure, because you don't need to rely on tags for search indexing. Instead, the purpose of tags is sort the content into well-defined categories. Tags should be more like labels or folders.
I'm gonna take a guess here that this content will be mostly text-based, with the occasional picture or video file thrown in. So you don't want either comma or space separated tag entry, but rather some mechanism that forces users to pick from an existing set of tags.
I would assume space separated tags unless there are one or more commas, in which case you should split on commas instead. In other words, support both but in a limited way. You can probably guess right 90+ percent of the time.

Resources