what is the best UX for non-programmer users? comma-separated tags or space-separated tags? - user-experience

I'm creating a social site for teachers (non-programmers) on which teachers can add events, links, exercises, tips, lesson plans, books, etc.
Each of these items I want them to be able to add tags to as we do at StackOverflow.
However, because they are non-programming users, I thought that space-separated, nonspace tags and camelCase tags would lead to too much confusion, e.g.:
grammar teachingtips universityOfMinnesota phrasalverbs
and indeed on this similar stackoverflow question most of the answers suggested commas like this:
grammar, teaching tips, university of minnesota, phrasal verbs
but then I just signed up for a delicious.com account (which I don't think has a very programmer-centric audience) and saw that they use spaces as well:
separate tags with spaces: e.g. hotels bargains newyork (not new york)
What has been your experience on this point in terms of the current UX trend for tags? Is the average Internet user accostumed to space-separated tags by now? I have to admit, I have never seen comma-separated tags on any major site I have used. Have you come upon a good way to combine them so it doesn't even matter, e.g.:
grammar book reviews teaching tips
and e.g. have a quick algorithm which checks the number of current tags for:
grammar
grammar book
grammar book reviews
book
book reviews
book reviews teaching
...

I'd go comma separated personally. You'll note that Stackoverflow doesn't but the tags are clearly delineated into their own boxes. Plus hyphens are often used for "spacing". I'd say spaces are more natural to non-programmers than hyphens are however.

Comma separated seems the most natural - it's what English uses to punctuate lists. It also allows you to have spaces in tags if you want. People will try to enter
this, that, the other
and expect it to work.
I can't think of a good reason to use spaces.

Notice that delicious has to give an example to demonstrate how to do it their way. That's not a good sign.
If you do go with commas, take care to see how easy it is for a "space user" to see that they made a mistake, and to fix it.

I would go with comma separated tags, if only to save your users the pain of having to use quotes to indicate a tag has a space in it, ie website "stack overflow" tips, or website, stack overflow, tips. I know which I'd prefer.

Comma-separated is the way to go for your educational audience. It's simply intuitive.
Most teachers should have no trouble understanding a system where tags are comma separated, and there is no need to come up with an awkward workaround for phrases.

It depends a little on how the tags are entered. If the user gets suggestions for tags as they type like SO provides (shades of intellisense), space separated is probably fine. However, if you are going to force the user to enter each tag without a reference list it may be easier to accept case-insensitive comma (or semicolon) delimited tags.

You don't want to check all those possibilities unless you are going to severely limit the number of possible tags - that's an O(n!) algorithm, and you most likely don't want to have that extra load on your server.
Your best bet is probably just to stick with one option - the users will (should!) get used to it fairly quickly. Spaces as separators are probably the most common, so I would go with that, since it is the one the users are most likely to have had prior exposure to.

As long as what the software accepts/demands is clear, I think users will be happy with either. Confusion comes when they don't know whether to use commas, semicolons, spaces or...
If you use a number of e-mail clients you'll know how useful a simple tool-tip reminder of whether it's commas or spaces would be when entering multiple recipients.

When tagging, how you set it up depends on what kinds of things you will tag. Media that is hard to index, like pictures, audio, or video, should encourage many and varied tags, because the tags are how you will search the content.
Easily indexed content (text!) should use a very rigid tagging structure, because you don't need to rely on tags for search indexing. Instead, the purpose of tags is sort the content into well-defined categories. Tags should be more like labels or folders.
I'm gonna take a guess here that this content will be mostly text-based, with the occasional picture or video file thrown in. So you don't want either comma or space separated tag entry, but rather some mechanism that forces users to pick from an existing set of tags.

I would assume space separated tags unless there are one or more commas, in which case you should split on commas instead. In other words, support both but in a limited way. You can probably guess right 90+ percent of the time.

Related

How to Protect Against Unicode Security Vulnerabilities

"Five things everyone should know about Unicode" is a blog post showing how Unicode characters can be used as an attack vector for websites.
The main example given of such a real world attack is a fake WhatsApp app submitted to the Google Play store using a unicode non-printable space in the developer name which made the name unique and allowed it to get past Google's filters. The Mongolian Vowel Separator (U+180E) is one such non-printable space character.
Another vulnerability is to use alternative Unicode characters that look similar. The Mimic tool shows how this can work.
An example I can think of is to protect usernames when registering a new user. You don't want two usernames to be the same or for them to look the same either.
How do you protect against this? Is there a list of these characters out there? Should it be common practice to strip all of these types of characters from all form inputs?
What you are talking about is called a homoglyph attack.
There is a "confusables" list by Unicode here, and also have a look at this. There should be libraries based on these or pontentially other databases. One such library is this one that you can use in Java or Javascript. The same must exist for other languages as well, or you can write one.
The important thing I think is to not have your own database - the library or service is easy to do on top of good data.
As for whether you should filter out similar looking usernames - I think it depends. If there is an interest for users to try and fake each other's usernames, maybe yes. For many other types of data, maybe there is no point in doing so. There is no generic best practice I think other than you should assess the risk in your application, with your datapoints.
Also a different approach for a different problem, but what may often work for Unicode input validation is the \w word character in a regular expression, if your regex engine is Unicode-ready. In such an engine, \w should match all Unicode classes of word characters, ie. letters, modifiers and connectors in any language, but nothing else (no special characters). This does not protect against homoglyph attacks, but may protect against some injections while keeping your application Unicode-friendly.
All sanitization works best when you have a whitelist of known safe values, and exclude all others.
ASCII is one such set of characters.
This could be approached in various ways, however each one might increase the number of false positives, causing legitimate users' annoyance. Also, none of them will work for 100% of the cases (even if combined). They will just add an extra layer.
One approach would be to have tables with characters that look similar and check if duplicate names exist. What 'look similar' means is subjective in many cases, so building such list might be tricky. This method might produce false positives in certain occasions.
Also, reversing the order of certain letters might trick many users. Checking for anagrams or very similar names can be achieved using algorithms like Jaro-Winkler and Levenshtein distance (i.e., checking if a similar username/company name already exists). Sometimes however, this might be due to a different spelling of some word in some region (e.g., 'centre' vs 'center'), or the name of some company might deliberately contain an anagram. This approach might further increase the number of false positives.
Furthermore, as Jonathan mentioned, sanitisation is also a good approach, however it might not protect against anagrams and cause issues to legitimate users who want to use some special character.
As the OP also mentioned, special characters can also be stripped. Other parts of the name might also need to be stripped, for example common names like 'Inc.', '.com' etc.
Finally, the name can be restricted to only contain characters in one language and not a mixture of characters from various languages (a more relaxed version of this may not allow mixture of characters in the same word - while would allow if separated by space). Restricting using a capital first letter and lower case for the rest of the letters can further improve this approach, as certain lower case letters (like 'l') may look like upper case ones (like 'I') when certain fonts are used. Excluding the use of certain symbols (like '|') will enhance this approach further. This solution will increase the amount of annoyance of certain users who will not be able to use certain names.
A combination of some/all aforementioned approaches can also be used. The selection of the methods and how exactly they will be applied (e.g., you may choose to forbid similar names, or to require moderator approval in case a name is similar, or to not take any action, but simply warn a moderator/administrator) depends on the scenario that you are trying to solve.
I may have an innovative solution to this problem regarding usernames. Obviously, you want to allow ASCII characters, but in some special cases, other characters will be used (different language, as you said).
I think an intuitive way to allow both ASCII and other characters to be used in an username, while being protected against "Unicode Vulnerabilities", would be something like this:
Allow all ASCII characters and disallow other characters, except when there are x or more of these special characters in the username(the username is in another language).
Take for example this:
Whatsapp, Inc + (U+180E) - Not allowed, only has 1 special character.
элч + (U+180E) - Allowed! It has more than x special characters (for example, 3). It can use the Mongolian separator since it's Mongolian.
Obviously, this does not protect you 100% from these types of vulnerabilities, but it is a very efficient method I have been using, ESPECIALLY if you do not mention the existence of this algorithm on the "login" or "register" page, as attackers might figure out that you have an algorithm protecting the website from these types of attacks, but not mention it so they cannot reverse engineer it and find a way to bypass it.
Sorry if this is not an answer you are looking for, just sharing my ideas.
Edit: Or you can use a RNN (Recurrent Neural Network) AI to detect the language and allow specific characters from that language.

Smart search for acronyms in Salesforce

In Salesforce's Service Cloud one can enable the out of the box search function where the user enters a term and the system searches all parts of the database for a match. I would like to enable smart searching of acronyms so that if I spell an organizations name the search functionality will also search for associated acronyms in the database. For example, if I search type in American Automobile Association, I would also get results that contain both "American Automobile Association" and "AAA".
I imagine such a script would involve declaring that if the term being searched contains one or more spaces or periods, take the first letter of the first word and concatenate it with the letters that follow subsequent spaces or periods.
I have unsuccessfully tried to find scripts for this or articles on enabling this functionality in Salesforce. Any guidance would be appreciated.
Interesting question! I don't think there's a straightforward answer but as it's standard search functionality, not 100% programming related - you might want to cross-post it to salesforce.stackexchange.com
Let's start with searchable fields list: https://help.salesforce.com/articleView?id=search_fields_business_accounts.htm&type=0
In Setup there's standard functionality for Synonyms, quite easy to use. It's not a silver bullet though, applies only to certain objects like Knowledge Base (if you use it). Still - it claims to work on Cases too so if there's "AAA" in Case description it should still be good enough?
You could also check out the trick with marking a text field as indexed and/or external ID and adding there all your variations / acronyms: https://success.salesforce.com/ideaView?id=08730000000H6m2 This is more work, to prepare / sanitize your data upfront but it's not a bad idea.
Similar idea would be to use Tags although that could explode in size very quickly. It's ridiculous to create a tag for every single company.
You can do some really smart things in data deduplication rules. Too much to write it all here, check out the trailhead: https://trailhead.salesforce.com/en/modules/sales_admin_duplicate_management/units/sales_admin_duplicate_management_unit_2 No idea if it impacts search though.
If you suffer from bad address data there are State & Country picklists, no more mess with CA / California / SoCal... https://resources.docs.salesforce.com/204/latest/en-us/sfdc/pdf/state_country_picklists_impl_guide.pdf Might not help with Name problem...
Data.com cleanup might help. Paid service I think, no idea if it affects search too. But if enabling it can bring these common abbreviations into your org - might be better than reinventing the wheel.

List of uninteresting words

[Caveat] This is not directly a programing question, but it is something that comes up so often in language processing that I'm sure it's of some use to the community.
Does anyone have a good list of uninteresting (English) words that have been tested by more then a casual look? This would include all prepositions, conjunctions, etc... words that may have semantic meaning, but are often frequent in every sentence, regardless of the subject. I've built my own lists from time to time for personal projects but they've been ad-hoc; I continuously add words that I forgotten as they come in.
These words are usually called stop words. The Wikipedia article contains much more information about them, including where to find some lists.
I think you mean stop words.
There's a few links to lists of stop words on Wikipedia, including this one.

Large free block of english non-pronoun text

As part of teaching myself python I've written a script which allows a user to play hangman. At the moment, the hangman word to be guessed is simply entered manually at the start of the script's code.
I want instead for the script to choose randomly from a large list of english words. This I know how to do - my problem is finding that list of words to work from in the first place.
Does anyone know of a source on the net for, say, 1000 common english words where they can be downloaded as a block of text or something similar that I can work with?
(My initial thought was grabbing a chunk of a novel from project gutenburg [this project is only for my own amusement and won't be available anywhere else so copyright etc doesn't matter hugely to me btw], but anything like that is likely to contain too many names or non-standard words that wouldn't be suitable for hangman. I need text that only has words legal for use in scrabble, basically).
It's a slightly odd question for here I suppose, but actually I thought the answer might be of use not just to me but anyone else working on a project for a wordgame or similar that needs a large seed list of words to work from.
Many thanks for any links or suggestions :)
Would this be useful?
Have you tried /usr/share/dict/words?
Create text list manually
Grab text from Project Gutenberg, Wikipedia or some other source. Go through the text and count how many times each word is found. The words that are found most frequently will be pronouns, conjunctions, etc... Just throw them out.
Proper Nouns will likely be the least frequently found words unless of course your text is a story, then the character names will likely be found quite often. Probably the best way to handle proper nouns is to use many sources and count how many sources the word is found in. Essentially, words that are common among a lot of different sources will likely not be proper nouns. Words that are specific to one text source, you can throw out. This idea is related to tfidf.
Once you have calculated these word frequencies, it's also easy to just look over the words, and tweak your list as necessary.
Use Wordnet
Another idea is to download words from Wordnet. Wordnet tells the parts of speech for a lot of words. You could just stick to nouns and verbs for your purpose.

I am building a search engine. How do I remove duplicates from search results?

When I search for something, I get content that have the same text and title.
Of course, there is always an original (where others copy/leech from)
If you have expertise in search and crawling...how do you recommend that I remove these duplicates? (in a very feasible and efficient mannter)
Sounds like a programming question to me.
If you have a clear idea about what the stolen and original components of these pages are, and those differences are general enough that you can write a filter to separate them, then do that, hash the 'stolen' content, and then you should be able to compare hashes to determine if two pages are the same.
I guess web-page thieves might go to some further code-obfuscation to mess you up, including changing whitespace, so you might want to normalise the html before hashing, for instance removing any redundant whitespace, making all attributes use " quotes etc.
Here's a technique based on simhash.
Here's one that uses stopwords to work around ads.
Have you tried looking at the origin date of the site? After comparing a value of word strings to verify duplication, whitelist the one that is earlier.

Resources