A good word splitter

A good word splitter - nlp

I have a set of short strings (average length < 12).
The strings are mostly sequence of English words (names, dict words etc).
However there is no delimiter between the words. I want to split each string into individual words. I tried google but didn't find anything.
Is there any standard way to do that? Also where can I get dictionary which also includes name of person, along with other English words.
Please note: The strings might not adhere to grammatical rules of English.
Examples of Strings are given below:
dontdisturb
ilovejane
iamagoodperson

It is a known problem for Twitter content/hashtags, though there is no standard/universally accepted way to solve it. (I would also suggest changing the topic to "hashtag splitter" if it is your problem, then more people would be able to find it.)
The algorithm I would suggest is the one typically used for segmentation of Chinese (which has a very similar issue as you can imagine). Here is the idea:
1.Try finding all substrings that can be found in a dictionary, give them the highest score.
2.Then add sequences accepted by some English heuristic with a lower score.
3.And finally throw in individual letters or syllables found in the remainder, with the lowest score.
4.Use Viterbi algorithm (or here) to find the best non-overlapping coverage of the string with the highest score.

Related

Find the most similar text in a list of strings

First of all, I want to acknowledge that this is perhaps a very sophisticated problem; however, I have not been able to find a definitive answer for it online so I'm looking for suggestions.
Suppose I want to collect a list containing over hundred thousand strings, values of these strings are sentences that a user has typed. The values are added to the list as soon as a user types a new message. For example:
["Hello world!", "Good morning, my name is John", "Good morning, everyone"]
But I also want to have a timeout for each string so if they are not repeated within 5 min, they should be removed, so I change it to following format:
[{message:"Hello world!", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, my name is John", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, everyone", timeout: NodeJS.Timeout, count: 1}]
Now suppose a user types the following message:
Good morning, everyBODY
I want to compare this string to all the messages in list and if one is 70% or more similar, update the count of that message, otherwise insert it as a new message. For this message for example, the application should update the count for Good morning, everyone to be equal to 2.
Since users can type a lot of messages in a short amount of time, the algorithm must also support fast insertion, searching, and deleting after the timeout.
What is the best way to implement this? or are there any libraries to help me with this?
NOTE: The strings do not need to be in an array, any data structure would work.
The main purpose of this algorithm is to detect similar messages when the count reaches a predefined value. For example warning: Over 5 users typed messages similar to "Hello everybody" within 5 minutes
I have looked at B-Trees, Nearest Neighbor, etc but I can't figure out what would be the best solution.
Update:
I plan on using Levenshtein distance for string similarity, however the main problem is how to apply that to a list of strings in most time efficient way, without having to check every single string every time a new message is added.

Levenshtein distance
Unlike the other answer I think Levenshtein distance is perfectly capable of dealing with spelling mistakes. Indeed, Levenshtein and LevXenshtein only have Levenshtein distance 1, and thus can be concluded to likely be the same message.
However, if you want to use this distance, you will have to compute the distance between the new message and every message stored, every time a new message comes in. There is likely no way around this.
Unfortunately there is no real useful pre-processing you can do for this.
Other possibilities
If you can find a way to map every message to a fixed-size vector, you can use essentially any nearest neighbor search technique. I suggest doing so.
This leaves us with two problems to solve. Generating the fixed-length vector, and doing the search.
Fixed-size vector representation
There are multiple ways of doing this, all with their own set of drawbacks. I'll specifically mention two, but it will depend on your architecture and data which method is best for you.
First, you could go the machine-learning way. You could map every word to a pre-trained vector with fastText, average the words in the message, and use that as your vector. The drawbacks of this method are that it will ignore word order, and it will work less well if the words used tend to be very informal. If your messages have their own culture to them (such as for example Twitch chat) you would have to retrain these vectors instead of using pre-trained ones.
Alternatively, you could use the structure of the text directly, and make an occurrence vector of bigrams. That is, jot down how often every 2-character combination occurs in a message. This is fairly robust, but has the drawback that the vectors will become relatively large.
Regardless, these are just two options, and it's impossible to tell what method is ideal for you. Unless of course someone has a brilliant idea.
Nearest neighbor search
Given that we have fixed length vectors, we can now do nearest neighbor search. As you've probably found, there are once again many different methods for this, all with their own drawbacks. Exhausting, I know.
I'll choose to discuss three categories.
Approximate search: This method may seem a little silly, but it could be what you want. Specifically, Locality-sensitive hashing is essentially just making some hashing function where "similar" vectors are likely to end up in the same bucket. You could then do anything you want, such as Levenshtein, with all of the other members of the bucket, because there should not be too many of them. The advantage of such an approximate algorithm is that it can be fast, and with some smart hashing you don't even need fixed-length vectors. A downside, of course, is that it is not guaranteed to work.
Exact search: We can also choose to instead solve the problem of Fixed-radius near neighbors. That is, find the points within some distance of the target point. You could do this by mapping vectors to integers (if they aren't already) and simply checking every lattice point within the distance you want to search. The primary drawback here is that the search time grows very fast not with the number of points, but with the number of dimensions of the vector. This method would necessitate small vectors.
Fancy datastructures: This seems to me most likely to be the right solution. Unfortunately you have a lot of letter-trees. You mention B-trees, but there's also R-trees, R+-Trees, R*-Trees, X-Trees, and that's just the direct descendants of the R-tree. With the risk of missing the trees for the forest, I'd suggest taking a look at the k-d tree. It can do nearest neighbor search in logarithmic time, as well as insertion and deletion.

You want to covert all of the words to their Soundex value.
Then you need a database for the soundex values that ranks the importance of the word in the sentence, e.g. the should probably get 0. The more information the word carries the higher its value.
Then sort the words in the sentence into a list of integers.
Use the list of integers as the key to find similar sentences.
Since the key is a list of integers a Rose tree should work as data structure.
While some may suggest measuring using something like Levenshtein distance that presupposes that the sentences have no spelling mistakes or such. You need something that is flexible enough to deal with human error.

I would suggest you to use Algolia. Which has their own ranking algorithm rates each matching record on several criteria (such as the number of typos or the geo-distance), to which they individually assign a integer value score.
I would totally take a loook on it, since they have Search-as-you-type and different Ranking algorithm criterias.
https://blog.algolia.com/search-ranking-algorithm-unveiled/

I think Search Engine like SOLR or Elastic Search are best fit for your problem.
You have to create single collection in which you can store data as you have mention in the question after that you just have to add data to solr and search it in the solr search with your time limit.

How to automate finding sentences similar to the ones from a given list?

I have a list of let's say "forbidden sentences" (1000 of them, each with around 40 words). I want to create a tool that will find and mark them in a given document.
The problem is that in such document this forbidden sentence can be expressed differently than it is on this list keeping the same meaning but changed by using synonyms, a few words more or less, different word order, punctuation, grammar etc. The fact that this is all in Polish is not making things easier with each noun, pronoun, and adjective having 14 cases in total plus modifiers and gender that changes the words further. I was also thinking about making it so that the found sentences are ranked by the probability of them being forbidden with some displaying less resemblance.
I studied IT for two years but I don't have much knowledge in NLP. Do you think this is possible to be done by an amateur? Could you give me some advice on where to start, what tools to use best to put it all together? No need to be fancy, just practical. I was hoping to find some ready to use code cause i imagine this is sth that was made before. Any ideas where to find such resources or what keywords to use while searching? I'd really appreciate some help cause I'm very new to this and need to start with the basics.
Thanks in advance,
Kamila

Probably the easiest first try will be to use polish SpaCy, which is an extension of popular production-ready NLP library to support polish language.
http://spacypl.sigmoidal.io/#home
You can try to do it like this:
Split document into sentences.
Clean these sentences with spacy (deleting stopwords, punctuation, doing lemmatization - it will help you with many differnet versions of the same word)
Clean "forbidden sentences" as well
Prepare vector representation of each sentence - you can use spaCy methods
Calculate similarity between sentences - cosine similarity
You can set threshold, from which if sentences of document is similar to any of "forbidden sentences" it will be treated as forbidden
If anything is not clear let me know.
Good luck!

Comparing strings for their similarities?

I want to count the number of times there is an ocurrence of certain college course on a list of thousands of entries. The problem is the course is not always spelled the same. For example, Computer Engineering can be spelled Computers Engineering. What is a proper, elegant way to test if 2 strings are very similar?

I would try to canonize the strings using stemming. The idea is - give each string its canonized form, and two different strings, that represent the same word are very likely to have the same canon form (for example, Computer and Computers will have the same cannon form, and you will get a match).
Porter stemming algorithm is often used for canonization.
An alternative - is grading the strings with a distance between each other, the suggested Levenshtein Distance can help you with it, but personally - I'd prefer canonization.

Filtering out meaningless phrases

I have an algorithm (that I can't change) that outputs a list of phrases. These phrases are intended to be "topics". However, some of them are meaningless on their own. Take this list:
is the fear
freesat
are more likely to
first sight
an hour of
sue apple
depression and
itunes
How can I filter out those phrases that don't make sense on their own, to leave a list like the following?
freesat
first sight
sue apple
itunes
This will be applied to sets of phrases in many languages, but English is the priority.

It's got to be grammatically acceptable in that it can't rely on other words in the original sentence that it was extracted from; e.g. it can't end in 'and'.
Although this is still an underspecified question, it sounds like you want some kind of grammar checker. I suggest you try applying a part-of-speech tagger to each phrase, compile a list of patterns of POS tags that are acceptable (e.g. anything that ends in a preposition would be unacceptable) and use that to filter your input.

At a high level, it seems that phrases which were only nouns or adjective-noun combos would give much better results.
Examples:
"Blue Shirt"
"Happy People"
"Book"
First of all, this problem can be as complex as you want it to be. For third-party reading/solutions, I came across:
http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
http://research.microsoft.com/en-us/groups/nlp/
http://sharpnlp.codeplex.com/ (note the part of speech tagger)
If you need 100% accuracy, then I wouldn't write such a tool myself.
However, if the problem domain is limited...
I would start by throwing out conjunctions, prepositions, contractions, state-of-being verbs, etc. This is a fairly short list in English (and looks very similar to the stopwords which #HappyTimeGopher suggested).
After that, you could create a dictionary (as an indexed structure, of course) of all acceptable nouns and adjectives and compare each word in the raw phrases to that. Anything which didn't occur in the dictionary and occur in the correct sequence could be thrown out or ranked lower.
This could be useful if you were given 100 input values and wanted to select the best 5. Finding the values in the dictionary would mean that it's likely the word/phrase was good.
I've auto-generated such a dictionary before by building a raw index from thousands of documents pertaining to a vertical industry. I then spent a few hours with SQL and Excel stripping out problems easily spotted by a human. The resulting list wasn't perfect but it eliminated most of the blatantly dumb/pointless terminology.
As you may have guessed, none of this is foolproof, although checking adjective-to-noun sequence would help somewhat. Consider the case of "Greatest Hits" versus "Car Hits [Wall]".
Proper nouns (e.g. person names) don't work well with the dictionary approach, since it's probably not feasible to build a dictionary of all variations of given/surnames.
To summarize:
use a list of stopwords
generate a dictionary of words, classifying them with a part of speech(s)
run raw phrases through dictionary and stopwords
(optional) rank on how confident you are on a match
if needed, accept phrases which didn't violate known patterns (this would handle many proper nouns)

If you've access to the text these phrases were generated from, it may be easier to just create your own topic tags.
Failing that, I'd probably just remove anything that contained a stop word. See this list, for example:
http://www.ranks.nl/resources/stopwords.html
I wouldn't break out POS tagging or anything stronger for this.

It seems you could create a list that filters out three things:
Prepositions: https://en.wikipedia.org/wiki/List_of_English_prepositions
Conjunctions: https://en.wikipedia.org/wiki/Conjunction_(grammar)
Verb forms of to-be: http://www.englishplus.com/grammar/00000040.htm
If you filter on these things you'd get pretty far. Are you more concerned with false negatives or positives? If false negatives aren't a huge problem, this is how I would approach it.

A string searching algorithm to quickly match an abbreviation in a large list of unabbreviated strings?

I am having a lot of trouble finding a string matching algorithm that fits my requirements.
I have a very large database of strings in an unabbreviated form that need to be matched to an arbitrary abbreviation. A string that is an actual substring with no letters between its characters should also match, and with a higher score.
Example: if the word to be matched within was "download" and I searched "down", "ownl", and then "dl", I would get the highest matching score for "down", followed by "ownl" and then "dl".
The algorithm would have to be optimized for speed and a large number of strings to be searched through, and should allow me to pull back a list of matching items strings (if I had added both "download" and "upload" to the database, searching "load" should return both). Memory is still important, but not as important as speed.
Any ideas? I've done a bunch of research on some of these algorithms but I haven't found any that even touch abbreviations, let alone with all these conditions!

I'd wonder if Peter Norvig's spell checker could be adapted in some way for this problem.
It's a stretch that I haven't begun to work out, but it's such an elegant solution that it's worth knowing about.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string