Item description keyword extraction - nlp

I'm playing around with a recommendation system that takes key descriptive words and phrases and matches them against others. Specifically, I'm focusing on flavors in beer, with an algorithm searching for things like malty or medium bitterness, pulling those out, and then comparing against other beers to come up with flavor recommendations.
Currently, I'm struggling with the extraction. What are some techniques for identifying words and standardizing them for later processing?
How do I pull out hoppy and hops and treat them as the same word, but also keeping in mind that very hoppy and not enough hops have different meanings that are modified by the preceding word(s)? I believe I can use stemming for things like plurals and suffixed/prefixed words, but what about pairs or more complicated patterns? What techniques exist for this?

I would first ignore the finer-grained distinctions and compile a list of lexico-semantic patterns, which can be used to extract some information structure-- for example:
<foodstuff> has a <taste-description> taste
<foodstuff> tastes <taste-description>
very <taste-description>
not enough <taste-description>
You can use instances of such patterns in your text to infer useful concepts (such as different taste descriptions) which can then be used again in order to bootstrap the extraction of new patterns and thus new concepts.


semantic similarity for mix of languages

I have a database of several thousands of utterances. Each record (utterance) is a text representing a problem description, which a user has submitted to a service desk. Sometimes also the service desk agent's response is included. The language is highly technical, and it contains three types of tokens:
words and phrases in Language 1 (e.g. English)
words and phrases in Language 2 (e.g. French, Norwegian, or Italian)
machine-generated output (e.g. listing of files using unix command ls -la)
These languages are densely mixed. I often see that in one conversation, a sentence in Language 1 is followed by Language 2. So it is impossible to divide the data into two separate sets, corresponding to utterances in two languages.
The task is to find similarities between the records (problem descriptions). The purpose of this exercise is to understand whether some bugs submitted by users are similar to each other.
Q: What is the standard way to proceed in such a situation?
In particular, the problem lies in the fact that the words come from two different corpora (corpuses), while in addition, some technical words (like filenames, OS paths, or application names) will not be found in any.
I don't think there's a "standard way" - just things you could try.
You could look into word-embeddings that are aligned between langauges – so that similar words across multiple languages have similar vectors. Then ways of building a summary vector for a text based on word-vectors (like a simple average of all a text's words' vectors), or pairwise comparisons based on word vectors (like "Word Mover's Distance"), may still work with mixed-language texts (even mixes of languages within one text).
That a single text, presumably about a a single (or closely related) set of issues, has mixed language may be a blessing rather than a curse: some classifiers/embeddings you train from such texts might then be able to learn the cross-language correlations of words with shared topics. But also, you could consider enhancing your texts with extra synthetic auto-translated text, for any monolingual ranges, to ensure downstream embeddings/comparisons get closer to your ideal of language-obliviousness.
Thank you for the suggestions. After several experiments I developed a method which is simple and works pretty well. Rather than using existing corpora, I created my own corpus based on all the utterances available in my multilingual database. Without translating them. The database has 130,000 utterances, including 3,5 million of words (in three languages: English, French and Norwegian) and 150,000 unique words. The phrase similarity based on the meaning space constructed this way works surprisingly well. I have tested this method on production and the results are good. I also see a lot of space for improvement, and will continue to polish it. I also wrote this article An approach to categorize multi-lingual phrases, describing all the steps in more detail. Critics or improvements welcome.

How to determine if a piece of text mentions a product

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

Stemmers vs Lemmatizers

Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems.
Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms.
[in]: having
[out]: hav
[in]: having
[out]: have
So the question is, are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English
If not, then how should we move on to build robust lemmatizers that
can take on nounify, verbify, adjectify and adverbify
How could the lemmatization task be easily scaled to other languages
that have similar morphological structures as English?
Q1: "[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English"
Yes. Stemmers are much simpler, smaller, and usually faster than lemmatizers, and for many applications, their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction in Information Retrieval. You replace all drive/driving with driv in both the searched documents and the query. You do not care if it is drive or driv or x17a$ as long as it clusters inflectionally related words together.
Q2: "[..]how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify, and adverbify preprocesses?
What is your definition of a lemma, does it include derivation (drive - driver) or only inflection (drive - drives - drove)? Does it take into account semantics?
If you want to include derivation (which most people would say includes verbing nouns etc.) then keep in mind that derivation is far more irregular than inflection. There are many idiosyncracies, gaps, etc. Do you really want for to change (change trains) and change (as coins) to have the same lemma? If not, where do you draw the boundary? How about nerve - unnerve, earth -- unearth - earthling, ... It really depends on the application.
If you take into account semantics (bank would be labeled as bank-money or bank-river depending on context), how deep do you go (do you distinguish bank-institution from bank-building)? Some apps may not care about this at all, some might want to distinguish basic semantics, and some might want it fined-grained.
Q3: "How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?"
What do you mean by "similar morphological structures as English"? English has very little inflectional morphology. There are good lemmatizers for languages of other morphological types (truly inflectional, agglutinative, template, ...).
With a possible exception of agglutinative languages, I would argue that a lookup table (say a compressed trie) is the best solution. (Possibly with some backup rules for unknown words such as proper names). The lookup is followed by some kind of disambiguation (ranging from trivial - take the first one, or take the first one consistent with the words POS tag, to much more sophisticated). The more sophisticated disambiguations are usually supervised stochastical algorithms (e.g. TreeTagger or Faster), although a combination of machine learning and manually created rules has been done too (see e.g. this).
Obviously, for most languages, you do not want to create the lookup table by
hand, but instead, generate it from a description of the morphology of
that language. For inflectional languages, you can go the engineering
way of Hajic for Czech or Mikheev for Russian, or, if you are daring,
you use two-level morphology. Or you can do something in between,
such as Hana (myself) (Note that these are all full
morphological analyzers that include lemmatization as one of their features). Or you can learn
the lemmatizer in an unsupervised manner a la Yarowsky and
Wicentowski, possibly with manual post-processing, correcting the
most frequent words.
There are way too many options and it really all depends on what you want to do with the results.
One classical application of either stemming or lemmatization is the improvement of search engine results: By applying stemming (or lemmatization) to the query as well as (prior to indexing) to all tokens indexed, users searching for, say, "having" are able to find results containing "has".
(Arguably, verbs are somewhat uncommon in most search queries, but the same principle applies to nouns, especially in languages with a rich noun morphology.)
For the purpose of search result improvement, it is not actually important whether the stem (or lemma) is meaningful ("have") or not ("hav"). It only needs to able to represent the word in question, and all its inflectional forms. In fact, some systems use numbers or other kinds of id-strings instead of either stem or lemma (or base form or whatever it may be called).
Hence, this is an example of an application where stemmers (by your definition) are as good as lemmatizers.
However, I am not quite convinced that your (implied) definition of "stemmer" and "lemmatizer" are generally accepted. I am not sure if there is any generally accepted definition of these terms, but the way I define them is as follows:
Stemmer: A function that reduces inflectional forms to stems or base forms, using rules and lists of known suffixes.
Lemmatizer: A function that performs the same reduction, but using a comprehensive full-form dictionary to be able to deal with irregular forms.
Based on these definitions, a lemmatizer is essentially a higher-quality (and more expensive) version of a stemmer.
The answer is highly dependent on the task or specific field of study within the Natural Language Processing (NLP) that we are talking about.
It is worth pointing out that it has been proved that in some specific tasks, like Sentiment Analysis (that is a favorite sub-field in NLP), using a Stemmer or Lemmatizer as a feature in the development of a system (training a machine learning model) does not have a noticeable effect on the accuracy of the model no matter how great the tool is. Even though it makes the performance a little bit better, there are more important features like Dependency parsing that have a considerable potential to be worked on in such systems.
It is important to mention that the characteristics of the language which we are working on should also be taken into the consideration.
Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:
If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.
If you lemmatize the word 'Stripes' in verb context, it would return 'Strip'. If you lemmatize it in noun context, it would return 'Stripe'. If you just stem it, it would just return 'Strip'.
You would get same results whether you lemmatize or stem words such as walking, running, swimming... to walk, run, swim etc.
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.

Given a list of dozens of words, how do I find the best matching sections from a corpus of hundreds of texts?

Let’s say I have a list of 250 words, which may consist of unique entries throughout, or a bunch of words in all their grammatical forms, or all sorts of words in a particular grammatical form (e.g. all in the past tense). I also have a corpus of text that has conveniently been split up into a database of sections, perhaps 150 words each (maybe I would like to determine these sections dynamically in the future, but I shall leave it for now).
My question is this: What is a useful way to get those sections out of the corpus that contain most of my 250 words?
I have looked at a few full text search engines like Lucene, but am not sure they are built to handle long query lists. Bloom filters seem interesting as well. I feel most comfortable in Perl, but if there is something fancy in Ruby or Python, I am happy to learn. Performance is not an issue at this point.
The use case of such a program is in language teaching, where it would be nice to have a variety of word lists that mirror the different extents of learner knowledge, and to quickly find fitting bits of text or examples from original sources. Also, I am just curious to know how to do this.
Effectively what I am looking for is document comparison. I have found a way to rank texts by similarity to a given document, in PostgreSQL.

Filtering out meaningless phrases

I have an algorithm (that I can't change) that outputs a list of phrases. These phrases are intended to be "topics". However, some of them are meaningless on their own. Take this list:
is the fear
are more likely to
first sight
an hour of
sue apple
depression and
How can I filter out those phrases that don't make sense on their own, to leave a list like the following?
first sight
sue apple
This will be applied to sets of phrases in many languages, but English is the priority.
It's got to be grammatically acceptable in that it can't rely on other words in the original sentence that it was extracted from; e.g. it can't end in 'and'.
Although this is still an underspecified question, it sounds like you want some kind of grammar checker. I suggest you try applying a part-of-speech tagger to each phrase, compile a list of patterns of POS tags that are acceptable (e.g. anything that ends in a preposition would be unacceptable) and use that to filter your input.
At a high level, it seems that phrases which were only nouns or adjective-noun combos would give much better results.
"Blue Shirt"
"Happy People"
First of all, this problem can be as complex as you want it to be. For third-party reading/solutions, I came across: (note the part of speech tagger)
If you need 100% accuracy, then I wouldn't write such a tool myself.
However, if the problem domain is limited...
I would start by throwing out conjunctions, prepositions, contractions, state-of-being verbs, etc. This is a fairly short list in English (and looks very similar to the stopwords which #HappyTimeGopher suggested).
After that, you could create a dictionary (as an indexed structure, of course) of all acceptable nouns and adjectives and compare each word in the raw phrases to that. Anything which didn't occur in the dictionary and occur in the correct sequence could be thrown out or ranked lower.
This could be useful if you were given 100 input values and wanted to select the best 5. Finding the values in the dictionary would mean that it's likely the word/phrase was good.
I've auto-generated such a dictionary before by building a raw index from thousands of documents pertaining to a vertical industry. I then spent a few hours with SQL and Excel stripping out problems easily spotted by a human. The resulting list wasn't perfect but it eliminated most of the blatantly dumb/pointless terminology.
As you may have guessed, none of this is foolproof, although checking adjective-to-noun sequence would help somewhat. Consider the case of "Greatest Hits" versus "Car Hits [Wall]".
Proper nouns (e.g. person names) don't work well with the dictionary approach, since it's probably not feasible to build a dictionary of all variations of given/surnames.
To summarize:
use a list of stopwords
generate a dictionary of words, classifying them with a part of speech(s)
run raw phrases through dictionary and stopwords
(optional) rank on how confident you are on a match
if needed, accept phrases which didn't violate known patterns (this would handle many proper nouns)
If you've access to the text these phrases were generated from, it may be easier to just create your own topic tags.
Failing that, I'd probably just remove anything that contained a stop word. See this list, for example:
I wouldn't break out POS tagging or anything stronger for this.
It seems you could create a list that filters out three things:
Verb forms of to-be:
If you filter on these things you'd get pretty far. Are you more concerned with false negatives or positives? If false negatives aren't a huge problem, this is how I would approach it.
