semantic similarity for mix of languages - nlp

I have a database of several thousands of utterances. Each record (utterance) is a text representing a problem description, which a user has submitted to a service desk. Sometimes also the service desk agent's response is included. The language is highly technical, and it contains three types of tokens:
words and phrases in Language 1 (e.g. English)
words and phrases in Language 2 (e.g. French, Norwegian, or Italian)
machine-generated output (e.g. listing of files using unix command ls -la)
These languages are densely mixed. I often see that in one conversation, a sentence in Language 1 is followed by Language 2. So it is impossible to divide the data into two separate sets, corresponding to utterances in two languages.
The task is to find similarities between the records (problem descriptions). The purpose of this exercise is to understand whether some bugs submitted by users are similar to each other.
Q: What is the standard way to proceed in such a situation?
In particular, the problem lies in the fact that the words come from two different corpora (corpuses), while in addition, some technical words (like filenames, OS paths, or application names) will not be found in any.

I don't think there's a "standard way" - just things you could try.
You could look into word-embeddings that are aligned between langauges – so that similar words across multiple languages have similar vectors. Then ways of building a summary vector for a text based on word-vectors (like a simple average of all a text's words' vectors), or pairwise comparisons based on word vectors (like "Word Mover's Distance"), may still work with mixed-language texts (even mixes of languages within one text).
That a single text, presumably about a a single (or closely related) set of issues, has mixed language may be a blessing rather than a curse: some classifiers/embeddings you train from such texts might then be able to learn the cross-language correlations of words with shared topics. But also, you could consider enhancing your texts with extra synthetic auto-translated text, for any monolingual ranges, to ensure downstream embeddings/comparisons get closer to your ideal of language-obliviousness.

Thank you for the suggestions. After several experiments I developed a method which is simple and works pretty well. Rather than using existing corpora, I created my own corpus based on all the utterances available in my multilingual database. Without translating them. The database has 130,000 utterances, including 3,5 million of words (in three languages: English, French and Norwegian) and 150,000 unique words. The phrase similarity based on the meaning space constructed this way works surprisingly well. I have tested this method on production and the results are good. I also see a lot of space for improvement, and will continue to polish it. I also wrote this article An approach to categorize multi-lingual phrases, describing all the steps in more detail. Critics or improvements welcome.

Related

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

NLP: retrieve vocabulary from text

I have some texts in different languages and, potentially, with some typo or other mistake, and I want to retrieve their own vocabulary. I'm not experienced with NLP in general, so maybe I use some word improperly.
With vocabulary I mean a collection of words of a single language in which every word is unique and the inflections for gender, number, or tense are not considered (e.g. think, thinks and thought are are all consider think).
This is the master problem, so let's reduce it to the vocabulary retrieving of one language, English for example, and without mistakes.
I think there are (at least) three different approaches and maybe the solution consists of a combination of them:
search in a database of words stored in relation with each others. So, I could search for thought (considering the verb) and read the associated information that thought is an inflection of think
compute the "base form" (a word without inflections) of a word by processing the inflected form. Maybe it can be done with stemming?
use a service by any API. Yes, I accept also this approach, but I'd prefer to do it locally
For a first approximation, it's not necessary that the algorithm distinguishes between nouns and verbs. For instance, if in the text there were the word thought like both noun and verb, it could be considered already present in the vocabulary at the second match.
We have reduced the problem to retrieve a vocabulary of an English text without mistakes, and without consider the tag of the words.
Any ideas about how to do that? Or just some tips?
Of course, if you have suggestions about this problem also with the others constraints (mistakes and multi-language, not only Indo-European languages), they would be much appreciated.
You need lemmatization - it's similar to your 2nd item, but not exactly (difference).
Try nltk lemmatizer for Python or Standford NLP/Clear NLP for Java. Actually nltk uses WordNet, so it is really combination of 1st and 2nd approaches.
In order to cope with mistakes use spelling correction before lemmatization. Take a look at related questions or Google for appropriate libs.
About part of speech tag - unfortunately, nltk doesn't consider POS tag (and context in general), so you should provide it with the tag that can be found by nltk pos tagging. Again, it is already discussed here (and related/linked questions). I'm not sure about Stanford NLP here - I guess it should consider context, but I was sure that NLTK does so. As I can see from this code snippet, Stanford doesn't use POS tags, while Clear NLP does.
About other languages - google for lemmatization models, since algorithm for most languages (at least from the same family) is almost the same, differences are in training data. Take a look here for example of German; it is a wrapper for several lemmatizers, as I can see.
However, you always can use stemmer at cost of precision, and stemmer is more easily available for different languages.
Topic Word has become an integral part of the rising debate in the present world. Some people perceive that Topic Word (Synonyms) beneficial, while opponents reject this notion by saying that it leads to numerous problems. From my point of view, Topic Word (Synonyms) has more positive impacts than negative around the globe. This essay will further elaborate on both positive and negative effects of this trend and thus will lead to a plausible conclusion.
On the one hand, there is a myriad of arguments in favour of my belief. The topic has a plethora of merits. The most prominent one is that the Topic Word (Synonyms). According to the research conducted by Western Sydney University, more than 70 percentages of the users were in favour of the benefits provided by the Topic Word (Synonyms). Secondly, Advantage of Essay topic. Thus, it can say that Topic Word (Synonyms) plays a vital role in our lives.
On the flip side, critics may point out that one of the most significant disadvantages of the Topic Word (Synonyms) is that due to Demerits relates to the topic. For instance, a survey conducted in the United States reveals that demerit. Consequently, this example explicit shows that it has various negative impacts on our existence.
As a result, after inspection upon further paragraphs, I profoundly believe that its benefits hold more water instead of drawbacks. Topic Word (Synonyms) has become a crucial part of our life. Therefore, efficient use of Topic Word (Synonyms) method should promote; however, excessive and misuse should condemn.

Item description keyword extraction

I'm playing around with a recommendation system that takes key descriptive words and phrases and matches them against others. Specifically, I'm focusing on flavors in beer, with an algorithm searching for things like malty or medium bitterness, pulling those out, and then comparing against other beers to come up with flavor recommendations.
Currently, I'm struggling with the extraction. What are some techniques for identifying words and standardizing them for later processing?
How do I pull out hoppy and hops and treat them as the same word, but also keeping in mind that very hoppy and not enough hops have different meanings that are modified by the preceding word(s)? I believe I can use stemming for things like plurals and suffixed/prefixed words, but what about pairs or more complicated patterns? What techniques exist for this?
I would first ignore the finer-grained distinctions and compile a list of lexico-semantic patterns, which can be used to extract some information structure-- for example:
<foodstuff> has a <taste-description> taste
<foodstuff> tastes <taste-description>
very <taste-description>
not enough <taste-description>
You can use instances of such patterns in your text to infer useful concepts (such as different taste descriptions) which can then be used again in order to bootstrap the extraction of new patterns and thus new concepts.

What features do NLP practitioners use to pick out English names?

I am trying named entity recognition for the first time. I'm looking for features that will pick out English names. I am using the methods outlined in the coursera nlp course (week three) and the nltk book. In other words: I am defining features, identifying features of words and then running those words/features through a classifier that I train on labeled data.
What features are used to pick out English names?
I can imagine that you'd look for two capital words in a row, or a capital word and then an initial and then a capital word. (ex. John Smith or James P. Smith).
But what other features are used for NER?
Some common features:
Word lists for common names (John, Adam, etc)
casing
contains symbol or numeric characters (names generally don't)
person prefixes (Mr., Mrs., etc...)
person postfixes (Jr., Sr., etc...)
single letter abbreviation (ie, (J.) Smith).
analysis of surrounding words (you may find some words have a high probability of appearing near names).
Named Entities previously recognized (often it is easy to identify NE in some parts of the corpus based on context, but very hard in other parts. If previously identified, this is an excellent hint towards NER)
Depending what language you are working with there may be more language specific features as well. Frankly you can turn up a wealth of information with a simple Google query, I'm really not sure why you haven't turned there. Some starting points however:
Google
A survey of named entity recognition and classification
Named entity recognition without gazetteers
I had done something similar back in school using machine learning. I suppose that you will use a supervised algorithm and you will classify every single word independently and not words in combination. In that case I would choose some features for the word itself like the ones you mentioned (if the word begins with a capital letter, if the word is an abbreviation) but I would add some more features like if the previous or the next words also start from a capital letter, or if they are abbreviations. This way you can add some context and overcome the problems related to your basic independence assumption.
If you want have a look here. In the machine learning section you can find some more information and examples (the problem is slightly different but the method should be similar).
Whatever features you choose it is important that you use some measure to evaluate their relevance and possibly reduce them to the useful ones to avoid over-fitting. One of the measures you can use to evaluate them is the gain ratio but there are many more. Here you can find some basic information about feature extraction.
Hope it helps!

What methods are used for recognizing language a text is written in?

If I have a given text (both long or short), with which methods do you usually detect which language it is written in?
It is clear that:
You need a training corpus to train the models you use (e.g. neural networks, if used)
Easiest thing coming to my mind is:
Check characters used in the text (e.g. hiragana are only used in Japanese, Umlauts probably only in European languages, ç in French, Turkish, …)
Increase the check to two or three letter pairs to find specific combinations of a language
Lookup a dictionary to check which words occur in which language (probably only without stemming, as stemming depends on the language)
But I guess there are better ways to go. I am not searching for existing projects (those questions have already been answered), but for methods like Hidden-Markov-Models, Neural Networks, … whatever may be used for this task.
In product I'm working on we use dictionary-based approach.
First relative probabilities for all words in training corpus are calculated and this is stored as a model.
Then input text is processed word by word to see if particular model gives best match (much better then the other models).
In some cases all models provide quite bad match.
Few interesting points:
As we are working with social media both normalized and non-normalized matches are attempted (in this context normalization is removal of diacritics from symbols). Non-normalized matches have a higher weight
This method works rather bad on very short phrases (1-2 words) in particular when these words are there in few languages, which is the case of few European languages
Also for a better detection we are considering added per-character model as you have described (certain languages have certain unique characters)
Btw, we use ICU library to split words. Works rather good for European and Eastern languages (currently we support Chinese)
Check the Cavnar and Trenkle algorithm.

Resources