Can we compare two different strings for similarity using Google NLP API's? - google-cloud-nl

Example:
String 1: Help me to track calls.
String 2: Assist me in call tracking.
These two strings have the same meaning but are not identical. Is there any way to find similarity between strings like these using Google Natural Language Processing Api's.

The Google Cloud Natural Language API doesn't provide a specific feature to find similarities between two different strings; instead, this service offers the Content Classification functionality that you can use to classify the strings into categories to then calculate the similarity between them based on their resulting content classification. You can find a helpful Content Classification Tutorial where is explained the process required to perform these tasks.
In case this feature doesn't cover your current needs, you can use the Send Feedback button, located at the lower left and upper right corners of the service public documentation, as well as take a look the Issue Tracker tool in order to raise a Natural Language API feature request and notify to Google about this desired functionality.

Related

suggest list of how-to articles based on text content

I have 20,000 messages (combination of email and live chat) between my customer and my support staff. I also have a knowledge base for my product.
Often times, the questions customers ask are quite simple and my support staff simply point them to the right knowledge base article.
What I would like to do, in order to save my support staff time, is to show my staff a list of articles that may likely be relevant based on the initial user's support request. This way they can just copy and paste the link to the help article instead of loading up the knowledge base and searching for the article manually.
I'm wondering what solutions I should investigate.
My current line of thinking is to run analysis on existing data and use a text classification approach:
For each message, see if there is a response with a link to a how-to article
If Yes, extract key phrases (microsoft cognitive services)
TF-IDF?
Treat each how-to as a 'classification' that belongs to sets of key phrases
Use some supervised machine learning, support vector machines maybe to predict which 'classification, aka how-to article' belongs to key phrase determined from a new support ticket.
Feed new responses back into the set to make the system smarter.
Not sure if I'm over complicating things. Any advice on how this is done would be appreciated.
PS: naive approach of just dumping 'key phrases' into search query of our knowledge base yielded poor results since the content of the help article is often different than how a person phrases their question in an email or live chat.
A simple classifier along the lines of a "spam" classifier might work, except that each FAQ would be a feature as opposed to a single feature classifier of spam, not-spam.
Most spam-classifiers start-off with a dictionary of words/phrases. You already have a start on this with your naive approach. However, unlike your approach a spam classifier does much more than a text search. Essentially, in a spam classifier, each word in the customer's email is given a weight and the sum of weights indicates if the message is spam or not-spam. Now, extend this to as many features as FAQs. That is, features like: FAQ1 or not-FAQ1, FAQ2 or not-FAQ2, etc.
Since your support people can easily identify which of the FAQs an e-mail requires then using a supervised learning algorithm would be appropriate. To reduce the impact of any miss-classification errors, then consider the application presenting a support person with the customer's email followed by the computer generated response and all the support person would have to-do is approve the response or modify it. Modifying a response should result in a new entry in the training set.
Support Vector Machines are one method to implement machine learning. However, you are probably suggesting this solution way too early in the process of first identifying the problem and then getting a simple method to work, as well as possible, before using more sophisticated methods. After all, if a multi-feature spam classifier works why invest more time and money in something else that also works?
Finally, depending on your system this is something I would like to work-on.

How can I use Natural Language Processing to check if a paragraph contains predefined topics?

We have a system that allows users to answer a question as free text and we want to check whether their answer contains any of our predefined topics. These topics will be defined prior to answers being submitted.
We tried to use a method similar to spam detection, but this is only good for determining whether something is true/false, incorrect/correct. We need the response to say which of the predefined topics a piece of text contains. Is there an algorithm that would solve this problem?
Maybe you will try to use "bag of words" for feature extraction and "naive Bayes classifier with multinomial model" for classification.
In this page this described more detail link.
You could also try explicit semantic analysis (ESA)[1][2]. Given a set of documents that represent concepts (in your case your topics) you can train a model and given any new sentence as input you can get a ranked list of the closest concepts "evoked" by that sentence. Of course this assume you have a document with some text describing every concept you want to identify (that's why the most common thing to do is to use Wikipedia pages as concepts), but if this is the case you could give it a try.
[1] https://en.wikipedia.org/wiki/Explicit_semantic_analysis
[2] http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf

Semantic search with NLP and elasticsearch

I am experimenting with elasticsearch as a search server and my task is to build a "semantic" search functionality. From a short text phrase like "I have a burst pipe" the system should infer that the user is searching for a plumber and return all plumbers indexed in elasticsearch.
Can that be done directly in a search server like elasticsearch or do I have to use a natural language processing (NLP) tool like e.g. Maui Indexer. What is the exact terminology for my task at hand, text classification? Though the given text is very short as it is a search phrase.
There may be several approaches with different implementation complexity.
The easiest one is to create list of topics (like plumbing), attach bag of words (like "pipe"), identify search request by majority of keywords and search only in specified topic (you can add field topic to your elastic search documents and set it as mandatory with + during search).
Of course, if you have lots of documents, manual creation of topic list and bag of words is very time expensive. You can use machine learning to automate some of tasks. Basically, it is enough to have distance measure between words and/or documents to automatically discover topics (e.g. by data clustering) and classify query to one of these topics. Mix of these techniques may also be a good choice (for example, you can manually create topics and assign initial documents to them, but use classification for query assignment). Take a look at Wikipedia's article on latent semantic analysis to better understand the idea. Also pay attention to the 2 linked articles on data clustering and document classification. And yes, Maui Indexer may become good helper tool this way.
Finally, you can try to build an engine that "understands" meaning of the phrase (not just uses terms frequency) and searches appropriate topics. Most probably, this will involve natural language processing and ontology-based knowledgebases. But in fact, this field is still in active research and without previous experience it will be very hard for you to implement something like this.
You may want to explore https://blog.conceptnet.io/2016/11/03/conceptnet-5-5-and-conceptnet-io/.
It combines semantic networks and distributional semantics.
When most developers need word embeddings, the first and possibly only place they look is word2vec, a neural net algorithm from Google that computes word embeddings from distributional semantics. That is, it learns to predict words in a sentence from the other words around them, and the embeddings are the representation of words that make the best predictions. But even after terabytes of text, there are aspects of word meanings that you just won’t learn from distributional semantics alone.
Some results
The ConceptNet Numberbatch word embeddings, built into ConceptNet 5.5, solve these SAT analogies better than any previous system. It gets 56.4% of the questions correct. The best comparable previous system, Turney’s SuperSim (2013), got 54.8%. And we’re getting ever closer to “human-level” performance on SAT analogies — while particularly smart humans can of course get a lot more questions right, the average college applicant gets 57.0%.
Semantic search is basically search with meaning. Elasticsearch uses JSON serialization by default, to apply search with meaning to JSON you would need to extend it to support edge relations via JSON-LD. You can then apply your semantic analysis over the JSON-LD schema to word disambiguate plumber entity and burst pipe contexts as a subject, predicate, object relationships. Elasticsearch has a very weak semantic search support but you can go around it using faceted searching and bag of words. You can index a thesaurus schema for plumbing terms, then do a semantic matching over the text phrases in your sentences.
"Elasticsearch 7.3 introduced introduced text similarity search with vector fields".
They describe the application of using text embeddings (e.g., word embeddings and sentence embeddings) to implement this sort of semantic similarity measure.
A bit late to the party, but part II of this blog seems to address this through "contextual searches". It basically makes a two-part query to Elasticsearch in order to build a list of "seed" documents and then an expanded query via the more-like-this API. The result is a set of documents most contextually similar to the search query.
it's possible. This GitHub repo shows how to integrate Elasticsearch with the current state-of-the-art on NLP for semantic representation of language: BERT (Bidirectional Encoder Representations from Transformers) https://github.com/Hironsan/bertsearch
Good luck.
My suggestion is to use BERT embedding for your sentences and add an embedding field to your ElasticSearch, as it is described in https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch
For BERT embedding I suggest to use sentence-transformers from Huggingface library. You can find sample codes in https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8
There are several options for that:
You can perform it in elasticsearch itself. Elasticsearch supports the indexing of Dense Embedding of docs. From there, you can write your own pipeline for search and use your preferred relevancy score formula ie. cosine similarity or something else.
Use Haystack pipeline, refer to my blog which describes setting up a semantic search pipeline (end-to-end).
You can use Meta's Faiss

associated words

I am developing a program but stuck on a particular hurdle. I need to find words associated with other words. EG "green" might be associated with "environment", "leaf", "earth", "wind", "electric", "hybrid", etc. All I can find is Google Sets. Is there any other resource that is better?
If you have a large text collection (say Wikipedia, Project Gutenberg) you can use co-occurrence scores extract this kind of data. See e.g. Padó and Lapata and the references therein.
I recently built a tool that mines this kind of associations from Wikipedia database dumps by another method. It requires a lot of memory though; other folks have tried to do the same using randomized methods.
If you're still looking for a resource of semantically related words, I've just recently developed an API that takes a query and returns semantically related words. It offers parts of speech, relationships to the query word, and a word similarity measurement.
https://kiingo.co/rapid-associations-api
Disclaimer: I'm the developer of this API.

How to group / compare similar news articles

In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in the same group. I am guessing its some sort of fuzzy logic comparison. How would I go about doing this from a technical standpoint? What are my options? We haven't even started the app yet, so we aren't limited in the technologies we can use.
Thanks, in advance for the help!
This problem breaks down into a few subproblems from a machine learning standpoint.
First, you are going to want to figure out what properties of the news stories you want to group based on. A common technique is to use 'word bags': just a list of the words that appear in the body of the story or in the title. You can do some additional processing such as removing common English "stop words" that provide no meaning, such as "the", "because". You can even do porter stemming to remove redundancies with plural words and word endings such as "-ion". This list of words is the feature vector of each document and will be used to measure similarity. You may have to do some preprocessing to remove html markup.
Second, you have to define a similarity metric: similar stories score high in similarity. Going along with the bag of words approach, two stories are similar if they have similar words in them (I'm being vague here, because there are tons of things you can try, and you'll have to see which works best).
Finally, you can use a classic clustering algorithm, such as k-means clustering, which groups the stories together, based on the similarity metric.
In summary: convert news story into a feature vector -> define a similarity metric based on this feature vector -> unsupervised clustering.
Check out Google scholar, there probably have been some papers on this specific topic in the recent literature. A lot of these things that I just discussed are implemented in natural language processing and machine learning modules for most major languages.
The problem can be broken down to:
How to represent articles (features, usually a bag of words with TF-IDF)
How to calculate similarity between two articles (cosine similarity is the most popular)
How to cluster articles together based on the above
There are two broad groups of clustering algorithms: batch and incremental. Batch is great if you've got all your articles ahead of time. Since you're clustering news, you've probably got your articles coming in incrementally, so you can't cluster them all at once. You'll need an incremental (aka sequential) algorithm, and these tend to be complicated.
You can also try http://www.similetrix.com, a quick Google search popped them up and they claim to offer this service via API.
One approach would be to add tags to the articles when they are listed. One tag would be XYZ. Other tags might describe the article subject.
You can do that in a database. You can have an unlimited number of tags for each article. Then, the "groups" could be identified by one or more tags.
This approach is heavily dependent upon human beings assigning appropriate tags, so that the right articles are returned from the search, but not too many articles. It isn't easy to do really well.

Resources