Searching for known phrases in text using Azure Cognitive Services

Searching for known phrases in text using Azure Cognitive Services - nlp

I'm trying to ascertain the "right tool for the job" here, and I believe Cognitive Services can do this but without disappearing down an R&D rabbit-hole I thought I'd make sure I was tunnelling in the right direction first.
So, here is the brief:
I have a collection of known existing phrases which I want to look for, but these might be written in slightly different ways, be that grammar or language.
I want to be able to parse a (potentially large) volume of text to scan and look for those phrases so that I can identify them.
For example, my phrase could be "the event will be in person" but that also needs to identify different uses of language; for example "in-person event", "face to face event", or "on-site event" - as well as the various synonyms and variations you can get with such things.
LUIS initially appeared to be the go-to tool for this kind of thing, and includes the ability to write your own Features (aka Phrase Lists) to augment the model, but it isn't clear whether that would hit the brief - LUIS appears to be much more about "intent" and user interaction (for example building a chat Bot, or understanding intent from emails).
Text Analytics also seems a likely candidate, but again seems more focused about identifying "entities" (such as people / places / organisations) rather than a natural language "phrase" - would this tool work if I was defining my own "Topics" or is that really just barking up the wrong tree?
.. or ... is there actually something else I should be looking at completely different?
At this point - I'm really looking for a "which tool should I spend lots of time learning about".
Thanks all in advance - I appreciate this is a fairly open-ended requirement.

It seems your scenario aligns more with our text analytics service. I was going to recommend Key Phrase Extraction API which evaluates unstructured text and returns a list of key phrases. However, since you require to use known (custom) phrase list, it may not be the solution you're looking for. We currently don't support custom key phrase extraction today, however it's on our roadmap. If interested, we can connect you with the product team to learn more about your scenario.
Updated:
Please try custom NER capability.

Related

Python sentiment / text analysis advice

I don't know if this is the right place to ask this but, i am trying to build a bot in Python that will read incoming messages on a Slack channel where customer post their issues such as 'unable to connect to VPN', 'can someone reply to my ticket' etc…
The bot will analyze the message, determine if the customer is angry or not, and then propose a solution until an agent is free to actually check the issue.
Now, I was experimenting with TextBlob for the sentiment analysis part, but I don't know which technologies to actually use to determine the issue based on specific keywords and provide a solution to the user. Can someone propose me some python libraries/technologies that I could use to achieve this ?

To be honest your question is to generic to answer in one go.
Nontheless, you first have to clearly define the scope of your project. In doing so, you might want to first do a quick literaty review (Google Scholar) to familiarize with the state of the art technologies and methods.
From my little experience, a common (maybe simple) technique (lexicon-based approach) used to determine the sentiment of a word, is to use a pre-compiled dictionary (you can create your own though) that contains words - sentiment mappings. For example:
word:tired, sentiment:negative, score:5
So each time the bot finds the keyword "tired" in a sentence it will assign its corresponding negative value (polarity) to the sentence.
You might want to consider applying POS tags in the input text, as sometimes nouns or ``verbs carry significant meaning, compared to adjectives for example.
Keep in mind though, that negative comments can be written in the form of sarcasm. Sarcasm detectioin is a more difficult task though.
Alternatively, you could try using a pre-trained model such as bert-base-multilingual-uncased-sentiment that can be found here in Hugging Face.
For more information on the matter you have a look at this post.
Again as I mentioned, you have to clearly define your goals. This will enable you to specify the libraries or methodology available to solve your problem. Hope my answer helps.

Smart search for acronyms in Salesforce

In Salesforce's Service Cloud one can enable the out of the box search function where the user enters a term and the system searches all parts of the database for a match. I would like to enable smart searching of acronyms so that if I spell an organizations name the search functionality will also search for associated acronyms in the database. For example, if I search type in American Automobile Association, I would also get results that contain both "American Automobile Association" and "AAA".
I imagine such a script would involve declaring that if the term being searched contains one or more spaces or periods, take the first letter of the first word and concatenate it with the letters that follow subsequent spaces or periods.
I have unsuccessfully tried to find scripts for this or articles on enabling this functionality in Salesforce. Any guidance would be appreciated.

Interesting question! I don't think there's a straightforward answer but as it's standard search functionality, not 100% programming related - you might want to cross-post it to salesforce.stackexchange.com
Let's start with searchable fields list: https://help.salesforce.com/articleView?id=search_fields_business_accounts.htm&type=0
In Setup there's standard functionality for Synonyms, quite easy to use. It's not a silver bullet though, applies only to certain objects like Knowledge Base (if you use it). Still - it claims to work on Cases too so if there's "AAA" in Case description it should still be good enough?
You could also check out the trick with marking a text field as indexed and/or external ID and adding there all your variations / acronyms: https://success.salesforce.com/ideaView?id=08730000000H6m2 This is more work, to prepare / sanitize your data upfront but it's not a bad idea.
Similar idea would be to use Tags although that could explode in size very quickly. It's ridiculous to create a tag for every single company.
You can do some really smart things in data deduplication rules. Too much to write it all here, check out the trailhead: https://trailhead.salesforce.com/en/modules/sales_admin_duplicate_management/units/sales_admin_duplicate_management_unit_2 No idea if it impacts search though.
If you suffer from bad address data there are State & Country picklists, no more mess with CA / California / SoCal... https://resources.docs.salesforce.com/204/latest/en-us/sfdc/pdf/state_country_picklists_impl_guide.pdf Might not help with Name problem...
Data.com cleanup might help. Paid service I think, no idea if it affects search too. But if enabling it can bring these common abbreviations into your org - might be better than reinventing the wheel.

suggest list of how-to articles based on text content

I have 20,000 messages (combination of email and live chat) between my customer and my support staff. I also have a knowledge base for my product.
Often times, the questions customers ask are quite simple and my support staff simply point them to the right knowledge base article.
What I would like to do, in order to save my support staff time, is to show my staff a list of articles that may likely be relevant based on the initial user's support request. This way they can just copy and paste the link to the help article instead of loading up the knowledge base and searching for the article manually.
I'm wondering what solutions I should investigate.
My current line of thinking is to run analysis on existing data and use a text classification approach:
For each message, see if there is a response with a link to a how-to article
If Yes, extract key phrases (microsoft cognitive services)
TF-IDF?
Treat each how-to as a 'classification' that belongs to sets of key phrases
Use some supervised machine learning, support vector machines maybe to predict which 'classification, aka how-to article' belongs to key phrase determined from a new support ticket.
Feed new responses back into the set to make the system smarter.
Not sure if I'm over complicating things. Any advice on how this is done would be appreciated.
PS: naive approach of just dumping 'key phrases' into search query of our knowledge base yielded poor results since the content of the help article is often different than how a person phrases their question in an email or live chat.

A simple classifier along the lines of a "spam" classifier might work, except that each FAQ would be a feature as opposed to a single feature classifier of spam, not-spam.
Most spam-classifiers start-off with a dictionary of words/phrases. You already have a start on this with your naive approach. However, unlike your approach a spam classifier does much more than a text search. Essentially, in a spam classifier, each word in the customer's email is given a weight and the sum of weights indicates if the message is spam or not-spam. Now, extend this to as many features as FAQs. That is, features like: FAQ1 or not-FAQ1, FAQ2 or not-FAQ2, etc.
Since your support people can easily identify which of the FAQs an e-mail requires then using a supervised learning algorithm would be appropriate. To reduce the impact of any miss-classification errors, then consider the application presenting a support person with the customer's email followed by the computer generated response and all the support person would have to-do is approve the response or modify it. Modifying a response should result in a new entry in the training set.
Support Vector Machines are one method to implement machine learning. However, you are probably suggesting this solution way too early in the process of first identifying the problem and then getting a simple method to work, as well as possible, before using more sophisticated methods. After all, if a multi-feature spam classifier works why invest more time and money in something else that also works?
Finally, depending on your system this is something I would like to work-on.

Extracting relationship from NER parse

I'm working on a problem that at the very least seems to require named entity recognition, but I'm not sure how to go farther than the NER parse. What I'm trying to do is parse information (likely from tweets) regarding scheduling of events. So, for example, I'd like to be able to automatically resolve the yes/no answer to the question of "Are The Beatles playing tomorrow?" from short messages like:
"The Beatles cancelled their show tomorrow" or
"The Beatles' show is still on tomorrow"
I know NER will get me close as it will identify the band of interest and the time (if it's indicated), but there are many ways to express the concepts I'm interested in, for example:
"The Beatles are on for tomorrow" or
"The Beatles won't be playing tomorrow."
How can I go from an NER parsed representation to extracting the information of interest? Any suggestions would be much appreciated.

I guess you should search by event detection (optionally - in Twitter); maybe, also by question answering systems, if your example with yes/no questions wasn't just an illustration: if you know user needs in advance, this information may increase the quality of the system.
For start, there are some papers about event detection in Twitter: here and here.
As a baseline, you can create a list with positive verbs for your domain (to be, to schedule) and negative verbs (to cancel, to delay) - just start from manual list and expand it by synonyms from some dictionary, e.g. WordNet. Also check for negations - again, by presence of pre-specified words ('not' in different forms) in a tweet. Then, if there is a negation, you just invert the meaning.
Since you work with Twitter and most likely there would be just one event mentioned in a tweet, it can work pretty well.

Open source projects for email scrubbing generating structured data from unstructured source?

Don't know where to start on this one so hopefully you guys can clear up my question. I have project where email will be searched for specific words/patterns and stored in a structured manner. Something that is done with Trip it.
The article states that they developed a DataMapper
The DataMapper is responsible for taking inbound email messages
addressed to plans [at] tripit.com and transforming them from the
semi-structured format you see in your mail reader into a highly
structured XML document.
There is a comment that also states
If you're looking to build this yourself, reading a little bit about
Wrappers and Wrapper Induction might be helpful
I Googled and read about wrapper induction but it was just too broad of a definition and didn't help me understand how one would go about solving such problem.
Is there some open source project out there that does similar things?

There are a couple of different ways and things you can do to accomplish this.
The first part, which involves getting access to the email content I'll not answer here. Basically, I'll assume that you have access to the text of emails, and if you don't there are some libraries that allow you to connect java to an email box like camel (http://camel.apache.org/mail.html).
So now you've got the email so then what?
A handy thing that could help is that lingpipe (http://alias-i.com/lingpipe/) has an entity recognizer that you can populate with your own terms. Specifically, look at some of their extraction tutorials and their dictionary extractor (http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html) So inside of the lingpipe dictionary extractor (http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html) you'd simply import the terms you're interested in and use that to associate labels with an email.
You might also find the following question helpful: Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?

Really a very broad question, but I can try to give you some general ideas, which might be enough to get started. Basically, it sounds like you're talking about an elaborate parsing problem - scanning through the text and looking to apply meaning to specific chunks. Depending on what exactly you're looking for, you might get some good mileage out of a few regular expressions to start - things like phone numbers, email addresses, and dates have fairly standard structures that should be matchable. Other data points might benefit from some indicator words - the phrase "departing from" might indicate that what follows is an address. The natural language processing community also has a large tool set available for text processing - check out things like parts of speech taggers and semantic analyzers if they're appropriate to what you're trying to do.
Armed with those techniques, you can follow a basic iterative development process: For each data point in your expected output structure, define some simple rules for how to capture it. Then, run the application over a batch of test data and see which samples didn't capture that datum. Look at the samples and revise your rules to catch those samples. Repeat until the extractor reaches an acceptable level of accuracy.
Depending on the specifics of your problem, there may be machine learning techniques that can automate much of that process for you.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string