Outline detection from patterns in a list of textual articles - nlp

Are there NLP algorithms dealing with detecting the repeating
patterns in a a list of texts from which a topic keywords
and other associative keywords can be derived?
I will show it as an example:
You have a search query "vegan food for something health"
(where something is a part of body you need an advice about).
The search engine will return a list of articles.
The algorithm will search for patterns in these articles.
E.g. it notices that 80 % of them have a paragraph with
at least 4 multiple instances of a word orange, similarly
carrot, apples, cucumbers.
So it will give you an outline (textual mindmap)
orange
carrot -->
vitamin A
apple
banana -->
vitamin B
run a lot
Once I watched a video about semantic web on youtube and know that Tim Berners-Lee talked about something similar, but I have lost the link. Could you
keyword me to that direction again?

Probably you are looking for word2vec -- described patterns can be described in terms of distance between words.

Related

How to find similar noun phrases in NLP?

Is there a way to identify similar noun phrases. Some suggest use pattern-based approaches, for example X as Y expressions:
Usain Bolt as Sprint King
Liverpool as Reds
There are many techniques to find alternative names for a given entity,
using patterns such as:
X also known as Y
X also titled as Y
and scanning large collections of documents (e.g., Wikipedia or news papers articles) is one way to do it.
There are also other alternatives, one I remember is using Wikipedia inter-links structure, for instance, by exploring the redirect links between articles. You can download a file with a list of redirects from here: https://wiki.dbpedia.org/Downloads2015-04 and exploring the file you can find alternative names/synonyms for entities, e.g.:
Kennedy_Centre -> John_F._Kennedy_Center_for_the_Performing_Arts>
Lord_Alton_of_Liverpool -> David_Alton,_Baron_Alton_of_Liverpool
Indiana_jones_2 -> Indiana_Jones_and_the_Temple_of_Doom
Another thing you can do is combine these two techniques, for instance, look for text segments where both Indiana Jones and Indiana_Jones_and_the_Temple_of_Doom occur and are not further apart more than, let's say, 4 or 5 tokens. You might find patterns like also titled as, then you can use these patterns to find more synonyms/alternative names.

LUIS List entity

I am using "list" entity. However, I do not achieve my expected result.
Here is what I have for LUIS intent:
getAnimal
I want to get a cat**[animal]**.
Here is what I have with LUIS entities:
List Entities [animal]
cat: russian blue, persian cat, british shorthair
dog: bulldog, german shepard, beagle
rabbit: holland lop, american fuzzy lop, florida white
Here is what I have with LUIS Phrase lists:
Phrase lists [animal_phrase]
cat, russian blue, persian cat, british shorthair, dog, bulldog, german shepard, beagle, etc
Desired:
When user enters "I want to get a beagle." It will be match with "getAnimal" intent.
Actual:
When user enters "I want to get a beagle." It will be match with "None" intent.
Please help. Your help will be appreciated.
So using a phrase list is a good way to start, however you need to make sure you provide enough data for LUIS to be able to learn the intents as well as the entities separate from the phrase list. Most likely you need to add more utterances.
Additionally, if your end goal is to have LUIS recognize the getAnimal intent, I would do away with the list entity, and instead use a simple entity to take advantage of LUIS's machine learning, and do so in combination with a phrase list to boost the signal to what an animal may look like.
As the documentation on phrase lists states,
Features help LUIS recognize both intents and entities, but features
are not intents or entities themselves. Instead, features might
provide examples of related terms.
--Features, in machine learning, being a distinguishing trait or attribute of data that your system observes, and what you add to a group/class when using a phrase list
Start by
1. Creating a simple entity called Animal
2. Add more utterances to your getAnimal intent.
Following best practices outlined here, you should include at least 15 utterances per intent. Make sure to include plenty of examples of the Animal entity.
3. Be mindful to include variation in your utterances that are valuable to LUIS's learning (different word order, tense, grammatical correctness, length of utterance and entities themselves). Highly recommend reading this StackOverflow answer I wrote on how to build your app properly get accurate entity detection if you want more elaboration.
above blue highlighted words are tokens labeled to the simple Animal entity
3. Use a phrase list.
Be sure to include values that are not just 1 word long, but 2, 3, and 4 words long in length, as different animal names may possibly be that long in length (e.g. cavalier king charles spaniel, irish setter, english springer spaniel, etc.) I also included 40 animal breed names. Don't be shy about adding Related Values suggested to you into your phrase list.
After training your app to update it with your changes, prosper!
Below "I want a beagle" reaches the proper intent. LUIS will even be able to detect animals that were not entered in the app in entity extraction.

Where can I find a list of english part of speech constraints?

I'm looking for a list of English part of speech sequencing rules (e.g. "a determiner cannot be followed by a verb").
Thought it would be easy but I couldn't find an actual list of more than several examples.
Any ideas?
Thanks.
The problem of make a "list of POS constraints" lies in the fact that those constrants will mainly depends on discourse domain.
I think you can face it from a n-gram approach. You can make POS tagging over a specific corpus (wikipedia articles for certain topic for example) then generate 2-grams or 3-grams (using grams of words) and calculate their frequencies, so you will get the most/less frequent POS combinantions. Finally, you can think about those POS combinations which not even appeared in the frecuency list, such sequences may be called "part of speech constraints".

Possible approach to sentiment analysis (I apologize, I'm very new to NLP)

So I have an idea for classifying sentiments of sentences talking about a given brand product (in this case, pepsi). Basically, let's say I wanted to figure out how people feel about the taste of pepsi. Given this problem, I want to construct abstract sentence templates, basically possible sentence structures that would indicate an opinion about the taste of pepsi. Here's one example for a three word sentence:
[Pepsi] [tastes] [good, bad, great, horrible, etc.]
I then look through my database of sentences, and try to find ones that match this particular structure. Once I have this, I can simply extract the third component and get a sentiment regarding this particular aspect (taste) of this particular entity (pepsi).
The application for this would be looking at tweets, so this might yield a few tweets from the past year or so, but it wouldn't be enough to get an accurate read on the general sentiment, so I would create other possible structures, like:
[I] [love, hate, dislike, like, etc.] [the taste of pepsi]
[I] [love, hate, dislike, like, etc.] [the way pepsi tastes]
[I] [love, hate, dislike, like, etc.] [how pepsi tastes]
And so on and so forth.
Of course most tweets won't be this simple, there would be possible words that would mean the same as pepsi, or words in between the major components, etc - deviations that it would not be practical to account for.
What I'm looking for is just a general direction, or a subfield of sentiment analysis that discusses this particular problem. I have no problem coming up with a large list of possible structures, it's just the deviations from the structures that I'm worried about. I know this is something like a syntax tree, but most of what I've read about them has just been about generating text - in this case I'm trying to match a sentence to a structure, and pull out the entity, sentiment, and aspect components to get a basic three word answer.
This templates approach is the core idea behind my own sentiment mining work. You might find study of EBMT (example-based machine translation) interesting, as a similar (but under-studied) approach in the realm of machine translation.
Get familiar with Wordnet, for automatically generating rephrasings (there are hundreds of papers that build on WordNet, some of which will be useful to you). (The WordNet book is getting old now, but worth at least a skim read if you can find it in a library.)
I found Bing Liu's book a very useful overview of all the different aspects and approachs to sentiment mining, and a good introduction to further reading. (The Amazon UK reviews are so negative I wondered if it was a different book! The Amazon US reviews are more positive, though.)

geting semantically related keywords for a given word

Is there any open source/free software available that gives you semantically related keywords for a given word. for example the word dog: it should give the keywords like: animal, mammal, ...
or for the word France it should give you keywords like: country, Europe ... .
basically a set of keywords related to the given word.
or if there is not, has anybody an idea of how this could be implemented and how complex this would be.
best regards
Wordnet might be what you need. Wordnet groups English words in sets of synonyms and provides general definitions, and records the various semantic relations between these groups.
There are tons of projects out there using Wordnet, here you have a list:
http://wordnet.princeton.edu/wordnet/related-projects/
Look at this one, you might find it particularly useful (http://kylescholz.com)
you can see the live demo here :
http://kylescholz.com/projects/wordnet/?text=dog
I hope this helps.
Yes. A company named Saplo in Sweden specialize in this. I beleive you can use their API for this and if you ask nicely you might be able to use it for free (if it's not for commercial purposes of course).
Saplo
Yes. What you are looking for is something similar to vector space model for searching and it is the best efficient way of doing. There are some open source libraries available for latent semantic indexing / searching ( special case of vector space model). Apache Lucene is one of the most pupular one. Or something from google code.
If you are looking for online resources, there are several to consider (at least in 2017; the OP is dated 2010).
Semantic Link (http://www.semantic-link.com): The creator of Semantic Link offers an interface to the results of a computation of a metric called "mutual information" on pairs of words over all of English Wikipedia. Only words occurring more than 1000 times in Wikipedia are available.
"Dog" gets you, for example: purebred, breeds, canine, pet, puppies.
It seems, however, you are really looking for an online tool that gives hyponyms and hypernyms. From the Wikipedia page for "Hyponymy and hypernymy":
In linguistics, a hyponym (from Greek hupó, "under" and ónoma, "name") is a word or phrase whose semantic field is included within that of another word, its hyperonym or hypernym (from Greek hupér, "over" and ónoma, "name") . In simpler terms, a hyponym shares a type-of relationship with its hypernym. For example, pigeon, crow, eagle and seagull are all hyponyms of bird (their hyperonym); which, in turn, is a hyponym of animal.
WordNet(https://wordnet.princeton.edu) has this information and has an online search tool. With this tool, if you enter a word, you'll get one or more entries with an "S" beside them. If you click the "S", you can browse the "Synset (semantic) relations" of the word with that meaning or usage and this includes direct hyper- and hyponyms. It's incredibly rich!
For example: "dog" (as in "domestic dog") --> "canine" --> "carnivore" --> "placental mammal" --> "vertebrate" --> "chordate" --> etc. or "dog" --> "domestic animal" --> "animal" --> "organism" --> "living thing" -->
There is also WordNik which lists hypernyms and reverse dictionary words (words with the given word in their definition). Hypernyms for "France" include "european country/nation" and reverse dictionary includes regions and cities in France, names of certain rulers, etc.. "Dog" gets the hypernym "domesticated animal" (and others).

Resources