Algorithm For Determining Sentence Subject Similarity - python-3.x

I'm looking to generate an algorithm that can determine the similarity of a series of sentences. Specifically, given a starter sentence, I want to determine if the following sentence is a suitable addition.
For example, take the following:
My dog loves to drink water.
All is good, this is just the first sentence.
The dog hates cats.
All is good, both sentences reference dogs.
It enjoys walks on the beach.
All is good, "it" is neutral enough to be an appropriate communication.
Pizza is great with pineapple on top.
This would not be a suitable addition, as the sentence does not build on to the "narrative" created by the first three sentences.
To outline the project a bit, I've created a library that generated Markov text chains based on the input text. That text is then corrected grammatically to produce viable sentences. I now want to string these sentences together to create coherent paragraphs.

Related

How to identify words with the same meaning in order to reduce number of tags/categories/classes in a dataset

So here is an example of a column in my data-set:
"industries": ["Gaming", "fitness and wellness"]
The industries column has hundreds of different tags, some of which can have the same meaning, for example, some rows have: "Gaming" and some have "video games" and others "Games & consoles".
I'd like to "lemmatize" these tags so I could query the data and not worry about minute differences in the presentation (if they are basically the same).
What is the standard solution in this case?
I don't know that there is a "standard" solution, but I can suggest a couple of approaches, ranked by increasing depth of knowledge, or going from the surface form to the meaning.
String matching
Lemmatisation/stemming
Word embedding vector distance
String matching is based on the calculating the difference between strings, as a measure of how many characters they share or how many editing steps it takes to transform one into the other. Levenshtein distance is one of the most common ones. However, depending on the size of your data, it might be a bit inefficient to use. This is a really cool approach to find most similar strings in a large data set.
However, it might not be the most suitable one for your particular data set, as your similarities seem more semantic and less bound to the surface form of the words.
Lemmatisation/stemming goes beyond the surface by analysing the words apart based on their morphology. In your example, gaming and games both have the same stem game, so you could base your similarity measure on matching stems. This can be better than pure string matching as you can see that *go" and went are related
Word embeddings go beyond the surface form by encoding meaning as the context in which words appear and as such, might find a semantic similarity between health and *fitness", that is not apparent from the surface at all! The similarity is measured as the cosine distance/similarity between two word vectors, which is basically the angle between the two vectors.
It seems to me that the third approach might be most suitable for your data.
This is a tough NLU question! Basically 'what are synonyms or near synonyms of each other, even if there's not exact string overlap?'.
1. Use GLoVE word embeddings to judge synonymous words
It might be interesting to use spaCy's pre-trained GLoVE model (en_vectors_web_lg) for word embeddings, to get the pairwise distances between tokens, and use that as a metric for judging 'closeness'.
nlp = spacy.load('en_vectors_web_lg')
doc1 = nlp('board games')
doc2 = nlp('Games & Recreation')
doc3 = nlp('video games')
for doc in [doc1, doc2, doc3]:
for comp in [doc1, doc2, doc3]:
if doc != comp:
print(f'{doc} | {comp} | similarity: {round(doc.similarity(comp), 4)}')
board games | Games & Recreation | similarity: 0.6958
board games | video games | similarity: 0.7732
Games & Recreation | board games | similarity: 0.6958
Games & Recreation | video games | similarity: 0.675
video games | board games | similarity: 0.7732
video games | Games & Recreation | similarity: 0.675
(GLoVE is cool - really nice mathematical intuition for word embeddings.)
PROS: GLoVE is robust, spaCy has it built in, vector space comparisons are easy in spaCy
CONS: It doesn't handle out of vocabulary words well, spaCy's just taking the average of all the token vectors here (so it's sensitive to document length)
2. Try using different distance metrics/fuzzy string matching
You might also look at different kinds of distance metrics -- cosine distance isn't the only one.
FuzzyWuzzy is a good implementation of Levenshtein distance for fuzzy string matching (no vectors required).
This library implements a whole slew of string-matching algorithms.
PROS: Using a preconfigured library saves you some coding, other distance metrics might help you find new correlations, don't need to train a vector model
CONS: More dependencies, some kinds of distance aren't appropriate and will miss synonymous words without literal string overlap
3. Use WordNet to get synonym sets
You could also get a sort of dictionary of synonym sets ('synsets') from WordNet, which was put together by linguists as a kind of semantic knowledge graph.
The nice thing about this is it gets you some textual entailment -- that is, given sentence A, would a reader think sentence B is most likely true?
Because it was handmade by linguists and grad students, WordNet isn't as dependent on string overlap and can give you nice semantic enrichment. It also provides things like hyponyms/meroynms and hypernyms/holonyms -- so you could, e.g., say 'video game' is a subtype of 'game', which is a subset of 'recreation' -- just based off of WordNet.
You can access WordNet in python through the textblob library.
from textblob import Word
from textblob.wordnet import NOUN
game = Word('game').get_synsets(pos=NOUN)
for synset in game:
print(synset.definition())
a contest with rules to determine a winner
a single play of a sport or other contest
an amusement or pastime
animal hunted for food or sport
(tennis) a division of play during which one player serves
(games) the score at a particular point or the score needed to win
the flesh of wild animals that is used for food
a secret scheme to do something (especially something underhand or illegal)
the game equipment needed in order to play a particular game
your occupation or line of work
frivolous or trifling behavior
print(game[0].hyponyms())
[Synset('athletic_game.n.01'),
Synset('bowling.n.01'),
Synset('card_game.n.01'),
Synset('child's_game.n.01'),
Synset('curling.n.01'),
Synset('game_of_chance.n.01'),
Synset('pall-mall.n.01'),
Synset('parlor_game.n.01'),
Synset('table_game.n.01'),
Synset('zero-sum_game.n.01')]
Even cooler, you can get the similarity based on these semantic features between any words you like.
print((Word('card_game').synsets[0]).shortest_path_distance(Word('video_game').synsets[0]))
5
PROS: Lets you use semantic information like textual entailment to get at your objective, which is hard to get in other ways
CONS: WordNet is limited to what is in WordNet, so again out-of-vocabulary words may be a problem for you.
I Suggest to use the word2vector approcah or the lemmatisation approach:
With the first one you can compute vectors starting from words, and so you have a projection into a vectorial space. With this projection you can compute the similarity between words (with cosine similarity as #Shnipp sad) and then put a threshold, above which you say that two words belong do different arguments.
Using lemmatisation you can compare bare words/lemma using SequenceMatcher. In this case you condition of equality could be based on the presence of very similar lemmas (similarity above 95%).
It's us to you to choose the best for your purpouse. If you want something solid and structured use word2vec. Othervise if you something simple and fast to implement use the lemmatisation approcah.

NLP: Curating definitional summaries for a specific term from textbook

I would like to be able to curate definitional summaries for a specific term from a textbook.
For example, from a Biology textbook, I would like to be able form a concise summary for the word "mitochondria". I have tried this by first parsing through the textbook for all sentences that contain the word "mitochondria", and feeding those sentences through summarization algorithms such as TextRank and LexRank, but those algorithms were not able to determine "definitional" sentences that well.
By definitional summaries, I mean useful sentences as far as a definition goes. For example, the sentence "The mitochondria is the powerhouse of the cell" would be a definitional sentence while the sentence "Fungal cells also contain mitochondria and a complex system of internal membranes, including the endoplasmic reticulum and Golgi apparatus" is not really pertinent to the definition of the mitochondria.
Any help or leads would be very much appreciated
There isn't really a straightforward way to do this, but you do have some options:
Just use a regex for "mitochondria is". It is the stupidest possible thing, but given a textbook it might prove satisfactory. It's simple enough testing should be easy, and at worst provides a baseline to compare alternatives to.
Run a parser (example: Stanford Parser) on each sentence with the word "mitochondria", and extract sentences where mitochondria is the subject. This would eliminate the negative example you gave. You would have to tune this, perhaps restricting main verbs, accounting for coordinators, and so on.
Use Information Extraction (example: Stanford OpenIE) to get a list of facts about mitochondria (like is-in(mitochondria, cell)) and do something with that.
This is a very open ended question. I can try to point how I would approach this...
One way would be to use some kind of vector representation for text (word2vec
or sent2vec come to mind).
Then by encoding the average of the sentences in vector format and checking the cosine similarity of this and of the term you seek, you could be getting something close to the definitional sentences you seek.
Even testing the cosine similarity of the averaged sentences you get out of the summary algorithm and the term might get you close to judge how close you are

Creating emotional text-based artwork using Sent2Vec and POSTagging?

I want to create computer-generated abstract based on the following criteria:
Nouns and verbs will correspond to round or jagged shapes. Lets say 0 is very jagged and 10 is very round. The rounder something is, the more calm or serene. The more jagged, the angrier or excited it is. Each word can get assigned a "weight" from 0-10 based on its perceived emotional content.
Adjectives and adverbs will correspond to warm or cool colors. Colors like blue and purple will correspond to calmness or serenity, red and orange to anger, yellow to happiness, etc. Same weight rules apply.
I'm not very experienced with Artificial Neural Networks or NLP and I want to make something like this based on any text input. How should I approach this? Could it simply do POS tagging on the entire document and parse it through Sent2Vec?
This is quite a unusual NLP use, however you could try the following:
Define a set of words and attribute them values.
Use NLP to automactically extract POS from text.
Using the previous steps, build your art with the extracted features by writing an algorithm.
No need for vectorization (this is only useful for machine learning algorithms) or any Sentiment Analysis approach (this is when you have a machine learning problem).

String-matching algorithm for noisy text

I have used OCR (optical character recognition) to get texts from images. The images contain book covers. Because of the images are so noisy, some characters are misrecognised, or some noises are recognised as a character.
Examples:
"w COMPUTER Nnwonxs i I "(Compuer Networks)
"s.ll NEURAL NETWORKS C "(Neural Networks)
"1llllll INFRODUCIION ro PROBABILITY ti iitiiili My "(Introduction of Probability)
I builded a dictionary with words, but i want to somehow match the recognised text with the dictionary. I tried LCS (Longest Common subsequence), but its not so effective.
What is the best string matching algorithm for this kind of problem? (So a part of string is just noise, but also the important part of string can has some misrecognised characters)
That's really a big question. Followings are something I know about it. For more details, you can read some related papers.
For single word, use Hamming Distance to calculate the similarity between the word your recognized by OCR and those in your dictionary;
this step is used to correct the the words have been recognized by OCR but do not exist.
Eg:
If the result of OCR is INFRODUCIION which dosen't exist in your dictionary, you can find out the Hamming Distance of word 'INTRODUCTION' is 2. So it may be mis-recognized as 'INFRODUCIION'.
However, the same word may be recognized as different words with the same Hamming Distance between them.
Eg: If the result of OCR is the CAY, you may find CAR and CAT are both with the same Hamming Distance of 1, so that will be confused.
In this case, there are several things can be used for analyze:
Still for single word, the image different between CAT and CAY is less that CAR and CAY. So for this reason, CAT seems the right word with a greater probability.
Then let us the context to caculate another probability. If the whold sentence is 'I drove my new CAY this morning', as for people usually drive a CAR but not a CAT, we have a better chance to regard the word CAY as CAR but not CAT.
For the frequency of the words used in the similar articles, use TF-TDF.
Are you saying you have a dictionary that defines all words that are acceptable?
If so, it should be fairly straight forward to take each word and find the closest match in your dictionary. Set a match threshold and discard the word if it does not reach the threshold.
I would experiment with the Soundex and Metaphone algorithms or the Levenshtein Distance algorithm.

Lexicon-based text analysis. Any algorithm out there that does probabilistic category assignment?

I'm using a lexicon-based approach to text analysis. Basically I have a long list of words marked with whether they are positive/negative/angry/sad/happy etc. I match the words in the text I want to analyze to the words in the lexicon in order to help me determine if my text is positive/negative/angry/sad/happy etc.
But the length of the texts I want to analyze vary. Most of them are under 100 words, but consider the following example:
John is happy. (1 word in the category 'happy' giving a score of 33% for happy)
John told Mary yesterday that he was happy. (12.5% happy)
So comparing across different sentences, it seems that my first sentence is more 'happy' than my second sentence, simply because the sentence is shorter, and gives a disproportionate % to the word 'happy'.
Is there an algorithm or way of calculation you can think of that would allow me to make a fairer comparison, perhaps by taking into account the length of the sentence?
As many pointed out, you have to go down to syntactic tree, something similar to this work.
Also, consider this:
John told Mary yesterday that he was happy.
John told Mary yesterday that she was happy.
The second one tells nothing about John's happiness, but naive algorithm would be confused quickly. So in addition to syntax parsing, pronouns have to represent linking to the subjects. In particular, that means that the algorithm should know that John is he and Mary is she.
Ignoring the issue of negation raised by HappyTimeGopher, you can simply divide the number of happy words in the sentence by the length of the sentence. You get:
John is happy. (1 word in the category 'happy' / 3 words in sentence = score of 33% for happy)
John told Mary yesterday that he was happy. (1/8 = 12.5% happy)
Keep in mind word-list based approaches will only go so far. What should be the score for "I was happy with the food, but the waiter was horrible"? Consider using a more sophisticated system--- the papers below are a good place to start your research:
Choi, Y., & Cardie, C. (2008). Learning with compositional semantics as structural inference for subsentential sentiment analysis.
Moilanen, K., & Pulman, S. (2009). Multi-entity sentiment scoring.
Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques.
Turney, P. D., & Littman, M. L. (2003). Measuring praise and criticism: Inference of semantic orientation from association.

Resources