NLP: How to get an exact number of sentences for a text summary using Gensim - nlp

I am trying to summarise some text using Gensim in python and want exactly 3 sentences in my summary. There doesn't seem to be an option to do this so I have done the following workaround:
with open ('speeches//'+speech, "r") as myfile:
speech=myfile.read()
sentences = speech.count('.')
x = gensim.summarization.summarize(speech, ratio=3.0/sentences)
However this code is only giving me two sentences. Furthermore, as I incrementally increase 3 to 5 still nothing happens.
Any help would be most appreciated.

You may not be able use 'ratio' for this. If you give ratio=0.3, and you have 10 sentences (assuming count of words in each sentence is same), your output will have 3 sentences, 6 for 20 and so on.
As per gensim doc
ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.
Instead you might want to try using word_count, summarize(speech, word_count=60)
This question is a bit old, in case you found a better solution, pls share.

Related

Use the polarity distribution of word to detect the sentiment of new words

I have just started a project in NLP. Suppose I have a graph for each word that shows the polarity distribution of sentiments for that word in different sentences. I want to know what I can use to recognize the feelings of new words? Any other use you have in mind I will be happy to share.
I apologize for any possible errors in my writing. Thanks a lot
Assuming you've got some words that have been hand-labeled with positive/negative sentiments, but then you encounter some new words that aren't labeled:
If you encounter the new words totally alone, outside of contexts, there's not much you can do. (Maybe, you could go out to try to find extra texts with those new words, such as vis dictionaries or the web, then use those larger texts in the next approach.)
If you encounter the new words inside texts that also include some of your hand-labeled words, you could try guessing that the new words are most like the words you already know that are closest-to, or used-in-the-same-places. This would leverage what's called "the distributional hypothesis" – words with similar distributions have similar meanings – that underlies a lot of computer natural-language analysis, including word2vec.
One simple thing to try along these lines: across all your texts, for every unknown word U, tally up the counts all neighboring words within N positions. (N could be 1, or larger.) From that, pick the top 5 words occuring most often near the unknown word, and look up your prior labels, and avergae them together (perhaps weighted by the number of occurrences.)
You'll then have a number for the new word.
Alternatively, you could train a word2vec set-of-word-vectors for all of your texts, including the unknown & know words. Then, ask that model for the N most-similar neighbors to your unknown word. (Again, N could be small or large.) Then, from among those neighbors with known labels, average them together (again perhaps weighted by similarity), to get a number for the previously unknown word.
I wouldn't particularly expect either of these techniques to work very well. The idea that individual words can have specific sentiment is somewhat weak given the way that in actual language, their meaning is heavily modified, or even reversed, by the surrounding grammar/context. But in each case these simple calculate-from-neighbors techniqyes are probably better than random guesses.
If your real aim is to calculate the overall sentiment of longer texts, like sentences, paragraphs, reviews, etc, then you should discard your labels of individual words an acquire/create labels for full texts, and apply real text-classification techniques to those larger texts. A simple word-by-word approach won't do very well compared to other techniques – as long as those techniques have plenty of labeled training data.

OpenAI - Limit TL/DR Summarization to X characters & complete sentences

I'm currently learning how to use OpenAI API for text summarization for a project. Overall, it's pretty amazing but there is one thing I'm struggling with.
I need a tl/dr summary that is 1 - 2 complete sentences with max of 250 characters. I can play around with the MaximumLength option but if I make it too short, the summary often just ends up with a sentence that is just cut off in the middle.
Another problem is - if there is a bullet list in the main text, the summary will be a few of those bullets. Again, I need 1-2 complete sentences, not bullets.
Lastly, if the main text is quite short, often my summary will be 2 sentences that say the exact same thing with a slight variation.
I've tried this using the various engines (text-davinci, davinci-instruct-beta). Any suggestions on how I can instruct/guide OpenAI to give me the output that I'm looking for? Or, do I need to start doing the "Fine Tuning" option. If I feed it 1,000+ examples of 1-2 sentences with < 250 characters & no bullets, will it understand what I need?
Many thanks in advance.

Extract Acronyms and Māori (non-english) words in a dataframe, and put them in adjacent columns within the dataframe

Regular expression seems a steep learning curve for me. I have a dataframe that contains texts (up to 300,000 rows). The text as contained in outcome column of a dummy file named foo_df.csv has a mixture of English words, acronyms and Māori words. foo_df.csv is as thus:
outcome
0 I want to go to DHB
1 Self Determination and Self-Management Rangatiratanga
2 mental health wellness and AOD counselling
3 Kai on my table
4 Fishing
5 Support with Oranga Tamariki Advocacy
6 Housing pathway with WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services
The result I desire is in form of a table below such that has Abreviation and Māori_word columns:
outcome Abbreviation Māori_word
0 I want to go to DHB DHB
1 Self Determination and Self-Management Rangatiratanga Rangatiratanga
2 mental health wellness and AOD counselling AOD
3 Kai on my table Kai
4 Fishing
5 Support with Oranga Tamariki Advocacy Oranga Tamariki
6 Housing pathway with WINZ WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services Owaraika
The approach I am using is to extract the ACRONYMS using regular expression and extract the Māori words using nltk module.
I have been able to extract the ACRONYMS using regular expression with this code:
pattern = '(\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b)'
foo_df['Abbreviation'] = foo_df.outcome.str.extract(pattern)
I have been able to extract non-english words from a sentence using the code below:
import nltk
nltk.download('words')
from nltk.corpus import words
words = set(nltk.corpus.words.words())
sent = "Self Determination and Self-Management Rangatiratanga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if not w.lower() in words or not w.isalpha())
However, I got an error TypeError: expected string or bytes-like object when I tried to iterate the above code over a dataframe. The iteration I tried is below:
def no_english(text):
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
foo_df['Māori_word'] = foo_df.apply(no_english, axis = 1)
print(foo_df)
Any help in python3 will be appreciated. Thanks.
You can't magically tell if a word is English/Māori/abbreviation with a simple short regex. Actually, it is quite likely that some words can be found in multiple categories, so the task itself is not binary (or trinary in this case).
What you want to do is natural language processing, here are some examples of libraries for language detection in python. What you'll get is a probability that the input is in a given language. This is usually ran on full texts but you could apply it to single words.
Another approach is to use Māori and abbreviation dictionaries (=exhaustive/selected lists of words) and craft a function to tell if a word is one of them and assume English otherwise.

how to find the similarity between two documents

I have tried using the similarity function of spacy to get the best matching sentence in a document. However it fails for bullet points because it considers each bullet as the a sentence and the bullets are incomplete sentences (eg sentence 1 "password should be min 8 characters long , sentence 2 in form of a bullet " 8 characters"). It does not know it is referring to password and so my similarity comes very low.
Sounds to me like you need to do more text processing before attempting to use similarity. If you want bullet points to be considered part of a sentence, you need to modify your spacy pipeline to understand to do so.
Bullets are considered but the thing is it doesn't understand who 8 characters is referring to so I thought of finding the heading of the paragraph and replacing the bullets with it
I found the headings using python docs but it doesn't read bullets while reading the document ,is there a way I can read it using python docs ?
Is there any way I can find the headings of a paragraph in spacy?
Is there a better approach for it
You can actually modify spaCy's sentencizer to recognize bullet points as sentence boundaries, but an easier way would be to use the sentence-transformers library instead. It doesn't matter if you have bullet points in your sentence in that case.

Trying to detect products from text while using a dictionary

I have a list of products names and a collection of text generated from random users. I am trying to detect products mentioned in the text while talking into account spelling variation. For example the text
Text = i am interested in galxy s8
Mentions the product samsung galaxy s8
But note the difference in spellings.
I've implemented the following approaches:
1- max tokenized products names and users text (i split words by punctuation and digits so s8 will be tokenized into 's' and '8'. Then i did a check on each token in user's text to see if it is in my vocabulary with damerau levenshtein distance <= 1 to allow for variation in spelling. Once i have detected a sequence of tokens that do exist in the vocabulary i do a search for the product that matches the query while checking the damerau levenshtein distance on each token. This gave poor results. Mainly because the sequence of tokens that exist in the vocabulary do not necessarily represent a product. For example since text is max tokenized numbers can be found in the vocabulary and as such dates are detected as products.
2- i constructed bigram and trigram indicies from the list of products and converted each user text into a query.. but also results weren't so great given the spelling variation
3- i manually labeled 270 sentences and trained a named entity recognizer with labels ('O' and 'Product'). I split the data into 80% training and 20% test. Note that I didn't use the list of products as part of the features. Results were okay.. not great tho
None of the above results achieved a reliable performance. I tried regular expressions but since there are so many different combinations to consider it became too complicated.. Are there better ways to tackle this problem? I suppose ner could give better results if i train more data but suppose there isn't enough training data, what do u think a better solution would be?
If i come up with a better alternative to the ones I've already mentioned, I'll add it to this post. In the meantime I'm open to suggestions
Consider splitting your problem into two parts.
1) Conduct a spelling check using a dictionary of known product names (this is not a NLP task and there should be guides on how to impelement spell check).
2) Once you have done pre-processing (spell checking), use your NER algorithm
It should improve your accuracy.

Resources