NLP summerization using textacy/spacy - nlp

I want to generate a summary maybe in one sentence from this text. I am using textacy.py.
Here is my code:
import textacy
import textacy.keyterms
import textacy.extract
import spacy
nlp = spacy.load('en_core_web_sm')
text = '''Sauti said, 'O thou that art blest with longevity, I shall narrate the history of Astika as I heard it from my father.
O Brahmana, in the golden age, Prajapati had two daughters.
O sinless one, the sisters were endowed with wonderful beauty.
Named Kadru and Vinata, they became the wives of Kasyapa.
Kasyapa derived great pleasure from his two wedded wives and being gratified he, resembling Prajapati himself, offered to give each of them a boon.
Hearing that their lord was willing to confer on them their choice blessings, those excellent ladies felt transports of joy.
Kadru wished to have for sons a thousand snakes all of equal splendour.
And Vinata wished to bring forth two sons surpassing the thousand offsprings of Kadru in strength, energy, size of body, and prowess.
Unto Kadru her lord gave that boon about a multitude of offspring.
And unto Vinata also, Kasyapa said, 'Be it so!' Then Vinata, having; obtained her prayer, rejoiced greatly.
Obtaining two sons of superior prowess, she regarded her boon fulfilled.
Kadru also obtained her thousand sons of equal splendour.
'Bear the embryos carefully,' said Kasyapa, and then he went into the forest, leaving his two wives pleased with his blessings.'''
doc = textacy.make_spacy_doc(text, 'en_core_web_sm')
sentobj = nlp(text)
sentences = textacy.extract.subject_verb_object_triples(sentobj)
summary=''
for i, x in enumerate(sentences):
subject, verb, fact = x
print('Fact ' + str(i+1) + ': ' + str(subject) + ' : ' + str(verb) + ' : ' + str(fact))
summary += 'Fact ' + str(i+1) + ': ' + (str(fact))
Results are as follows:
Fact 1: I : shall narrate : history
Fact 2: I : heard : it
Fact 3: they : became : wives
Fact 4: Kasyapa : derived : pleasure
Fact 5: ladies : felt : transports
Fact 6: Kadru : wished : have
Fact 7: Vinata : wished : to bring
Fact 8: lord : gave : boon
Fact 9: Kasyapa : said : Be
Fact 10: Vinata : obtained : prayer
Fact 11: she : regarded : boon
Fact 12: Kadru : obtained : sons
I tried
textacy.extract.words
textacy.extract.entities
textacy.extract.ngrams
textacy.extract.noun_chunks
textacy.ke.textrank
Everything is working as per the book but results are not perfect.
I am wanting something like "Kasyapa married Kadru and Vinata sisters" or "Kasyapa gave embroys to Kadru and Vinata".
Can you please suggest me how to do this? Or suggest me some alternative packages to use?

Just an update. I have been able to pagerank the "Sauti" sentences. Here are the results in descending order of pagerank:
(0.0869526908422304, ['O', 'Brahmana', ',', 'in', 'the', 'golden', 'age', ',', 'Prajapati', 'had', 'two', 'daughters', '.']),
(0.08675152795526771, ['Named', 'Kadru', 'and', 'Vinata', ',', 'they', 'became', 'the', 'wives', 'of', 'Kasyapa', '.']),
(0.08607926397402169, ['And', 'Vinata', 'wished', 'to', 'bring', 'forth', 'two', 'sons', 'surpassing', 'the', 'thousand', 'offsprings', 'of', 'Kadru', 'in', 'strength', ',', 'energy', ',', 'size', 'of', 'body', ',', 'and', 'prowess', '.']),
(0.08096858541855065, ['Kasyapa', 'derived', 'great', 'pleasure', 'from', 'his', 'two', 'wedded', 'wives', 'and', 'being', 'gratified', 'he', ',', 'resembling', 'Prajapati', 'himself', ',', 'offered', 'to', 'give', 'each', 'of', 'them', 'a', 'boon', '.']),
(0.08025844559654187, ['And', 'unto', 'Vinata', 'also', ',', 'Kasyapa', 'said', ',', '("\'Be",', "'VBD", 'it', 'so', '!', '("\'",', '"\'\'"),', 'Then', 'Vinata', ',', 'having', ';', 'obtained', 'her', 'prayer', ',', 'rejoiced', 'greatly', '.']),
(0.07764697882919071, ['Obtaining', 'two', 'sons', 'of', 'superior', 'prowess', ',', 'she', 'regarded', 'her', 'boon', 'fulfilled', '.']),
(0.07717129674341844, ['("\'Bear",', "'IN", 'the', 'embryos', 'carefully', ',', '("\'",', '"\'\'"),', 'said', 'Kasyapa', ',', 'and', 'then', 'he', 'went', 'into', 'the', 'forest', ',', 'leaving', 'his', 'two', 'wives', 'pleased', 'with', 'his', 'blessings', '.']),
(0.0768816552210493, ['Kadru', 'also', 'obtained', 'her', 'thousand', 'sons', 'of', 'equal', 'splendour', '.']),
(0.07172005226142254, ['Kadru', 'wished', 'to', 'have', 'for', 'sons', 'a', 'thousand', 'snakes', 'all', 'of', 'equal', 'splendour', '.']),
(0.06953411123175395, ['Unto', 'Kadru', 'her', 'lord', 'gave', 'that', 'boon', 'about', 'a', 'multitude', 'of', 'offspring', '.']),
(0.06943939082844, ['Sauti\\', 'said', ',', '("\'",', '"\'\'"),', 'O', 'thou', 'that', 'art', 'blest', 'with', 'longevity', ',', 'I', 'shall', 'narrate', 'the', 'history', 'of', 'Astika', 'as', 'I', 'heard', 'it', 'from', 'my', 'father', '.']),
(0.06888390365265022, ['O', 'sinless', 'one', ',', 'the', 'sisters', 'were', 'endowed', 'with', 'wonderful', 'beauty', '.']),
(0.0677120974454628, ['Hearing', 'that', 'their', 'lord', 'was', 'willing', 'to', 'confer', 'on', 'them', 'their', 'choice', 'blessings', ',', 'those', 'excellent', 'ladies', 'felt', 'transports', 'of', 'joy', '.'])]
Results are not what I was looking for but are impressive.
I used these following libraries:
import nltk.tokenize as tk
from nltk import sent_tokenize, word_tokenize
from nltk.cluster.util import cosine_distance
from nltk.corpus import brown, stopwords
import networkx as nx
Just wanted to share this with you all.
thanks

Related

Is there a better more efficient way to write the code below for extracting text that starts with '#'

import numpy as np
hash_list = [['obi', 'is', '#alive'],['oge', 'is', 'beautiful'],
['Ade', 'the', '#comedian', 'de', '#rich'],['Jesus', 'wept']]
print(hash_list)
Output: [['obi', 'is', '#alive'], ['oge', 'is', 'beautiful'], ['Ade', 'the', '#comedian', 'de', '#rich'], ['Jesus', 'wept']]
Below is the code I need to optimize for performance. It takes too long(like exponentially) to run the code as the list increases in size
a=[]
for x in hash_list:
b=[]
a.append(b)
for i in x:
if i[0]=="#":
b.append(i)
for x in range(len(a)):
if a[x]==[]:
a[x]=np.nan
print(a)
Output: [['#alive'], nan, ['#comedian', '#rich'], nan]

combining thousands of list strings in python

I have a .txt file of "Alice in the Wonderland" and need to strip all the punctuation and make all of the words lower case, so I can find the number of unique words in the file. The wordlist referred to below is one list of all the individual words as strings from the book, so wordlist looks like this
["Alice's", 'Adventures', 'in', 'Wonderland', "ALICE'S",
'ADVENTURES', 'IN', 'WONDERLAND', 'Lewis', 'Carroll', 'THE',
'MILLENNIUM', 'FULCRUM', 'EDITION', '3.0', 'CHAPTER', 'I',
'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning',
'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her',
'sister', 'on', 'the', 'bank,'
The code i have for the solution so far is
from string import punctuation
def wordcount(book):
for word in wordlist:
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
print(newlist)
This works for stripping punctuation and making all words lowercase, however the newlist = lower_case.split() makes an individual list of every word, so I cannot iterate over one big list to find the number of unique words. The reason I did the .split() is so that when iterated over, python does not count ever letter as a word, rather each word is kept intact since it is its own list item. Any ideas on how I can improve this or a more efficient approach? Here is a sample of the output
['down']
['the']
['rabbit-hole']
['alice']
['was']
['beginning']
['to']
['get']
['very']
['tired']
['of']
['sitting']
['by']
['her']
Here is a modification of your code with outputs
from string import punctuation
wordlist = "Alice fell down down down!.. down into, the hole."
single_list = []
for word in wordlist.split(" "):
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
#print(newlist)
single_list.append(newlist[0])
print(single_list)
#to get the unique
single_list_unique = set(single_list)
print(single_list_unique)
print(len(single_list_unique))
and that produces:
['alice', 'fell', 'down', 'down', 'down', 'down', 'into', 'the', 'hole']
and the unique set:
{'fell', 'alice', 'down', 'into', 'the', 'hole'}
and the length of the unique:
6
(This may not be the most efficient approach but it is close to your current code and will suffice for that book of thousands of elements. If this was a backend process serving multiple requests you would optimize it with improvements)
EDIT----------
You may be importing from file using a library that passes in a list, in which case you produce an error AttributeError: 'list' object has no attribute 'split', or you might see the error IndexError: list index out of range because of an empty string. In which case you use this modification:
from string import punctuation
wordlist2 = ["","Alice fell down down down!.. down into, the hole.", "There was only one hole for Alice to fall down into"]
single_list = []
for wordlist in wordlist2:
for word in wordlist.split(" "):
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
#print(newlist)
if(len(newlist) > 0):
single_list.append(newlist[0])
print(single_list)
#to get the unique
single_list_unique = set(single_list)
print(single_list_unique)
print(len(single_list_unique))
producing:
['alice', 'fell', 'down', 'down', 'down', 'down', 'into', 'the', 'hole', 'there', 'was', 'only', 'one', 'hole', 'for', 'alice', 'to', 'fall', 'down', 'into']
{'there', 'fall', 'fell', 'alice', 'for', 'down', 'was', 'into', 'the', 'to', 'only', 'hole', 'one'}
13

Python program that extracts most frequently found names from .csv file

I have created a program that generates 5000 random names, ssn, city, address, and email and stored them in fakeprofile.csv file. I am trying to extract the most common names from the file. I was able to get the program to work syntactically but fail to extract frequent names.
Here's the code:
import re
import statistics
file_open = open('fakeprofile.csv').read()
frequent_names = re.findall('[A-Z][a-z]*', file_open)
print(frequent_names)
Sample in the file:
Alicia Walters 419-52-4141 Yorkstad 66616 Schultz Extensions Suite 225
Reynoldsmouth, VA 72465 stevenserin#stein.biz
Nicole Duffy 212-38-9009 West Timothy 51077 Phillips Ports Apt. 314
Hubbardville, IN 06723 kaitlinthomas#bennett-carter.com
Stephanie Lewis 442-20-1279 Jacquelineshire 650 Gutierrez Forge Apt. 839
West Christianbury, TN 13654 ukelley#gmail.com
Michael Harris 108-81-3733 East Toddberg 14387 Douglas Mission Suite 038
Garciaview, WI 58624 kshields#yahoo.com
Aaron Moreno 171-30-7715 Port Taraburgh 56672 Wagner Path
Lake Christopher, VA 37884 lucasscott#nguyen.info
Alicia Zimmerman 286-88-9507 Barberstad 5365 Heath Extensions Apt. 731
South Randyburgh, NJ 79367 daniellewebb#yahoo.com
Brittney Mcmillan 334-44-0321 Lisahaven PSC 3856, Box 2428
APO AE 03215 kevin95#hotmail.com
Amanda Perkins 327-31-6610 Perryville 8750 Hurst Harbor Apt. 929
Sample output:
', 'Lake', 'Brianna', 'P', 'A', 'Michael', 'Smith', 'Harveymouth', 'Patricia', 'Tunnel', 'West', 'William', 'G', 'A', 'Charles', 'Perkins', 'Lake', 'Marie', 'Lisa', 'Overpass', 'Suite', 'Kennedymouth', 'C', 'A', 'Barbara', 'Perez', 'Billyshire', 'Joshua', 'Village', 'Cindymouth', 'W', 'I', 'Curtis', 'Simmons', 'North', 'Mitchellport', 'Gordon', 'Crest', 'Suite', 'Jacksonburgh', 'C', 'O', 'Cameron', 'Berg', 'South', 'Dean', 'Christina', 'Coves', 'Williamton', 'T', 'N', 'Maria', 'Williams', 'North', 'Judith', 'Carson', 'Overpass', 'Apt', 'West', 'Amandastad', 'N', 'M', 'Hannah', 'Dennis', 'Rodriguezmouth', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'E', 'Laura', 'Richardson', 'Lake', 'Kayla', 'Johnson', 'Place', 'Suite', 'Port', 'Jennifermouth', 'N', 'H', 'John', 'Lawson', 'Hintonhaven', 'Thomas', 'Via', 'Mossport', 'N', 'J', 'Jennifer', 'Hill', 'East', 'Phillip', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'E', 'Cody', 'Jackson', 'Lake', 'Jessicamouth', 'Snyder', 'Ways', 'Apt', 'New', 'Stacey', 'M', 'E', 'Ryan', 'Friedman', 'Shahburgh', 'Jerry', 'Pike', 'Suite', 'Toddfort', 'N', 'V', 'Kathleen', 'Fox', 'Ferrellmouth', 'P', 'S', 'C', 'Box', 'A', 'P', 'O', 'A', 'P', 'Michael', 'Thompson', 'Port', 'Jessica', 'Boone', 'Spurs', 'Suite', 'Port', 'Ashleyland', 'C', 'O', 'Christopher', 'Marsh', 'North', 'Catherine', 'Scott', 'Trail', 'Apt', 'Baileyburgh', 'F', 'L', 'Richard', 'Rangel', 'New', 'Anna', 'Ray', 'Drive', 'Apt', 'Nunezland', 'I', 'A', 'Connor', 'Stanton', 'Troyshire', 'Rodgers', 'Hill', 'West', 'Annmouth', 'N', 'H', 'James', 'Medina',
My issue here is being unable to extract most frequently found first names as well as avoiding those capital letters. Instead, I have extracted all names (including the unnecessary capital letters) and the one seen above is a small sample of all names extracted. I noticed that the first names are always on the odd rows in the output, and I am trying to capture the most frequent first names in those odd rows.
The fakeprofile.csv file was created by this program:
import csv
import faker
from faker import Faker
fake = Faker()
name = fake.name(); print(name)
ssn = fake.ssn(); print(ssn)
city = fake.city(); print(city)
address = fake.address(); print(address)
email = fake.email(); print(email)
profile = fake.simple_profile()
for i,j in profile.items():
print('{}: {}'.format(i,j))
print('Name: {}, SSN: {}, City: {}, Address: {}, Email: {}'.format(name,ssn,city,address,email))
with open('fakeprofile.csv', 'w') as f:
for i in range(0,5001):
print(f'{fake.name()} {fake.ssn()} {fake.city()} {fake.address()} {fake.email()}', file=f)
Does this achieve what you want?
import collections, re
# Read in all lines into a list
with open('fakeprofile.csv') as f:
lines = f.readlines()
# Throw out every other line
lines = [line for i, line in enumerate(lines) if i%2 == 0]
# Keep only first word of each line
names = [line.split()[0] for line in lines]
# Find most common names
n = 3
frequent_names = collections.Counter(names).most_common(n)
# Display most common names
for name, count in frequent_names:
print(name, count)
To do the counting it uses collections.Counter together with its most_common() method.
I think It would have been better if you use pandas library, for the CSV manipulation (collecting the desire information ), and then apply python collection like counter(df ['name'] ) into it, or else could you give us more information about the CSV file.
thank you
So the main problem you have is that you use a regexp that will capture every letter.
You are interested in the first world in the odd line.
you can do something on those lines:
# either use a dict to count or a list to transform as counter.
dico_count = {}
with open('fakeprofile.csv') as file_open: # use of context manager
line_number = 1
for line in file_open: #iterates all the lines
if line_number % 2 != 0 : # odd line
spt = line.strip().split()
dico_count[spt[0]] = dico_count.get(spt[0], 0) + 1
frequent_name_counter = [(k,v) for k,v in sorted(dico_count.items(), key=lambda x: x[1], reverse=True)]

filtering a txt file in python on three letter words only

I need to write a program in Python3 that can filter a txt file in the Linux shell on three letter words only.
This is what I've got so far:
def main():
string = open("verhaaltje.txt", "r")
words = [word for word in string.split() if len(word)==3]
file.close()
print (str(words))
main()
Is there anyone that can help?
Upload your txt file contents and error logs.
string = open("verhaaltje.txt", "r")
words = [word for word in string.read().split() if len(word)==3]
string.close()
print (str(words))
With my above code and some text from the internet, I got
['the', 'the', 'the', 'm).', 'the', 'the', 'The', 'its', 'ago', 'was', 'm),', 'but', 'now', 'm).', 'and', 'its', 'big', 'but', 'the', 'are', 'and', 'its', 'the', 'and', 'for']
For the new line, modify a bit of the print statement.
print ('\n'.join(words))

Filter row based on column value not present in List Of String or not

I have a dataframe
var input1 = spark.createDataFrame(Seq(
(10L, "Joe Doe", 34),
(11L, "Jane Doe", 31),
(12L, "Alice Jones", 25)
)).toDF("id", "name", "age")
I am trying to filter row which are not available in the List.
I can filter based on age and id easily -
input1.filter("age not in (31,56,81)").show()
But same is not working when I am trying to filter based on name
input1.filter("name not in ("joe Doe","Pappu cam","Log")").show()
There must be some representation of string while filtering.
I am getting exception
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'Doe' expecting {')', ',', '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 1, pos 16)
== SQL ==
name not in (Joe Doe,abc dej)
----------------^^^
seems like a syntax error.
try:
input1.filter("name not in ('joe Doe','Pappu cam','Log')").show()
Try to escape the SQL query:
input1.filter(s"""name not in ("joe Doe","Pappu cam","Log")""").show()

Resources