combining thousands of list strings in python - string

I have a .txt file of "Alice in the Wonderland" and need to strip all the punctuation and make all of the words lower case, so I can find the number of unique words in the file. The wordlist referred to below is one list of all the individual words as strings from the book, so wordlist looks like this
["Alice's", 'Adventures', 'in', 'Wonderland', "ALICE'S",
'ADVENTURES', 'IN', 'WONDERLAND', 'Lewis', 'Carroll', 'THE',
'MILLENNIUM', 'FULCRUM', 'EDITION', '3.0', 'CHAPTER', 'I',
'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning',
'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her',
'sister', 'on', 'the', 'bank,'
The code i have for the solution so far is
from string import punctuation
def wordcount(book):
for word in wordlist:
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
print(newlist)
This works for stripping punctuation and making all words lowercase, however the newlist = lower_case.split() makes an individual list of every word, so I cannot iterate over one big list to find the number of unique words. The reason I did the .split() is so that when iterated over, python does not count ever letter as a word, rather each word is kept intact since it is its own list item. Any ideas on how I can improve this or a more efficient approach? Here is a sample of the output
['down']
['the']
['rabbit-hole']
['alice']
['was']
['beginning']
['to']
['get']
['very']
['tired']
['of']
['sitting']
['by']
['her']

Here is a modification of your code with outputs
from string import punctuation
wordlist = "Alice fell down down down!.. down into, the hole."
single_list = []
for word in wordlist.split(" "):
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
#print(newlist)
single_list.append(newlist[0])
print(single_list)
#to get the unique
single_list_unique = set(single_list)
print(single_list_unique)
print(len(single_list_unique))
and that produces:
['alice', 'fell', 'down', 'down', 'down', 'down', 'into', 'the', 'hole']
and the unique set:
{'fell', 'alice', 'down', 'into', 'the', 'hole'}
and the length of the unique:
6
(This may not be the most efficient approach but it is close to your current code and will suffice for that book of thousands of elements. If this was a backend process serving multiple requests you would optimize it with improvements)
EDIT----------
You may be importing from file using a library that passes in a list, in which case you produce an error AttributeError: 'list' object has no attribute 'split', or you might see the error IndexError: list index out of range because of an empty string. In which case you use this modification:
from string import punctuation
wordlist2 = ["","Alice fell down down down!.. down into, the hole.", "There was only one hole for Alice to fall down into"]
single_list = []
for wordlist in wordlist2:
for word in wordlist.split(" "):
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
#print(newlist)
if(len(newlist) > 0):
single_list.append(newlist[0])
print(single_list)
#to get the unique
single_list_unique = set(single_list)
print(single_list_unique)
print(len(single_list_unique))
producing:
['alice', 'fell', 'down', 'down', 'down', 'down', 'into', 'the', 'hole', 'there', 'was', 'only', 'one', 'hole', 'for', 'alice', 'to', 'fall', 'down', 'into']
{'there', 'fall', 'fell', 'alice', 'for', 'down', 'was', 'into', 'the', 'to', 'only', 'hole', 'one'}
13

Related

Creating nested list for elements with same counts

Here's an example list:
['hello', 'hell', 'hel', 'he', 'h', 'he', 'hell', 'hello', 'hel', 'hello', 'hell']
so how would i go about making a nested list for the elements with the same amount of counts? To be more clear, nesting the elements together that appears the same amount of time in a list. Output would be like this:
[['hello','hell'], ['hel', 'he'], ['h']]
Because the count of [hello,hell] is 3 so they are together like the rest of the elements in the list
With some imports it could be done like this:
from collections import Counter
from itertools import groupby
words = ['hello', 'hell', 'hel', 'he', 'h', 'he', 'hell', 'hello', 'hel', 'hello', 'hell']
counts = Counter(words)
res = [list(group) for _, group in groupby(counts, key=lambda k: counts[k])]
res will be:
[['hello', 'hell'], ['hel', 'he'], ['h']]

NLP summerization using textacy/spacy

I want to generate a summary maybe in one sentence from this text. I am using textacy.py.
Here is my code:
import textacy
import textacy.keyterms
import textacy.extract
import spacy
nlp = spacy.load('en_core_web_sm')
text = '''Sauti said, 'O thou that art blest with longevity, I shall narrate the history of Astika as I heard it from my father.
O Brahmana, in the golden age, Prajapati had two daughters.
O sinless one, the sisters were endowed with wonderful beauty.
Named Kadru and Vinata, they became the wives of Kasyapa.
Kasyapa derived great pleasure from his two wedded wives and being gratified he, resembling Prajapati himself, offered to give each of them a boon.
Hearing that their lord was willing to confer on them their choice blessings, those excellent ladies felt transports of joy.
Kadru wished to have for sons a thousand snakes all of equal splendour.
And Vinata wished to bring forth two sons surpassing the thousand offsprings of Kadru in strength, energy, size of body, and prowess.
Unto Kadru her lord gave that boon about a multitude of offspring.
And unto Vinata also, Kasyapa said, 'Be it so!' Then Vinata, having; obtained her prayer, rejoiced greatly.
Obtaining two sons of superior prowess, she regarded her boon fulfilled.
Kadru also obtained her thousand sons of equal splendour.
'Bear the embryos carefully,' said Kasyapa, and then he went into the forest, leaving his two wives pleased with his blessings.'''
doc = textacy.make_spacy_doc(text, 'en_core_web_sm')
sentobj = nlp(text)
sentences = textacy.extract.subject_verb_object_triples(sentobj)
summary=''
for i, x in enumerate(sentences):
subject, verb, fact = x
print('Fact ' + str(i+1) + ': ' + str(subject) + ' : ' + str(verb) + ' : ' + str(fact))
summary += 'Fact ' + str(i+1) + ': ' + (str(fact))
Results are as follows:
Fact 1: I : shall narrate : history
Fact 2: I : heard : it
Fact 3: they : became : wives
Fact 4: Kasyapa : derived : pleasure
Fact 5: ladies : felt : transports
Fact 6: Kadru : wished : have
Fact 7: Vinata : wished : to bring
Fact 8: lord : gave : boon
Fact 9: Kasyapa : said : Be
Fact 10: Vinata : obtained : prayer
Fact 11: she : regarded : boon
Fact 12: Kadru : obtained : sons
I tried
textacy.extract.words
textacy.extract.entities
textacy.extract.ngrams
textacy.extract.noun_chunks
textacy.ke.textrank
Everything is working as per the book but results are not perfect.
I am wanting something like "Kasyapa married Kadru and Vinata sisters" or "Kasyapa gave embroys to Kadru and Vinata".
Can you please suggest me how to do this? Or suggest me some alternative packages to use?
Just an update. I have been able to pagerank the "Sauti" sentences. Here are the results in descending order of pagerank:
(0.0869526908422304, ['O', 'Brahmana', ',', 'in', 'the', 'golden', 'age', ',', 'Prajapati', 'had', 'two', 'daughters', '.']),
(0.08675152795526771, ['Named', 'Kadru', 'and', 'Vinata', ',', 'they', 'became', 'the', 'wives', 'of', 'Kasyapa', '.']),
(0.08607926397402169, ['And', 'Vinata', 'wished', 'to', 'bring', 'forth', 'two', 'sons', 'surpassing', 'the', 'thousand', 'offsprings', 'of', 'Kadru', 'in', 'strength', ',', 'energy', ',', 'size', 'of', 'body', ',', 'and', 'prowess', '.']),
(0.08096858541855065, ['Kasyapa', 'derived', 'great', 'pleasure', 'from', 'his', 'two', 'wedded', 'wives', 'and', 'being', 'gratified', 'he', ',', 'resembling', 'Prajapati', 'himself', ',', 'offered', 'to', 'give', 'each', 'of', 'them', 'a', 'boon', '.']),
(0.08025844559654187, ['And', 'unto', 'Vinata', 'also', ',', 'Kasyapa', 'said', ',', '("\'Be",', "'VBD", 'it', 'so', '!', '("\'",', '"\'\'"),', 'Then', 'Vinata', ',', 'having', ';', 'obtained', 'her', 'prayer', ',', 'rejoiced', 'greatly', '.']),
(0.07764697882919071, ['Obtaining', 'two', 'sons', 'of', 'superior', 'prowess', ',', 'she', 'regarded', 'her', 'boon', 'fulfilled', '.']),
(0.07717129674341844, ['("\'Bear",', "'IN", 'the', 'embryos', 'carefully', ',', '("\'",', '"\'\'"),', 'said', 'Kasyapa', ',', 'and', 'then', 'he', 'went', 'into', 'the', 'forest', ',', 'leaving', 'his', 'two', 'wives', 'pleased', 'with', 'his', 'blessings', '.']),
(0.0768816552210493, ['Kadru', 'also', 'obtained', 'her', 'thousand', 'sons', 'of', 'equal', 'splendour', '.']),
(0.07172005226142254, ['Kadru', 'wished', 'to', 'have', 'for', 'sons', 'a', 'thousand', 'snakes', 'all', 'of', 'equal', 'splendour', '.']),
(0.06953411123175395, ['Unto', 'Kadru', 'her', 'lord', 'gave', 'that', 'boon', 'about', 'a', 'multitude', 'of', 'offspring', '.']),
(0.06943939082844, ['Sauti\\', 'said', ',', '("\'",', '"\'\'"),', 'O', 'thou', 'that', 'art', 'blest', 'with', 'longevity', ',', 'I', 'shall', 'narrate', 'the', 'history', 'of', 'Astika', 'as', 'I', 'heard', 'it', 'from', 'my', 'father', '.']),
(0.06888390365265022, ['O', 'sinless', 'one', ',', 'the', 'sisters', 'were', 'endowed', 'with', 'wonderful', 'beauty', '.']),
(0.0677120974454628, ['Hearing', 'that', 'their', 'lord', 'was', 'willing', 'to', 'confer', 'on', 'them', 'their', 'choice', 'blessings', ',', 'those', 'excellent', 'ladies', 'felt', 'transports', 'of', 'joy', '.'])]
Results are not what I was looking for but are impressive.
I used these following libraries:
import nltk.tokenize as tk
from nltk import sent_tokenize, word_tokenize
from nltk.cluster.util import cosine_distance
from nltk.corpus import brown, stopwords
import networkx as nx
Just wanted to share this with you all.
thanks

Tokenize in Python

I am trying to build a function that python that allows me to tokenize a character string. I have performed the following function:
def tokenize(string):
words = nltk.word_tokenize(string)
return words
This function prints the following:
tokenize("Hello. What’s your name?")
['Hello', '.', 'What', '’', 's', 'your', 'name', '?']
But I need you to print me as follows:
['Hello', '.', 'What’s', 'your', 'name', '?']
How could I implement it?.
Thank you

filtering a txt file in python on three letter words only

I need to write a program in Python3 that can filter a txt file in the Linux shell on three letter words only.
This is what I've got so far:
def main():
string = open("verhaaltje.txt", "r")
words = [word for word in string.split() if len(word)==3]
file.close()
print (str(words))
main()
Is there anyone that can help?
Upload your txt file contents and error logs.
string = open("verhaaltje.txt", "r")
words = [word for word in string.read().split() if len(word)==3]
string.close()
print (str(words))
With my above code and some text from the internet, I got
['the', 'the', 'the', 'm).', 'the', 'the', 'The', 'its', 'ago', 'was', 'm),', 'but', 'now', 'm).', 'and', 'its', 'big', 'but', 'the', 'are', 'and', 'its', 'the', 'and', 'for']
For the new line, modify a bit of the print statement.
print ('\n'.join(words))

Comparing like words between two dictionaries

I am using python 3.x,
I have 2 dictionaries (both very large but will substitute here). The values of the dictionaries contain more than one word:
dict_a = {'key1': 'Large left panel', 'key2': 'Orange bear rug', 'key3': 'Luxo jr. lamp'}
dict_a
{'key1': 'Large left panel',
'key2': 'Orange bear rug',
'key3': 'Luxo jr. lamp'}
dict_b = {'keyX': 'titanium panel', 'keyY': 'orange Ball and chain', 'keyZ': 'large bear musket'}
dict_b
{'keyX': 'titanium panel',
'keyY': 'orange Ball and chain',
'keyZ': 'large bear musket'}
I am looking for a way to compare the individual words contained in the values of dict_a to the words contained in the values of dict_b and return a dictionary or data-frame that contains the word, and the keys from dict_a and dict_b it is associated with:
My desired output (not formatted any certain way):
bear: key2 (from dict_a), keyZ(from dict_b)
Luxo: key3
orange: key2 (from dict_a), keyY (from dict_b)
I've got code that works for looking up a specific word in a single dictionary but it's not sufficient for what I need to accomplish here:
def search(myDict, lookup):
aDict = {}
for key, value in myDict.items():
for v in value:
if lookup in v:
aDict[key] = value
return aDict
print (key, value)
dicts = {'a': {'key1': 'Large left panel', 'key2': 'Orange bear rug',
'key3': 'Luxo jr. lamp'},
'b': {'keyX': 'titanium panel', 'keyY': 'orange Ball and chain',
'keyZ': 'large bear musket'} }
from collections import defaultdict
index = defaultdict(list)
for dname, d in dicts.items():
for key, words in d.items():
for word in words.lower().split(): # lower() to make Orange/orange match
index[word].append((dname, key))
index now contains:
{'and' : [('b', 'keyY')],
'ball' : [('b', 'keyY')],
'bear' : [('a', 'key2'), ('b', 'keyZ')],
'chain' : [('b', 'keyY')],
'jr.' : [('a', 'key3')],
'lamp' : [('a', 'key3')],
'large' : [('a', 'key1'), ('b', 'keyZ')],
'left' : [('a', 'key1')],
'luxo' : [('a', 'key3')],
'musket' : [('b', 'keyZ')],
'orange' : [('a', 'key2'), ('b', 'keyY')],
'panel' : [('a', 'key1'), ('b', 'keyX')],
'rug' : [('a', 'key2')],
'titanium': [('b', 'keyX')] }
Update to comments
Since your actual dictionary is a mapping from string to list (and not string to string) change your loops to
for dname, d in dicts.items():
for key, wordlist in d.items(): # changed "words" to "wordlist"
for words in wordlist: # added extra loop to iterate over wordlist
for word in words.split(): # removed .lower() since text is always uppercase
index[word].append((dname, key))
Since your lists have only one item you could just do
for dname, d in dicts.items():
for key, wordlist in d.items():
for word in wordlist[0].split(): # assumes single item list
index[word].append((dname, key))
If you have words that you don't want to be added to your index you can skip adding them to the index:
words_to_skip = {'-', ';', '/', 'AND', 'TO', 'UP', 'WITH', ''}
Then filter them out with
if word in words_to_skip:
continue
I noticed that you have some words surrounded by parenthesis (such as (342) and (221)). If you want to get rid the parenthesis do
if word[0] == '(' and word[-1] == ')':
word = word[1:-1]
Putting this all together we get
words_to_skip = {'-', ';', '/', 'AND', 'TO', 'UP', 'WITH', ''}
for dname, d in dicts.items():
for key, wordlist in d.items():
for word in wordlist[0].split(): # assumes single item list
if word[0] == '(' and word[-1] == ')':
word = word[1:-1] # remove outer parenthesis
if word in words_to_skip: # skip unwanted words
continue
index[word].append((dname, key))
I think you can do what you want pretty easily. This code produces output in the format {word: {key: name_of_dict_the_key_is_in}}:
def search(**dicts):
result = {}
for name, dct in dicts.items():
for key, value in dct.items():
for word in value.split():
result.setdefault(word, {})[key] = name
return result
You call it with the input dictionaries as keyword arguments. The keyword you use for each dictionary will be the string used to describe it in the output dictionary, so use something like search(dict_a=dict_a, dict_b=dict_b).
If your dictionaries might have some of the same keys, this code might not work right, since the keys could collide if they have the same words in their values. You could make the outer dict contain a list of (key, name) tuples, instead of an inner dictionary, I suppose. Just change the assignment line to result.setdefault(word, []).append((key, name)). That would be less handy to search in though.

Resources