Newbie here. I'm trying to extract full names of people and organisations using the following code.
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(' '.join([token for token, pos in i.leaves()]))
if current_chunk:
named_entity = ' '.join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
>>> my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."
>>> get_continuous_chunks(my_sent)
['Toni']
As you can see it is returning only the first proper noun. Not the full name, and not any other proper nouns in the string.
What am I doing wrong?
Here is some working code.
The best thing to do is to step through your code and put a lot of print statements at different places. You will see where I printed the type() and the str() value of the items you are iterating on. I find this helps me to visualize and think more about the loops and conditionals I am writing if I can see them listed.
Also, oops, I inadvertently named all of the variables, "contiguous" instead of "continuous" ... not sure why ... contiguous might be more accurate
Code:
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
current_chunk = []
contiguous_chunk = []
contiguous_chunks = []
for i in chunked:
print(f"{type(i)}: {i}")
if type(i) == Tree:
current_chunk = ' '.join([token for token, pos in i.leaves()])
# Apparently, Tony and Morrison are two separate items,
# but "Random House" and "New York City" are single items.
contiguous_chunk.append(current_chunk)
else:
# discontiguous, append to known contiguous chunks.
if len(contiguous_chunk) > 0:
contiguous_chunks.append(' '.join(contiguous_chunk))
contiguous_chunk = []
current_chunk = []
return contiguous_chunks
my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."
print()
contig_chunks = get_continuous_chunks(my_sent)
print(f"INPUT: My sentence: '{my_sent}'")
print(f"ANSWER: My contiguous chunks: {contig_chunks}")
Exection:
(venv) [ttucker#zim stackoverflow]$ python contig.py
<class 'nltk.tree.Tree'>: (PERSON Toni/NNP)
<class 'nltk.tree.Tree'>: (PERSON Morrison/NNP)
<class 'tuple'>: ('was', 'VBD')
<class 'tuple'>: ('the', 'DT')
<class 'tuple'>: ('first', 'JJ')
<class 'tuple'>: ('black', 'JJ')
<class 'tuple'>: ('female', 'NN')
<class 'tuple'>: ('editor', 'NN')
<class 'tuple'>: ('in', 'IN')
<class 'tuple'>: ('fiction', 'NN')
<class 'tuple'>: ('at', 'IN')
<class 'nltk.tree.Tree'>: (ORGANIZATION Random/NNP House/NNP)
<class 'tuple'>: ('in', 'IN')
<class 'nltk.tree.Tree'>: (GPE New/NNP York/NNP City/NNP)
<class 'tuple'>: ('.', '.')
INPUT: My sentence: 'Toni Morrison was the first black female editor in fiction at Random House in New York City.'
ANSWER: My contiguous chunks: ['Toni Morrison', 'Random House', 'New York City']
I am also a little unclear as to exactly what you were looking for, but from the description, this seems like it.
Related
I am ultimately trying to save variables that my browser finds to a file in order for it to be later recalled in order to compare if it has went through those values before. Before I reach that step, I am testing my code and have been running into issues:
First part of my code with no error:
import shelve
shelfFile = shelve.open('mydata')
cats = ['Zophie', 'Pooka', 'Simon']
shelfFile['cats'] = cats
shelfFile.close()
This does what it is intended to do, saves cats to a file.
import shelve
shelfFile = shelve.open('mydata')
cats = shelfFile['cats']
shelfFile.close()
new = 'Zophie', 'Pooka', 'Simon'
if new in cats:
print('Found it!')
else:
print("There is an error")
When I run the code it tells me there is an error rather than saying that it found it. Since the list variables are the same, why are they not matching?
I haven't seen a declaration of variables through the "comma-separated list" as you did: new = 'Zophie', 'Pooka', 'Simon'.
I'm pretty sure you just did typo and should use an array for names:
new = ['Zophie', 'Pooka', 'Simon']
for item in new:
if item in cats:
print('Found it!')
else:
print("Not found")
or:
new = ['Zophie', 'Pooka', 'Simon']
for item in new:
if item in cats:
print(f'Found {item}!')
else:
print(f'{item} is not found')
You are checking a tuple available in a list
new = 'Zophie', 'Pooka', 'Simon' is a tuple object
and you are expecting a cats = ['Zophie', 'Pooka', 'Simon'] as a list.
If cats = shelfFile['cats'] returns the list of
cats then you need to do is
for n in new:
if n in cats:
print('Found it!')
else:
print('Not found')
You can demonstrate the logic by running this script
>>> x = "a", "b", "c"
>>> print(x)
('a', 'b', 'c')
>>> y = ["a", "b", "c"]
>>> x in y
False
>>> for n in x:
... if n in y:
... print("found")
found
found
found
I would like to write a function to build a bytes type string that need to use f-string with different value. the only way I can think about is like following code. anyone have better
suggestion? In code, I have string like I have level but in my actual code the string is about 600 charactors
def get_level_string(x):
size = dict(v1=1, v2= 200, v3= 30000)
s = size.get('v1')
name = lambda x: f"I have level value as {x} in the house"
return {
'level1': b'%a' % (name(size['v1'])),
'level2': b'%a' % (name(size['v2'])),
'level3': b'%a' % (name(size['v3'])),
}[x]
a = get_level_string('level1')
b = get_level_string('level2')
c = get_level_string('level3')
print(a, type(a))
print(b, type(b))
print(c, type(c))
=> #b"'I have level value as 1 in the house'" <class 'bytes'>
=> #b"'I have level value as 200 in the house'" <class 'bytes'>
=> #b"'I have level value as 30000 in the house'" <class 'bytes'>
You can make this a good deal simpler, by generating the strings and then calling their encode method to make them bytes objects. Note that your function really just builds a dictionary and then looks stuff up in it. It's much simpler to build the dictionary only once and then supply the bound __getitem__ method under a different name.
template = "I have level value as {} in the house"
size_list = (1, 200, 30000)
sizes = {f"level{i}": template.format(x).encode() for i, x in enumerate(size_list, start=1)}
get_level_string = sizes.__getitem__
# tests
a = get_level_string('level1')
b = get_level_string('level2')
c = get_level_string('level3')
print(a, type(a))
print(b, type(b))
print(c, type(c))
prints
b'I have level value as 1 in the house' <class 'bytes'>
b'I have level value as 200 in the house' <class 'bytes'>
b'I have level value as 30000 in the house' <class 'bytes'>
for your tests
I have this list in a file:
Alabama
4802982
9
Alaska
721523
3
Arizona
6412700
11
Arkansas
2926229
6
California
37341989
55
Colorado
5044930
9
(Except it continues for every state) I need to create a dictionary with the state names as keys and both the population and electoral votes(first and second numbers) as a list of values.
This is my function so far:
def make_elector_dictionary(file):
dic = {}
try:
infile = open(file,'r')
except IOError:
print('file not found')
else:
for line in infile:
line = line.strip()
dic[line] = ()
print(dic)
Try this:
s = "Alabama 4802982 9 Alaska 721523 3 Arizona 6412700 11 Arkansas 2926229 6 California 37341989 55 Colorado 5044930 9"
l = s.split()
dictionaryYouWant = {l[index]: [l[index+1], l[index+2]] for index in range(0, len(l), 3)}
split the string by space to split it into words, then loop through every three, making an item first one : list of the last two with a dictionary comprehension.
This gives:
{'Alabama': ['4802982', '9'], 'Alaska': ['721523', '3'], 'Arizona': ['6412700', '11'], 'Arkansas': ['2926229', '6'], 'California': ['37341989', '55'], 'Colorado': ['5044930', '9']}
The following should give you roughly what you want:
def make_elector_dictionary(file):
# Open and read the entire file
try:
with open(file,'r') as infile:
raw_data = infile.read()
except IOError:
print('file not found')
return
# Split the text into an array, using a space as the separator between array elements
raw_data = raw_data.split(' ')
# Rearrange the data into a dictionary of dictionaries
processed_data = {raw_data[i]: {'pop': int(raw_data[i+1]), 'electoral_votes': int(raw_data[i+2])}
for i in range(0, len(raw_data), 3) }
return processed_data
print(make_elector_dictionary('data.txt'))
This gives:
{'Arizona': {'pop': 6412700, 'electoral_votes': 11}, 'Arkansas': {'pop': 2926229, 'electoral_votes': 6}, 'California': {'pop': 37341989, 'electoral_votes': 55}, 'Colorado': {'pop': 5044930, 'electoral_votes': 9}, 'Alabama': {'pop': 4802982, 'electoral_votes': 9}, 'Alaska': {'pop': 721523, 'electoral_votes': 3}}
Or you can use
processed_data = {raw_data[i]: [int(raw_data[i+1]), int(raw_data[i+2])]
for i in range(0, len(raw_data), 3) }
if you want the dictionary values to be arrays rather than dictionaries. Whether this approach works is a bit dependent on the details of your data file. For instance, if "New Hampshire" is written in your datafile with a space between "New" and "Hampshire", then "New" will be interpreted by the function as the state name, and you'll get a ValueError when you try to pass "Hampshire" to int as the population. In this case, you'd have to resort to some more sophisticated parsing to get this to work--regular expressions are probably the best option. You could do:
processed_data = {match[1]: [match[2], match[3]]
for match in re.findall(r'(\W|^)([a-zA-z ]+)\s+(\d+)\s+(\d+)', raw_data)}
Remember to import re. This is probably the most robust approach. It will handle the New Hampshire-type case and, in the form above, is not dependent on the type of whitespace that separates the data elements.
I tried modifying this example code on python 3.x.
import csv
def cmp(a, b):
return (a > b) - (a < b)
# write stocks data as comma-separated values
f = open('stocks.csv', 'w')
writer = csv.writer(f)
writer.writerows([
('GOOG', 'Google, Inc.', 505.24, 0.47, 0.09),
('YHOO', 'Yahoo!, Inc.', 27.38, 0.33, 1.22),
('CNET', 'CNET Networks, Inc.', 8.62, -0.13, -1.49)
])
f.close()
# read stocks data, print status messages
f = open('stocks.csv', 'r')
stocks = csv.reader(f)
status_labels = {-1: 'down', 0: 'unchanged', 1: 'up'}
for ticker, name, price, change, pct in stocks:
status = status_labels[cmp(float(change), 0.0)]
print('%s is %s (%s%%)' % (name, status, pct))
f.close()
With suggestions from #glibdud, and #bernie, I have updated my code.
Am getting the below error:
ValueError: not enough values to unpack (expected 5, got 0)
What am I missing?
Note: Removed my question about double quotes in CSV file for string. Double quotes will be there if we have comma separated string, otherwise not.
The Problem occurs during writing the file.
The problem is the newline handling of the csv module. See this and footnote 1
if you add print(*stocks, sep='\n') between line 19 ans 20 you will get following output:
['GOOG', 'Google, Inc.', '505.24', '0.47', '0.09']
[]
['YHOO', 'Yahoo!, Inc.', '27.38', '0.33', '1.22']
[]
['CNET', 'CNET Networks, Inc.', '8.62', '-0.13', '-1.49']
[]
You see... an empty list can not have 5 values to unpack
#bernie already gave you the solution in his comment.
Change line 7 to:
f = open('stocks.csv', 'w', newline='')
^^^^^^^^^^^^
and you're fine.
Are there any solutions how to compare short strings not by characters, but by meaning? I've tried to google it, but all search results are about comparing characters, length and so on.
I'm not asking you about ready-to-use solutions, just show me the way, where I need "to dig".
Thank you in advance.
Your topic is not clear enough. When you compare string by meaning, you need to define the level of equal. for example "I have 10 dollars" and "there are 10 dollars in my pocket. Are they equal in your definition? sometimes there is implied meaning in the string.
Answer to a very similar closed question, that wants to compare the context between two lists ['apple', 'spinach', 'clove'] and ['fruit', 'vegetable', 'spice'], that uses the Google Knowledge Graph Search API:
import json
from urllib.parse import urlencode
from urllib.request import urlopen
def get_descriptions_set(query: str) -> set[str]:
descriptions = set()
kg_response = get_kg_response(query)
for element in kg_response['itemListElement']:
if 'description' in element['result']:
descriptions.add(element['result']['description'].lower())
return descriptions
def get_kg_response(query: str) -> str:
api_key = open('.api_key').read()
service_url = 'https://kgsearch.googleapis.com/v1/entities:search'
params = {
'query': query,
'limit': 10,
'indent': True,
'key': api_key,
}
url = f'{service_url}?{urlencode(params)}'
response = json.loads(urlopen(url).read())
return response
def main() -> None:
list_1 = ['apple', 'spinach', 'clove']
list_2 = ['fruit', 'vegetable', 'spice']
list_1_kg_descrpitons = [get_descriptions_set(q) for q in list_1]
print('\n'.join(f'{q} {descriptions}'
for q, descriptions in zip(list_1, list_1_kg_descrpitons)))
list_2_matches_context = [
d in descriptions
for d, descriptions in zip(list_2, list_1_kg_descrpitons)
]
print(list_2_matches_context)
if __name__ == '__main__':
main()
Output:
apple {'watch', 'technology company', 'fruit', 'american singer-songwriter', 'digital media player', 'mobile phone', 'tablet computer', 'restaurant company', 'plant'}
spinach {'video game', 'plant', 'vegetable', 'dish'}
clove {'village in england', 'spice', 'manga series', 'production company', '2018 film', 'american singer-songwriter', '2008 film', 'plant'}
[True, True, True]