Compare two strings by meaning - string

Are there any solutions how to compare short strings not by characters, but by meaning? I've tried to google it, but all search results are about comparing characters, length and so on.
I'm not asking you about ready-to-use solutions, just show me the way, where I need "to dig".
Thank you in advance.

Your topic is not clear enough. When you compare string by meaning, you need to define the level of equal. for example "I have 10 dollars" and "there are 10 dollars in my pocket. Are they equal in your definition? sometimes there is implied meaning in the string.

Answer to a very similar closed question, that wants to compare the context between two lists ['apple', 'spinach', 'clove'] and ['fruit', 'vegetable', 'spice'], that uses the Google Knowledge Graph Search API:
import json
from urllib.parse import urlencode
from urllib.request import urlopen
def get_descriptions_set(query: str) -> set[str]:
descriptions = set()
kg_response = get_kg_response(query)
for element in kg_response['itemListElement']:
if 'description' in element['result']:
descriptions.add(element['result']['description'].lower())
return descriptions
def get_kg_response(query: str) -> str:
api_key = open('.api_key').read()
service_url = 'https://kgsearch.googleapis.com/v1/entities:search'
params = {
'query': query,
'limit': 10,
'indent': True,
'key': api_key,
}
url = f'{service_url}?{urlencode(params)}'
response = json.loads(urlopen(url).read())
return response
def main() -> None:
list_1 = ['apple', 'spinach', 'clove']
list_2 = ['fruit', 'vegetable', 'spice']
list_1_kg_descrpitons = [get_descriptions_set(q) for q in list_1]
print('\n'.join(f'{q} {descriptions}'
for q, descriptions in zip(list_1, list_1_kg_descrpitons)))
list_2_matches_context = [
d in descriptions
for d, descriptions in zip(list_2, list_1_kg_descrpitons)
]
print(list_2_matches_context)
if __name__ == '__main__':
main()
Output:
apple {'watch', 'technology company', 'fruit', 'american singer-songwriter', 'digital media player', 'mobile phone', 'tablet computer', 'restaurant company', 'plant'}
spinach {'video game', 'plant', 'vegetable', 'dish'}
clove {'village in england', 'spice', 'manga series', 'production company', '2018 film', 'american singer-songwriter', '2008 film', 'plant'}
[True, True, True]

Related

Extracting full names with ne_chunks

Newbie here. I'm trying to extract full names of people and organisations using the following code.
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(' '.join([token for token, pos in i.leaves()]))
if current_chunk:
named_entity = ' '.join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
>>> my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."
>>> get_continuous_chunks(my_sent)
['Toni']
As you can see it is returning only the first proper noun. Not the full name, and not any other proper nouns in the string.
What am I doing wrong?
Here is some working code.
The best thing to do is to step through your code and put a lot of print statements at different places. You will see where I printed the type() and the str() value of the items you are iterating on. I find this helps me to visualize and think more about the loops and conditionals I am writing if I can see them listed.
Also, oops, I inadvertently named all of the variables, "contiguous" instead of "continuous" ... not sure why ... contiguous might be more accurate
Code:
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
current_chunk = []
contiguous_chunk = []
contiguous_chunks = []
for i in chunked:
print(f"{type(i)}: {i}")
if type(i) == Tree:
current_chunk = ' '.join([token for token, pos in i.leaves()])
# Apparently, Tony and Morrison are two separate items,
# but "Random House" and "New York City" are single items.
contiguous_chunk.append(current_chunk)
else:
# discontiguous, append to known contiguous chunks.
if len(contiguous_chunk) > 0:
contiguous_chunks.append(' '.join(contiguous_chunk))
contiguous_chunk = []
current_chunk = []
return contiguous_chunks
my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."
print()
contig_chunks = get_continuous_chunks(my_sent)
print(f"INPUT: My sentence: '{my_sent}'")
print(f"ANSWER: My contiguous chunks: {contig_chunks}")
Exection:
(venv) [ttucker#zim stackoverflow]$ python contig.py
<class 'nltk.tree.Tree'>: (PERSON Toni/NNP)
<class 'nltk.tree.Tree'>: (PERSON Morrison/NNP)
<class 'tuple'>: ('was', 'VBD')
<class 'tuple'>: ('the', 'DT')
<class 'tuple'>: ('first', 'JJ')
<class 'tuple'>: ('black', 'JJ')
<class 'tuple'>: ('female', 'NN')
<class 'tuple'>: ('editor', 'NN')
<class 'tuple'>: ('in', 'IN')
<class 'tuple'>: ('fiction', 'NN')
<class 'tuple'>: ('at', 'IN')
<class 'nltk.tree.Tree'>: (ORGANIZATION Random/NNP House/NNP)
<class 'tuple'>: ('in', 'IN')
<class 'nltk.tree.Tree'>: (GPE New/NNP York/NNP City/NNP)
<class 'tuple'>: ('.', '.')
INPUT: My sentence: 'Toni Morrison was the first black female editor in fiction at Random House in New York City.'
ANSWER: My contiguous chunks: ['Toni Morrison', 'Random House', 'New York City']
I am also a little unclear as to exactly what you were looking for, but from the description, this seems like it.

How can I set up a new column in python with a value based on the return of a function?

I am doing some text mining in python and want to set up a new column with the value 1 if the return of my search function is true and 0 if it's false.
I have tried various if statements, but cannot get anything to work.
A simplified version of what I'm doing is below:
import pandas as pd
import nltk
nltk.download('punkt')
df = pd.DataFrame (
{
'student number' : [1,2,3,4,5],
'answer' : [ 'Yes, she is correct.', 'Yes', 'no', 'north east', 'No its North East']
# I know there's an apostrophe missing
}
)
print(df)
# change all text to lower case
df['answer'] = df['answer'].str.lower()
# split the answer into individual words
df['text'] = df['answer'].apply(nltk.word_tokenize)
# Check if given words appear together in a list of sentence
def check(sentence, words):
res = []
for substring in sentence:
k = [ w for w in words if w in substring ]
if (len(k) == len(words) ):
res.append(substring)
return res
# Driver code
sentence = df['text']
words = ['no','north','east']
print(check(sentence, words))
This is what you want I think:
df['New'] = df['answer'].isin(words)*1
This one works for me:
for i in range(0, len(df)):
if set(words) <= set(df.text[i]):
df['NEW'][i] = 1
else:
df['NEW'][i] = 0
You don't need the function if you use this method.

Plotly Dash Graph With Multiple Dropdown Inputs Not Working

I’m trying to create a time-series Dash line graph that has multiple interactive dropdown user input variables. I would ideally like each of the dropdown inputs to allow for multiple selections.
While I’m able to create the drop down menus successfully, the chart isn’t updating like I’d like. When I allow the dropdowns to have multiple selections, I get an error that arrays are different lengths. And when I limit the dropdowns to one selection, I get an error that [‘Vendor_Name’] is not in index. So this may be two separate problems.
Graph that doesn’t work:
Snippet of Excel data imported into DF
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import pandas as pd
#import plotly.graph_objs as go
df = pd.read_csv("Data.csv", sep = "\t")
df['YearMonth'] = pd.to_datetime(df['YearMonth'], format = '%Y-%m')
cols = ['Product_1', 'Product_2', 'Product_3']
vendor = df['Vendor'].unique()
app = dash.Dash('Data')
app.layout = html.Div([
html.Div([
html.Div([
html.Label('Product'),
dcc.Dropdown(
id = 'product',
options = [{
'label' : i,
'value' : i
} for i in cols],
multi = True,
value = 'Product_1'
),
]),
html.Div([
html.Label('Vendor'),
dcc.Dropdown(
id = 'vendor',
options = [{
'label' : i,
'value' : i
} for i in vendor],
multi = True,
value = 'ABC')
,
]),
]),
dcc.Graph(id = 'feature-graphic')
])
#app.callback(Output('feature-graphic', 'figure'),
[Input('product', 'value'),
Input('vendor', 'value')])
def update_graph(input_vendor, input_column):
df_filtered = df[df['Vendor'] == input_vendor]
##also tried setting an index because of the error I was getting. Not sure if necessary
df_filtered = df_filtered.set_index(['Vendor'])
traces = []
df_by_col = df_filtered[[input_column, 'YearMonth']]
traces.append({
'x' :pd.Series(df_by_col['YearMonth']),
'y' : df_by_col[input_column],
'mode' : 'lines',
'type' : 'scatter',
'name' :'XYZ'}
)
fig = {
'data': traces,
'layout': {'title': 'Title of Chart'}
}
return fig
if __name__ == '__main__':
app.run_server(debug=False)
Thanks in advance for helping! Still new-ish to Python, but very excited about Dash’s capabilities. I’ve been able to create other graphs with single inputs, and have read through documentation.
Here is the approach I followed: (editing common example available in google with my approach):
import dash
from dash.dependencies import Input, Output
import dash_core_components as dcc
import dash_html_components as html
app = dash.Dash(__name__)
all_options = {
'America': ['New York City', 'San Francisco', 'Cincinnati'],
'Canada': [u'Montréal', 'Toronto', 'Ottawa']
}
app.layout = html.Div([
dcc.Dropdown(
id='countries-dropdown',
options=[{'label': k, 'value': k} for k in all_options.keys()],
value='America', #default value to show
multi=True,
searchable=False
),
dcc.Dropdown(id='cities-dropdown', multi=True, searchable=False, placeholder="Select a city"),
html.Div(id='display-selected-values')
])
#app.callback(
dash.dependencies.Output('cities-dropdown', 'options'),
[dash.dependencies.Input('countries-dropdown', 'value')])
def set_cities_options(selected_country):
if type(selected_country) == 'str':
return [{'label': i, 'value': i} for i in all_options[selected_country]]
else:
return [{'label': i, 'value': i} for country in selected_country for i in all_options[country]]
if __name__ == '__main__':
app.run_server(debug=True)
Workaround here is: When there is single input present in parent dropdown, the value is in string format. But for multiple values, it comes in list format.
This code also work perfectly and gets updated automatically even when you click on cross option to remove any selected option.
Note: I have used 'placeholder' attribute instead of defining default value for it as it made no sense in this case. But you can also update the value dynamically in similar way.
1 input data
The data as it is in the csv is hard to loop.
And I would argue that it is the main reason your code does not work,
because you seem to understand the fundamental code structure.
Having put on my SQL glasses I think you should try to get it to sth like
Date, Vendor, ProductName, Value
2 callback input types change
multi is tricky because it changes switches between returning a str if only 1 item is selected and list if more than one is selected
3 callback return type
you code returns a dict but the callback declared figure as the return type
but here is the code with debugging traces of print() and sleep()
import pandas as pd
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.graph_objs as go
import time
df = pd.read_csv("Data.csv", sep="\t")
df['YearMonth'] = pd.to_datetime(df['YearMonth'], format='%Y-%m')
products = ['Product_1', 'Product_2', 'Product_3']
vendors = df['Vendor'].unique()
app = dash.Dash('Data')
app.layout = html.Div([
html.Div([
html.Div([
html.Label('Product'),
dcc.Dropdown(
id='product',
options=[{'label' : p, 'value' : p} for p in products],
multi=True,
value='Product_1'
),
]),
html.Div([
html.Label('Vendor'),
dcc.Dropdown(
id='vendor',
options=[{'label': v, 'value': v} for v in vendors],
multi=True,
value='ABC'
),
]),
]),
dcc.Graph(id='feature-graphic', figure=go.Figure())
])
#app.callback(
Output('feature-graphic', 'figure'),
[Input('product', 'value'),
Input('vendor', 'value')])
def update_graph(input_product, input_vendor):
# df_filtered[['Product_1', 'YearMonth']]
if type(input_product) == str:
input_product = [input_product]
if type(input_vendor) == str:
input_vendor= [input_vendor]
datasets = ['']
i = 1
for vendor in input_vendor:
df_filtered = df[df['Vendor'] == vendor]
for product in input_product:
datasets.append((df_filtered[['YearMonth', 'Vendor', product]]).copy())
datasets[i]['ProductName'] = product
datasets[i].rename(columns={product: 'Value'}, inplace=True)
i += 1
datasets.pop(0)
print(datasets)
traces = ['']
for dataset in datasets:
print(dataset)
time.sleep(1)
traces.append(
go.Scatter({
'x': dataset['YearMonth'],
'y': dataset['Value'],
'mode': 'lines',
'name': f"Vendor: {dataset['Vendor'].iloc[0]} Product: {dataset['ProductName'].iloc[0]}"
}))
traces.pop(0)
layout = {'title': 'Title of Chart'}
fig = {'data': traces, 'layout': go.Layout(layout)}
return go.Figure(fig)
if __name__ == '__main__':
app.run_server()
quick and dirty disclosure:
If you handle the 1. issue it will dramatically simplify everything.
So I'd try to isolate the pd.DataFrame() juggling out of the callback and into the upper I/O part.
1) don't use counters in for loops
2) my variable names aren't the best either
3) the following style is caveman's python and there must be a better way:
traces = ['']
traces.append(this_and_that)
traces.pop(0)
Generally:
using print(input_variable) and print(type(input_variable))
gets my wheels most of the time out of the mud.
after all
you should notice that each trace got its individual name which will show up in the legend. Clicking on the name in the legend will add or remove the trace without the need for#app.callback()

Unable to read value for variable outside loop in python

I am trying to create a list of dictionaries that contain lists of words at 'body' and 'summ' keys using spacy. I am also using BeautifulSoup since the actual data is raw html.
This i what I have so far
from pymongo import MongoClient
from bs4 import BeautifulSoup as bs
import spacy
import string
clt = MongoClient('localhost')
db1 = clt['mchack']
db2 = clt['clean_data']
nlp = spacy.load('en')
valid_shapes = ['X.X','X.X.','X.x','X.x.','x.x','x.x.','x.X','x.X.']
cake = list()
sent_x = list()
temp_b = list()
temp_s = list()
sent_y = list()
table = str.maketrans(dict.fromkeys(string.punctuation))
for item in db1.article.find().limit(1):
finale_doc = {}
x = bs(item['news']['article']['Body'], 'lxml')
y = bs(item['news']['article']['Summary'], 'lxml')
for content in x.find_all('p'):
v = content.text
v = v.translate(table)
sent_x.append(v)
body = ' '.join(sent_x)
for content in y.find_all('p'):
v = content.text
v = v.translate(table)
sent_y.append(v)
summ = ' '.join(sent_y)
b_nlp = nlp(body)
s_nlp = nlp(summ)
for token in b_nlp:
if token.is_alpha:
temp_b.append(token.text.lower())
elif token.shape_ in valid_shapes:
temp_b.append(token.text.lower())
elif token.pos_=='NUM':
temp_b.append('<NUM>')
elif token.pos_=="<SYM>":
temp_b.append('<SYM>')
for token in s_nlp:
if token.is_alpha:
temp_s.append(token.text.lower())
elif token.shape_ in valid_shapes:
temp_s.append(token.text.lower())
elif token.pos_=='NUM':
temp_s.append('<NUM>')
elif token.pos_=="<SYM>":
temp_s.append('<SYM>')
finale_doc.update({'body':temp_b,'summ':temp_s})
cake.append(finale_doc)
print(cake)
del sent_x[:]
del sent_y[:]
del temp_b[:]
del temp_s[:]
del finale_doc
print(cake)
The first print statement gives proper output
'summ': ['as', 'per', 'the', 'budget', 'estimates', 'we', 'are', 'going', 'to', 'spend', 'rs', '<NUM>', 'crore', 'in', 'the', 'next', 'year'],
'body': ['central', 'government', 'has', 'proposed', 'spendings', 'worth', 'over', 'rs', '<NUM>', 'crore', 'on', 'medical', 'and', 'cash', 'benefits', 'for', 'workers', 'and', 'family', 'members']}]
However, after emptying the lists sent_x, sent_y, temp_b and temp_s, the output comes:
[{'summ': [], 'body': []}]
You keep passing the references to temp_b and temp_s. That's why after emptying these lists cake's content also changes (values of the dictionary are the same objects as temp_b and temp_s)!
You simply need to make a copy before appending the finale_doc dict to cake list.
finale_doc.update({'body': list(temp_b), 'summ': list(temp_s)})
You should try creating a minimal reproducible version of this, as it would meet stack overflow guidelines and you would be likely to answer your own problem.
I think what you are asking is this:
How can I empty a list without changing other instances of that list?
I made some code and I think it should work:
items = []
contents = []
for value in (1, 2):
contents.append(value)
items.append(contents)
print(contents)
del contents[:]
print(items)
This prints [1], [2] like I want, but then it prints [[], []] instead of [[1], [2]].
Then I could answer your question:
Objects (including lists) are permanent, this won't work
Instead of modifying (adding to and then deleting) the same list, you probably want to create a new list inside the loop. You can verify this by looking at id(contents) and id(items[0]), etc., and see they are all the same list. You can even do contents.append(None); print(items) and see that you now have [None, None].
Try doing
for ...
contents = []
contents.append(value)
instead of
contents = []
for ...
del contents[:]
Edit: Another answer suggests making a copy of the values as you add them. This will work, but in your case I feel that making a copy and then nulling is unnecessarily complicated. This might be appropriate if you continued to add to the list.

Assigning multiple values to dictionary keys from a file in Python 3

I'm fairly new to Python but I haven't found the answer to this particular problem.
I am writing a simple recommendation program and I need to have a dictionary where cuisine is a key and name of a restaurant is a value. There are a few instances where I have to split a string of a few cuisine names and make sure all other restaurants (values) which have the same cuisine get assigned to the same cuisine (key). Here's a part of a file:
Georgie Porgie
87%
$$$
Canadian, Pub Food
Queen St. Cafe
82%
$
Malaysian, Thai
Mexican Grill
85%
$$
Mexican
Deep Fried Everything
52%
$
Pub Food
so it's just the first and the last one with the same cuisine but there are more later in the file.
And here is my code:
def new(file):
file = "/.../Restaurants.txt"
d = {}
key = []
with open(file) as file:
lines = file.readlines()
for i in range(len(lines)):
if i % 5 == 0:
if "," not in lines[i + 3]:
d[lines[i + 3].strip()] = [lines[i].strip()]
else:
key += (lines[i + 3].strip().split(', '))
for j in key:
if j not in d:
d[j] = [lines[i].strip()]
else:
d[j].append(lines[i].strip())
return d
It gets all the keys and values printed but it doesn't assign two values to the same key where it should. Also, with this last 'else' statement, the second restaurant is assigned to the wrong key as a second value. This should not happen. I would appreciate any comments or help.
In the case when there is only one category you don't check if the key is in the dictionary. You should do this analogously as in the case of multiple categories and then it works fine.
I don't know why you have file as an argument when you have a file then overwritten.
Additionally you should make 'key' for each result, and not += (adding it to the existing 'key'
when you check if j is in dictionary, clean way is to check if j is in the keys (d.keys())
def new(file):
file = "/.../Restaurants.txt"
d = {}
key = []
with open(file) as file:
lines = file.readlines()
for i in range(len(lines)):
if i % 5 == 0:
if "," not in lines[i + 3]:
if lines[i + 3] not in d.keys():
d[lines[i + 3].strip()] = [lines[i].strip()]
else:
d[lines[i + 3]].append(lines[i].strip())
else:
key = (lines[i + 3].strip().split(', '))
for j in key:
if j not in d.keys():
d[j] = [lines[i].strip()]
else:
d[j].append(lines[i].strip())
return d
Normally, I find that if you use names for the dictionary keys, you may have an easier time handling them later.
In the example below, I return a series of dictionaries, one for each restaurant. I also wrap the functionality of processing the values in a method called add_value(), to keep the code more readable.
In my example, I'm using codecs to decode the value. Although not necessary, depending on the characters you are dealing with it may be useful. I'm also using itertools to read the file lines with an iterator. Again, not necessary depending on the case, but might be useful if you are dealing with really big files.
import copy, itertools, codecs
class RestaurantListParser(object):
file_name = "restaurants.txt"
base_item = {
"_type": "undefined",
"_fields": {
"name": "undefined",
"nationality": "undefined",
"rating": "undefined",
"pricing": "undefined",
}
}
def add_value(self, formatted_item, field_name, field_value):
if isinstance(field_value, basestring):
# handle encoding, strip, process the values as you need.
field_value = codecs.encode(field_value, 'utf-8').strip()
formatted_item["_fields"][field_name] = field_value
else:
print 'Error parsing field "%s", with value: %s' % (field_name, field_value)
def generator(self, file_name):
with open(file_name) as file:
while True:
lines = tuple(itertools.islice(file, 5))
if not lines: break
# Initialize our dictionary for this item
formatted_item = copy.deepcopy(self.base_item)
if "," not in lines[3]:
formatted_item['_type'] = lines[3].strip()
else:
formatted_item['_type'] = lines[3].split(',')[1].strip()
self.add_value(formatted_item, 'nationality', lines[3].split(',')[0])
self.add_value(formatted_item, 'name', lines[0])
self.add_value(formatted_item, 'rating', lines[1])
self.add_value(formatted_item, 'pricing', lines[2])
yield formatted_item
def split_by_type(self):
d = {}
for restaurant in self.generator(self.file_name):
if restaurant['_type'] not in d:
d[restaurant['_type']] = [restaurant['_fields']]
else:
d[restaurant['_type']] += [restaurant['_fields']]
return d
Then, if you run:
p = RestaurantListParser()
print p.split_by_type()
You should get:
{
'Mexican': [{
'name': 'Mexican Grill',
'nationality': 'undefined',
'pricing': '$$',
'rating': '85%'
}],
'Pub Food': [{
'name': 'Georgie Porgie',
'nationality': 'Canadian',
'pricing': '$$$',
'rating': '87%'
}, {
'name': 'Deep Fried Everything',
'nationality': 'undefined',
'pricing': '$',
'rating': '52%'
}],
'Thai': [{
'name': 'Queen St. Cafe',
'nationality': 'Malaysian',
'pricing': '$',
'rating': '82%'
}]
}
Your solution is simple, so it's ok. I'd just like to mention a couple of ideas that come to mind when I think about this kind of problem.
Here's another take, using defaultdict and split to simplify things.
from collections import defaultdict
record_keys = ['name', 'rating', 'price', 'cuisine']
def load(file):
with open(file) as file:
data = file.read()
restaurants = []
# chop up input on each blank line (2 newlines in a row)
for record in data.split("\n\n"):
fields = record.split("\n")
# build a dictionary by zipping together the fixed set
# of field names and the values from this particular record
restaurant = dict(zip(record_keys, fields))
# split chops apart the type cuisine on comma, then _.strip()
# removes any leading/trailing whitespace on each type of cuisine
restaurant['cuisine'] = [_.strip() for _ in restaurant['cuisine'].split(",")]
restaurants.append(restaurant)
return restaurants
def build_index(database, key, value):
index = defaultdict(set)
for record in database:
for v in record.get(key, []):
# defaultdict will create a set if one is not present or add to it if one does
index[v].add(record[value])
return index
restaurant_db = load('/var/tmp/r')
print(restaurant_db)
by_type = build_index(restaurant_db, 'cuisine', 'name')
print(by_type)

Resources