How to say if a word tree is similar to another? - python-3.x

I want to know when a tree is similar to part of another, for instance when
When did Beyonce start becoming popular?
is partly included in the following sentences :
She started becoming popular in the late 1990s (1st case)
She rose to fame in the late 1990s (2nd case)
I am able to say when one text is strictly included included in another : I created a class that transforms spaCy array to tree and I show later how to transform text to spaCy.
class WordTree:
'''Tree for spaCy dependency parsing array'''
def __init__(self, tree, is_subtree=False):
"""
Construct a new 'WordTree' object.
:param array: The array contening the dependency
:param parent: The parent of the array if exists
:return: returns nothing
"""
self.parent = []
self.children = []
self.data = tree.label().split('_')[0] # the first element of the tree # We are going to add the synonyms as well.
for subtree in tree:
if type(subtree) == Tree:
# Iterate through the depth of the subtree.
t = WordTree(subtree, True)
t.parent=tree.label().split('_')[0]
elif type(subtree) == str:
surface_form = subtree.split('_')[0]
self.children.append(surface_form)
And I can tell when one sentence is included in another tahnks to the following functions :
def isSubtree(T,S):
if S is None:
return True
if T is None:
return False
if areIdentical(T, S):
return True
return any(isSubtree(c, S) for c in T.children)
def areIdentical(root1, root2):
'''
function to say if two roots are identical
'''
# Base Case
if root1 is None and root2 is None:
return True
if root1 is None or root2 is None:
return False
# Check if the data of both roots their and children are the same
return (root1.data == root2.data and
((areIdentical(child1 , child2))
for child1, child2 in zip(root1.children, root2.children)))
Indeed, for instance :
# first tree creation
text = "start becoming popular"
textSpacy = spacy_nlp(text)
treeText = nltk_spacy_tree(textSpacy)
t = WordTree(treeText[0])
# second tree creation
question = "When did Beyonce start becoming popular?"
questionSpacy = spacy_nlp(question)
treeQuestion = nltk_spacy_tree(questionSpacy)
q = WordTree(treeQuestion[0])
# tree comparison
isSubtree(t,q)
Returns True. Therefore when can I say that a tree is partly included (1st case) in another or similar (2nd case) to another ?

Related

AI50 Pset0 Degrees: What seems to be the problem in my code's shortest_path function?

This function is supposed to return a list containing tuples of (movie_id, person_id), in order to figure out the number of degrees of separation by movies between two different actors (whose person_id are denoted by source and target).
But it takes very long to compute the output, even with small datasets, and sometimes the program just crashes, even if the degree of separation is only 2 or 3. How can I make this algorithm more efficient?
See here for more background: https://cs50.harvard.edu/ai/2020/projects/0/degrees/
def shortest_path(source, target):
# create node for the start. initialise action and parent to None, and state to start id parameter
source_node = Node (source, None, None)
# create empty set of tuples called explored
explored = set()
frontier = QueueFrontier()
# add start node to frontier list
frontier.add (source_node)
# execute search for target id (while True loop):
while True:
# remove a node from frontier (name it removed_node)
removed_node = frontier.remove()
# check if removed_node.state is target id
if removed_node.state != target:
neighbours = neighbors_for_person(removed_node.state)
for neighbour in neighbours:
neighbour_node = Node (neighbour[1], removed_node.state, neighbour[0])
if neighbour != target:
frontier.add(neighbour_node)
else:
frontier.add_infront(neighbour_node)
explored.add (removed_node)
# if removed_node.state is not target id:
# get set of neighbours from get_neighbours function with its input as removed_node
# generate node for each neighbour in the set, and add all the neighbours to frontier
# add removed_node to explored set
else:
path = []
path.append ((removed_node.action, removed_node.state))
while removed_node.parent != None:
find_node = removed_node.parent
for a_node in explored:
if find_node == a_node.state:
removed_node = a_node
path.append ((removed_node.action, removed_node.state))
break
path.reverse()
path.pop(0)
print (path)
return path

Making vars() defined variables global in a function

I am running multiple scenarios for my experiment, which requires me to dynamically change the variable names depending upon the Scenario and Class. For that, I have got a few lines of working code, where changing simulations (i.e., Scenario and Class) changes the variable names. However, this code needs to be called everytime after I define my experiment. Code below:
# Funtion
def Moisture_transport(Scenario, Class, delta_crop):
""" (unrelated to this question) """
return Class_direct, Class_sum_cmr
""" Define the Scenario and Class """
Scenario = 2; Class = 1; delta_crop = True # Assign the Scenario, Class and delta_crop
## Few lines of code that needs to run every time without any change
if delta_crop == False:
vars()['Moisture_direct_Scenario_'+str(Scenario)+'_Class_'+str(Class)], vars()['Moisture_with_CMR_Scenario_'+str(Scenario)+'_Class_'+str(Class)] = Moisture_transport(Scenario, Class, delta_crop)
else:
vars()['Moisture_direct_Scenario_'+str(Scenario)+'_Class_'+str(Class)+'_deltacrop'], vars()['Moisture_with_CMR_Scenario_'+str(Scenario)+'_Class_'+str(Class)+'_deltacrop'] = Moisture_transport(Scenario, Class, delta_crop)
Does any one know how to make vars()['variable_name'] global in the function Moisture_transport?
I think this can be simpler still. There is some cost to handling a key so I'd not make them excessively long. Please note the global, where it is and is not used.
Moisture_variables = {}
def Moisture_transport(Scenario, Class, delta_crop):
global Moisture_variables
""" (unrelated to this question) """
#return Class_direct, Class_sum_cmr
Moisture_variables[f"{Scenario} {Class} {delta_crop}"] = (Class_direct, Class_sum_cmr)
You can also sub-dictionary the results although this creates a bit of overhead to checking if sub-dictionaries exist. Note I've deliberately changed (shortened) the variables in the called function to make it clear these are in a different scope.
Moisture_variables = {}
def Moisture_transport(Scenario, Class, delta_crop):
""" (unrelated to this question) """
#return Class_direct, Class_sum_cmr
add_Moisture_Variables(Scenario, Class, delta_crop, Class_direct, Class_sum_cmr)
def add_Moisture_variables(s, c, d, cd, cs):
global Moisture_variables
if s not in Moisture_variables:
Moisture_variables[s] = {}
if c not in Moisture_variables[s]:
Moisture_variables[s][c] = {}
Moisture_variables[s][c][d] = (cd, cs)
Yet another approach if a list works, the double bracket to append a tuple are important.
Moisture_variables = []
def Moisture_transport(Scenario, Class, delta_crop):
global Moisture_variables
""" (unrelated to this question) """
#return Class_direct, Class_sum_cmr
Moisture_variables.append((Scenario, Class, delta_crop, Class_direct, Class_sum_cmr))
The choice of which approach works best depends on how you wish to recover the data.
Defining a dictionary is more efficient in the following case to hold all the variables as string, which can be called with conditions, i.e., Scenario or Class.
#Add a last line to the original function
def Moisture_transport(Scenario, Class, delta_crop):
""" (unrelated to this question) """
#return Class_direct, Class_sum_cmr
variables_dict(Class_direct, Class_sum_cmr,delta_crop)
#Add a normal dictionary and a variable name defining funtion
Moisture_variables = {}
def variables_dict(Class_direct, Class_sum_cmr, delta_crop):
if delta_crop == False:
Moisture_variables['Moisture_direct_Scenario_{0}_Class_{1}'.format(Scenario,Class)] = Class_direct
Moisture_variables['Moisture_with_CMR_Scenario_{0}_Class_{1}'.format(Scenario,Class)] = Class_sum_cmr
else:
Moisture_variables['Moisture_direct_Scenario_{0}_Class_{1}_deltacrop'.format(Scenario,Class)] = Class_direct
Moisture_variables['Moisture_with_CMR_Scenario_{0}_Class_{1}_deltacrop'.format(Scenario,Class)] = Class_sum_cmr
After that, you can run the function Moisture_transport() as it is, and not worry about defining the variables outside the function, i.e., code after ## Few lines of code that needs to run every time without any change from the original question is not needed. E.g.:
""" Define the Scenario and Class """
Scenario = 1; Class = 0; delta_crop = False
Moisture_transport(Scenario, Class, delta_crop)

How to implement python dictionaries into code to do the same job as list functions

I need to be able to implement dictionaries into this code. Not all needs to be changed just were i can change it and it still does the same job.
In a test file I have a list of three strings (1, once),(2,twice).(2, twice).
I'm guessing the number will represent the value.
This code passes the tests but I am struggling to understand how I can use dictionaries to make it do the same job.
If any one can help it'll be grateful.
The current is:
The list items are in a test file elsewhere.
class Bag:
def __init__(self):
"""Create a new empty bag."""
self.items = []
def add(self, item):
"""Add one copy of item to the bag. Multiple copies are allowed."""
self.items.append(item)
def count(self, item):
"""Return the number of copies of item in the bag.
Return zero if the item doesn't occur in the bag.
"""
counter = 0
for an_item in self.items:
if an_item == item:
counter += 1
return counter
def clear(self, item):
"""Remove all copies of item from the bag.
Do nothing if the item doesn't occur in the bag.
"""
index = 0
while index < len(self.items):
if self.items[index] == item:
self.items.pop(index)
else:
index += 1
def size(self):
"""Return the total number of copies of all items in the bag."""
return len(self.items)
def ordered(self):
"""Return the items by decreasing number of copies.
Return a list of (count, item) pairs.
"""
result = set()
for item in self.items:
result.add((self.count(item), item))
return sorted(result, reverse=True)
I have been scratching my head over it for a while now. I can only use these also for dictionaries.
Items[key] = value
len(items)
dict()
items[key]
key in items
Del items[key]
Thank you
Start with the simplest possible problem. You have an empty bag:
self.items = {}
and now a caller is trying to add an item, with bag.add('twice').
Where shall we put the item?
Well, we're going to need some unique index.
Hmmm, different every time, different every time, what changes with each .add()?
Right, that's it, use the length!
n = len(self.items)
self.items[n] = new_item
So items[0] = 'twice'.
Now, does this still work after a 2nd call?
Yes. items[1] = 'twice'.
Following this approach you should be able to refactor the other methods to use the new scheme.
Use unit tests, or debug statements like print('after clear() items is: ', self.items), to help you figure out if the Right Thing happened.

How to null out exceptions in an htmlChecker

While this is a project assignment for class I am trying to understand how to do a specific part of the project.
I need to go through an html file and check if all the opening statements are matched to closing statements. Further, they must be in the correct order and this must be checked using a stack I've implemented. As of right now I am working on extracting each tag from the file. The tough part seems to be the two exceptions that I am working on here. The and the . I need these tags to be removed so the program doesn't read them as an opening or closing statement.
class Stack(object):
def __init__(self):
self.items = []
def isEmpty(self):
return self.items = []
def push(self, item):
self.items.append(item)
def pop(self):
return self.items[-1]
def getTag(file):
EXCEPTIONS = ['br/', 'meta']
s = Stack()
balanced = True
i = 0
isCopying = False
currentTag = ''
isClosing = False
while i < len(file) and balanced:
if symbol == "<":
if i < (len(file) - 1) and file[i + 1] == "/":
i = i + 1
isClosing == True
isCopying == True
if symbol == ">":
if isClosing == True:
top = s.pop()
if not matches(top, symbol):
balanced = False
else:
**strong text**
s.push(currentTag)
currentTag = ''
isCopying == False
if isCopying == True:
currentTag += symbol
The code reads in the file and goes letter by letter to search for <string>. If it exists it pushes it on to the stack. The matches functions checks to see if the closing statement equals the opening statement. The exceptions list is the ones I have to check for that will screw up the placing of the strings on the stack. I am having a tough time trying to incorporate them into my code. Any ideas? Before I push on to the stack I should go through a filter system to see whether that statement is valid or not valid. A basic if statement should suffice.
If I read your requirements correctly, you're going about this very awkwardly. What you're really looking to do is tokenize your file, and so the first thing you should do is get all the tokens in your file, and then check to see if it is a valid ordering of tokens.
Tokenization means you parse through your file and find all valid tokens and put them in an ordered list. A valid token in your case is any string length that starts with a < and ends with a >. You can safely discard the rest of the information I think? It would be easiest if you had a Token class to contain your token types.
Once you have that ordered list of tokens it is much easier to determine if they are a 'correct ordering' using your stack:
is_correct_ordering algorithm:
For each element in the list
if the element is an open-token, put it on the stack
if the element is a close-token
if the stack is empty return false
if the top element of the stack is a matching close token
pop the top element of the stack
else return false
discard any other token
If the stack is NOT empty, return false
Else return true
Naturally, having a reasonable Token class structure makes things easy:
class Token:
def matches(t: Token) -> bool:
pass # TODO Implement
#classmethod
def tokenize(token_string: str) -> Token:
pass # TODO Implement to return the proper subclass instantiation of the given string
class OpenToken:
pass
class CloseToken:
pass
class OtherToken:
pass
This breaks the challenge into two parts: first parsing the file for all valid tokens (easy to validate because you can hand-compare your ordered list with what you see in the file) and then validating that the ordered list is correct. Note that here, too, you can simplify what you're working on by delegating work to a sub-routine:
def tokenize_file(file) -> list:
token_list = []
while i < len(file):
token_string, token_end = get_token(file[i:])
token_list.append = Token.tokenize(token_string)
i = i + token_end # Skip to the end of this token
return token_list
def get_token(file) -> tuple:
# Note this is a naive implementation. Consider the edge case:
# <img src="Valid string with >">
token_string = ""
for x in range(len(file)):
token_string.append(file[x])
if file[x] == '>':
return token_string, x
# Note that this function will fail if the file terminates before you find a closing tag!
The above should turn something like this:
<html>Blah<meta src="lala"/><body><br/></body></html>
Into:
[OpenToken('<html>'),
OtherToken('<meta src="lala"/>'),
OpenToken('<body>'),
OtherToken('<br/>'),
CloseToken('</body>'),
CloseToken('</html>')]
Which can be much more easily handled to determine correctness.
Obviously this isn't a complete implementation of your problem, but hopefully it will help straighten out the awkwardness you've chosen with your current direction.

recursively editing member variable: All instances have same value

I want to create a Tree data structure that consists of TreeNode objects. The root is a TreeNode. Each TreeNode has one parent TreeNode and a list of children TreeNodes.
The Tree is built up recursively. I simplified the code to make the example not too difficult. The function get_list_of_values_from_somewhere works correctly. The recursion ends when there are no child_values for a TreeNode and get_list_of_values_from_somewhere returns an empty list. That works perfectly well.
The children member of each TreeNode is not correct. The script collects all the TreeNodes in a list (node_list). There I can check that each TreeNode has a parent node and this parent node is correct.
But for some reason they all have the same list of childrens. I don't understand why. Everything else is correct. The recursion works, the TreeNodes are created correctly, their parent is correct. Why is their children list not filled correctly and how would you edit the memver variables of the instances after creating the instance?
class Tree(object):
def __init__(self, root_value):
print ("Creating tree")
self.root = self.createTree(root_value)
self.node_list = []
def createTree(self, value, parent=None):
node = TreeNode(value, parent)
children_values = get_list_of_values_from_somewhere()
for child_value in children_values:
child_node = self.createTree(child_value, node)
self.node_list.append(child_node)
node.children.append(child_node)
# I also tried alternatives:
#node.insertChildren(self.createTree(child_value, node))
#node.insertChild(child_node)
return node
class TreeNode(object):
def __init__(self, value, parent=None, children=[]):
self.value = value
self.parent = parent
self.children = children
def insertChildren(self, children=[]):
self.children += children
def insertChild(self, child):
self.children.append(child)
if __name__ == '__main__':
tree = Tree(1)
#tree.node_list contains a list of nodes, their parent is correct
#tree.root.children contains all children
#tree.node_list[x] contains the same children - although many of them should not even have a single child. Otherwise the recursion would not end.
Be very, very cautious of this:
def __init__(self, value, parent=None, children=[]):
and this:
def insertChildren(self, children=[]):
The initial value -- the list object created by [] -- is a single object which is shared. Widely.
You are using this single, shared, default list object widely.
You may want to use this instead.
def __init__( self, value, parent= None, children= None ):
if children is None: children= []
This technique will create a fresh, empty list object. No sharing.

Resources