how to evaluate an AutoModelForQuestionAnswering? - nlp

I'm using this AutoModelForQuestionAnswering from the transformers for semantic search. Therefore i could not find a way to evaluate it knowing that i'm do not have a predicted values and i don't have a train and test data.Here's the code.
def load_model():
model_name = "camembert-base"
tokenizer = CamembertTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
return model, tokenizer
model,tokenizer= load_model()
#input_ids are the indices corresponding to each token in the sentence.
#input_ids = tokenizer.encode(search, text)
encoded_output = tokenizer.encode_plus(text=search, text_pair=text, padding=True)
input_ids = torch.tensor(encoded_output['input_ids'])
attention_mask = torch.tensor(encoded_output['attention_mask'])
token_type_ids = torch.tensor(0)
tokens = tokenizer.convert_ids_to_tokens(input_ids)
# Batch size of 1
input_ids = input_ids.unsqueeze(0)
token_type_ids = token_type_ids.unsqueeze(0)
attention_mask = attention_mask.unsqueeze(0)
print("this is the encode output", encoded_output)
print("this is token_types_ids",token_type_ids)
print("this is attention mask",attention_mask)
print("this is input_ids",input_ids)
print("this is tokens", tokens)
output = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
print( model.eval() )
# 3. GET THE ANSWER SPAN
# once we have the most likely start and end tokens, we grab all the tokens between them
# and convert tokens back to words!
# tokens with highest start and end scores
answer_start = torch.argmax(output.start_logits) # get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(output.end_logits)
if answer_end >= answer_start:
answer = " ".join(tokens[answer_start:answer_end + 1])
else:
answer = "No Answer"
return answer
'''

Related

What does num_labels actually do?

When training a BERT based model one can set num_labels
AutoConfig.from_pretrained(BERT_MODEL_NAME, num_labels=num_labels)
So for example if we want to have a prediction of 3 values we may use num_labels=3.
My question is what does it do internally? Is it just connecting a nn.Linear to the last embedding layer?
Thanks
I suppose if there is a num label then the model is used for classification then simply you can go to the documentation of BERT on hugging face then search for the classification class and take a look into the code, then you will find the following:
https://github.com/huggingface/transformers/blob/bd469c40659ce76c81f69c7726759d249b4aef49/src/transformers/models/bert/modeling_bert.py#L1572
if labels is not None:
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(logits, labels)
so the number of labels as we see affects using the loss function
hope this answers your question

PyTorchLightning returning CUDA error: printing somehow fixes this. Why?

I'm using PyTorch Lightning to write a simple trainer, but when I try to run the trainer, for some reason, 9 out of 10 times it returns "CUDA error: device-side assert." Simply printing a newline before it somehow seems to make it work. Any ideas?
My code:
class Elementwise(nn.ModuleList):
"""
A simple network container.
Parameters are a list of modules.
Inputs are a 3d Tensor whose last dimension is the same length
as the list.
Outputs are the result of applying modules to inputs elementwise.
An optional merge parameter allows the outputs to be reduced to a
single Tensor.
"""
def __init__(self, merge=None, *args):
assert merge in [None, 'first', 'concat', 'sum', 'mlp']
self.merge = merge
super(Elementwise, self).__init__(*args)
def forward(self, inputs):
inputs_ = [feat.squeeze(1) for feat in inputs.split(1, dim=1)]
for i, j in enumerate(inputs_):
inp = torch.tensor(j).to(device).long()
inputs_[i] = inp
# this does not work
outputs = [f(x) for i, (f, x) in enumerate(zip(self, inputs_))]
if self.merge == 'first':
return outputs[0]
elif self.merge == 'concat' or self.merge == 'mlp':
return torch.cat(outputs, 1)
elif self.merge == 'sum':
return sum(outputs)
else:
return outputs
but somehow magically this works:
class Elementwise(nn.ModuleList):
"""
A simple network container.
Parameters are a list of modules.
Inputs are a 3d Tensor whose last dimension is the same length
as the list.
Outputs are the result of applying modules to inputs elementwise.
An optional merge parameter allows the outputs to be reduced to a
single Tensor.
"""
def __init__(self, merge=None, *args):
assert merge in [None, 'first', 'concat', 'sum', 'mlp']
self.merge = merge
super(Elementwise, self).__init__(*args)
def forward(self, inputs):
inputs_ = [feat.squeeze(1) for feat in inputs.split(1, dim=1)]
for i, j in enumerate(inputs_):
inp = torch.tensor(j).to(device).long()
inputs_[i] = inp
print("")
outputs = [f(x) for i, (f, x) in enumerate(zip(self, inputs_))]
if self.merge == 'first':
return outputs[0]
elif self.merge == 'concat' or self.merge == 'mlp':
return torch.cat(outputs, 1)
elif self.merge == 'sum':
return sum(outputs)
else:
return outputs
Any idea as to how this error gets fixed by simply printing to output?
Edit: This error is only raised when using PyTorch Lightning for training abstraction, using plain PyTorch makes it work fine.

How to calculate semantic similarity of short text corpora?

What is the proper approach to unsupervised comparison of semantic similarity between two short text corpora? Comparing LDA topic distributions for the two doesn't seem to be a solution, as for short documents the generated topics do not really grasp the semantics very well. Chunking didn't help, because following tweets don't have to be on the same topic. Is e.g. creating a matrix of cosine similarities between document TF-IDFs in these corpora a good way to go?
Here is one approach found here. The higher the similarity score, the closer the sentences are(semantically).
#Invoke libraries
from nltk import pos_tag, word_tokenize
from nltk.corpus import wordnet as wn
#Build functions to compute similarity
def ptb_to_wn(tag):
if tag.startswith('N'):
return 'n'
if tag.startswith('V'):
return 'v'
if tag.startswith('J'):
return 'a'
if tag.startswith('R'):
return 'r'
return None
def tagged_to_synset(word, tag):
wn_tag = ptb_to_wn(tag)
if wn_tag is None:
return None
try:
return wn.synsets(word, wn_tag)[0]
except:
return None
def sentence_similarity(s1, s2):
s1 = pos_tag(word_tokenize(s1))
s2 = pos_tag(word_tokenize(s2))
synsets1 = [tagged_to_synset(*tagged_word) for tagged_word in s1]
synsets2 = [tagged_to_synset(*tagged_word) for tagged_word in s2]
#suppress "none"
synsets1 = [ss for ss in synsets1 if ss]
synsets2 = [ss for ss in synsets2 if ss]
score, count = 0.0, 0
for synset in synsets1:
best_score = max([synset.path_similarity(ss) for ss in synsets2])
if best_score is not None:
score += best_score
count += 1
# Average the values
score /= count
return score
#compute the symmetric sentence similarity
def symSentSim(s1, s2):
sss_score = (sentence_similarity(s1, s2) + sentence_similarity(s2,s1)) / 2
return (sss_score)
s1 = 'We rented a vehicle to drive to New York'
s2 = 'The car broke down on our jouney'
s1tos2 = symSentSim(s1, s2)
print(s1tos2)
#0.142509920635

Document Similarity - Multiple documents ended with same similarity score

I have been working in a business problem where i need to find a similarity of new document with existing one.
I have used various approach as below
1.Bag of words + Cosine similarity
2.TFIDF + Cosine similarity
3.Word2Vec + Cosine similarity
None of them worked as expected.
But finally i found an approach which works better its
Word2vec + Soft cosine similarity
But the new challenge is i ended up with multiple documents with same similarity score.
Most of them are relevant but few of them even though having some semantically similar words they are different
Please suggest how to over come this issue
If the objective is to identify semantic similarity, the following code sourced from here helps.
#invoke libraries
from nltk import pos_tag, word_tokenize
from nltk.corpus import wordnet as wn
#Build functions
def ptb_to_wn(tag):
if tag.startswith('N'):
return 'n'
if tag.startswith('V'):
return 'v'
if tag.startswith('J'):
return 'a'
if tag.startswith('R'):
return 'r'
return None
def tagged_to_synset(word, tag):
wn_tag = ptb_to_wn(tag)
if wn_tag is None:
return None
try:
return wn.synsets(word, wn_tag)[0]
except:
return None
def sentence_similarity(s1, s2):
s1 = pos_tag(word_tokenize(s1))
s2 = pos_tag(word_tokenize(s2))
synsets1 = [tagged_to_synset(*tagged_word) for tagged_word in s1]
synsets2 = [tagged_to_synset(*tagged_word) for tagged_word in s2]
#suppress "none"
synsets1 = [ss for ss in synsets1 if ss]
synsets2 = [ss for ss in synsets2 if ss]
score, count = 0.0, 0
for synset in synsets1:
best_score = max([synset.path_similarity(ss) for ss in synsets2])
if best_score is not None:
score += best_score
count += 1
# Average the values
score /= count
return score
#Build function to compute the symmetric sentence similarity
def symSentSim(s1, s2):
sss_score = (sentence_similarity(s1, s2) + sentence_similarity(s2,s1)) / 2
return (sss_score)
#Example
s1 = 'We rented a vehicle to drive to Goa'
s2 = 'The car broke down on our jouney'
s1tos2 = symSentSim(s1, s2)
print(s1tos2)
#0.155753968254

How to search in an id tree in the python program

I'm a student and working on a small assignment where I need to collect inputs from the student on factors like kind of books they like to issue from the library. I've been provided id_tree class which I need to search using. As you can see I'm getting inputs from the console and I like to use that as the search criteria and get the recommendation from the id tree.
Just for testing purpose, I'm using out.py, but that needs to be replaced with id_tree search logic for which I'm struggling.
# k-Nearest Neighbors and Identification Trees
#api.py
import os
from copy import deepcopy
from functools import reduce
################################################################################
############################# IDENTIFICATION TREES #############################
################################################################################
class Classifier :
def __init__(self, name, classify_fn) :
self.name = str(name)
self._classify_fn = classify_fn
def classify(self, point):
try:
return self._classify_fn(point)
except KeyError as key:
raise ClassifierError("point has no attribute " + str(key) + ": " + str(point))
def copy(self):
return deepcopy(self)
def __eq__(self, other):
try:
return (self.name == other.name
and self._classify_fn.__code__.co_code == other._classify_fn.__code__.co_code)
except:
return False
def __str__(self):
return "Classifier<" + str(self.name) + ">"
__repr__ = __str__
## HELPER FUNCTIONS FOR CREATING CLASSIFIERS
def maybe_number(x) :
try :
return float(x)
except (ValueError, TypeError) :
return x
def feature_test(key) :
return Classifier(key, lambda pt : maybe_number(pt[key]))
def threshold_test(feature, threshold) :
return Classifier(feature + " > " + str(threshold),
lambda pt: "Yes" if (maybe_number(pt.get(feature)) > threshold) else "No")
## CUSTOM ERROR CLASSES
class NoGoodClassifiersError(ValueError):
def __init__(self, value=""):
self.value = value
def __str__(self):
return repr(self.value)
class ClassifierError(RuntimeError):
def __init__(self, value=""):
self.value = value
def __str__(self):
return repr(self.value)
class IdentificationTreeNode:
def __init__(self, target_classifier, parent_branch_name=None):
self.target_classifier = target_classifier
self._parent_branch_name = parent_branch_name
self._classification = None #value, if leaf node
self._classifier = None #Classifier, if tree continues
self._children = {} #dict mapping feature to node, if tree continues
self._data = [] #only used temporarily for printing with data
def get_parent_branch_name(self):
return self._parent_branch_name if self._parent_branch_name else "(Root node: no parent branch)"
def is_leaf(self):
return not self._classifier
def set_node_classification(self, classification):
self._classification = classification
if self._classifier:
print("Warning: Setting the classification", classification, "converts this node from a subtree to a leaf, overwriting its previous classifier:", self._classifier)
self._classifier = None
self._children = {}
return self
def get_node_classification(self):
return self._classification
def set_classifier_and_expand(self, classifier, features):
if classifier is None:
raise TypeError("Cannot set classifier to None")
if not isinstance_Classifier(classifier):
raise TypeError("classifier must be Classifier-type object: " + str(classifier))
self._classifier = classifier
try:
self._children = {feature:IdentificationTreeNode(self.target_classifier, parent_branch_name=str(feature))
for feature in features}
except TypeError:
raise TypeError("Expected list of feature names, got: " + str(features))
if len(self._children) == 1:
print("Warning: The classifier", classifier.name, "has only one relevant feature, which means it's not a useful test!")
if self._classification:
print("Warning: Setting the classifier", classifier.name, "converts this node from a leaf to a subtree, overwriting its previous classification:", self._classification)
self._classification = None
return self
def get_classifier(self):
return self._classifier
def apply_classifier(self, point):
if self._classifier is None:
raise ClassifierError("Cannot apply classifier at leaf node")
return self._children[self._classifier.classify(point)]
def get_branches(self):
return self._children
def copy(self):
return deepcopy(self)
def print_with_data(self, data):
tree = self.copy()
tree._assign_data(data)
print(tree.__str__(with_data=True))
def _assign_data(self, data):
if not self._classifier:
self._data = deepcopy(data)
return self
try:
pairs = list(self._soc(data, self._classifier).items())
except KeyError: #one of the points is missing a feature
raise ClassifierError("One or more points cannot be classified by " + str(self._classifier))
for (feature, branch_data) in pairs:
if feature in self._children:
self._children[feature]._assign_data(branch_data)
else: #feature branch doesn't exist
self._data.extend(branch_data)
return self
_ssc=lambda self,c,d:self.set_classifier_and_expand(c,self._soc(d,c))
_soc=lambda self,d,c:reduce(lambda b,p:b.__setitem__(c.classify(p),b.get(c.classify(p),[])+[p]) or b,d,{})
def __eq__(self, other):
try:
return (self.target_classifier == other.target_classifier
and self._parent_branch_name == other._parent_branch_name
and self._classification == other._classification
and self._classifier == other._classifier
and self._children == other._children
and self._data == other._data)
except:
return False
def __str__(self, indent=0, with_data=False):
newline = os.linesep
ret = ''
if indent == 0:
ret += (newline + "IdentificationTreeNode classifying by "
+ self.target_classifier.name + ":" + newline)
ret += " "*indent + (self._parent_branch_name + ": " if self._parent_branch_name else '')
if self._classifier:
ret += self._classifier.name
if with_data and self._data:
ret += self._render_points()
for (feature, node) in sorted(self._children.items()):
ret += newline + node.__str__(indent+1, with_data)
else: #leaf
ret += str(self._classification)
if with_data and self._data:
ret += self._render_points()
return ret
def _render_points(self):
ret = ' ('
first_point = True
for point in self._data:
if first_point:
first_point = False
else:
ret += ', '
ret += str(point.get("name","datapoint")) + ": "
try:
ret += str(self.target_classifier.classify(point))
except ClassifierError:
ret += '(unknown)'
ret += ')'
return ret
################################################################################
############################# k-NEAREST NEIGHBORS ##############################
################################################################################
class Point(object):
"""A Point has a name and a list or tuple of coordinates, and optionally a
classification, and/or alpha value."""
def __init__(self, coords, classification=None, name=None):
self.name = name
self.coords = coords
self.classification = classification
def copy(self):
return deepcopy(self)
def __getitem__(self, i): # make Point iterable
return self.coords[i]
def __eq__(self, other):
try:
return (self.coords == other.coords
and self.classification == other.classification)
except:
return False
def __str__(self):
ret = "Point(" + str(self.coords)
if self.classification:
ret += ", " + str(self.classification)
if self.name:
ret += ", name=" + str(self.name)
ret += ")"
return ret
__repr__ = __str__
################################################################################
############################### OTHER FUNCTIONS ################################
################################################################################
def is_class_instance(obj, class_name):
return hasattr(obj, '__class__') and obj.__class__.__name__ == class_name
def isinstance_Classifier(obj):
return is_class_instance(obj, 'Classifier')
def isinstance_IdentificationTreeNode(obj):
return is_class_instance(obj, 'IdentificationTreeNode')
def isinstance_Point(obj):
return is_class_instance(obj, 'Point')
#id_tree
from api import *
import math
log2 = lambda x: math.log(x, 2)
INF = float('inf')
import pandas as pd
def id_tree_classify_point(point, id_tree):
if id_tree.is_leaf():
return id_tree.get_node_classification()
else:
new_tree = id_tree.apply_classifier(point)
get_point = id_tree_classify_point(point, new_tree)
return get_point
def split_on_classifier(data, classifier):
"""Given a set of data (as a list of points) and a Classifier object, uses
the classifier to partition the data. Returns a dict mapping each feature
values to a list of points that have that value."""
#Dictionary which will contain the data after classification.
class_dict = {}
#Iterating through all the points in data
for i in range(len(data)):
get_value = classifier.classify(data[i])
if get_value not in class_dict:
class_dict[get_value] = [data[i]]
else:
class_dict[get_value].append(data[i])
return class_dict
def branch_disorder(data, target_classifier):
"""Given a list of points representing a single branch and a Classifier
for determining the true classification of each point, computes and returns
the disorder of the branch."""
#Getting data after classification based on the target_classifier
class_dict = split_on_classifier(data, target_classifier)
if (len(class_dict) == 1):
#Homogenous condition
return 0
else:
disorder = 0
for i in class_dict:
get_len = len(class_dict[i])
p_term = get_len/ float(len(data))
disorder += (-1) * p_term * log2(p_term)
return disorder
def average_test_disorder(data, test_classifier, target_classifier):
"""Given a list of points, a feature-test Classifier, and a Classifier
for determining the true classification of each point, computes and returns
the disorder of the feature-test stump."""
average_disorder = 0.0
#Getting all the branches after applying test_classifer
get_branches = split_on_classifier(data, test_classifier)
#Iterating through the branches
for i in get_branches:
disorder = branch_disorder(get_branches[i], target_classifier)
average_disorder += disorder * (len(get_branches[i])/ float(len(data)))
return average_disorder
#### CONSTRUCTING AN ID TREE
def find_best_classifier(data, possible_classifiers, target_classifier):
"""Given a list of points, a list of possible Classifiers to use as tests,
and a Classifier for determining the true classification of each point,
finds and returns the classifier with the lowest disorder. Breaks ties by
preferring classifiers that appear earlier in the list. If the best
classifier has only one branch, raises NoGoodClassifiersError."""
#Base values to start with
best_classifier = average_test_disorder(data, possible_classifiers[0], target_classifier)
store_classifier = possible_classifiers[0]
#Iterating over the list of possible classifiers
for i in range(len(possible_classifiers)):
avg_disorder = average_test_disorder(data, possible_classifiers[i], target_classifier)
if avg_disorder < best_classifier:
best_classifier = avg_disorder
store_classifier = possible_classifiers[i]
get_branches = split_on_classifier(data, store_classifier)
if len(get_branches)==1:
#Only 1 branch present
raise NoGoodClassifiersError
else:
return store_classifier
def construct_greedy_id_tree(data, possible_classifiers, target_classifier, id_tree_node=None):
"""Given a list of points, a list of possible Classifiers to use as tests,
a Classifier for determining the true classification of each point, and
optionally a partially completed ID tree, returns a completed ID tree by
adding classifiers and classifications until either perfect classification
has been achieved, or there are no good classifiers left."""
#print data
#print "possible", possible_classifiers
#print "target", target_classifier
if id_tree_node == None:
#Creating a new tree
id_tree_node = IdentificationTreeNode(target_classifier)
if branch_disorder(data, target_classifier) == 0:
id_tree_node.set_node_classification(target_classifier.classify(data[0]))
else:
try:
#Getting the best classifier from the options available
best_classifier = find_best_classifier(data, possible_classifiers, target_classifier)
get_branches = split_on_classifier(data, best_classifier)
id_tree_node = id_tree_node.set_classifier_and_expand(best_classifier, get_branches)
#possible_classifiers.remove(best_classifier)
branches = id_tree_node.get_branches()
for i in branches:
construct_greedy_id_tree(get_branches[i], possible_classifiers, target_classifier, branches[i])
except NoGoodClassifiersError:
pass
return id_tree_node
possible_classifiers = [feature_test('age'),
feature_test('gender'),
feature_test('duration'),
feature_test('Mood')
]
df1 = pd.read_csv("data_form.csv")
#df1 = df1.drop("age", axis=1)
print(df1)
a = []
with open("data_form.csv") as myfile:
firstline = True
for line in myfile:
if firstline:
mykeys = "".join(line.split()).split(',')
firstline = False
else:
values = "".join(line.split()).split(',')
a.append({mykeys[n]:values[n] for n in range(0,len(mykeys))})
keys = a[0].keys()
print(keys)
with open('data_clean.csv', 'w') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(a)
print(a)
tar = feature_test('genre')
print(construct_greedy_id_tree(a, possible_classifiers, tar))
#book_suggestion
import random
#from out import *
def genre(Mood, age, gender, duration):
print("Hi")
res_0= input("What's your name?")
res_1 = input("How are you, "+str(res_0)+"?")
if res_1 in ("good","fine","ok","nice"):
print ("Oh nice")
else:
print("Oh! It's alright")
Mood = input("What is your current mood?")
age = input("What is your age range : 10-12, 12-15,13-14,15-18,18+?")
gender = input("What is your gender?")
duration = input("How long do you want to read : 1week, 2weeks, 3weeks, 3+weeks, 2hours")
def get_book(genre):
suggestions = []
genre_to_book = {"Fantasy":["Just me and my babysitter - Mercer Mayer","Just Grandpa and me - Mercer Mayer","Just me and my babysitter - Mercer Mayer",
"The new Potty - Mercer Mayer","I was so mad - Mercer Mayer","Just me and my puppy" ,"Just a mess" ,"Me too"
,"The new Baby","Just shopping with mom"],
"Encyclopedias":["Brain Power - Paul Mcevoy", "My best books of snakes Gunzi Chrisitian","MY best books of MOON Grahame,Ian",
"The book of Planets Twist,Clint", "Do stars have points? Melvin", "Young discover series:cells Discovery Channel"]
,
"Action" : ["The Kane Chronicle:The Throne of Fire s Book 2 Riordan,Rick",
"Zane : ninja of ice Farshtey, Greg",
"Escape from Sentai Mountain Farshtey, Greg",
"Percy jackson Rick Riordan",
"The Kane Chronicle:The Throne of Fire s Book 2 Rick Riordan"],
"Comic" : ["Double Dork Diaries Russell Rachel Renée",
"Dork Dairies Russell Rachel Renee",
"Dork Dairies Russell Rachel Renée"],
"Mystery" : ["Sparkling Cyanide Christie Agatha",
"Poirot's Early Cases: Agatha Christie",
"The Name of this Book is Secret Bosch,Pseudonyuous"],
"Biographies" :["All by myself Mercer Mayer", "D Days prett bryan",
"Snake Bite Lane Andrew"] }
if (genre == "Fantasy"):
suggestions = [random.sample(genre_to_book["Fantasy"], 3)]
elif (genre == "Action"):
suggestions = [random.sample(genre_to_book["Action"], 3)]
elif (genre == "Comic"):
suggestions = [random.sample(genre_to_book["Comic"], 3)]
elif (genre == "Mystery"):
suggestions = [random.sample(genre_to_book["Mystery"], 3)]
elif (genre == "Encyclopedias"):
suggestions = random.sample(genre_to_book["Encyclopedias"], 3)
elif (genre == "Biographies"):
suggestions = random.sample(genre_to_book["Biographies"], 3)
return suggestions
print(get_book(genre(Mood, age, gender, duration)))
I want the program to not depend on out.py and and run on the information of id tree
The current implementation of the suggestions works by asking the user for a genre, then looking up a list of book titles in a dictionary using that genre as the key, then randomly selecting one of the titles and printing it. The current implementation also (presumably) constructs a IdentificationTreeNode containing recommendations, but then does nothing with it except printing it to the standard output.
The next step would be to not discard the tree, but save it in a variable and use in the recommendation process. Since the class structure is not given, it is not clear how this could be done, but it seems a reasonable assumption that it is possible to provide a keyword (the genre) and receive some collection of objects where each one contains data on a recommendation.
If constructing the IdentificationTreeNode is too costly to run on each recommendation request, it is possible to split the construction into its own script file and using python's pickle package to save the object in a file that can then be unpickled more quickly in the script performing the recommendations.

Resources