Document Similarity - Multiple documents ended with same similarity score - nlp

I have been working in a business problem where i need to find a similarity of new document with existing one.
I have used various approach as below
1.Bag of words + Cosine similarity
2.TFIDF + Cosine similarity
3.Word2Vec + Cosine similarity
None of them worked as expected.
But finally i found an approach which works better its
Word2vec + Soft cosine similarity
But the new challenge is i ended up with multiple documents with same similarity score.
Most of them are relevant but few of them even though having some semantically similar words they are different
Please suggest how to over come this issue

If the objective is to identify semantic similarity, the following code sourced from here helps.
#invoke libraries
from nltk import pos_tag, word_tokenize
from nltk.corpus import wordnet as wn
#Build functions
def ptb_to_wn(tag):
if tag.startswith('N'):
return 'n'
if tag.startswith('V'):
return 'v'
if tag.startswith('J'):
return 'a'
if tag.startswith('R'):
return 'r'
return None
def tagged_to_synset(word, tag):
wn_tag = ptb_to_wn(tag)
if wn_tag is None:
return None
try:
return wn.synsets(word, wn_tag)[0]
except:
return None
def sentence_similarity(s1, s2):
s1 = pos_tag(word_tokenize(s1))
s2 = pos_tag(word_tokenize(s2))
synsets1 = [tagged_to_synset(*tagged_word) for tagged_word in s1]
synsets2 = [tagged_to_synset(*tagged_word) for tagged_word in s2]
#suppress "none"
synsets1 = [ss for ss in synsets1 if ss]
synsets2 = [ss for ss in synsets2 if ss]
score, count = 0.0, 0
for synset in synsets1:
best_score = max([synset.path_similarity(ss) for ss in synsets2])
if best_score is not None:
score += best_score
count += 1
# Average the values
score /= count
return score
#Build function to compute the symmetric sentence similarity
def symSentSim(s1, s2):
sss_score = (sentence_similarity(s1, s2) + sentence_similarity(s2,s1)) / 2
return (sss_score)
#Example
s1 = 'We rented a vehicle to drive to Goa'
s2 = 'The car broke down on our jouney'
s1tos2 = symSentSim(s1, s2)
print(s1tos2)
#0.155753968254

Related

how to evaluate an AutoModelForQuestionAnswering?

I'm using this AutoModelForQuestionAnswering from the transformers for semantic search. Therefore i could not find a way to evaluate it knowing that i'm do not have a predicted values and i don't have a train and test data.Here's the code.
def load_model():
model_name = "camembert-base"
tokenizer = CamembertTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
return model, tokenizer
model,tokenizer= load_model()
#input_ids are the indices corresponding to each token in the sentence.
#input_ids = tokenizer.encode(search, text)
encoded_output = tokenizer.encode_plus(text=search, text_pair=text, padding=True)
input_ids = torch.tensor(encoded_output['input_ids'])
attention_mask = torch.tensor(encoded_output['attention_mask'])
token_type_ids = torch.tensor(0)
tokens = tokenizer.convert_ids_to_tokens(input_ids)
# Batch size of 1
input_ids = input_ids.unsqueeze(0)
token_type_ids = token_type_ids.unsqueeze(0)
attention_mask = attention_mask.unsqueeze(0)
print("this is the encode output", encoded_output)
print("this is token_types_ids",token_type_ids)
print("this is attention mask",attention_mask)
print("this is input_ids",input_ids)
print("this is tokens", tokens)
output = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
print( model.eval() )
# 3. GET THE ANSWER SPAN
# once we have the most likely start and end tokens, we grab all the tokens between them
# and convert tokens back to words!
# tokens with highest start and end scores
answer_start = torch.argmax(output.start_logits) # get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(output.end_logits)
if answer_end >= answer_start:
answer = " ".join(tokens[answer_start:answer_end + 1])
else:
answer = "No Answer"
return answer
'''

How to calculate semantic similarity of short text corpora?

What is the proper approach to unsupervised comparison of semantic similarity between two short text corpora? Comparing LDA topic distributions for the two doesn't seem to be a solution, as for short documents the generated topics do not really grasp the semantics very well. Chunking didn't help, because following tweets don't have to be on the same topic. Is e.g. creating a matrix of cosine similarities between document TF-IDFs in these corpora a good way to go?
Here is one approach found here. The higher the similarity score, the closer the sentences are(semantically).
#Invoke libraries
from nltk import pos_tag, word_tokenize
from nltk.corpus import wordnet as wn
#Build functions to compute similarity
def ptb_to_wn(tag):
if tag.startswith('N'):
return 'n'
if tag.startswith('V'):
return 'v'
if tag.startswith('J'):
return 'a'
if tag.startswith('R'):
return 'r'
return None
def tagged_to_synset(word, tag):
wn_tag = ptb_to_wn(tag)
if wn_tag is None:
return None
try:
return wn.synsets(word, wn_tag)[0]
except:
return None
def sentence_similarity(s1, s2):
s1 = pos_tag(word_tokenize(s1))
s2 = pos_tag(word_tokenize(s2))
synsets1 = [tagged_to_synset(*tagged_word) for tagged_word in s1]
synsets2 = [tagged_to_synset(*tagged_word) for tagged_word in s2]
#suppress "none"
synsets1 = [ss for ss in synsets1 if ss]
synsets2 = [ss for ss in synsets2 if ss]
score, count = 0.0, 0
for synset in synsets1:
best_score = max([synset.path_similarity(ss) for ss in synsets2])
if best_score is not None:
score += best_score
count += 1
# Average the values
score /= count
return score
#compute the symmetric sentence similarity
def symSentSim(s1, s2):
sss_score = (sentence_similarity(s1, s2) + sentence_similarity(s2,s1)) / 2
return (sss_score)
s1 = 'We rented a vehicle to drive to New York'
s2 = 'The car broke down on our jouney'
s1tos2 = symSentSim(s1, s2)
print(s1tos2)
#0.142509920635

How to search in an id tree in the python program

I'm a student and working on a small assignment where I need to collect inputs from the student on factors like kind of books they like to issue from the library. I've been provided id_tree class which I need to search using. As you can see I'm getting inputs from the console and I like to use that as the search criteria and get the recommendation from the id tree.
Just for testing purpose, I'm using out.py, but that needs to be replaced with id_tree search logic for which I'm struggling.
# k-Nearest Neighbors and Identification Trees
#api.py
import os
from copy import deepcopy
from functools import reduce
################################################################################
############################# IDENTIFICATION TREES #############################
################################################################################
class Classifier :
def __init__(self, name, classify_fn) :
self.name = str(name)
self._classify_fn = classify_fn
def classify(self, point):
try:
return self._classify_fn(point)
except KeyError as key:
raise ClassifierError("point has no attribute " + str(key) + ": " + str(point))
def copy(self):
return deepcopy(self)
def __eq__(self, other):
try:
return (self.name == other.name
and self._classify_fn.__code__.co_code == other._classify_fn.__code__.co_code)
except:
return False
def __str__(self):
return "Classifier<" + str(self.name) + ">"
__repr__ = __str__
## HELPER FUNCTIONS FOR CREATING CLASSIFIERS
def maybe_number(x) :
try :
return float(x)
except (ValueError, TypeError) :
return x
def feature_test(key) :
return Classifier(key, lambda pt : maybe_number(pt[key]))
def threshold_test(feature, threshold) :
return Classifier(feature + " > " + str(threshold),
lambda pt: "Yes" if (maybe_number(pt.get(feature)) > threshold) else "No")
## CUSTOM ERROR CLASSES
class NoGoodClassifiersError(ValueError):
def __init__(self, value=""):
self.value = value
def __str__(self):
return repr(self.value)
class ClassifierError(RuntimeError):
def __init__(self, value=""):
self.value = value
def __str__(self):
return repr(self.value)
class IdentificationTreeNode:
def __init__(self, target_classifier, parent_branch_name=None):
self.target_classifier = target_classifier
self._parent_branch_name = parent_branch_name
self._classification = None #value, if leaf node
self._classifier = None #Classifier, if tree continues
self._children = {} #dict mapping feature to node, if tree continues
self._data = [] #only used temporarily for printing with data
def get_parent_branch_name(self):
return self._parent_branch_name if self._parent_branch_name else "(Root node: no parent branch)"
def is_leaf(self):
return not self._classifier
def set_node_classification(self, classification):
self._classification = classification
if self._classifier:
print("Warning: Setting the classification", classification, "converts this node from a subtree to a leaf, overwriting its previous classifier:", self._classifier)
self._classifier = None
self._children = {}
return self
def get_node_classification(self):
return self._classification
def set_classifier_and_expand(self, classifier, features):
if classifier is None:
raise TypeError("Cannot set classifier to None")
if not isinstance_Classifier(classifier):
raise TypeError("classifier must be Classifier-type object: " + str(classifier))
self._classifier = classifier
try:
self._children = {feature:IdentificationTreeNode(self.target_classifier, parent_branch_name=str(feature))
for feature in features}
except TypeError:
raise TypeError("Expected list of feature names, got: " + str(features))
if len(self._children) == 1:
print("Warning: The classifier", classifier.name, "has only one relevant feature, which means it's not a useful test!")
if self._classification:
print("Warning: Setting the classifier", classifier.name, "converts this node from a leaf to a subtree, overwriting its previous classification:", self._classification)
self._classification = None
return self
def get_classifier(self):
return self._classifier
def apply_classifier(self, point):
if self._classifier is None:
raise ClassifierError("Cannot apply classifier at leaf node")
return self._children[self._classifier.classify(point)]
def get_branches(self):
return self._children
def copy(self):
return deepcopy(self)
def print_with_data(self, data):
tree = self.copy()
tree._assign_data(data)
print(tree.__str__(with_data=True))
def _assign_data(self, data):
if not self._classifier:
self._data = deepcopy(data)
return self
try:
pairs = list(self._soc(data, self._classifier).items())
except KeyError: #one of the points is missing a feature
raise ClassifierError("One or more points cannot be classified by " + str(self._classifier))
for (feature, branch_data) in pairs:
if feature in self._children:
self._children[feature]._assign_data(branch_data)
else: #feature branch doesn't exist
self._data.extend(branch_data)
return self
_ssc=lambda self,c,d:self.set_classifier_and_expand(c,self._soc(d,c))
_soc=lambda self,d,c:reduce(lambda b,p:b.__setitem__(c.classify(p),b.get(c.classify(p),[])+[p]) or b,d,{})
def __eq__(self, other):
try:
return (self.target_classifier == other.target_classifier
and self._parent_branch_name == other._parent_branch_name
and self._classification == other._classification
and self._classifier == other._classifier
and self._children == other._children
and self._data == other._data)
except:
return False
def __str__(self, indent=0, with_data=False):
newline = os.linesep
ret = ''
if indent == 0:
ret += (newline + "IdentificationTreeNode classifying by "
+ self.target_classifier.name + ":" + newline)
ret += " "*indent + (self._parent_branch_name + ": " if self._parent_branch_name else '')
if self._classifier:
ret += self._classifier.name
if with_data and self._data:
ret += self._render_points()
for (feature, node) in sorted(self._children.items()):
ret += newline + node.__str__(indent+1, with_data)
else: #leaf
ret += str(self._classification)
if with_data and self._data:
ret += self._render_points()
return ret
def _render_points(self):
ret = ' ('
first_point = True
for point in self._data:
if first_point:
first_point = False
else:
ret += ', '
ret += str(point.get("name","datapoint")) + ": "
try:
ret += str(self.target_classifier.classify(point))
except ClassifierError:
ret += '(unknown)'
ret += ')'
return ret
################################################################################
############################# k-NEAREST NEIGHBORS ##############################
################################################################################
class Point(object):
"""A Point has a name and a list or tuple of coordinates, and optionally a
classification, and/or alpha value."""
def __init__(self, coords, classification=None, name=None):
self.name = name
self.coords = coords
self.classification = classification
def copy(self):
return deepcopy(self)
def __getitem__(self, i): # make Point iterable
return self.coords[i]
def __eq__(self, other):
try:
return (self.coords == other.coords
and self.classification == other.classification)
except:
return False
def __str__(self):
ret = "Point(" + str(self.coords)
if self.classification:
ret += ", " + str(self.classification)
if self.name:
ret += ", name=" + str(self.name)
ret += ")"
return ret
__repr__ = __str__
################################################################################
############################### OTHER FUNCTIONS ################################
################################################################################
def is_class_instance(obj, class_name):
return hasattr(obj, '__class__') and obj.__class__.__name__ == class_name
def isinstance_Classifier(obj):
return is_class_instance(obj, 'Classifier')
def isinstance_IdentificationTreeNode(obj):
return is_class_instance(obj, 'IdentificationTreeNode')
def isinstance_Point(obj):
return is_class_instance(obj, 'Point')
#id_tree
from api import *
import math
log2 = lambda x: math.log(x, 2)
INF = float('inf')
import pandas as pd
def id_tree_classify_point(point, id_tree):
if id_tree.is_leaf():
return id_tree.get_node_classification()
else:
new_tree = id_tree.apply_classifier(point)
get_point = id_tree_classify_point(point, new_tree)
return get_point
def split_on_classifier(data, classifier):
"""Given a set of data (as a list of points) and a Classifier object, uses
the classifier to partition the data. Returns a dict mapping each feature
values to a list of points that have that value."""
#Dictionary which will contain the data after classification.
class_dict = {}
#Iterating through all the points in data
for i in range(len(data)):
get_value = classifier.classify(data[i])
if get_value not in class_dict:
class_dict[get_value] = [data[i]]
else:
class_dict[get_value].append(data[i])
return class_dict
def branch_disorder(data, target_classifier):
"""Given a list of points representing a single branch and a Classifier
for determining the true classification of each point, computes and returns
the disorder of the branch."""
#Getting data after classification based on the target_classifier
class_dict = split_on_classifier(data, target_classifier)
if (len(class_dict) == 1):
#Homogenous condition
return 0
else:
disorder = 0
for i in class_dict:
get_len = len(class_dict[i])
p_term = get_len/ float(len(data))
disorder += (-1) * p_term * log2(p_term)
return disorder
def average_test_disorder(data, test_classifier, target_classifier):
"""Given a list of points, a feature-test Classifier, and a Classifier
for determining the true classification of each point, computes and returns
the disorder of the feature-test stump."""
average_disorder = 0.0
#Getting all the branches after applying test_classifer
get_branches = split_on_classifier(data, test_classifier)
#Iterating through the branches
for i in get_branches:
disorder = branch_disorder(get_branches[i], target_classifier)
average_disorder += disorder * (len(get_branches[i])/ float(len(data)))
return average_disorder
#### CONSTRUCTING AN ID TREE
def find_best_classifier(data, possible_classifiers, target_classifier):
"""Given a list of points, a list of possible Classifiers to use as tests,
and a Classifier for determining the true classification of each point,
finds and returns the classifier with the lowest disorder. Breaks ties by
preferring classifiers that appear earlier in the list. If the best
classifier has only one branch, raises NoGoodClassifiersError."""
#Base values to start with
best_classifier = average_test_disorder(data, possible_classifiers[0], target_classifier)
store_classifier = possible_classifiers[0]
#Iterating over the list of possible classifiers
for i in range(len(possible_classifiers)):
avg_disorder = average_test_disorder(data, possible_classifiers[i], target_classifier)
if avg_disorder < best_classifier:
best_classifier = avg_disorder
store_classifier = possible_classifiers[i]
get_branches = split_on_classifier(data, store_classifier)
if len(get_branches)==1:
#Only 1 branch present
raise NoGoodClassifiersError
else:
return store_classifier
def construct_greedy_id_tree(data, possible_classifiers, target_classifier, id_tree_node=None):
"""Given a list of points, a list of possible Classifiers to use as tests,
a Classifier for determining the true classification of each point, and
optionally a partially completed ID tree, returns a completed ID tree by
adding classifiers and classifications until either perfect classification
has been achieved, or there are no good classifiers left."""
#print data
#print "possible", possible_classifiers
#print "target", target_classifier
if id_tree_node == None:
#Creating a new tree
id_tree_node = IdentificationTreeNode(target_classifier)
if branch_disorder(data, target_classifier) == 0:
id_tree_node.set_node_classification(target_classifier.classify(data[0]))
else:
try:
#Getting the best classifier from the options available
best_classifier = find_best_classifier(data, possible_classifiers, target_classifier)
get_branches = split_on_classifier(data, best_classifier)
id_tree_node = id_tree_node.set_classifier_and_expand(best_classifier, get_branches)
#possible_classifiers.remove(best_classifier)
branches = id_tree_node.get_branches()
for i in branches:
construct_greedy_id_tree(get_branches[i], possible_classifiers, target_classifier, branches[i])
except NoGoodClassifiersError:
pass
return id_tree_node
possible_classifiers = [feature_test('age'),
feature_test('gender'),
feature_test('duration'),
feature_test('Mood')
]
df1 = pd.read_csv("data_form.csv")
#df1 = df1.drop("age", axis=1)
print(df1)
a = []
with open("data_form.csv") as myfile:
firstline = True
for line in myfile:
if firstline:
mykeys = "".join(line.split()).split(',')
firstline = False
else:
values = "".join(line.split()).split(',')
a.append({mykeys[n]:values[n] for n in range(0,len(mykeys))})
keys = a[0].keys()
print(keys)
with open('data_clean.csv', 'w') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(a)
print(a)
tar = feature_test('genre')
print(construct_greedy_id_tree(a, possible_classifiers, tar))
#book_suggestion
import random
#from out import *
def genre(Mood, age, gender, duration):
print("Hi")
res_0= input("What's your name?")
res_1 = input("How are you, "+str(res_0)+"?")
if res_1 in ("good","fine","ok","nice"):
print ("Oh nice")
else:
print("Oh! It's alright")
Mood = input("What is your current mood?")
age = input("What is your age range : 10-12, 12-15,13-14,15-18,18+?")
gender = input("What is your gender?")
duration = input("How long do you want to read : 1week, 2weeks, 3weeks, 3+weeks, 2hours")
def get_book(genre):
suggestions = []
genre_to_book = {"Fantasy":["Just me and my babysitter - Mercer Mayer","Just Grandpa and me - Mercer Mayer","Just me and my babysitter - Mercer Mayer",
"The new Potty - Mercer Mayer","I was so mad - Mercer Mayer","Just me and my puppy" ,"Just a mess" ,"Me too"
,"The new Baby","Just shopping with mom"],
"Encyclopedias":["Brain Power - Paul Mcevoy", "My best books of snakes Gunzi Chrisitian","MY best books of MOON Grahame,Ian",
"The book of Planets Twist,Clint", "Do stars have points? Melvin", "Young discover series:cells Discovery Channel"]
,
"Action" : ["The Kane Chronicle:The Throne of Fire s Book 2 Riordan,Rick",
"Zane : ninja of ice Farshtey, Greg",
"Escape from Sentai Mountain Farshtey, Greg",
"Percy jackson Rick Riordan",
"The Kane Chronicle:The Throne of Fire s Book 2 Rick Riordan"],
"Comic" : ["Double Dork Diaries Russell Rachel Renée",
"Dork Dairies Russell Rachel Renee",
"Dork Dairies Russell Rachel Renée"],
"Mystery" : ["Sparkling Cyanide Christie Agatha",
"Poirot's Early Cases: Agatha Christie",
"The Name of this Book is Secret Bosch,Pseudonyuous"],
"Biographies" :["All by myself Mercer Mayer", "D Days prett bryan",
"Snake Bite Lane Andrew"] }
if (genre == "Fantasy"):
suggestions = [random.sample(genre_to_book["Fantasy"], 3)]
elif (genre == "Action"):
suggestions = [random.sample(genre_to_book["Action"], 3)]
elif (genre == "Comic"):
suggestions = [random.sample(genre_to_book["Comic"], 3)]
elif (genre == "Mystery"):
suggestions = [random.sample(genre_to_book["Mystery"], 3)]
elif (genre == "Encyclopedias"):
suggestions = random.sample(genre_to_book["Encyclopedias"], 3)
elif (genre == "Biographies"):
suggestions = random.sample(genre_to_book["Biographies"], 3)
return suggestions
print(get_book(genre(Mood, age, gender, duration)))
I want the program to not depend on out.py and and run on the information of id tree
The current implementation of the suggestions works by asking the user for a genre, then looking up a list of book titles in a dictionary using that genre as the key, then randomly selecting one of the titles and printing it. The current implementation also (presumably) constructs a IdentificationTreeNode containing recommendations, but then does nothing with it except printing it to the standard output.
The next step would be to not discard the tree, but save it in a variable and use in the recommendation process. Since the class structure is not given, it is not clear how this could be done, but it seems a reasonable assumption that it is possible to provide a keyword (the genre) and receive some collection of objects where each one contains data on a recommendation.
If constructing the IdentificationTreeNode is too costly to run on each recommendation request, it is possible to split the construction into its own script file and using python's pickle package to save the object in a file that can then be unpickled more quickly in the script performing the recommendations.

I made a function to find a word in a list of words by splitting it in half each time -- how does each if statement contribute to the time complexity?

My program is here. It is assumed that lst is a list of words in alphabetical order.
My question is if the time complexity is changed by each if statement.
When compared to just using x in s (which has a time complexity of O(n)), it is faster, right? From what I know the time complexity of this program is O(N/8), unless each if statement counts, in which it should be N/4?.
Thanks for the help.
def is_in(val, lst):
"""
is_in is designed to check if a word is in a list of words in alphabetical order.
:param val: the word to look for
:param lst: the lst of words to check
"""
curlen = len(lst) # saving processing power
if ((curlen == 1) and (lst[0] != val)) or (curlen == 0): # base case
return False
else:
center = curlen // 2
pivot = lst[center]
# print("pivot: " + str(pivot))
# print("val: " + str(val))
if pivot == val: # base case 2
return True
elif pivot > val:
return is_in(val, lst[:center])
else:
return is_in(val, lst[center:])

Greedy Motif Search in Python

I am studying the Bioinformatics course at Coursera, and have been stuck on the following problem for 5 days:
Implement GreedyMotifSearch.
Input: Integers k and t, followed by a collection of strings Dna.
Output: A collection of strings BestMotifs resulting from applying GreedyMotifSearch(Dna, k, t).
If at any step you find more than one Profile-most probable k-mer in a given string, use the
one occurring first.
Here's my attempt to solve this (I just copied it from my IDE, so pardon any print statements):
def GreedyMotifSearch(DNA, k, t):
"""
Documentation here
"""
import math
bestMotifs = []
bestScore = math.inf
for string in DNA:
bestMotifs.append(string[:k])
base = DNA[0]
for i in window(base, k):
newMotifs = []
for j in range(t):
profile = ProfileMatrix([i])
probable = ProfileMostProbable(DNA[j], k, profile)
newMotifs.append(probable)
if Score(newMotifs) <= bestScore:
bestScore = Score(newMotifs)
bestMotifs = newMotifs
return bestMotifs
The helper functions are these:
def SymbolToNumber(Symbol):
"""
Converts base to number (in lexicograpical order)
Symbol: the letter to be converted (str)
Returns: the number correspondinig to that base (int)
"""
if Symbol == "A":
return 0
elif Symbol == "C":
return 1
elif Symbol == "G":
return 2
elif Symbol == "T":
return 3
def NumberToSymbol(index):
"""
Finds base from number (in lexicographical order)
index: the number to be converted (int)
Returns: the base corresponding to index (str)
"""
if index == 0:
return str("A")
elif index == 1:
return str("C")
elif index == 2:
return str("G")
elif index == 3:
return str("T")
def HammingDistance(p, q):
"""
Finds the number of mismatches between 2 DNA segments of equal lengths
p: first DNA segment (str)
q: second DNA segment (str)
Returns: number of mismatches (int)
"""
return sum(s1 != s2 for s1, s2 in zip(p, q))
def window(s, k):
for i in range(1 + len(s) - k):
yield s[i:i+k]
def ProfileMostProbable(Text, k, Profile):
"""
Finds a k-mer that was most likely to be generated by profile among
all k-mers in Text
Text: given DNA segment (str)
k: length of pattern (int)
Profile: a 4x4 matrix (list)
Returns: profile-most probable k-mer (str)
"""
letter = [[] for key in range(k)]
probable = ""
hamdict = {}
index = 1
for a in range(k):
for j in "ACGT":
letter[a].append(Profile[j][a])
for b in range(len(letter)):
number = max(letter[b])
probable += str(NumberToSymbol(letter[b].index(number)))
for c in window(Text, k):
for x in range(len(c)):
y = SymbolToNumber(c[x])
index *= float(letter[x][y])
hamdict[c] = index
index = 1
for pat, ham in hamdict.items():
if ham == max(hamdict.values()):
final = pat
break
return final
def Count(Motifs):
"""
Documentation here
"""
count = {}
k = len(Motifs[0])
for symbol in "ACGT":
count[symbol] = []
for i in range(k):
count[symbol].append(0)
t = len(Motifs)
for i in range(t):
for j in range(k):
symbol = Motifs[i][j]
count[symbol][j] += 1
return count
def FindConsensus(motifs):
"""
Finds a consensus sequence for given list of motifs
motifs: a list of motif sequences (list)
Returns: consensus sequence of motifs (str)
"""
consensus = ""
for i in range(len(motifs[0])):
countA, countC, countG, countT = 0, 0, 0, 0
for motif in motifs:
if motif[i] == "A":
countA += 1
elif motif[i] == "C":
countC += 1
elif motif[i] == "G":
countG += 1
elif motif[i] == "T":
countT += 1
if countA >= max(countC, countG, countT):
consensus += "A"
elif countC >= max(countA, countG, countT):
consensus += "C"
elif countG >= max(countC, countA, countT):
consensus += "G"
elif countT >= max(countC, countG, countA):
consensus += "T"
return consensus
def ProfileMatrix(motifs):
"""
Finds the profile matrix for given list of motifs
motifs: list of motif sequences (list)
Returns: the profile matrix for motifs (list)
"""
Profile = {}
A, C, G, T = [], [], [], []
for j in range(len(motifs[0])):
countA, countC, countG, countT = 0, 0, 0, 0
for motif in motifs:
if motif[j] == "A":
countA += 1
elif motif[j] == "C":
countC += 1
elif motif[j] == "G":
countG += 1
elif motif[j] == "T":
countT += 1
A.append(countA)
C.append(countC)
G.append(countG)
T.append(countT)
Profile["A"] = A
Profile["C"] = C
Profile["G"] = G
Profile["T"] = T
return Profile
def Score(motifs):
"""
Finds score of motifs relative to the consensus sequence
motifs: a list of given motifs (list)
Returns: score of given motifs (int)
"""
consensus = FindConsensus(motifs)
score = 0.0000
for motif in motifs:
score += HammingDistance(consensus, motif)
#print(score)
return round(score, 4)
It seems fine to me. However, when I run this code for quiz problems, it gives an incorrect answer. Their code grading system shows this error:
Failed test #3. Your indexing may be off by one at the beginning of each string in Dna.
I have tried everything I can think of and run this code on all their sample data and debug data, but I simply can't figure out how to make this code work. Please help me with any possible solutions to this.
You have a few problems. I think this should address them all. I've included comments explaining each change along with your original code and a reference to the relevant Pseudocode in the debug data page you linked to.
def GreedyMotifSearch(DNA, k, t):
"""
Documentation here
"""
import math
bestMotifs = []
bestScore = math.inf
for string in DNA:
bestMotifs.append(string[:k])
base = DNA[0]
for i in window(base, k):
# Change here. Should start with one element in motifs and build up.
# As in the line "motifs ← list with only Dna[0](i,k)"
# newMotifs = []
newMotifs = [i]
# Change here to iterate over len(DNA).
# Should go through "for j from 1 to |Dna| - 1"
# for j in range(t):
for j in range(1, len(DNA)):
# Change here. Should build up motifs and build profile using them.
# profile = ProfileMatrix([i])
profile = ProfileMatrix(newMotifs)
probable = ProfileMostProbable(DNA[j], k, profile)
newMotifs.append(probable)
# Change to < rather < = to ensure getting the most recent hit. As referenced in the instructions:
# If at any step you find more than one Profile-most probable k-mer in a given string, use the one occurring **first**.
if Score(newMotifs) < bestScore:
#if Score(newMotifs) <= bestScore:
bestScore = Score(newMotifs)
bestMotifs = newMotifs
return bestMotifs

Resources