I am trying to implement the word ladder problem where I have to convert one word to another in shortest path possible.Obviously we can use the breadth first search (BFS) to solve it but before that we have to first draw the graph.I have implemented the concept of buckets where certain words fall under a bucket if they match the bucket type.But my graph is not implementing correctly.
The given word list is ["CAT", "BAT", "COT", "COG", "COW", "RAT", "BUT", "CUT", "DOG", "WED"]
So for each word I can create a bucket.For example for the word 'CAT', I can have three buckets _AT, C_T, CA_. Similarly I can create buckets for the rest of the words and which ever words match the bucket type will fall under those buckets.
Implementing with hand should give me a graph like this
Since the graph is undirected, so for the vertex COG, its neighbouring vertices should be DOG, COW, COT (relationship work both ways) but instead I am getting COG is connected to nothing.Here is my code below
class Vertex:
def __init__(self,key): = key
self.connectedTo = {}
def addNeighbour(self,nbr,weight=0):
self.connectedTo[nbr] = weight
#string representation of the object
def __str__(self):
return str( + " is connected to " + str([ for x in self.connectedTo])
def getConnections(self):
return self.connectedTo.keys()
def getId(self):
def getWeight(self,nbr):
return self.connectedTo[nbr]
class Graph:
def __init__(self):
self.vertList = {}
self.numVertices = 0
def addVertex(self,key):
self.numVertices += 1
newVertex = Vertex(key)
self.vertList[key] = newVertex
return newVertex
def getVertex(self,n):
if n in self.vertList:
return self.vertList[n]
return None
def addEdge(self,f,t,cost=0):
if f not in self.vertList:
nv = self.addVertex(f)
if t not in self.vertList:
nv = self.addVertex(t)
def getVertices(self):
return self.vertList.keys()
def __iter__(self):
return iter(self.vertList.values())
wordList = ["CAT", "BAT", "COT", "COG", "COW", "RAT", "BUT", "CUT", "DOG", "WED"]
def buildGraph(wordList):
d = {} #in this dictionary the buckets will be the keys and the words will be their values
g = Graph()
for i in wordList:
for j in range(len(i)):
bucket = i[:j] + "_" + i[j+1:]
if bucket in d:
#we are storing the words that fall under the same bucket in a list
d[bucket] = [i]
# create vertices for the words under the buckets and join them
for bucket in d.keys():
for word1 in d[bucket]:
for word2 in d[bucket]:
#we ensure same words are not treated as two different vertices
if word1 != word2:
return g
# get the graph object
gobj = buildGraph(wordList)
for v in gobj: #the graph contains a set of vertices
The result I get is
BUT is connected to ['BAT']
CUT is connected to ['COT']
COW is connected to ['COG']
COG is connected to []
CAT is connected to []
DOG is connected to ['COG']
RAT is connected to ['BAT']
COT is connected to []
BAT is connected to []
I was hoping the results to be something like
BUT is connected to ['BAT', 'CUT']
CUT is connected to ['CAT', 'COT', 'BUT']
and so on....
What am I doing wrong?

The problem is in your addEdge method.
You are checking if vertices are already present in graph, ok. But if they are present, you are creating new vertices anyway and adding edge for those new vertices, throwing away the previous ones. That's why you have exactly one edge for each vertex in the end.
Just change the last line of addEdge to :


How to simplify text comparison for big data-set where text meaning is same but not exact - deduplicate text data

I have text data set (different menu items like chocolate, cake, coke etc) of around 1.8 million records which belongs to 6 different categories (category A, B, C, D, E, F). one of the category has around 700k records. Most of the menu items are mixed up in multiple categories to which they doesn't belong to, for example: cake belongs to category 'A' but it is found in category 'B' & 'C' as well.
I want to identify those misclassified items and report to a personnel but the challenge is the item name is not always correct because it is totally human typed text. For example: Chocolate might be updated as hot chclt, sweet choklate, chocolat etc. There can also be items like chocolate cake ;)
so to handle this, I tried a simple method using cosine similarity to compare category-wise and identify those anomalies but it takes alot of time since I am comparing each items to 1.8 million records (Sample code is as shown below). Can anyone suggest a better way to deal with this problem?
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def cos_similarity(a,b):
X =a
Y =b
# tokenization
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[];l2 =[]
# remove stop words from the string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
# cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
if float((sum(l1)*sum(l2))**0.5)>0:
cosine = c / float((sum(l1)*sum(l2))**0.5)
cosine = 0
return cosine
#Base code
cos_sim_list = []
for i in category_B.index:
ln_cosdegree = 0
ln_degsem = []
for j in category_A.index:
ln_j = str(category_A['item_name'][j])
ln_i = str(category_B['item_name'][i])
degreeOfSimilarity = cos_similarity(ln_j,ln_i)
if degreeOfSimilarity>0.5:
Consider text is already cleaned
I used KNeighbor and cosine similarity to solve this case. Though I am running the code multiple times to compare category by category; still it is effective because of lesser number of categories. Please suggest me if any better solution is available
cat_A_clean = category_A['item_name'].unique()
print('Vecorizing the data - this could take a few minutes for large datasets...')
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf = vectorizer.fit_transform(cat_A_clean)
print('Vecorizing completed...')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)
unique_B = set(category_B['item_name'].values)
def getNearestN(query):
queryTFIDF_ = vectorizer.transform(query)
distances, indices = nbrs.kneighbors(queryTFIDF_)
return distances, indices
import time
t1 = time.time()
print('getting nearest n...')
distances, indices = getNearestN(unique_B)
t = time.time()-t1
print("COMPLETED IN:", t)
unique_B = list(unique_B)
print('finding matches...')
matches = []
for i,j in enumerate(indices):
temp = [round(distances[i][0],2), cat_A_clean['item_name'].values[j],unique_B[i]]
print('Building data frame...')
matches = pd.DataFrame(matches, columns=['Match confidence (lower is better)','ITEM_A','ITEM_B'])
def clean_string(text):
text = str(text)
text = text.lower()
def cosine_sim_vectors(vec1,vec2):
vec1 = vec1.reshape(1,-1)
vec2 = vec2.reshape(1,-1)
return cosine_similarity(vec1,vec2)[0][0]
def cos_similarity(sentences):
cleaned = list(map(clean_string,sentences))
vectorizer = CountVectorizer().fit_transform(cleaned)
vectors = vectorizer.toarray()
cos_sim_list =[]
for ind in matches.index:
a = matches['Match confidence (lower is better)'][ind]
b = matches['ITEM_A'][ind]
c = matches['ITEM_B'][ind]
degreeOfSimilarity = cos_similarity([b,c])

Pos Tag Lemmatize giving only one row in output

Using Pos Tag on tokenize data, it is coming into form of word, pos_tag.
When passing the same for lemmatization, only the first value is getting lemmatized.
Dataframe with two columns-
ID Text
1 Lemmatization is an interesting part
After tokenize and removing stop words -
ID Tokenize_data
1 'Lemmatization', 'interesting', 'part'
#Lemmatization with postag
#Part of Speech Tagging
df2['tag_words'] = df2.tokenize_data.apply(nltk.pos_tag)
#Treebank to Wordnet
from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
return None
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def tagging(text):
#tagged = nltk.pos_tag(tokens)
for (word, tag) in text:
wntag = get_wordnet_pos(tag)
if wntag is None:# not supply tag in case of None
lemma = lemmatizer.lemmatize(word)
lemma = lemmatizer.lemmatize(word, pos=wntag)
return lemma
tag1 = lambda x: tagging(x)
df2['lemma_tag'] = df2.tag_words.apply(tag1)
Output is coming as -
ID Lemma_words
1 'Lemmatize'
Expected -
ID Lemma_words
1 'Lemmatize', 'interest', 'part'
Below function works -
My code was not retaining the values of all the tuples inside my pos tag list hence only one value was coming in output
def lemmatize_sentence(text):
#tokenize the sentence and find the POS tag for each token
nltk_tagged = nltk.pos_tag(nltk.word_tokenize(text))
#tuple of (token, wordnet_tag)
wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
lemmatized_sentence = []
for word, tag in wordnet_tagged:
if tag is None:
#if there is no available tag, append the token as is
#else use the tag to lemmatize the token
lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
return lemmatized_sentence

NetworkX - allow duplicate nodes

import networkx as nx
import matplotlib.pyplot as plt
G = nx.DiGraph()
def calculate_lists(user_input):
""" Calculates the number of occurences of certain character in a string."""
input_list = []
for i in user_input:
occurence_list = []
for i in set(input_list):
occurence_list.append((i, user_input.count(i)))
sorted_by_first = sorted(occurence_list, key=lambda tup: tup[1])
sorted_list = list(reversed(sorted_by_first))
propability_list = []
for i in range(len(sorted_list)):
print("Input list is: ", input_list)
print("Input list is: ", input_list)
print("Occurence list: ", occurence_list)
print("Sorted list is: ", sorted_list)
print("Probility list is: ", propability_list)
return huffmann_algorithm(propability_list)
def huffmann_algorithm(prob_list):
node_list = []
while len(prob_list) != 1:
first_minimum = min(float(s) for s in prob_list)
print("First minimum", first_minimum)
second_minimum = min(float(s) for s in prob_list)
print("Second minimum", second_minimum)
node_list.append([first_minimum, second_minimum])
print("new value: ", first_minimum+second_minimum)
new_value = int(first_minimum+second_minimum)
print("Finished: ", prob_list)
count = 0
for i in node_list:
print("Nodes: ", tuple(i))
G.add_edge(i[0], i[0]+i[1])
G.add_edge(i[1], i[0]+i[1])
print("Node list: ", node_list)
nx.draw_networkx(G, with_labels=True, arrows=False)
def main():
user_input = str(input("Please enter a text: "))
if __name__ == "__main__":
I'm trying to implement a version of the huffman code in python. However, Im not able to add duplicate nodes to the graph. Is there a workaround to display values with the same text? To see what I mean, enter for example: aaaaabbbbcccdde
The graph only shows one node with the label 3.
I think you are mistaking nodes with node labels. Having duplicate nodes in a graph doesn't really make sense. What I feel you need here is to have duplicate labels.
What you can do to add the notion of labels to your graph is to have a dictionary that maps nodes identifiers (unique) to node labels (possibly not unique):
user_input = "aaaaabbbbcccdde"
# i is the node identifier and l is its corresponding label:
labels = {i: l for i, l in enumerate(user_input)}
nodes = labels.keys()
Using these you can construct your graph:
G = nx.DiGraph()
Then you can, for example, draw it:
pos = nx.spring_layout(G)
nx.draw(G, pos)
nx.draw_networkx_labels(G, pos, labels)
And of course (probably most importantly), anytime you have a node identifier, say node_id, you can retrieve its label using labels[node_id]. What I suggest is to always work with node identifier, then at the very end, when you need to print a result you can translate node identifiers to something readable by a human, ie. node labels.
Depending on the complexity of your code, you may also find useful to attach the labels to the node objects themselves, networkx allows that:
nx.set_node_attributes(G, labels, 'label')
You'll then have access to node attributes:
for node_id, u in G.nodes(data=True):
# Or if you have a node_identifier:
node_id = 1
This would output:
{'label': 'a'}
{'label': 'a'}

Error in implementing BFS to find the shortest transformation from one word to another(Word Ladder Challenge)

I am trying to implement the word ladder problem where I have to convert one word to another in shortest path possible.Obviously we can use the breadth first search (BFS) to solve it but before that we have to first draw the graph.I have implemented the concept of buckets where certain words fall under a bucket if they match the bucket type.But my graph is not implementing correctly.
The given word list is ["CAT", "BAT", "COT", "COG", "COW", "RAT", "BUT", "CUT", "DOG", "WED"]
So for each word I can create a bucket.For example for the word 'CAT', I can have three buckets _AT, C_T, CA_. Similarly I can create buckets for the rest of the words and which ever words match the bucket type will fall under those buckets.
My code for expressing the problem in graph works fine and I get a graph like this (theoritical)
Now I need to find the shortest no of operations to transform 'CAT' to 'DOG'.So I use a modified method of BFS to achieve it.It works fine when I make a sample graph of my own.For example
graph = {'COG': ['DOG', 'COW', 'COT'], 'CAT': ['COT', 'BAT', 'CAT', 'RAT'], 'BUT': ['CUT', 'BAT'], 'DOG': ['COG']}
The code works fine and I get the correct result.But if I have a huge list of words say 1500, it's not feasible to type and create a dictionary that long.So I made a function which takes those words from the list, implements the technique I dicussed above and creates the graph for me which works just fine until here.But when I try to get the shortest distance between two words, I get the following error
for neighbour in neighbours:
TypeError: 'Vertex' object is not iterable
Here is my code below
class Vertex:
def __init__(self,key): = key
self.connectedTo = {}
# add neighbouring vertices to the current vertex along with the edge weight
def addNeighbour(self,nbr,weight=0):
self.connectedTo[nbr] = weight
#string representation of the object
def __str__(self):
return str( + " is connected to " + str([ for x in self.connectedTo])
def getConnections(self):
return self.connectedTo.keys()
def getId(self):
def getWeight(self,nbr):
return self.connectedTo[nbr]
class Graph:
def __init__(self):
self.vertList = {}
self.numVertices = 0
def addVertex(self,key):
self.numVertices += 1
newVertex = Vertex(key)
self.vertList[key] = newVertex
return newVertex
def getVertex(self,n):
if n in self.vertList:
return self.vertList[n]
return None
def addEdge(self,f,t,cost=0):
if f not in self.vertList:
nv = self.addVertex(f)
if t not in self.vertList:
nv = self.addVertex(t)
def getVertices(self):
return self.vertList.keys()
def __iter__(self):
return iter(self.vertList.values())
# I have only included few words in the list to focus on the implementation
wordList = ["CAT", "BAT", "COT", "COG", "COW", "RAT", "BUT", "CUT", "DOG", "WED"]
def buildGraph(wordList):
d = {} #in this dictionary the buckets will be the keys and the words will be their values
g = Graph()
for i in wordList:
for j in range(len(i)):
bucket = i[:j] + "_" + i[j+1:]
if bucket in d:
#we are storing the words that fall under the same bucket in a list
d[bucket] = [i]
# create vertices for the words under the buckets and join them
for bucket in d.keys():
for word1 in d[bucket]:
for word2 in d[bucket]:
#we ensure same words are not treated as two different vertices
if word1 != word2:
return g
def bfs_shortest_path(graph, start, goal):
explored = []
queue = [[start]]
if start == goal:
return "The starting node and the destination node is same"
while queue:
path = queue.pop(0)
node = path[-1]
if node not in explored:
neighbours = graph[node] # it shows the error here
for neighbour in neighbours:
new_path = list(path)
if neighbour == goal:
return new_path
return "No connecting path between the two nodes"
# get the graph object
gobj = buildGraph(wordList)
# just to check if I am able to fetch the data properly as mentioned above where I get the error (neighbours = graph[node])
print(gobj["CAT"]) # ['COT', 'BAT', 'CUT', 'RAT']
print(bfs_shortest_path(gobj, "CAT", "DOG"))
To check neighbouring vertices of each vertex, we can do
for v in gobj:
The output is obtained as below which correctly depicts the graph above.
CAT is connected to ['COT', 'BAT', 'CUT', 'RAT']
RAT is connected to ['BAT', 'CAT']
COT is connected to ['CUT', 'CAT', 'COG', 'COW']
CUT is connected to ['COT', 'BUT', 'CAT']
COG is connected to ['COT', 'DOG', 'COW']
DOG is connected to ['COG']
BUT is connected to ['BAT', 'CUT']
BAT is connected to ['BUT', 'CAT', 'RAT']
COW is connected to ['COT', 'COG']
CAT is connected to ['COT', 'BAT', 'CUT', 'RAT']
What could be going wrong then?
Ok so I figured out the issue.The problem was in this line of code
neighbours = graph[node]
Basically it is trying the fetch the neigbours for the particular node.So it needs to access the vertList dictionary declared as an attribute for the Graph class.So for an object to access the dictonary value, one has to implement a __getitem__ special method.So I declare this under the Graph class as follows
# returns the value for the key which will be an object
def __getitem__(self, key):
return self.vertList[key]
Now graph[node] will be able to fetch the object representation of the node since the value of vertList dictionary is a vertex object (vertList stores vertex name as key and vertex object as value) .So I have to explicitly tell it to fetch the object's neighbours and not the object itself.So I can call the getConnections() method under the Vertex class which further calls the connectedTo attribute to get the neighbour objects for the particular vertex object. (connectedTo dictionary has the vertex object as the key and edge weight as the value.)
So now those neighbour objects will have their own ids which I can access and use it for the BFS operation.The below line is the modified code (under the bfs_shortest_path method) which does the above work.
if node not in explored:
neighbours = [ for x in graph[node].getConnections()]
Now I get the list of the neigbours for the particular node and use it.The rest of the code stays the same.

How to separate two concatenaded words

I have a review dataset and I want to process it using NLP techniques. I did all the preprocessing stages (remove stop words, stemming, etc.). My problem is that there are some words, which are connected to each other and my function doesn't understand those. Here is an example:
Great services. I had a nicemeal and I love it a lot.
How can I correct it from nicemeal to nice meal?
Peter Norvig has a nice solution to the word segmentation problem that you are encountering. Long story short, he uses a large dataset of word (and bigram) frequencies and some dynamic programming to split long strings of connected words into their most likely segmentation.
You download the zip file with the source code and the word frequencies and adapt it to your use case. Here is the relevant bit, for completeness.
def memo(f):
"Memoize function f."
table = {}
def fmemo(*args):
if args not in table:
table[args] = f(*args)
return table[args]
fmemo.memo = table
return fmemo
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
def splits(text, L=20):
"Return a list of all possible (first, rem) pairs, len(first)<=L."
return [(text[:i+1], text[i+1:])
for i in range(min(len(text), L))]
def Pwords(words):
"The Naive Bayes probability of a sequence of words."
return product(Pw(w) for w in words)
#### Support functions (p. 224)
def product(nums):
"Return the product of a sequence of numbers."
return reduce(operator.mul, nums, 1)
class Pdist(dict):
"A probability distribution estimated from counts in datafile."
def __init__(self, data=[], N=None, missingfn=None):
for key,count in data:
self[key] = self.get(key, 0) + int(count)
self.N = float(N or sum(self.itervalues()))
self.missingfn = missingfn or (lambda k, N: 1./N)
def __call__(self, key):
if key in self: return self[key]/self.N
else: return self.missingfn(key, self.N)
def datafile(name, sep='\t'):
"Read key,value pairs from file."
for line in file(name):
yield line.split(sep)
def avoid_long_words(key, N):
"Estimate the probability of an unknown word."
return 10./(N * 10**len(key))
N = 1024908267229 ## Number of tokens
Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words)
You can also use the segment2 method as it uses bigrams and is much more accurate.
