Tips optimizing my python graph creation code - python-3.x

My programming task is to, given an input file words.txt, create a graph where the nodes are the words themselves, and edges are words with one letter different. (eg. "abcdef" and "abcdeg" would have an edge).
For later stages of the problem, I really need to use the adjacency list implementation (so no adjacency matrix). I'd prefer not to use external libraries if possible.
My code is functional, but takes about 7 seconds to run (for ~5000 words). I need help optimizing it further. The main optimization I did was using sets with {"index,char"} for each char in the string and compared the sets, which was a lot faster than something like sum(x!=y for x,y in zip(word1, word2))
This is the code I have so far:
def createGraph(filename):
"""Time taken to generate graph: {} seconds."""
wordReference = {} # Words and their smartsets (for comparisons)
wordEdges = {} # Words and their edges (Graph representation)
edgeCount = 0 # Count of edges
with open(filename) as f:
for raw in f.readlines(): # For each word in the file (Stripping trailing \n)
word = raw.strip()
wordReference[word] = set("{}{}".format(letter, str(i)) for i, letter in enumerate(word)) # Create smartsets in wordReference
wordEdges[word] = set() # Create a set to contain the edges
for parsed in wordEdges: # For each of the words we've already parsed
if len(wordReference[word] & wordReference[parsed]) == 5: # If compare smartSets to see if there should be an edge
wordEdges[parsed].add(word) #Add edge parsed -> word
wordEdges[word].add(parsed) #Add edge word -> parsed
edgeCount += 1 #Add 1 to edgesCount
return wordEdges, edgeCount, len(wordReference) #Return dictionary of words and their edges

Related

Generating a random string with matched brackets

I need to generate a random string of a certain length – say ten characters, for the sake of argument – composed of the characters a, b, c, (, ), with the rule that parentheses must be matched.
So for example aaaaaaaaaa, abba()abba and ((((())))) are valid strings, but )aaaabbbb( is not.
What algorithm would generate a random string, uniformly sampled from the set of all strings consistent with those rules? (And run faster than 'keep generating strings without regard to the balancing rule, discard the ones that fail it', which could end up generating very many invalid strings before finding a valid one.)
A string consisting only of balanced parentheses (for any arbitrary pair of characters representing an open and a close) is called a "Dyck string", and the number of such strings with p pairs of parentheses is the pth Catalan number, which can be computed as (2pCp)/(p+1), a formula which would be much easier to make readable if only SO allowed MathJax. If you want to also allow k other non-parenthetic characters, you need to consider, for each number p ≤ n of pairs of balanced parentheses, the number of different combinations of the non-parentheses characters (k(2n-2p)) and the number of ways you can interpolate 2n-2p characters in a string of total length 2n (2nC2p). If you sum all these counts for each possible value of p, you'll get the count of the total universe of possibilities, and you can then choose a random number in that range and select whichever of the individual p counts corresponds. Then you can select a random placement of random non-parentheses characters.
Finally, you need to get a uniformly distributed Dyck string; a simple procedure is to decompose the Dyck string into it's shortest balanced prefix and the remainder (i.e. (A)B, where A and B are balanced subsequences). Select a random length for (A), then recursively generate a random A and a random B.
Precomputing the tables of counts (or memoising the function which generates them) will produce a speedup if you expect to generate a lot of random strings.
Use dynamic programming to generate a data structure that knows how many there are for each choice, recursively. Then use that data structure to find a random choice.
I seem to be the only person who uses the technique. And I always write it from scratch. But here is working code that hopefully explains it. It will take time O(length_of_string * (length_of_alphabet + 2)) and similar data.
import random
class DPPath:
def __init__ (self):
self.count = 0
self.next = None
def add_option(self, transition, tail):
if self.next is None:
self.next = {}
self.next[transition] = tail
self.count += tail.count
def random (self):
if 0 == self.count:
return None
else:
return self.find(int(random.random() * self.count))
def find (self, pos):
result = self._find(pos)
return "".join(reversed(result))
def _find (self, pos):
if self.next is None:
return []
for transition, tail in self.next.items():
if pos < tail.count:
result = tail._find(pos)
result.append(transition)
return result
else:
pos -= tail.count
raise IndexException("find out of range")
def balanced_dp (n, alphabet):
# Record that there is 1 empty string with balanced parens.
base_dp = DPPath()
base_dp.count = 1
dps = [base_dp]
for _ in range(n):
# We are working backwards towards the start.
prev_dps = [DPPath()]
for i in range(len(dps)):
# prev_dps needs to be bigger in case of closed paren.
prev_dps.append(DPPath())
# If there are closed parens, we can open one.
if 0 < i:
prev_dps[i-1].add_option('(', dps[i])
# alphabet chars don't change paren balance.
for char in alphabet:
prev_dps[i].add_option(char, dps[i])
# Add a closed paren.
prev_dps[i+1].add_option(")", dps[i])
# And we are done with this string position.
dps = prev_dps
# Return the one that wound up balanced.
return dps[0]
# And a quick demo of several random strings.
for _ in range(10):
print(balanced_dp(10, "abc").random())

Find all cycles with at least 3 nodes in a directed graph using dictionary data structure

The above graph was drawn using LaTeX: https://www.overleaf.com/read/rxhpghzbkhby
The above graph is represented as a dictionary in Python.
graph = {
'A' : ['B','D', 'C'],
'B' : ['C'],
'C' : [],
'D' : ['E'],
'E' : ['G'],
'F' : ['A', 'I'],
'G' : ['A', 'K'],
'H' : ['F', 'G'],
'I' : ['H'],
'J' : ['A'],
'K' : []
}
I have a large graph of about 3,378,546 nodes.
Given the above-directed graph, I am trying to find circles with at least 3 and less than 5 different nodes, and output the first 3 circles.
I spent 1 day and a half on this problem. I looked in Stackoverflow and even tried to follow this Detect Cycle in a Directed Graph tutorial but couldn't come up with a solution.
In this example, the output is a tab-delimited text file where each line has a cycle.
0 A, D, E, G
1 F, I, H
0 and 1 are indexes.
Also, there is no order in the alphabet of the graph nodes.
I tried this form How to implement depth-first search in Python tutorial:
visited = set()
def dfs(visited, graph, node):
if node not in visited:
print (node)
visited.add(node)
for neighbour in graph[node]:
dfs(visited, graph, neighbour)
dfs(visited, graph, 'A')
But this doesn't help. I also tried this Post
Here is a commented code that would print the array containing the cycles found. Not much more would be necessary I believe to adjust the return value to the desired format (CSV in your case I think).
It could be that with 3M nodes, this turns out to be slow. I would then suggest going the dynamic programming way and caching/memoize the results of some recursions in order not to repeat them.
I hope this solves your problem or at least helps.
def cycles_rec(root, current_node, graph, depth, visited, min_depth, max_depth):
depth += 1
# First part our stop conditions
if current_node in visited or current_node not in graph.keys():
return ''
if depth >= max_depth:
return ''
visited.append(current_node)
if root in graph[current_node] and depth >= min_depth:
return current_node
# The recursive part
# for each connection we try to find recursively one that would cycle back to our root
for connections in graph[current_node]:
for connection in connections:
result = cycles_rec(root, connection, graph, depth, visited, min_depth, max_depth)
# If a match was found, it would "bubble up" here, we can return it along with the
# current connection that "found it"
if result != '':
return current_node + ' ' + result
# If we are here we found no cycle
return ''
def cycles(graph, min_depth = 3, max_depth = 5):
cycles = {}
for node, connections in graph.items():
for connection in connections:
visited = []
# Let the recursion begin here
result = cycles_rec(node, connection, graph, 1, visited, min_depth, max_depth)
if result == '':
continue
# Here we found a cycle.
# Fingerprint is only necessary in order to not repeat the cycles found in the results
# It could be ignored if repeating them is not important
# It's based on the fact that nodes are all represented as letters here
# It could be it's own function returning a hash for example if nodes have a more
# complex representation
fingerprint = ''.join(sorted(list(node + ' ' + result)))
if fingerprint not in cycles.keys():
cycles[fingerprint] = node + ' ' + result
return list(cycles.values())
So, assuming the graph variable you declared in your example:
print(cycles(graph, 3, 5))
Would print out
['A D E G', 'F I H']
NOTE: This solution is an extended solution to the describe one. I extended to the original graph with ~3million nodes and I look for all cycles that are at least 3 nodes and less than 40 nodes and store the first 3 cycles into a file.
I came up with the following solution.
# implementation of Johnson's cycle finding algorithm
# Original paper: Donald B Johnson. "Finding all the elementary circuits of a directed graph." SIAM Journal on Computing. 1975.
from collections import defaultdict
import networkx as nx
from networkx.utils import not_implemented_for, pairwise
#not_implemented_for("undirected")
def findCycles(G):
"""Find simple cycles of a directed graph.
A `simple cycle` is a closed path where no node appears twice.
Two elementary circuits are distinct if they are not cyclic permutations of each other.
This is iterator/generator version of Johnson's algorithm [1]_.
There may be better algorithms for some cases [2]_ [3]_.
Parameters
----------
G : NetworkX DiGraph
A directed graph
Returns
-------
cycle_generator: generator
A generator that produces elementary cycles of the graph.
Each cycle is represented by a list of nodes along the cycle.
Examples
--------
>>> graph = {'A' : ['B','D', 'C'],
'B' : ['C'],
'C' : [],
'D' : ['E'],
'E' : ['G'],
'F' : ['A', 'I'],
'G' : ['A', 'K'],
'H' : ['F', 'G'],
'I' : ['H'],
'J' : ['A'],
'K' : []
}
>>> G = nx.DiGraph()
>>> G.add_nodes_from(graph.keys())
>>> for keys, values in graph.items():
G.add_edges_from(([(keys, node) for node in values]))
>>> list(nx.findCycles(G))
[['F', 'I', 'H'], ['G', 'A', 'D', 'E']]
Notes
-----
The implementation follows pp. 79-80 in [1]_.
The time complexity is $O((n+e)(c+1))$ for $n$ nodes, $e$ edges and $c$
elementary circuits.
References
----------
.. [1] Finding all the elementary circuits of a directed graph.
D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975.
https://doi.org/10.1137/0204007
.. [2] Enumerating the cycles of a digraph: a new preprocessing strategy.
G. Loizou and P. Thanish, Information Sciences, v. 27, 163-182, 1982.
.. [3] A search strategy for the elementary cycles of a directed graph.
J.L. Szwarcfiter and P.E. Lauer, BIT NUMERICAL MATHEMATICS,
v. 16, no. 2, 192-204, 1976.
--------
"""
def _unblock(thisnode, blocked, B):
stack = {thisnode}
while stack:
node = stack.pop()
if node in blocked:
blocked.remove(node)
stack.update(B[node])
B[node].clear()
# Johnson's algorithm requires some ordering of the nodes.
# We assign the arbitrary ordering given by the strongly connected comps
# There is no need to track the ordering as each node removed as processed.
# Also we save the actual graph so we can mutate it. We only take the
# edges because we do not want to copy edge and node attributes here.
subG = type(G)(G.edges())
sccs = [scc for scc in nx.strongly_connected_components(subG) if len(scc) in list(range(3, 41))]
# Johnson's algorithm exclude self cycle edges like (v, v)
# To be backward compatible, we record those cycles in advance
# and then remove from subG
for v in subG:
if subG.has_edge(v, v):
yield [v]
subG.remove_edge(v, v)
while sccs:
scc = sccs.pop()
sccG = subG.subgraph(scc)
# order of scc determines ordering of nodes
startnode = scc.pop()
# Processing node runs "circuit" routine from recursive version
path = [startnode]
blocked = set() # vertex: blocked from search?
closed = set() # nodes involved in a cycle
blocked.add(startnode)
B = defaultdict(set) # graph portions that yield no elementary circuit
stack = [(startnode, list(sccG[startnode]))] # sccG gives comp nbrs
while stack:
thisnode, nbrs = stack[-1]
if nbrs:
nextnode = nbrs.pop()
if nextnode == startnode:
yield path[:]
closed.update(path)
# print "Found a cycle", path, closed
elif nextnode not in blocked:
path.append(nextnode)
stack.append((nextnode, list(sccG[nextnode])))
closed.discard(nextnode)
blocked.add(nextnode)
continue
# done with nextnode... look for more neighbors
if not nbrs: # no more nbrs
if thisnode in closed:
_unblock(thisnode, blocked, B)
else:
for nbr in sccG[thisnode]:
if thisnode not in B[nbr]:
B[nbr].add(thisnode)
stack.pop()
path.pop()
# done processing this node
H = subG.subgraph(scc) # make smaller to avoid work in SCC routine
sccs.extend(scc for scc in nx.strongly_connected_components(H) if len(scc) in list(range(3, 41)))
import sys, csv, json
def findAllCycles(jsonInputFile, textOutFile):
"""Find simple cycles of a directed graph (jsonInputFile).
Parameters:
----------
jsonInputFile: a json file that has all concepts
textOutFile: give a desired name of output file
Returns:
----------
a .text file (named: {textOutFile}.txt) has the first 3 cycles found in jsonInputFile
Each cycle is represented by a list of nodes along the cycle
"""
with open(jsonInputFile) as infile:
graph = json.load(infile)
# Convert the json file to a NetworkX directed graph
G = nx.DiGraph()
G.add_nodes_from(graph.keys())
for keys, values in graph.items():
G.add_edges_from(([(keys, node) for node in values]))
# Search for all simple cycles existed in the graph
_cycles = list(findCycles(G))
# Start with an empty list and populate it by looping over all cycles
# in _cycles that have at least 3 and less than 40 different concepts (nodes)
cycles = []
for cycle in _cycles:
if len(cycle) in list(range(3, 41)):
cycles.append(cycle)
# Store the cycels under constraint in {textOutFile}.txt
with open(textOutFile, 'w') as outfile:
for cycle in cycles[:3]:
outfile.write(','.join(n for n in cycle)+'\n')
outfile.close()
# When process finishes, print Done!!
return 'Done!!'
infile = sys.argv[1]
outfile = sys.argv[2]
first_cycles = findAllCycles(infile, outfile)
To run this program, you simply use a command line as follows:
>> python3 {program file name}.py graph.json {desired output file name}[.txt][.csv]
let, for example {desired output file name}}.[txt][.csv], be first_3_cycles_found.txt
In my case, the graph has 3,378,546 nodes which took me ~40min to find all cycles using the above code. Thus, the output file will be:
Please contribute to this if you see it needs any improvement or something else to be added.

What is the time complexity for a nested loop in this case?

I'm trying to tokenize a text file. I created a list of lines found in the text file using readlines() and plan to loop through each sentence in that list to split each sentence using re.split(). I then plan to loop through the resulting list to add each word into a dictionary to count how many times each word occurs. Would this implementation of a nested list result in O(N^2) or O(N)? Thanks.
This code is just an example of how I plan to implement it.
for sentence in list:
result = re.split(sentence)
for word in result:
dictionary[word] += 1
for sentence in list: # n-times (n = length of list)
result = re.split(sentence)
for word in result: # m-times (m = number of words in sentence)
dictionary[word] += 1
so runtime will be n * m or n-squared.
A better way to solve a counting problem is by using collections.Counter.

Sliding Window and Recognizing Specific Characters in a List

Instructions: Write a script that will calculate the %GC of a dna string
based on a sliding window of adjustable size. So say the length of
the window is L = 10 bases, then you will move the window along
the dna strand from position 0 to the end (careful, not too far...)
and 'extract' the bases into a substring and analyze GC content.
Put the numbers in a list. The dna string may be very large so you
will want to read the string in from an infile, and print the results
to a comma-delimited outfile that can be ported into Excel to plot.
For the final data analysis, use a window of L = 100 and analyze the two genomes in files:
Bacillus_amyloliquefaciens_genome.txt
Deinococcus_radiodurans_R1_chromosome_1.txt
But first, to get your script functioning, use the following trainer data set.Let window L=4. Example input and output follow:
INPUT:
AACGGTT
OUTPUT:
0,0.50
1,0.75
2,0.75
3,0.50
My code:
dna = ['AACGGTT']
def slidingWindow(dna,winSize,step):
"""Returns a generator that will iterate through
the defined chunks of input sequence. Input sequence
must be iterable."""
# Verify the inputs
#try: it = iter(dna)
# except TypeError:
#raise Exception("**ERROR** sequence must be iterable.")
if not ((type(winSize) == type(0)) and (type(step) == type(0))):
raise Exception("**ERROR** type(winSize) and type(step) must be int.")
if step > winSize:
raise Exception("**ERROR** step must not be larger than winSize.")
if winSize > len(dna):
raise Exception("**ERROR** winSize must not be larger than sequence length.")
# Pre-compute number of chunks to emit
numOfwins = ((len(dna)-winSize)/step)+1
# Do the work
for i in range(0,numOfwins*step,step):
yield dna[i:i+winSize]
chunks = slidingWindow(dna,len(dna),step)
for y in chunks:
total = 1
search = dna[y]
percentage = (total/len(dna))
if search == "C":
total = total+1
print ("#", y,percentage)
elif search == "G":
total = total+1
print ("#", y,percentage)
else:
print ("#", y, "0.0")
"""
MAIN
calling the functions from here
"""
# YOUR WORK HERE
#print ("#", z,percentage)
When approaching a complex problem, it is helpful to divide it into simpler sub-problems. Here, you have at least two separate concepts: a window of bases, and statistics on such a window. Why don't you tackle them one at a time?
Here is a simple generator that produces chunks of the desired size:
def get_chunks(dna, window_size=4, stride=1):
for i in range(0, len(dna) - window_size + 1, stride):
chunk = dna[i:i + window_size]
assert len(chunk) == window_size
yield chunk
for chunk in get_chunks('AACGGTT'):
print(chunk)
It displays this output:
AACG
ACGG
CGGT
GGTT
Now, with that in hand, could you write a simple function that accepts a four-character string and produces an appropriate statistical summary of it? [Please post it as a separate answer to your question. Yes, it might sound odd at first, but StackOverflow does encourage you to post answers to your questions, so you can share what you have learned.]

Algorithm to merge similar text by keeping the same sequence (ex: deduplicate log file)

I want to group similar text on a log file
example input:
user_id:1234 image_id:1234 seq:1: failed to upload data
user_id:12 image_id:234 seq:2: failed to upload data
user_id:34 image_id:123 seq:3: failed to upload data
fail processing data for user_id:12 image_23
fail processing data for user_id:12 image_23
expected output:
user_id:____ image_id:____ seq:_ failed to upload data -> 3
fail processing data for user_id:__ image___ -> 2
What I tried is using python SequenceMatcher: (pseudo code)
sms = {}
for err in errors:
for pattern in errors_map.keys():
# SequencMatcher caches information gathered about second sequence:
sms.setdefault(pattern, SequenceMatcher(b=pattern, autojunk=False))
s = sms[pattern]
s.set_seq1(err)
# check some thresshold
if s.quick_ratio() <= similarity_threshold:
continue
matching_blocks = s.get_matching_blocks()
# if ratio >= similarity_threshold,
# and the first matching block is located at the beginning,
# and the size of the first matching block > 10,
# construct the whole string & replace non matching word with _
if matching_blocks[0][0] == 0 and matching_blocks[0][1] == 0 and matching_blocks[0][2] > 10:
mblocks = []
prev_a = prev_l = 0
for a, b, l in matching_blocks:
if l > 0:
if prev_l > 0:
len_non_matching = len(err[prev_a + prev_l:a])
mblocks.append('_' * len_non_matching)
mblocks.append(err[a:a + l])
prev_a = a
prev_l = l
mblocks = ''.join(mblocks)
The result is not that good. I'm wondering if there is a better approach or library that does this already?
An approach could be to cluster the strings and then search for the Longest Common Subsequence within each cluster.
In the general case you could use Levenshtein distance for clustering (depending on the assumptions you have K-Means or DBSCAN could make sense). Computing the LCSS will be
NP-Hard if you have no other assumptions on you strings.
A more approximate algorithm could only look at the tokens ("user_id", "1234","image_id",...). Use a set based distance (e.g., Jaccard Index). Then relaxing the LCSS you could look for most common tokens within each cluster.

Resources