Alien Dictionary
Link to the online judge -> LINK
Given a sorted dictionary of an alien language having N words and k starting alphabets of standard dictionary. Find the order of characters in the alien language.
Note: Many orders may be possible for a particular test case, thus you may return any valid order and output will be 1 if the order of string returned by the function is correct else 0 denoting incorrect string returned.
Example 1:
Input:
N = 5, K = 4
dict = {"baa","abcd","abca","cab","cad"}
Output:
1
Explanation:
Here order of characters is
'b', 'd', 'a', 'c' Note that words are sorted
and in the given language "baa" comes before
"abcd", therefore 'b' is before 'a' in output.
Similarly we can find other orders.
My working code:
from collections import defaultdict
class Solution:
def __init__(self):
self.vertList = defaultdict(list)
def addEdge(self,u,v):
self.vertList[u].append(v)
def topologicalSortDFS(self,givenV,visited,stack):
visited.add(givenV)
for nbr in self.vertList[givenV]:
if nbr not in visited:
self.topologicalSortDFS(nbr,visited,stack)
stack.append(givenV)
def findOrder(self,dict, N, K):
list1 = dict
for i in range(len(list1)-1):
word1 = list1[i]
word2 = list1[i+1]
rangej = min(len(word1),len(word2))
for j in range(rangej):
if word1[j] != word2[j]:
u = word1[j]
v = word2[j]
self.addEdge(u,v)
break
stack = []
visited = set()
vlist = [v for v in self.vertList]
for v in vlist:
if v not in visited:
self.topologicalSortDFS(v,visited,stack)
result = " ".join(stack[::-1])
return result
#{
# Driver Code Starts
#Initial Template for Python 3
class sort_by_order:
def __init__(self,s):
self.priority = {}
for i in range(len(s)):
self.priority[s[i]] = i
def transform(self,word):
new_word = ''
for c in word:
new_word += chr( ord('a') + self.priority[c] )
return new_word
def sort_this_list(self,lst):
lst.sort(key = self.transform)
if __name__ == '__main__':
t=int(input())
for _ in range(t):
line=input().strip().split()
n=int(line[0])
k=int(line[1])
alien_dict = [x for x in input().strip().split()]
duplicate_dict = alien_dict.copy()
ob=Solution()
order = ob.findOrder(alien_dict,n,k)
x = sort_by_order(order)
x.sort_this_list(duplicate_dict)
if duplicate_dict == alien_dict:
print(1)
else:
print(0)
My problem:
The code runs fine for the test cases that are given in the example but fails for ["baa", "abcd", "abca", "cab", "cad"]
It throws the following error for this input:
Runtime Error:
Runtime ErrorTraceback (most recent call last):
File "/home/e2beefe97937f518a410813879a35789.py", line 73, in <module>
x.sort_this_list(duplicate_dict)
File "/home/e2beefe97937f518a410813879a35789.py", line 58, in sort_this_list
lst.sort(key = self.transform)
File "/home/e2beefe97937f518a410813879a35789.py", line 54, in transform
new_word += chr( ord('a') + self.priority[c] )
KeyError: 'f'
Running in some other IDE:
If I explicitly give this input using some other IDE then the output I'm getting is b d a c
Interesting problem. Your idea is correct, it is a partially ordered set you can build a directed acyclcic graph and find an ordered list of vertices using topological sort.
The reason for your program to fail is because not all the letters that possibly some letters will not be added to your vertList.
Spoiler: adding the following line somewhere in your code solves the issue
vlist = [chr(ord('a') + v) for v in range(K)]
A simple failing example
Consider the input
2 4
baa abd
This will determine the following vertList
{"b": ["a"]}
The only constraint is that b must come before a in this alphabet. Your code returns the alphabet b a, since the letter d is not present you the driver code will produce an error when trying to check your solution. In my opinion it should simply output 0 in this situation.
I have this code
for letters in itertools.product(charset, repeat=47):
string = "".join(letters)
print(string)
and out from that is
aaaaaaaaaaaa
aaaaaaaaaaab
aaaaaaaaaaac
but im wondering how can I make it not generate same three characters in row so that out put is
dddcccbbbaaa
dddcccbbbaab
dddcccbbbaac
and so on without using something like this
for letters in itertools.product(charset, repeat=47):
string = "".join(letters)
for i in range(1,len(string)-1):
if string[i] is not string[i+1] is not string[i-1]:
print(string)
else:
pass
Here's a slightly modified version of your code:
import itertools
def version1(charset, N):
result = []
for letters in itertools.product(charset, repeat=N):
string = "".join(letters)
for i in range(0, N-2):
if string[i] == string[i+1] == string[i+2]:
break
else: # did not find any ZZZ sequence
result.append(string)
return result
>>> charset = "abc"
>>> N = 5
>>> version1(charset, N)
['aabaa', 'aabab', 'aabac', 'aabba', 'aabbc', 'aabca', 'aabcb', 'aabcc', 'aacaa', 'aacab', 'aacac', 'aacba', 'aacbb', 'aacbc', 'aacca', 'aaccb', 'abaab', 'abaac', 'ababa', 'ababb', 'ababc', 'abaca', 'abacb', 'abacc', 'abbaa', 'abbab', 'abbac', 'abbca', 'abbcb', 'abbcc', 'abcaa', 'abcab', 'abcac', 'abcba', 'abcbb', 'abcbc', 'abcca', 'abccb', 'acaab', 'acaac', 'acaba', 'acabb', 'acabc', 'acaca', 'acacb', 'acacc', 'acbaa', 'acbab', 'acbac', 'acbba', 'acbbc', 'acbca', 'acbcb', 'acbcc', 'accaa', 'accab', 'accac', 'accba', 'accbb', 'accbc', 'baaba', 'baabb', 'baabc', 'baaca', 'baacb', 'baacc', 'babaa', 'babab', 'babac', 'babba', 'babbc', 'babca', 'babcb', 'babcc', 'bacaa', 'bacab', 'bacac', 'bacba', 'bacbb', 'bacbc', 'bacca', 'baccb', 'bbaab', 'bbaac', 'bbaba', 'bbabb', 'bbabc', 'bbaca', 'bbacb', 'bbacc', 'bbcaa', 'bbcab', 'bbcac', 'bbcba', 'bbcbb', 'bbcbc', 'bbcca', 'bbccb', 'bcaab', 'bcaac', 'bcaba', 'bcabb', 'bcabc', 'bcaca', 'bcacb', 'bcacc', 'bcbaa', 'bcbab', 'bcbac', 'bcbba', 'bcbbc', 'bcbca', 'bcbcb', 'bcbcc', 'bccaa', 'bccab', 'bccac', 'bccba', 'bccbb', 'bccbc', 'caaba', 'caabb', 'caabc', 'caaca', 'caacb', 'caacc', 'cabaa', 'cabab', 'cabac', 'cabba', 'cabbc', 'cabca', 'cabcb', 'cabcc', 'cacaa', 'cacab', 'cacac', 'cacba', 'cacbb', 'cacbc', 'cacca', 'caccb', 'cbaab', 'cbaac', 'cbaba', 'cbabb', 'cbabc', 'cbaca', 'cbacb', 'cbacc', 'cbbaa', 'cbbab', 'cbbac', 'cbbca', 'cbbcb', 'cbbcc', 'cbcaa', 'cbcab', 'cbcac', 'cbcba', 'cbcbb', 'cbcbc', 'cbcca', 'cbccb', 'ccaab', 'ccaac', 'ccaba', 'ccabb', 'ccabc', 'ccaca', 'ccacb', 'ccacc', 'ccbaa', 'ccbab', 'ccbac', 'ccbba', 'ccbbc', 'ccbca', 'ccbcb', 'ccbcc']
Your algorithm is not optimal. Look at the first string:
aaaaa
You know that you need len(charset) - 1 iterations (aaaab, aaaac) to arrive to:
aaaba
And then again len(charset) - 1 iterations to arrive to:
aaaca
But you can skip all those iterations, because of the aaa beginning.
Actually, when you find sequence aaa, you can skip len(charset)^K - 1 where
K is the number of remaining chars. This does not change the big O complexity,
but will reduce the time of computation for long sequences, depending on the
size of the charset and the number of characters of the strings.
Intuitively, if the charset has few chars, you will spare a lot of computations.
First, you need to find the first letter after a ZZZ sequence:
def first_after_ZZZ(string):
for i in range(0, len(string)-2):
if string[i] == string[i+1] == string[i+2]:
return i+3
return -1
>>> first_after_ZZZ("ababa")
-1
>>> first_after_ZZZ("aaaba")
3
>>> first_after_ZZZ("aaabaaabb")
3
We use this function in the previous code (intermediate step):
def version2(charset, N):
result = []
for letters in itertools.product(charset, repeat=N):
string = "".join(letters)
f = first_after_ZZZ(string)
if f == -1:
result.append(string)
return result
>>> version2(charset, N) == version1(charset, N)
True
Now, we can skip some elements:
def version3(charset, N):
result = []
it = itertools.product(charset, repeat=N)
for letters in it:
string = "".join(letters)
f = first_after_ZZZ(string)
if f == -1:
result.append(string)
elif f < N:
K = N - f # K > 1
to_skip = len(charset)**K-1
next(itertools.islice(it, to_skip, to_skip), None) # this will skip to_skip tuples
return result
>>> version3(charset, N) == version1(charset, N)
True
Benchmark:
>>> from timeit import timeit as ti
>>> ti(lambda: version1(charset, 15), number=1)
13.14919564199954
>>> ti(lambda: version3(charset, 15), number=1)
6.94705574299951
This is impressive because the charset is small, but may be insignificant with a whole alphabet.
Of course, if you write your own implementation of product, you can skip the
tuples without generating them and this could be faster.
I have a python dictionary as follows. Same way, dictionary might have 2 comma separate values for 'Var'(i.e. Dep1,Dep2) and then their respective SubValue (ABC1||A1B1||B1C1, ABC2||A2B2||B2C2).
I'm trying to extract value A1B1 (or A1B1 and B1C1 if there are two Var) with a match of mainValue 'ABC1' and prefix of SubVal 'ABC1'.
ld = { 'id' : 0
'Var': 'Dep1'
'SubVal': 'ABC1||A1B1,ABC2||A2B2,ABC3||A3B3',
'MainValue': 'ABC1'}
So far I tried splitting Subval into list (splitting by comma) and then convert each pair (|| separated) into another dictionary and then looking up the match.
Can anyone suggest a better approach in terms of performance to do this?
Let:
>>> ld = { 'id' : 0, 'Var': 'Dep1', 'SubVal': 'ABC1||A1B1,ABC2||A2B2,ABC3||A3B3', 'MainValue': 'ABC1'}
Your split + dict solution is roughly (note the maxsplit parameter to handle ABC1||A1B1||B1C1 cases):
>>> def parse(d):
... sub_val = dict(t.split('||', maxsplit=1) for t in ld['SubVal'].split(","))
... return sub_val[d['MainValue']]
>>> parse(ld)
'A1B1'
A benchmarck gives:
>>> import timeit
>>> timeit.timeit(lambda: parse(ld))
1.002971081999931
You build a dict for a one shot lookup: that's a bit overkill. You can perform a direct lookup for the MainValue:
>>> def parse_iter(d):
... mv = d['MainValue']
... g = (t.split('||', maxsplit=1) for t in d['SubVal'].split(","))
... return next(v for k, v in g if k == mv)
>>> parse_iter(ld)
'A1B1'
It is a little faster:
>>> timeit.timeit(lambda: parse_iter(ld))
0.8656512869993094
A faster approach is to look for the MainValue in the the ld[SubVal] string and extract the right SubVal. (I assume the MainValue can't be a SubVal or a substring of a SubVal).
With a regex:
>>> import re
>>> def parse_re(d):
... pattern = d['MainValue']+"\|\|([^,]+)"
... return re.search(pattern, d['SubVal']).group(1)
>>> parse_re(ld)
'A1B1'
This is around 25 % faster than the first version on the example:
>>> timeit.timeit(lambda: parse_re(ld))
0.7367669239997667
But why not perform the search manually?
>>> def parse_search(d):
... s = d['SubVal']
... mv = d['MainValue']
... i = s.index(mv) + len(mv) + 2 # after the ||
... j = s.index(",", i)
... return s[i:j]
>>> parse_search(ld)
'A1B1'
This version is around 60% faster than the first version (on the given example):
>>> timeit.timeit(lambda: parse_search(ld))
0.3840863969999191
(If the MainValue can be a SubVal, you can check if there is a comma before the MainValue or SubVal starts with MainValue.)
I am working on going from o-string binary to Unicode, part of this process requires converting Raised Position to Binary. I can't seem to be able to get it done. The doc test will explain what needs to be performed.
I have provided my code below but it is nowhere close to getting the correct answer.
def raisedpos_to_binary(s):
''' (str) -> str
Convert a string representing a braille character in raised-position
representation into the binary representation.
TODO: For students to complete.
>>> raisedpos_to_binary('')
'00000000'
>>> raisedpos_to_binary('142536')
'11111100'
>>> raisedpos_to_binary('14253678')
'11111111'
>>> raisedpos_to_binary('123')
'11100000'
>>> raisedpos_to_binary('125')
'11001000'
'''
res = ''
lowest_value = '00000000'
for i, c in enumerate(s):
if c == i:
lowest_value = lowest_value.replace('0', '1')
return lowest_value
Looks like you can create a set (converted to integers) of the single digits and then produce '1' or '0' iterating over a range of 1..8, eg:
def raisedpos_to_binary(digits):
raised = {int(digit) for digit in digits}
return ''.join('1' if n in raised else '0' for n in range(1, 9))
Tests:
for test in ['', '142536', '14253678', '123', '125']:
print(test, '->', raised_pos_to_binary(test))
Gives you:
-> 00000000
142536 -> 11111100
14253678 -> 11111111
123 -> 11100000
125 -> 11001000
So in full, your module should contain:
def raisedpos_to_binary(digits):
"""
>>> raisedpos_to_binary('')
'00000000'
>>> raisedpos_to_binary('142536')
'11111100'
>>> raisedpos_to_binary('14253678')
'11111111'
>>> raisedpos_to_binary('123')
'11100000'
>>> raisedpos_to_binary('125')
'11001000'
"""
raised = {int(digit) for digit in digits}
return ''.join('1' if n in raised else '0' for n in range(1, 9))
if __name__ == '__main__':
import doctest
doctest.testmod()
Then run your script using:
python your_script.py -v
gets the length of the common prefix of two words i.e-the common prefix of "global" and "glossary" is "glo"(length 3)
a= input("Enter string: ")
b= input("Enter string: ")
count=0
c=a.startswith(b)
while count<=c:
if c:
count=count+1
print(count)
what im not sure is how to get the length of the common prefix
You can do:
def pre(s1, s2):
if any(bool(s.strip())==False for s in (s1, s2)):
return 0
for i, (c1, c2) in enumerate(zip(s1, s2)):
if c1!=c2:
return i
return i+1
Testing:
>>> pre("global", "glossary")
3
>>> pre("global", "global")
6
>>> pre("global", "")
0
You can "cheat" (as in - it's not the way it's meant to be used, but oh well) and use os.path.commonprefix which does a char by char comparison on all elements passed, eg:
from os.path import commonprefix
a = 'global'
b = 'glossary'
length = len(commonprefix([a, b]))
# 3