I have a list of strings (from documents in CouchDB).
I want to find the minimum prefix length so that all shortened strings (taking the first LEN characters) are unique.
For example:
aabb
aabc
abcd
should give: LEN is three.
Is it possible to write this as a map/reduce function ?
Doing it the brute force way:
MAP: Create for each input record "ABCDE" records with the keys
- "A"
- "AB"
- "ABC"
- "ABCD"
- "ABCDE"
REDUCE:
IF you have 1 value in the iterator output: "length(key)" "true"
If you have more than one 1 value in the iterator output: "length(key)" "false"
MAP: Identity mapper
REDUCE: Output "true" if all input values are true. Else output false (or nothing);
That should result in a true for all lengths that are "unique"
Related
The problem is to find the length of the shortest unique substring and number of same length unique substring occurring in the string. For eg. "aatcc" will have "t" as the shortest length unique substring and length is 1 so the output will be 1,1. Another example is "aacc" here the output will be 2,3 as strings are aa,ac,cc
I tried to solve it but could come up only with a brute Force solution which is to loop over all possible substrings. It exceeded the time limit.
I googled it and found some references to suffix array but not quite clear about it.
So what is the optimal solution for this problem?
EDIT : Forgot to mention the key requirement of the solution of that was required for this problem and that is to NOT use any library functions other than input and output functions to read and write from and to the standard input and the standard output respectively.
EDIT: I have found another solution using trie data structure.
Pseudocode:
for i from 1 to length(string) do
for j from 0 to length(string)-1 do
1. create a substring of length i from jth character
2. if checkIfSeen(substring) then count-- else count++
close inner for loop
if count >= 1 then break
close outer for loop
print i(the length of the unique substring), count (no. of such substrings)
checkIfSeen(Substring) will use a trie data structure which
will run O(log l) where l is the average length of the prefixes.
The time complexity of this algorithm would be O(n^2 log l) where if the average length of the prefixes is n/2 then the time complexity would be O(n^2 log n). Please point out the mistakes if there are and also ways to improve this running time if possible.
Sorry, but keep in mind that my answer is based on program I wrote with Python, but can be applied to any programming language :)
Now I believe brute force approach is indeed what you need to do in this problem. But what we can do to shorten the time is:
1: start the brute force from the smallest substring length, which is
1.
2: after looping through the string with substring length 1 (the data
will look something like {"a":2, "t":1, "c":2} for "aatcc"), check if
any substring appeared only once. If it did, count the occurrence by
looping through the dictionary (in case of the example you gave, "t"
only appeared once, so occurrence is 1).
3: After the occurrence is counted, break the loop so that it does not
have to waste time on counting the rest of bigger substrings.
4: on 2:, if the unique substring was not found, reset the dictionary
and try a bigger substring (the data can be something like {"aa": 1, "ac":1,
"cc":1 for "aacc"}). Eventually the unique substring WILL be found no matter what (for example, in the string "aaaaa", the unique substring is "aaaaa" with the data {"aaaaa":1})
Here is the implementation in Python:
def countString(string):
for i in range(1, len(string)+1): #start the brute force from string length 1
dictionary = {}
for j in range(len(string)-i+1): #check every combination.
#count the substring occurrences
try:
dictionary[string[j:j+i]] += 1
except:
dictionary[string[j:j+i]] = 1
isUnique = False #loop stops if isUnique is True
occurrence= 0
for key in dictionary: #iterate through the dictionary
if dictionary[key] == 1: #check if any substring is unique
#if found, get ready to escape from the loop and increase the occurrence
isUnique = True
occurrence+=1
if isUnique:
return (i, occurrence)
print(countString("aacc")) #prints (2,3)
print(countString("aatcc")) #prints (1,1)
I am pretty sure that this design is fairly fast, but there always should be a better way. But anyway, I hope this helped :)
I have a map of elements:
elemA1: value
elemB1: value
elemC1: value
...
elemA99: value
elemB99: value
elemC99: value
...
elemA7823: value
elemB7823: value
elemD7823: value
I want to use groupBy to group each set of elements by number.
The number will always be at the end of the key, but my problem is that the number can be any number of characters.
Just have the groupBy closure extract the part of the key you want to group by. Here I'm using the regular expression /\d+$/ to get digits at the end of the key.
def map = [
elemA1: "1",
elemB1: "B1",
elemA99: "A99",
elemB99: "B99"
]
map.groupBy { ( it.key =~ /\d+$/ )[0] } // [1:[elemA1:1, elemB1:B1], 99:[elemA99:A99, elemB99:B99]]
I have a list of strings =
['after','second','shot','take','note','of','the','temp']
I want to strip all strings after the appearance of 'note'.
It should return
['after','second','shot','take']
There are also lists which does not have the flag word 'note'.
So in case of a list of strings =
['after','second','shot','take','of','the','temp']
it should return the list as it is.
How to do that in a fast way? I have to repeat the same thing with many lists with unequal length.
tokens = [tokens[:tokens.index(v)] if v == 'note' else v for v in tokens]
There is no need of an iteration when you can slice list:
strings[:strings.index('note')+1]
where s is your input list of strings. The end slice is exclusive, hence a +1 makes sure 'note' is part.
In case of missing data ('note'):
try:
final_lst = strings[:strings.index('note')+1]
except ValueError:
final_lst = strings
if you want to make sure the flagged word is present:
if 'note' in lst:
lst = lst[:lst.index('note')+1]
Pretty much the same as #Austin's answer above.
I have a vector with > 30000 words. I want to create a subset of this vector which contains only those words whose length is greater than 5. What is the best way to achieve this?
Basically df contains mutiple sentences.
So,
wordlist = df2;
wordlist = [strip(wordlist[i]) for i in [1:length(wordlist)]];
Now, I need to subset wordlist so that it contains only those words whose length is greater than 5.
sub(A,find(x->length(x)>5,A)) # => creates a view (most efficient way to make a subset)
EDIT: getindex() returns a copy of desired elements
getindex(A,find(x->length(x)>5,A)) # => makes a copy
You can use filter
wordlist = filter(x->islenatleast(x,6),wordlist)
and combine it with a fast condition such as islenatleast defined as:
function islenatleast(s,l)
if sizeof(s)<l return false end
# assumes each char takes at least a byte
l==0 && return true
p=1
i=0
while i<l
if p>sizeof(s) return false end
p = nextind(s,p)
i += 1
end
return true
end
According to my timings islenatleast is faster than calculating the whole length (in some conditions). Additionally, this shows the strength of Julia, by defining a primitive competitive with the core function length.
But doing:
wordlist = filter(x->length(x)>5,wordlist)
will also do.
Groovy split seems to be ignoring empty fields.
Here is the code:
line = abc,abc,,,
line.split(/,/)
println
prints only..
abc abc
It seems to ignore empty fields. How do I retrieve empty fields using split?
First of all, method split(regex) is not provided by Groovy, it is provided by Java.
Second, you can achieve what you need by using the generic split(regex, int limit) as below:
def line = "abc,abc,,,"
println line.split(/,/, -1) //prints [abc, abc, , , ]
println line.split(/,/, -1).size() //prints 5
Note:-
The string array you would end up in the print would throw a compilation error when asserted. But you can use the result as a normal list.
line.split(/,/, -1).each{println "Hello $it"}
I would rather use limit 0 or the overloaded split to discard unwanted empty strings.
Explanation on using -1 as limit:
Stress on the below statements from the javadoc.
The limit parameter controls the number of times the pattern is
applied and therefore affects the length of the resulting array. If
the limit n is greater than zero then the pattern will be applied at
most n - 1 times, the array's length will be no greater than n, and
the array's last entry will contain all input beyond the last matched
delimiter. If n is non-positive then the pattern will be applied as
many times as possible and the array can have any length. If n is zero
then the pattern will be applied as many times as possible, the array
can have any length, and trailing empty strings will be discarded.
Interesting. The split method works as expected provided there's a non-empty element at the end.
def list = 'abc,abc,,,abc'.split(/,/)
println list // prints [abc, abc, , ]
assert list.size() == 5
assert list[0] == 'abc'
assert list[1] == 'abc'
assert list[2] == ''
assert list[3] == ''
assert list[4] == 'abc'
Maybe you could just append a bogus character to the end of the string and sublist the result:
def list = 'abc,abc,,,X'.split(/,/) - 'X'
println list // prints [abc, abc, , ]