Time complexity of checking whether a set is contained in another set - python-3.x

I am trying to implement the example of finding the shortest substring of a given string s containing the pattern char. My code is working fine, but my goal is to attain the time complexity of O(N) where N is length of s.
Here is my code;
def shortest_subtstring(s,char):
#smallest substring containing char.use sliding window
start=0
d=defaultdict(int)
minimum=9999
for i in range(len(s)):
d[s[i]]+=1
#check whether all the characters from char has been visited.
while set(char).issubset(set([j for j in d if d[j]>0])):
#if yes, can we make it shorter
length=i-start+1
minimum=min(length,minimum)
if length==minimum:
s1=s[start:i+1]
d[s[start]]-=1
start+=1
return (minimum,s1)
My question is at the line;
while set(char).issubset(set([j for j in d if d[j]>0]))
Each time I am checking whether all strings of char are saved in my dictionary or not, using the idea of is.subset. May I know how can I find the time complexity of this step in my code? Is it O(1) , which is true for checking wether an element exists in a set or not. Otherwise, the time complexity will be much greater than O(N). Help is appreciated.

Per docs s.issubset(t) is the equivalent of s <= t meaning that during the operation it will test if every element in s is in t.
Best scenario:
if s is the first element of t -> O(1)
Worst case scenario:
if s in the last element of t -> O(len(t))
That is for isubset. For the list comprehension :
j for j in d is O(n) for getting each key
if d[j]>0 is O(n) for comparing each value of dictionary d
Here you can find more info.

Related

Calculation of time and space complexity

I am working on this Leetcode problem - "Given a string containing digits from 2-9 inclusive, return all possible letter combinations that the number could represent. Return the answer in any order.
A mapping of digit to letters (just like on the telephone buttons) is given below.
Note that 1 does not map to any letters."
This is a recursive solution to the problem that I was able to understand, but I am not able to figure out the time and space complexity of the solution.
if not len(digits):
return []
res = []
my_dict = {
'2':'abc',
'3':'def',
'4':'ghi',
'5':'jkl',
'6':'mno',
'7':'pqrs',
'8':'tuv',
'9':'wxyz'
}
if len(digits) == 1:
return list(my_dict[digits[0]])
my_list = my_dict[digits[0]] #string - abc
for i in range(len(my_list)): # i = 0,1,2
for item in self.letterCombinations(digits[1:]):
print(item)
res.append(my_list[i] + item)
return res
Any help or explanation regarding calculating time and space complexity for this solution would be helpful. Thank you.
With certain combinatorial problems, the time and space complexity can become dominated by the size of the output. Looking at the loops and function calls, the work being done in the function is one string concatenation and one append for each element of the output. There's also up to 4 repeated recursive calls to self.letterCombinations(digits[1:]): assuming these aren't cached, we need to add in the extra repeated work being done there.
We can write a formula for the number of operations needed to solve the problem when len(digits) == n. If T(n) is the number of steps, and A(n) is the length of the answer array, we get T(n) = 4*T(n-1) + n*A(n) + O(1). We get an extra multiplicative factor of n on A(n) because string concatenation is linear time; an implementation with lists and str.join() would avoid that.
Since A(n) is upper-bounded by 4^n, and T(1) is a constant, this gives T(n) = O(n * (4^n)); the space complexity here is also O(n * (4^n)), given 4^n strings of length n.
One possibly confusing part of complexity analysis is that it's usually a worst-case analysis unless specified otherwise. That's why we use 4 instead of 3 here: if any input could give 4^n results, we use that figure, even though many digit inputs would give closer to 3^n results.

time complexity for check if string has only unique chars

This is an algorithm to determine if a string has all unique characters. What is the time complexity?
def unique(s):
d = []
for c in s:
if c not in d:
d.append(c)
else:
return False
return True
Looks like it only one for loop here so it should be O(n), however, this line
if c not in d:
does this line also cost O(n) time, if so, the time complexity for this algorithm is O(n^2) ?
Your intuition is correct, this algorithm is O(n2). The documentation for list specifies that in is an O(n) operation. In the worst case scenario, when the target element is not present in the list, every element will need to be visited.
Using a set instead of a list would improve time complexity to O(n) because set lookups would be O(1).
An easy way to take advantage of set's O(n) time complexity to test if all characters in a string are unique is to simply convert the string sequence to a set and see if its length is still the same:
def unique(s):
return len(s) == len(set(s))

How to efficiently find identical substrings of a specified length in a collection of strings?

I have a collection S, typically containing 10-50 long strings. For illustrative purposes, suppose the length of each string ranges between 1000 and 10000 characters.
I would like to find strings of specified length k (typically in the range of 5 to 20) that are substrings of every string in S. This can obviously be done using a naive approach - enumerating every k-length substring in S[0] and checking if they exist in every other element of S.
Are there more efficient ways of approaching the problem? As far as I can tell, there are some similarities between this and the longest common subsequence problem, but my understanding of LCS is limited and I'm not sure how it could be adapted to the situation where we bound the desired common substring length to k, or if subsequence techniques can be applied to finding substrings.
Here's one fairly simple algorithm, which should be reasonably fast.
Using a rolling hash as in the Rabin-Karp string search algorithm, construct a hash table H0 of all the |S0|-k+1 length k substrings of S0. That's roughly O(|S0|) since each hash is computed in O(1) from the previous hash, but it will take longer if there are collisions or duplicate substrings. Using a better hash will help you with collisions but if there are a lot of k-length duplicate substrings in S0 then you could end up using O(k|S0|).
Now use the same rolling hash on S1. This time, look each substring up in H0 and if you find it, remove it from H0 and insert it into a new table H1. Again, this should be around O(|S1|) unless you have some pathological case, like both S0 and S1 are just long repetitions of the same character. (It's also going to be suboptimal if S0 and S0 are the same string, or have lots of overlapping pieces.)
Repeat step 2 for each Si, each time creating a new hash table. (At the end of each iteration of step 2, you can delete the hash table from the previous step.)
At the end, the last hash table will contain all the common k-length substrings.
The total run time should be about O(Σ|Si|) but in the worst case it could be O(kΣ|Si|). Even so, with the problem size as described, it should run in acceptable time.
Some thoughts (N is number of strings, M is average length, K is needed substring size):
Approach 1:
Walk through all strings, computing rolling hash for k-length strings and storing these hashes in the map (store tuple {key: hash; string_num; position})
time O(NxM), space O(NxM)
Extract groups with equal hash, check step-by-step:
1) that size of group >= number of strings
2) all strings are represented in this group 3
3) thorough checking of real substrings for equality (sometimes hashes of distinct substrings might coincide)
Approach 2:
Build suffix array for every string
time O(N x MlogM) space O(N x M)
Find intersection of suffix arrays for the first string pair, using merge-like approach (suffixes are sorted), considering only part of suffixes of length k, then continue with the next string and so on
I would treat each long string as a collection of overlapped short strings, so ABCDEFGHI becomes ABCDE, BCDEF, CDEFG, DEFGH, EFGHI. You can represent each short string as a pair of indexes, one specifying the long string and one the starting offset in that string (if this strikes you as naive, skip to the end).
I would then sort each collection into ascending order.
Now you can find the short strings common to the first two collection by merging the sorted lists of indexes, keeping only those from the first collection which are also present in the second collection. Check the survivors of this against the third collection, and so on and the survivors at the end correspond to those short strings which are present in all long strings.
(Alternatively you could maintain a set of pointers into each sorted list and repeatedly look to see if every pointer points at short strings with the same text, then advancing the pointer which points at the smallest short string).
Time is O(n log n) for the initial sort, which dominates. In the worst case - e.g. when every string is AAAAAAAA..AA - there is a factor of k on top of this, because all string compares check all characters and take time k. Hopefully, there is a clever way round this with https://en.wikipedia.org/wiki/Suffix_array which allows you to sort in time O(n) rather than O(nk log n) and the https://en.wikipedia.org/wiki/LCP_array, which should allow you to skip some characters when comparing substrings from different suffix arrays.
Thinking about this again, I think the usual suffix array trick of concatenating all of the strings in question, separated by a character not found in any of them, works here. If you look at the LCP of the resulting suffix array you can split it into sections, splitting at points where where the difference between suffixes occurs less than k characters in. Now each offset in any particular section starts with the same k characters. Now look at the offsets in each section and check to see if there is at least one offset from every possible starting string. If so, this k-character sequence occurs in all starting strings, but not otherwise. (There are suffix array constructions which work with arbitrarily large alphabets so you can always expand your alphabet to produce a character not in any string, if necessary).
I would try a simple method using HashSets:
Build a HashSet for each long string in S with all its k-strings.
Sort the sets by number of elements.
Scan the first set.
Lookup the term in the other sets.
The first step takes care of repetitions in each long string.
The second ensures the minimum number of comparisons.
let getHashSet k (lstr:string) =
let strs = System.Collections.Generic.HashSet<string>()
for i in 0..lstr.Length - k do
strs.Add lstr.[i..i + k - 1] |> ignore
strs
let getCommons k lstrs =
let strss = lstrs |> Seq.map (getHashSet k) |> Seq.sortBy (fun strs -> strs.Count)
match strss |> Seq.tryHead with
| None -> [||]
| Some h ->
let rest = Seq.tail strss |> Seq.toArray
[| for s in h do
if rest |> Array.forall (fun strs -> strs.Contains s) then yield s
|]
Test:
let random = System.Random System.DateTime.Now.Millisecond
let generateString n =
[| for i in 1..n do
yield random.Next 20 |> (+) 65 |> System.Convert.ToByte
|] |> System.Text.Encoding.ASCII.GetString
[ for i in 1..3 do yield generateString 10000 ]
|> getCommons 4
|> fun l -> printfn "found %d\n %A" l.Length l
result:
found 40
[|"PPTD"; "KLNN"; "FTSR"; "CNBM"; "SSHG"; "SHGO"; "LEHS"; "BBPD"; "LKQP"; "PFPH";
"AMMS"; "BEPC"; "HIPL"; "PGBJ"; "DDMJ"; "MQNO"; "SOBJ"; "GLAG"; "GBOC"; "NSDI";
"JDDL"; "OOJO"; "NETT"; "TAQN"; "DHME"; "AHDR"; "QHTS"; "TRQO"; "DHPM"; "HIMD";
"NHGH"; "EARK"; "ELNF"; "ADKE"; "DQCC"; "GKJA"; "ASME"; "KFGM"; "AMKE"; "JJLJ"|]
Here it is in fiddle: https://dotnetfiddle.net/ZK8DCT

Fast way to find strings in set of strings containing substring

Task
I have a set S of n = 10,000,000 strings s and need to find the set Sp containing the strings s of S that contain the substring p.
Simple solution
As I'm using C# this is quite a simple task using LINQ:
string[] S = new string[] { "Hello", "world" };
string p = "ll";
IEnumerable<string> S_p = S.Where(s => s.Contains(p));
Problem
If S contains many strings (like the mentioned 10,000,000 strings) this gets horribly slow.
Idea
Build some kind of index to retrieve Sp faster.
Question
What is the best way to index S for this task and do you have any implementation in C#?
Here is one way to do it:
1. Create a string T = S[0] + sep_0 + S[1] + sep_1 + ... + S[n - 1] + sep_n-1(where sep_i is a unique character that never appears in S[j] for any j(it can actually be an integer number if the set of characters is not big enough)).
2. Build a suffix tree for T(it can be done in linear time).
3. For each query string Q traverse the suffix tree(it takes O(length(Q)) time). Then all possible answers will be located in the leaves of some subtree. So you can just traverse all these leaves. If Q is rather long, then the number of leaves in this subtree is likely to be much smaller than n.
4. If Q is really short, then the number of leaves in a subtree can be pretty large. That's why you can use another strategy for short query strings: precompute all short substrings of S[0] ... S[n - 1] and for each of them store a set of indices where it has occurred. Then you can just print these indices for a given Q. It is difficult to say what 'short' exactly means here, but it can be found out experimentally.

algorithms for fast string approximate matching

Given a source string s and n equal length strings, I need to find a quick algorithm to return those strings that have at most k characters that are different from the source string s at each corresponding position.
What is a fast algorithm to do so?
PS: I have to claim that this is a academic question. I want to find the most efficient algorithm if possible.
Also I missed one very important piece of information. The n equal length strings form a dictionary, against which many source strings s will be queried upon. There seems to be some sort of preprocessing step to make it more efficient.
My gut instinct is just to iterate over each String n, maintaining a counter of how many characters are different than s, but I'm not claiming it is the most efficient solution. However it would be O(n) so unless this is a known performance problem, or an academic question, I'd go with that.
Sedgewick in his book "Algorithms" writes that Ternary Search Tree allows "to locate all words within a given Hamming distance of a query word". Article in Dr. Dobb's
Given that the strings are fixed length, you can compute the Hamming distance between two strings to determine the similarity; this is O(n) on the length of the string. So, worst case is that your algorithm is O(nm) for comparing your string against m words.
As an alternative, a fast solution that's also a memory hog is to preprocess your dictionary into a map; keys are a tuple (p, c) where p is the position in the string and c is the character in the string at that position, values are the strings that have characters at that position (so "the" will be in the map at {(0, 't'), "the"}, {(1, 'h'), "the"}, {(2, 'e'), "the"}). To query the map, iterate through query string's characters and construct a result map with the retrieved strings; keys are strings, values are the number of times the strings have been retrieved from the primary map (so with the query string "the", the key "thx" will have a value of 2, and the key "tee" will have a value of 1). Finally, iterate through the result map and discard strings whose values are less than K.
You can save memory by discarding keys that can't possibly equal K when the result map has been completed. For example, if K is 5 and N is 8, then when you've reached the 4th-8th characters of the query string you can discard any retrieved strings that aren't already in the result map since they can't possibly have 5 matching characters. Or, when you've finished with the 6th character of the query string, you can iterate through the result map and remove all keys whose values are less than 3.
If need be you can offload the primary precomputed map to a NoSql key-value database or something along those lines in order to save on main memory (and also so that you don't have to precompute the dictionary every time the program restarts).
Rather than storing a tuple (p, c) as the key in the primary map, you can instead concatenate the position and character into a string (so (5, 't') becomes "5t", and (12, 'x') becomes "12x").
Without knowing where in each input string the match characters will be, for a particular string, you might need to check every character no matter what order you check them in. Therefore it makes sense to just iterate over each string character-by-character and keep a sum of the total number of mismatches. If i is the number of mismatches so far, return false when i == k and true when there are fewer than k-i unchecked characters remaining in the string.
Note that depending on how long the strings are and how many mismatches you'll allow, it might be faster to iterate over the whole string rather than performing these checks, or perhaps to perform them only after every couple characters. Play around with it to see how you get the fastest performance.
My method if we're thinking out loud :P I can't see a way to do this without going through each n string, but I'm happy to be corrected. On that it would begin with a pre-process to save a second set of your n strings so that the characters are in ascending order.
The first part of the comparison would then be to check each n string a character at a time say n' to each character in s say s'.
If s' is less than n' then not equal and move to the next s'. If n' is less than s' then go to next n'. Otherwise record a matching character. Repeat this until k miss matches are found or the alternate matches are found and mark n accordingly.
For further consideration, an added pre-processing could be done on each adjacent string in n to see the total number of characters that differ. This could then be used when comparing strings n to s and if sufficient difference exist between these and the adjacent n there may not be a need to compare it?

Resources