Performance question about string slicing in Python

Performance question about string slicing in Python - python-3.x

I am learning some python and in the process of it I'm doing some simple katas from codewars.
I run into https://www.codewars.com/kata/scramblies problem.
My solution went as follows:
def scramble(s1,s2):
result = True
for character in s2:
if character not in s1:
return False
i = s1.index(character)
s1 = s1[0:i] + s1[i+1:]
return result
While it was correct result, it wasn't fast enough. My solution timed out after 12000 ms.
I looked at the solutions presented by others and one involved making a set.
def scramble(s1,s2):
for letter in set(s2):
if s1.count(letter) < s2.count(letter):
return False
return True
Why is my solution so much slower than the other one? It doesn't look like it should be unless I am misunderstanding something efficiency of slicing strings. Is my approach to solving this problem flawed or not pythonic?

For this kind of online programming challenge with a limit on your program's running time, the test inputs will include some quite large examples, and the time limit is usually set so that you don't have to squeeze every last millisecond of performance out of your code, but you do have to write an algorithm of a low enough computational complexity. To answer why your algorithm times out, we can analyse it to find its computational complexity using big O notation.
First we can label each individual statement with its complexity, where n is the length of s1 and m is the length of s2:
def scramble(s1,s2):
result = True # O(1)
for character in s2: # loop runs O(m) times
if character not in s1: # O(n) to search characters in s1
return False # O(1)
i = s1.index(character) # O(n) to search characters in s1
s1 = s1[0:i] + s1[i+1:] # O(n) to build a new string
return result # O(1)
Then the total complexity is O(1 + m*(n + 1 + n + n) + 1) or more simply, O(m*n). This is not efficient for this problem.
The key to why the alternate algorithm is faster lies in the fact that set(s2) contains only the distinct characters from the string s2. This is important because the alphabet that these strings are formed from has a constant, limited size; presumably 26 for the lowercase letters. Given this, the outer loop of the alternate algorithm actually runs at most 26 times:
def scramble(s1,s2):
for letter in set(s2): # O(m) to build a set
# loop runs O(1) times
if s1.count(letter) < s2.count(letter): # O(n) + O(m) to count
# chars from s1 and s2
return False # O(1)
return True # O(1)
This means the alternate algorithm's complexity is O(m + 1*(n + m + 1) + 1) or more simply O(m + n), meaning it is asymptotically more efficient than your algorithm.

First of all, set is fast and very good at its job. For things like in, set is faster than list.
Second of all, your solution is doing way more work than the correct solution. Note how the second solution never modifies s1 or s2, whereas your solution both takes two slices of s1 and then reassigns s1. This, along with calling .index(). Slicing isn't the fastest operation, mainly because memory has to be allocated and data has to be copied. .remove() would probably be faster than the combination of .index() and slicing that you're doing.
The underlying message here is if the task can be done in fewer operations, it's obviously going to execute more quickly. Slicing is also more expensive than most other methods because allocating space and copying memory is a more expensive operation than the computational methods like .count() that the correct solution uses.

Related

Calculation of time and space complexity

I am working on this Leetcode problem - "Given a string containing digits from 2-9 inclusive, return all possible letter combinations that the number could represent. Return the answer in any order.
A mapping of digit to letters (just like on the telephone buttons) is given below.
Note that 1 does not map to any letters."
This is a recursive solution to the problem that I was able to understand, but I am not able to figure out the time and space complexity of the solution.
if not len(digits):
return []
res = []
my_dict = {
'2':'abc',
'3':'def',
'4':'ghi',
'5':'jkl',
'6':'mno',
'7':'pqrs',
'8':'tuv',
'9':'wxyz'
}
if len(digits) == 1:
return list(my_dict[digits[0]])
my_list = my_dict[digits[0]] #string - abc
for i in range(len(my_list)): # i = 0,1,2
for item in self.letterCombinations(digits[1:]):
print(item)
res.append(my_list[i] + item)
return res
Any help or explanation regarding calculating time and space complexity for this solution would be helpful. Thank you.

With certain combinatorial problems, the time and space complexity can become dominated by the size of the output. Looking at the loops and function calls, the work being done in the function is one string concatenation and one append for each element of the output. There's also up to 4 repeated recursive calls to self.letterCombinations(digits[1:]): assuming these aren't cached, we need to add in the extra repeated work being done there.
We can write a formula for the number of operations needed to solve the problem when len(digits) == n. If T(n) is the number of steps, and A(n) is the length of the answer array, we get T(n) = 4*T(n-1) + n*A(n) + O(1). We get an extra multiplicative factor of n on A(n) because string concatenation is linear time; an implementation with lists and str.join() would avoid that.
Since A(n) is upper-bounded by 4^n, and T(1) is a constant, this gives T(n) = O(n * (4^n)); the space complexity here is also O(n * (4^n)), given 4^n strings of length n.
One possibly confusing part of complexity analysis is that it's usually a worst-case analysis unless specified otherwise. That's why we use 4 instead of 3 here: if any input could give 4^n results, we use that figure, even though many digit inputs would give closer to 3^n results.

Ocaml-What is the most efficient way to calculate hash values for all substrings in a string?

What is the most efficient way to obtain hash values for all substrings in a string. I tried to use:
let str1 = "AHTG...";;(*1000000 chars*)
let tam = 2;;
for i = 0 to String.length str1 - tam do
let st = String.sub str1 i tam in
Hashtbl.add hash_table (Hashtbl.hash st) i;
done;
to calculate all substrings with size =2 (AC,CH,TA,...) of a string with size = 1000000 and add values to hash_table but it takes a lot of time to finish the process,i think. I was wondering if there is any process more efficient and faster than the one presented above?

First of all, there are a lot of substrings of a string, around n^2/2 of them I would say. This is a big number when n = 1e6. If your hash function is a black box with no known arithmetic properties, and your string also has no known extra properties, you basically have to do O(n^2) calls to your hash function, which will take a long time.
If your hash function has interesting arithmetic properties, like say hash(a ^ b) = hash(a) + hash(b) mod K, you might be able to do a little better. On the other hand, properties like this probably make a weaker hash.
As an immediate improvement, you might consider a hash function that works directly on a substring. That will save you a lot of calls to String.sub and the associated consing and GC. (Probably this won't help a lot as OCaml has a really good GC for short-lived values.)

time complexity for check if string has only unique chars

This is an algorithm to determine if a string has all unique characters. What is the time complexity?
def unique(s):
d = []
for c in s:
if c not in d:
d.append(c)
else:
return False
return True
Looks like it only one for loop here so it should be O(n), however, this line
if c not in d:
does this line also cost O(n) time, if so, the time complexity for this algorithm is O(n^2) ?

Your intuition is correct, this algorithm is O(n2). The documentation for list specifies that in is an O(n) operation. In the worst case scenario, when the target element is not present in the list, every element will need to be visited.
Using a set instead of a list would improve time complexity to O(n) because set lookups would be O(1).

An easy way to take advantage of set's O(n) time complexity to test if all characters in a string are unique is to simply convert the string sequence to a set and see if its length is still the same:
def unique(s):
return len(s) == len(set(s))

Shortest uncommon prefix from a set of strings

Given a string A and a set of string S. Need to find an optimum method to find a prefix of A which is not a prefix of any of the strings in S.
Example
A={apple}
S={april,apprehend,apprehension}
Output should be "appl" and not "app" since "app" is prefix of both "apple" and "apprehension" but "appl" is not.
I know the trie approach; by making a trie of set S and then traversing in the trie for string A.
But what I want to ask is can we do it without trie?
Like can we compare every pair (A,Si), Si = ith string from set S and get the largest common prefix out of them.In this case that would be "app" , so now the required ans would be "appl".
This would take 2 loops(one for iterating through S and another for comparing Si and A).
Can we improve upon this??
Please suggest an optimum approach.

I'm not sure exactly what you had in mind, but here's one way to do it:
Keep a variable longest, initialised to 0.
Loop over all elements S[i] of S,
setting longest = max(longest, matchingPrefixLength(S[i], A)).
Return the prefix from A of length longest+1.
This uses O(1) space and takes O(length(S)*average length of S[i]) time.
This is optimal (at least for the worst case) since you can't get around needing to look at every character of every element in S.
Example:
A={apple}
S={april,apprehend,apprehension}
longest = 0
The longest prefix for S[0] and A is 2
So longest = max(0,2) = 2
The longest prefix for S[1] and A is 3
So longest = max(2,3) = 3
The longest prefix for S[2] and A is 3
So longest = max(3,3) = 3
Now we return the prefix of length longest+1 = 4, i.e. "appl"
Note that there are actually 2 trie-based approaches:
Store only A in the trie. Iterate through the trie for each element from S to eliminate prefixes.
This uses much less memory than the second approach (but still more than the approach above). At least assuming A isn't much, much longer than S[i], but you can optimise to stop at the longest element in S or construct the tree as we go to avoid this case.
Store all elements from S in the trie. Iterate through the trie with A to find the shortest non-matching prefix.
This approach is significantly faster if you have lots of A's that you want to query for a constant set S (since you only have to set up the trie once, and do a single lookup for each A, where-as you have to create a new trie and run through each S[i] for each A for the first approach).

What is your input size?
Let's model your input as being of N+1 strings whose lengths are about M characters. Your total input size is about M(N+1) character, plus some proportional amount of apparatus to encode that data in a usable format (data structure overhead).
Your algorithm ...
maxlen = 0
for i = 1 to N
for j = 1 to M
if A[j] = S[i][j] then
if j > maxlen then maxlen = j
break
print A[1...maxlen]
... performs up M x N iterations of the innermost loop, reading two characters each time, for a total of 2MN characters read.
Recall our input data size was about M(N+1) also. So our question now is whether we can solve this problem, in the worst case, looking at asymptotically less than the total input (you do a little less than looking at all the input twice, or linear in the input size). The answer is no. Consider this worst case:
length of A is M'
length of all strings in S is M'
A differs from N-1 strings in S by the last two characters
A differs from 1 string in S by only the last character
Any algorithm must look at M'-1 characters of N-1 strings, plus M' characters of 1 string, to correctly determine the answer of this problem instance is A.
(M'-1)(N'-1) + N = M'N - M' - N + 1 + N = M'N - M' + 1
For N >= 2, the dominant terms in both M'(N+1) and M'N' are both M'N, meaning that for N >= 2, both the input size and the amount of that input any correct algorithm must read is O(MN). Your algorithm is O(MN). Any other algorithm cannot be asymptotically better.

Space complexity O(1) storing string

I am a bit confused on space complexity. Is this O(1) space complexity or O(N) complexity?
Since I am creating a string of size n, my guess is the space complexity is O(N) is that correct?
## this function takes in a string and returns the string
def test(stringval):
stringval2 = ""
for x in stringval:
stringval2 = stringval2 + x
return stringval2
test("hello")}

Yes, that's correct. The space complexity of storing a new string of length n is Θ(n) because each individual character must be stored somewhere. In principle you could reduce the space usage by noticing that stringval2 ends up being a copy of stringval1 and potentially using copy-on-write or other optimizations, but in this case there's no reason to suspect that this is the case.
Hope this helps!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Performance question about string slicing in Python - python-3.x

Related

Calculation of time and space complexity

Ocaml-What is the most efficient way to calculate hash values for all substrings in a string?

time complexity for check if string has only unique chars

Shortest uncommon prefix from a set of strings

Space complexity O(1) storing string

Categories

Resources