Hash function to see if one string is scrambled form/permutation of another? - string

I want to check if string A is just a reordered version of string B. For example, "abc" = "bca" = "cab"...
There are other solutions here: https://www.geeksforgeeks.org/check-if-two-strings-are-permutation-of-each-other/
However, I was thinking a hash function would be an easy way of doing this, but the typical hash function takes order into consideration. Are there any hash functions that do not care about character order?

Are there any hash functions that do not care about character order?
I don't know of real-world hash functions that have this property, no. Because this is not a problem they are designed to solve.
However, in this specific case, you can make your own "hash" function (a very very bad one) that will indeed ignore order: just sum ASCII codes of characters. This works due to the commutative property of addition (a + b == b + a)
def isAnagram(self,a,b):
sum_a = 0
sum_b = 0
for c in a:
sum_a += ord(c)
for c in b:
sum_b += ord(c)
return sum_a == sum_b
To reiterate, this is absolutely a hack, that only happens to work because input strings are limited in content in the judge system (only have lowercase ASCII characters and do not contain spaces). It will not work (reliably) on arbitrary strings.

For a fast check you could use a kind af hash-funkction
Candidates are:
xor all characters of a String
add all characters of a String
multiply all characters of a String (be careful might lead to overflow for large Strings)
If the hash-value is equal, it could still be a collision of two not 'equal' strings. So you still need to make a dedicated compare. (e.g. sort the characters of each string before comparing them).

Related

Is there a built-in in Python 3 that checks whether a character is a "basic" algebraic symbol?

I know the string methods str.isdigit, str.isdecimal and str.isnumeric.
I'm looking for a built-in method that checks if a character is algebraic, meaning that it can be found in a declaration of a decimal number.
The above mentioned methods return False for '-1' and '1.0'.
I can use isdigit to retrieve a positive integer from a string:
string = 'number=123'
number = ''.join([d for d in string if d.isdigit()]) # returns '123'
But that doesn't work for negative integers or floats.
Imagine a method called isnumber that works like this:
def isnumber(s):
for c in s:
if c not in list('.+-0123456789'):
return False
return True
string1 = 'number=-1'
string2 = 'number=0.1'
number1 = ''.join([d for d in string1 if d.isnumber()]) # returns '-1'
number2 = ''.join([d for d in string2 if d.isnumber()]) # returns '0.1'
The idea is to test against a set of "basic" algebraic characters. The string does not have to contain a valid Python number. It could also be an IP address like 255.255.0.1.
.
Does a handy built-in that works approximately like that exist?
If not, why not? It would be much more efficient than a python function and very useful. I've seen alot of examples on stackoverflow that use str.isdigit() to retrieve a positive integer from a string. Is there a reason why there isn't a built-in like that, although there are three different methods that do almost the same thing?
No such function exists. There are a bunch of odd characters that can be part of number literals in Python, such as o, x and b in the prefix of integers of non-decimal bases, and e to introduce the exponential part of a float. I think those plus the hex digits (0-9 and A-F) and sign characters and the decimal point are all you need.
You can put together a string with the right character yourself and test against it:
from string import hex_digits
num_literal_chars = hex_digits + "oxOX.+-"
That will get a bunch of garbage though if you use it to test against mixed text and numbers:
string1 = "foo. bar. 0xDEADBEEF 10.0.0.1"
print("".join(c for c in string1 if c in num_literal_chars))
# prints "foo.ba.0xDEADBEEF10.0.0.1"
The fact that it gives you a bunch of junk is probably why no builtin function exists to do this. If you want to match a certain kind of number out of a string, write an appropriate regular expression to match that specific kind of number. Don't try to do it character-by-character, or try to match all the different kinds of Python numbers.

Find the minimal lexographical string formed by merging two strings

Suppose we are given two strings s1 and s2(both lowercase). We have two find the minimal lexographic string that can be formed by merging two strings.
At the beginning , it looks prettty simple as merge of the mergesort algorithm. But let us see what can go wrong.
s1: zyy
s2: zy
Now if we perform merge on these two we must decide which z to pick as they are equal, clearly if we pick z of s2 first then the string formed will be:
zyzyy
If we pick z of s1 first, the string formed will be:
zyyzy which is correct.
As we can see the merge of mergesort can lead to wrong answer.
Here's another example:
s1:zyy
s2:zyb
Now the correct answer will be zybzyy which will be got only if pick z of s2 first.
There are plenty of other cases in which the simple merge will fail. My question is Is there any standard algorithm out there used to perform merge for such output.
You could use dynamic programming. In f[x][y] store the minimal lexicographical string such that you've taken x charecters from the first string s1 and y characters from the second s2. You can calculate f in bottom-top manner using the update:
f[x][y] = min(f[x-1][y] + s1[x], f[x][y-1] + s2[y]) \\ the '+' here represents
\\ the concatenation of a
\\ string and a character
You start with f[0][0] = "" (empty string).
For efficiency you can store the strings in f as references. That is, you can store in f the objects
class StringRef {
StringRef prev;
char c;
}
To extract what string you have at certain f[x][y] you just follow the references. To udapate you point back to either f[x-1][y] or f[x][y-1] depending on what your update step says.
It seems that the solution can be almost the same as you described (the "mergesort"-like approach), except that with special handling of equality. So long as the first characters of both strings are equal, you look ahead at the second character, 3rd, etc. If the end is reached for some string, consider the first character of the other string as the next character in the string for which the end is reached, etc. for the 2nd character, etc. If the ends for both strings are reached, then it doesn't matter from which string to take the first character. Note that this algorithm is O(N) because after a look-ahead on equal prefixes you know the whole look-ahead sequence (i.e. string prefix) to include, not just one first character.
EDIT: you look ahead so long as the current i-th characters from both strings are equal and alphabetically not larger than the first character in the current prefix.

algorithms for fast string approximate matching

Given a source string s and n equal length strings, I need to find a quick algorithm to return those strings that have at most k characters that are different from the source string s at each corresponding position.
What is a fast algorithm to do so?
PS: I have to claim that this is a academic question. I want to find the most efficient algorithm if possible.
Also I missed one very important piece of information. The n equal length strings form a dictionary, against which many source strings s will be queried upon. There seems to be some sort of preprocessing step to make it more efficient.
My gut instinct is just to iterate over each String n, maintaining a counter of how many characters are different than s, but I'm not claiming it is the most efficient solution. However it would be O(n) so unless this is a known performance problem, or an academic question, I'd go with that.
Sedgewick in his book "Algorithms" writes that Ternary Search Tree allows "to locate all words within a given Hamming distance of a query word". Article in Dr. Dobb's
Given that the strings are fixed length, you can compute the Hamming distance between two strings to determine the similarity; this is O(n) on the length of the string. So, worst case is that your algorithm is O(nm) for comparing your string against m words.
As an alternative, a fast solution that's also a memory hog is to preprocess your dictionary into a map; keys are a tuple (p, c) where p is the position in the string and c is the character in the string at that position, values are the strings that have characters at that position (so "the" will be in the map at {(0, 't'), "the"}, {(1, 'h'), "the"}, {(2, 'e'), "the"}). To query the map, iterate through query string's characters and construct a result map with the retrieved strings; keys are strings, values are the number of times the strings have been retrieved from the primary map (so with the query string "the", the key "thx" will have a value of 2, and the key "tee" will have a value of 1). Finally, iterate through the result map and discard strings whose values are less than K.
You can save memory by discarding keys that can't possibly equal K when the result map has been completed. For example, if K is 5 and N is 8, then when you've reached the 4th-8th characters of the query string you can discard any retrieved strings that aren't already in the result map since they can't possibly have 5 matching characters. Or, when you've finished with the 6th character of the query string, you can iterate through the result map and remove all keys whose values are less than 3.
If need be you can offload the primary precomputed map to a NoSql key-value database or something along those lines in order to save on main memory (and also so that you don't have to precompute the dictionary every time the program restarts).
Rather than storing a tuple (p, c) as the key in the primary map, you can instead concatenate the position and character into a string (so (5, 't') becomes "5t", and (12, 'x') becomes "12x").
Without knowing where in each input string the match characters will be, for a particular string, you might need to check every character no matter what order you check them in. Therefore it makes sense to just iterate over each string character-by-character and keep a sum of the total number of mismatches. If i is the number of mismatches so far, return false when i == k and true when there are fewer than k-i unchecked characters remaining in the string.
Note that depending on how long the strings are and how many mismatches you'll allow, it might be faster to iterate over the whole string rather than performing these checks, or perhaps to perform them only after every couple characters. Play around with it to see how you get the fastest performance.
My method if we're thinking out loud :P I can't see a way to do this without going through each n string, but I'm happy to be corrected. On that it would begin with a pre-process to save a second set of your n strings so that the characters are in ascending order.
The first part of the comparison would then be to check each n string a character at a time say n' to each character in s say s'.
If s' is less than n' then not equal and move to the next s'. If n' is less than s' then go to next n'. Otherwise record a matching character. Repeat this until k miss matches are found or the alternate matches are found and mark n accordingly.
For further consideration, an added pre-processing could be done on each adjacent string in n to see the total number of characters that differ. This could then be used when comparing strings n to s and if sufficient difference exist between these and the adjacent n there may not be a need to compare it?

complexity of constructing an inverted index list

Given n strings S1, S2, ..., Sn, and an alphabet set A={a_1,a_2,....,a_m}. Assume that the alphabets in each string are all distinct. Now I want to create an inverted-index for each a_i (i=1,2...,m). My inverted-index has also something special: The alphabets in A are in some sequential order, if in the inverted-index a_i has included one string (say S_2), then a_j (j=i+1,i+2,...,m) don't need to include S_2 any more. In short, every string just appears in the inverted list only once. My question is how to build such list in a fast and efficient way? Any time complexity is bounded?
For example, A={a,b,e,g}, S1={abg}, S2={bg}, S3={gae}, S4={g}. Then my inverted-list should be:
a: S1,S3
b: S2 (since S1 has appeared previously, so we don't need to include it here)
e:
g: S4
If I understand your question correctly, a straightforward solution is:
for each string in n strings
find the "smallest" character in the string
put the string in the list for the character
The complexity is proportional to the total length of the strings, multiplying by a constant for the order testing.
If there is a simple way for testing, (e.g. the characters are in alphabetical order and all lower-case, a < will be enough), simply compare them; otherwise, I suggest using a hash table, each pair of which is a character and its order, later simply compare them.

Modifying a character in a string in Lua

Is there any way to replace a character at position N in a string in Lua.
This is what I've come up with so far:
function replace_char(pos, str, r)
return str:sub(pos, pos - 1) .. r .. str:sub(pos + 1, str:len())
end
str = replace_char(2, "aaaaaa", "X")
print(str)
I can't use gsub either as that would replace every capture, not just the capture at position N.
Strings in Lua are immutable. That means, that any solution that replaces text in a string must end up constructing a new string with the desired content. For the specific case of replacing a single character with some other content, you will need to split the original string into a prefix part and a postfix part, and concatenate them back together around the new content.
This variation on your code:
function replace_char(pos, str, r)
return str:sub(1, pos-1) .. r .. str:sub(pos+1)
end
is the most direct translation to straightforward Lua. It is probably fast enough for most purposes. I've fixed the bug that the prefix should be the first pos-1 chars, and taken advantage of the fact that if the last argument to string.sub is missing it is assumed to be -1 which is equivalent to the end of the string.
But do note that it creates a number of temporary strings that will hang around in the string store until garbage collection eats them. The temporaries for the prefix and postfix can't be avoided in any solution. But this also has to create a temporary for the first .. operator to be consumed by the second.
It is possible that one of two alternate approaches could be faster. The first is the solution offered by PaĆ­lo Ebermann, but with one small tweak:
function replace_char2(pos, str, r)
return ("%s%s%s"):format(str:sub(1,pos-1), r, str:sub(pos+1))
end
This uses string.format to do the assembly of the result in the hopes that it can guess the final buffer size without needing extra temporary objects.
But do beware that string.format is likely to have issues with any \0 characters in any string that it passes through its %s format. Specifically, since it is implemented in terms of standard C's sprintf() function, it would be reasonable to expect it to terminate the substituted string at the first occurrence of \0. (Noted by user Delusional Logic in a comment.)
A third alternative that comes to mind is this:
function replace_char3(pos, str, r)
return table.concat{str:sub(1,pos-1), r, str:sub(pos+1)}
end
table.concat efficiently concatenates a list of strings into a final result. It has an optional second argument which is text to insert between the strings, which defaults to "" which suits our purpose here.
My guess is that unless your strings are huge and you do this substitution frequently, you won't see any practical performance differences between these methods. However, I've been surprised before, so profile your application to verify there is a bottleneck, and benchmark potential solutions carefully.
You should use pos inside your function instead of literal 1 and 3, but apart from this it looks good. Since Lua strings are immutable you can't really do much better than this.
Maybe
"%s%s%s":format(str:sub(1,pos-1), r, str:sub(pos+1, str:len())
is more efficient than the .. operator, but I doubt it - if it turns out to be a bottleneck, measure it (and then decide to implement this replacement function in C).
With luajit, you can use the FFI library to cast the string to a list of unsigned charts:
local ffi = require 'ffi'
txt = 'test'
ptr = ffi.cast('uint8_t*', txt)
ptr[1] = string.byte('o')

Resources