Recreating JS bitwise integer handling in Python3 - python-3.x

I need to translate a hash function from JavaScript to Python.
The function is as follows:
function getIndex(string) {
var length = 27;
string = string.toLowerCase();
var hash = 0;
for (var i = 0; i < string.length; i++) {
hash = string.charCodeAt(i) + (hash << 6) + (hash << 16) - hash;
}
var index = Math.abs(hash % length);
return index;
}
console.log(getIndex(window.prompt("Enter a string to hash")));
This function is Objectively Correct™. It is perfection itself. I can't change it, I just have to recreate it. Whatever it outputs, my Python script must also output.
However - I'm having a couple of problems, and I think that it's all to do with the way that the two languages handle signed integers.
JS bitwise operators treat their operands as a sequence of 32 bits. Python, however, has no concept of bit limitation and just keeps going like an absolute madlad. I think that this is the one relevant difference between the two languages.
I can limit the length of hash in Python by masking it to 32 bits with hash & 0xFFFFFFFF.
I can also negate hash if it's above 0x7FFFFFFF with hash = hash ^ 0xFFFFFFFF (or hash = ~hash - they both seem to do the same thing). I believe that this simulates negative numbers.
I apply both of these restrictions to the hash with a function called t.
Here's my Python code so far:
def nickColor(string):
length = 27
def t(x):
x = x & 0xFFFFFFFF
if x > 0x7FFFFFFF:
x = x ^ 0xFFFFFFFF
return x
string = string.lower()
hash = t(0)
for letter in string:
hash = t(hash)
hash = t(t(ord(letter)) + t(hash << 6) + t(hash << 16) - t(hash))
index = hash % length
return index
It seems to work up until the point that a hash needs to become negative, at which point the two scripts diverge. This normally happens about 4 letters into the string.
I'm assuming that my problem lies in recreating JS negative numbers in Python. How can I say bye to this problem?

Here is a working translation:
def nickColor(string):
length = 27
def t(x):
x &= 0xFFFF_FFFF
if x > 0x7FFF_FFFF:
x -= 0x1_0000_0000
return float(x)
bytes = string.lower().encode('utf-16-le')
hash = 0.0
for i in range(0, len(bytes), 2):
char_code = bytes[i] + 256*bytes[i+1]
hash = char_code + t(int(hash) << 6) + t(int(hash) << 16) - hash
return int(hash % length if hash >= 0 else abs(hash % length - length))
The point is, only the shifts (<<) are calculated as 32-bit integer operations, their result is converted back to double before entering additions and subtractions. I'm not familiar with the rules for double-precision floating point representation in the two languages, but it's safe to assume that on all personal computing devices and web servers it is the same for both languages, namely double-precision IEEE 754. For very long strings (thousands of characters) the hash could lose some bits of precision, which of course affects the final result, but in the same way in JS as in Python (not what the author of the Objectively Correct™ function intended, but that's the way it is…). The last line corrects for the different definition of the % operator for negative operands in JavaScript and Python.
Furthermore (thanks to Mark Ransom for reminding me this), to fully emulate JavaScript, it is also necessary to consider its encoding, which is UTF-16, but with surrogate pairs handled as if they consisted of 2 characters. Encoding the string as utf-16-le you make sure that the first byte in each 16-bit “word” is the least significant one, plus, you don't get the BOM that you would get if you used utf-16 tout court (thank you Martijn Pieters).

Related

Rabin-Karp: rolling hash computation adds a large prime number to previously computed hash

I think I conceptually understand RabinKarp pattern matching algorithm using rolling hash. While going through a sample implementation here, I find that a large prime number q is added to the previously computed rolling hash.
for (int i = m; i < n; i++) {
// Remove leading digit, add trailing digit, check for match.
txtHash = (txtHash + q - RM*txt.charAt(i-m) % q) % q; //Why +q here?
txtHash = (txtHash*R + txt.charAt(i)) % q;
// match
int offset = i - m + 1;
if ((patHash == txtHash) && check(txt, offset))
return offset;
}
I am not sure why this is needed. Can I get some help with this?
In my limited testing, I get the same result whether or not q term in included.
Does this have something to do with which version of algorithm (Monte Carlo/Las Vegas) is being implemented?
The +q term is there to avoid dealing with negative numbers.
We want txtHashto always lie in the interval [0;q[, without this +q it could also be in ]-q;0[.
This could lead to missing the pattern. e.g if patHash = 0xdead but you compute txtHash = -q+0xdead. Those two values are equal mathematically mod q but different with Java taken % q.

isn't computing hash value by polynomial rolling for string O(n), where n is the string size?

Just read about string hashing and polynomial hash function to calculate it.
It looks like to me as if the time complexity of computing hash(string) is O(N) where 'N' is the string size
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
hash_value = (hash_value + (c - 'a' + 1) * p_pow) % m;
p_pow = (p_pow * p) % m;
}
return hash_value;
}
Where the hash value of string 'S' can be computed as S[0] + S[1].P + S[2].P.P + . . . S[N - 1].P^(N - 1)
And if the computation is O(N) then isn't hashing N strings is O(N^2)?
Short answer: given you insert n strings with length n, the reasoning is correct, but the "scenario" that the length of the strings is determined by the number of strings to hash, is a bit strange.
And if the computation is O(N) then isn't hashing N strings is O(N2)?
Well given the length of the strings scales with the number of strings, then, for the given hashing algorithm, this will indeed result in O(n2). But typically there is no correlation between the length of a string, and the number of strings to hash.
If the strings have an average length of k, and there are n strings, then this is an O(n×k) algorithm. You are thus correct that the "size" of the objects can have an impact on the performance, given of course the hashing algorithm scales with the size of the object.
Yes, hashing N strings of length N is an O(N²) process. (Though in practice, you very very rarely meet such a coincidental case.)

Generate an integer for encryption from a string and vice versa

I am trying to write an RSA code in python3. I need to turn user input strings (containing any characters, not only numbers) into integers to then encrypt them. What is the best way to turn a sting into an integer in Python 3.6 without 3-rd party modules?
how to encode a string to an integer is far from unique... there are many ways! this is one of them:
strg = 'user input'
i = int.from_bytes(strg.encode('utf-8'), byteorder='big')
the conversion in the other direction then is:
s = int.to_bytes(i, length=len(strg), byteorder='big').decode('utf-8')
and yes, you need to know the length of the resulting string before converting back. if length is too large, the string will be padded with chr(0) from the left (with byteorder='big'); if length is too small, int.to_bytes will raise an OverflowError: int too big to convert.
The #hiro protagonist's answer requires to know the length of the string. So I tried to find another solution and good answers here: Python3 convert Unicode String to int representation. I just summary here my favourite solutions:
def str2num(string):
return int(binascii.hexlify(string.encode("utf-8")), 16)
def num2str(number):
return binascii.unhexlify(format(number, "x").encode("utf-8")).decode("utf-8")
def numfy(s, max_code=0x110000):
# 0x110000 is max value of unicode character
number = 0
for e in [ord(c) for c in s]:
number = (number * max_code) + e
return number
def denumfy(number, max_code=0x110000):
l = []
while number != 0:
l.append(chr(number % max_code))
number = number // max_code
return ''.join(reversed(l))
Intersting: testing some cases shows me that
str2num(s) = numfy(s, max_code=256) if ord(s[i]) < 128
and
str2num(s) = int.from_bytes(s.encode('utf-8'), byteorder='big') (#hiro protagonist's answer)

Efficiently counting the number of substrings of a digit string that are divisible by k?

We are given a string which consists of digits 0-9. We have to count number of sub-strings divisible by a number k. One way is to generate all the sub-strings and check if it is divisible by k but this will take O(n^2) time. I want to solve this problem in O(n*k) time.
1 <= n <= 100000 and 2 <= k <= 1000.
I saw a similar question here. But k was fixed as 4 in that question. So, I used the property of divisibility by 4 to solve the problem.
Here is my solution to that problem:
int main()
{
string s;
vector<int> v[5];
int i;
int x;
long long int cnt = 0;
cin>>s;
x = 0;
for(i = 0; i < s.size(); i++) {
if((s[i]-'0') % 4 == 0) {
cnt++;
}
}
for(i = 1; i < s.size(); i++) {
int f = s[i-1]-'0';
int s1 = s[i] - '0';
if((10*f+s1)%4 == 0) {
cnt = cnt + (long long)(i);
}
}
cout<<cnt;
}
But I wanted a general algorithm for any value of k.
This is a really interesting problem. Rather than jumping into the final overall algorithm, I thought I'd start with a reasonable algorithm that doesn't quite cut it, then make a series of modifications to it to end up with the final, O(nk)-time algorithm.
This approach combines together a number of different techniques. The major technique is the idea of computing a rolling remainder over the digits. For example, let's suppose we want to find all prefixes of the string that are multiples of k. We could do this by listing off all the prefixes and checking whether each one is a multiple of k, but that would take time at least Θ(n2) since there are Θ(n2) different prefixes. However, we can do this in time Θ(n) by being a bit more clever. Suppose we know that we've read the first h characters of the string and we know the remainder of the number formed that way. We can use this to say something about the remainder of the first h+1 characters of the string as well, since by appending that digit we're taking the existing number, multiplying it by ten, and then adding in the next digit. This means that if we had a remainder of r, then our new remainder is (10r + d) mod k, where d is the digit that we uncovered.
Here's quick pseudocode to count up the number of prefixes of a string that are multiples of k. It runs in time Θ(n):
remainder = 0
numMultiples = 0
for i = 1 to n: // n is the length of the string
remainder = (10 * remainder + str[i]) % k
if remainder == 0
numMultiples++
return numMultiples
We're going to use this initial approach as a building block for the overall algorithm.
So right now we have an algorithm that can find the number of prefixes of our string that are multiples of k. How might we convert this into an algorithm that finds the number of substrings that are multiples of k? Let's start with an approach that doesn't quite work. What if we count all the prefixes of the original string that are multiples of k, then drop off the first character of the string and count the prefixes of what's left, then drop off the second character and count the prefixes of what's left, etc? This will eventually find every substring, since each substring of the original string is a prefix of some suffix of the string.
Here's some rough pseudocode:
numMultiples = 0
for i = 1 to n:
remainder = 0
for j = i to n:
remainder = (10 * remainder + str[j]) % k
if remainder == 0
numMultiples++
return numMultiples
For example, running this approach on the string 14917 looking for multiples of 7 will turn up these strings:
String 14917: Finds 14, 1491, 14917
String 4917: Finds 49,
String 917: Finds 91, 917
String 17: Finds nothing
String 7: Finds 7
The good news about this approach is that it will find all the substrings that work. The bad news is that it runs in time Θ(n2).
But let's take a look at the strings we're seeing in this example. Look, for example, at the substrings found by searching for prefixes of the entire string. We found three of them: 14, 1491, and 14917. Now, look at the "differences" between those strings:
The difference between 14 and 14917 is 917.
The difference between 14 and 1491 is 91
The difference between 1491 and 14917 is 7.
Notice that the difference of each of these strings is itself a substring of 14917 that's a multiple of 7, and indeed if you look at the other strings that we've matched later on in the run of the algorithm we'll find these other strings as well.
This isn't a coincidence. If you have two numbers with a common prefix that are multiples of the same number k, then the "difference" between them will also be a multiple of k. (It's a good exercise to check the math on this.)
So this suggests another route we can take. Suppose that we find all prefixes of the original string that are multiples of k. If we can find all of them, we can then figure out how many pairwise differences there are among those prefixes and potentially avoid rescanning things multiple times. This won't find everything, necessarily, but it will find all substrings that can be formed by computing the difference of two prefixes. Repeating this over all suffixes - and being careful not to double-count things - could really speed things up.
First, let's imagine that we find r different prefixes of the string that are multiples of k. How many total substrings did we just find if we include differences? Well, we've found k strings, plus one extra string for each (unordered) pair of elements, which works out to k + k(k-1)/2 = k(k+1)/2 total substrings discovered. We still need to make sure we don't double-count things, though.
To see whether we're double-counting something, we can use the following technique. As we compute the rolling remainders along the string, we'll store the remainders we find after each entry. If in the course of computing a rolling remainder we rediscover a remainder we've already computed at some point, we know that the work we're doing is redundant; some previous scan over the string will have already computed this remainder and anything we've discovered from this point forward will have already been found.
Putting these ideas together gives us this pseudocode:
numMultiples = 0
seenRemainders = array of n sets, all initially empty
for i = 1 to n:
remainder = 0
prefixesFound = 0
for j = i to n:
remainder = (10 * remainder + str[j]) % k
if seenRemainders[j] contains remainder:
break
add remainder to seenRemainders[j]
if remainder == 0
prefixesFound++
numMultiples += prefixesFound * (prefixesFound + 1) / 2
return numMultiples
So how efficient is this? At first glance, this looks like it runs in time O(n2) because of the outer loops, but that's not a tight bound. Notice that each element can only be passed over in the inner loop at most k times, since after that there aren't any remainders that are still free. Therefore, since each element is visited at most O(k) times and there are n total elements, the runtime is O(nk), which meets your runtime requirements.

How compiler is converting integer to string and vice versa

Many languages have functions for converting string to integer and vice versa. So what happens there? What algorithm is being executed during conversion?
I don't ask in specific language because I think it should be similar in all of them.
To convert a string to an integer, take each character in turn and if it's in the range '0' through '9', convert it to its decimal equivalent. Usually that's simply subtracting the character value of '0'. Now multiply any previous results by 10 and add the new value. Repeat until there are no digits left. If there was a leading '-' minus sign, invert the result.
To convert an integer to a string, start by inverting the number if it is negative. Divide the integer by 10 and save the remainder. Convert the remainder to a character by adding the character value of '0'. Push this to the beginning of the string; now repeat with the value that you obtained from the division. Repeat until the divided value is zero. Put out a leading '-' minus sign if the number started out negative.
Here are concrete implementations in Python, which in my opinion is the language closest to pseudo-code.
def string_to_int(s):
i = 0
sign = 1
if s[0] == '-':
sign = -1
s = s[1:]
for c in s:
if not ('0' <= c <= '9'):
raise ValueError
i = 10 * i + ord(c) - ord('0')
return sign * i
def int_to_string(i):
s = ''
sign = ''
if i < 0:
sign = '-'
i = -i
while True:
remainder = i % 10
i = i / 10
s = chr(ord('0') + remainder) + s
if i == 0:
break
return sign + s
I wouldn't call it an algorithm per se, but depending on the language it will involve the conversion of characters into their integral equivalent. Many languages will either stop on the first character that cannot be represented as an integer (e.g. the letter a), will blindly convert all characters into their ASCII value (e.g. the letter a becomes 97), or will ignore characters that cannot be represented as integers and only convert the ones that can - or return 0 / empty. You have to get more specific on the framework/language to provide more information.
String to integer:
Many (most) languages represent strings, on some level or another, as an array (or list) of characters, which are also short integers. Map the ones corresponding to number characters to their number value. For example, '0' in ascii is represented by 48. So you map 48 to 0, 49 to 1, and so on to 9.
Starting from the left, you multiply your current total by 10, add the next character's value, and move on. (You can make a larger or smaller map, change the number you multiply by at each step, and convert strings of any base you like.)
Integer to string is a longer process involving base conversion to 10. I suppose that since most integers have limited bits (32 or 64, usually), you know that it will come to a certain number of characters at most in a string (20?). So you can set up your own adder and iterate through each place for each bit after calculating its value (2^place).

Resources