I have a random binary string s of length l bits. How can I change it in-place to another random string of the same length, such that I can retrieve the original string?
A. A trivial example would be adding +1 modulo 2^l
B. Another example could be: for each bit b in the string, replace it with (b+position(b))%2 where position(b) is the position of the bit (0, 1, 2, 3, ...).
However with both these methods, for every input the output is very similar to the input. For example using method A I'll get '010101' => '010110'. Is there any way to "increase the randomness" of the output somehow? In short, can I randomize a string, and retrieve the original (without adding extra bits to the original string)?
Are you trying to make your own encryption system? If so, typical advice would be to use an existing encryption system.
However, one way to do what you ask, would be to generate a value from the string itself (for example, by taking the length of the string), and using that as a seed to a random number generator, and then using the random number generator to alter each character in some reversible way.
That way, your string will be the same length, and not look like the original, and be decodable. It's not very strong encryption though - just a variable cypher which could be broken by a decent decryption attempt.
Related
I have a string like this
ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8
The first part of the string is a random 18 digit number in base64 format and the second is a unix timestamp in base64 too, while the last is an hmac.
I want to make a model to recognize a string like this.
How may i do it?
While I did not necessarily think deeply about it, this would be what comes to my mind first.
You certainly don't need machine learning for this. In fact, machine learning would not only be inefficient for problems like this but may even be worse, depending on a given approach.
Here, an exact solution can be achieved, simply by understanding the problem.
One way people often go about matching strings with a certain structure is with so called regular expressions or RegExp.
Regular expressions allow you to match string patterns of varying complexity.
To give a simple example in Python:
import re
your_string = "ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8"
regexp_pattern = r"(.+)\.(.+)\.(.+)"
re.findall(regexp_pattern, your_string)
>>> [('ODQ1OTc3MzY0MDcyNDk3MTUy', 'YKoz0Q', 'wlST3vVZ3IN8nTtVX1tz8Vvq5O8')]
Now one problem with this is how do you know where your string starts and stops. Most of the times there are certain anchors, especially in strings that were created programmatically. For instance, if we knew that prior to each string you wanted to match there is the word Token: , you could include that in your RegExp pattern r"Token: (.+)\.(.+)\.(.+)".
Other ways to avoid mismatches would be to clearer define the pattern requirements. Right now we simply match a pattern with any amount of characters and two . separating them into three sequences.
If you would know which implementation of base64 you were using, you could limit the alphabet of potential characters from . (thus any) to the alphabet used in your base64 implementation [abcdefgh1234]. In this example it would be abcdefgh1234, so the pattern could be refined like this r"([abcdefgh1234]+).([abcdefgh1234]+).(.+)"`.
The same applies to the HMAC code.
Furthermore, you could specify the allowed length of each substring.
For instance, you said you have 18 random digits. This would likely mean each is encoded as 1 byte, which would translate to 18*8 = 144 bits, which in base64, would translate to 24 tokens (where each encodes a sextet, thus 6 bits of information). The same could be done with the timestamp, assuming a 32 bit timestamp, this would likely necessitate 6 base64 tokens (representing 36 bits, 36 because you could not divide 32 into sextets).
With this information, you could further refine the pattern
r"([abcdefgh1234]{24})\.([abcdefgh1234]{6})\.(.+)"`
In addition, the same could be applied to the HMAC code.
I leave it to you to read a bit about RegExp but I'd guess it is the easiest solution and certainly more appropriate than any kind of machine learning.
I want to generate a random string with any fixed length (N) of my choice. With the same number as a feed to this algorithm it should generate the same string. And with small change to the number like number+1, it should generate a completely different string. (Difficult to relate to the previous seed) It's ok if more than one number might result in the same string. Any approaches for doing this?
By the way, I have a set of characters that I want to appear in the string, like A-Z a-z 0-9.
For example
Algorithm(54893450,4,"ABCDEFG0") -> A0GF
Algorithm(54893451,4,"ABCDEFG0") -> BDCG
I could random each characters one by one, but it would need N different seed for each characters. If I want to do it this way, the question might become "how to generate N numbers from one number" for the seeds.
The end goal is that I want to convert a GUID to something more readable on printed media and shorter. I don't care about conflict. (If the conflict did happen, I can still check the GUID for resolution)
Ok, thanks for the guidance #Jim Mischel. I read all the related pages and come to understand more about this.
http://blog.mischel.com/2017/05/30/how-not-to-generate-unique-codes/
http://blog.mischel.com/2017/06/02/a-broken-unique-key-generator/
http://blog.mischel.com/2017/06/10/how-did-this-happen/
http://blog.mischel.com/2017/06/20/how-to-generate-random-looking-keys/
https://ericlippert.com/2013/11/12/math-from-scratch-part-thirteen-multiplicative-inverses/
https://ericlippert.com/2013/11/14/a-practical-use-of-multiplicative-inverses/
https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm
In short, first I should use a sequential number. That is 1,2,3,4,... Very predictable, but it can turn into something random and hard to guess.
(Note that in my case this is not entirely possible, since each users will be generating his own ID locally so I cannot run a global sequential number, hence I use GUID. But I will make my own workaround to fit GUID to this solution, probably with a simple modulo on the GUID to fit it to my desired range.)
With sequential integer n I can get another seemingly unrelated integer with a multiplication then a modulo. This might looks like (n * x)% m with x and m of my choice. Of course m would have to be larger than the largest number that I want to use since it wraps around the modulo while multiplying.
This alone is a good start as close number n does not provide similar output. But we cannot be so sure about that. For example, if my x is 4 and m is 16 then the input can only produce 0,4,8,12. To avoid this we choose x and m which is a coprime of each other. (Having greatest common divisor of 1) There are many obvious candidate to this such as 100000 as m (defines the limit of my output as 99999) and 2429 as x. If we choose 2 coprime like this, not only the result conflict as less as possible, it also guarantee that each input produces unique output in that range.
We can learn from this example :
(n * 5) % 16
As 5 and 16 is a coprime, we can get a maximum length of sequence of unique numbers before it wraps around. (length = 16) if we input numbers sequentially from 0 to 16 :
Input : 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
Output : 0, 5, 10, 15, 4, 9, 14, 3, 8, 13, 2, 7, 12, 1, 6, 11, 0
We can see that the output is in a not so predictable sequence and also none of the output other than the last one is the same. It travels to all available number possible.
Now my very predictable sequential running number would produce sufficiently different number and also guarantee not to conflict to any other input as long as it is in the range of m. What's left is to convert this number to a string of my choice via base conversion. If I have 5 characters "ABCDE" then I will use base-5.
Only this is enough for my use case. But with the concept of multiplicative inverse I can also find one more integer y which can reverse that multiply modulo transformation to the original number. Currently I still haven't understand that part fully, but it uses Extended Euclidean Algorithm to find y.
Since my application does not need reverting yet I am fine with not understanding it for now. I will definitely try to understand that part.
Is there any way to code a long string to a unique number (integer) and then decode this number to original string? (I mean to reduce size of long string)
The simple answer is no.
The complex answer is maybe.
What you are looking for is compression, compression can reduce the size of the String but there is no guarantee as to how small it can make it. In particular you can never guarantee being able to fit it into a certain sized integer.
There are concepts like "hashing" which may help you do what you want depending on exactly what you are trying to do with this number.
Alternatively if you use the same string in a lot of different places then you can store it once and pass references/pointers to that single instance of the String around.
First you need to hash it to string eg md5. Then you convert the characters of the hash string into numbers according to the alphabetical number
Usage case: client needs to send a huge string over HTTP. The server replies whether the string contains some substring. However, huge string is huge. This system is as a result really inefficient. Moreover, huge string contains some sensitive info, so this is really insecure.
Is there some pseudo-hashing mechanism that somehow summarizes a big string into some number, which all substrings of this big string would hash to the same number, but non-substrings will with high probability not hash to this big string?
Is there some pseudo-hashing mechanism that somehow summarizes a big string into some number, which all substrings of this big string would hash to the same number, but non-substrings will with high probability not hash to this big string?
No.
Let f be such a hash. Consider a string s and non-substring t. Note that s and t are substrings of s + t. Therefore, s and t have the same hash (i.e., f(s) = f(t) = f(s + t)). This is contrary to the requirement that f(s) != f(t) with high probability.
In particular, with s = "", we see that all strings t have f(s) = f(t), so that f is constant and equal to f("").
Is there some pseudo-hashing mechanism that somehow summarizes a big string into some number, which all substrings of this big string would hash to the same number, but non-substrings will with high probability not hash to this big string?
I guess I'll have to explain why this won't happen:
String string = "the quick brown fox jumps over the lazy dog";
That means, according to your request, that every single letter in this will hash to the same value. Hashing algorithms are deterministic. In this example, t -> 5, h -> 5, e -> 5... And so on, but if you have some string:
String string2 = "hello there";
Then now, you want h to hash to something different, and you want e to hash to something different, so given the exact same input, you want a different value. This defeats the definition of a mathematical function.
What does this mean?
Well, without any aspect of determinism in your function, your data has no repeatable mapping between a value and the letter that is being hashed, meaning your data is meaningless.
If you have a constant length for the substrings you could do what many file-sharing programs do and use a list of hashes or something like the tiger-tree hash.
List of hashes: Make a hash for every chunk of the file of some pre-set length (say 64kB), then transmit a list of these hashes so these chunks can be verified.
Tiger-Tree hash: http://en.wikipedia.org/wiki/Merkle_tree#Tiger_tree_hash
Basically build a binary tree of hashes with the leaves being hashes of chunks like in a list of hashes.
If you need to match to every possible substring instead of just pre-defined chunks this isn't going to work though.
All substrings doesn't sound viable, but I imagine you may have some constraints on your substrings you haven't yet told us about.
If you do your substrings block-aligned or whitespace-aligned or something, you might look into using a bloom filter, EG: https://pypi.python.org/pypi/drs-bloom-filter/1.01 . Bloom Filters can store members of a set and be used for testing set membership, at times with as little as one bit per element. They do sometimes give false positives, but with a user-adjustable probability of a false positive.
I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.
What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.
So, I am looking for a smart way of encoding the names.
Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?
Thanks
Abhishek S
To get the same width numbers, can't you just zero-pad on the left?
Some options:
Sort them. Count them. The 10th name is number 10.
Treat each character as a digit in a base 26 (case insensitive, no
digits) or 52 (case significant, no digits) or 36 (case insensitive
with digits) or 62 (case sensitive with digits) number. Compute the
value in an int. EG, for a name of "abc", you'd have 0 * 26^2 + 1 *
26^1 + 2 * 20^0. Sometimes Chinese names may use digits to indicate tonality.
Use a "perfect hashing" scheme: http://en.wikipedia.org/wiki/Perfect_hash_function
This one's mostly suggested in fun: use goedel numbering :). So
"abc" would be 2^0 * 3^1 * 5^2 - it's a product of powers of primes.
Factoring the number gives you back the characters. The numbers
could get quite large though.
Convert to ASCII, if you aren't already using it. Then treat each
ordinal of a character as a digit in a base-256 numbering system.
So "abc" is 0*256^2 + 1*256^1 + 2*256^0.
If you need to be able to update your list of names and numbers from time to time, #2, #4 and #5 should work. #1 and #3 would have problems. #5 is probably the most future-proofed, though you may find you need unicode at some point.
I believe you could do unicode as a variant of #5, using powers of 2^32 instead of 2^8 == 256.
What you are trying to do there is actually hashing (at least if you have a fixed number of digits). There are some good hashing algorithms with few collisions. Try out sha1 for example, that one is well tested and available for modern languages (see http://en.wikipedia.org/wiki/Sha1) -- it seems to be good enough for git, so it might work for you.
There is of course a small possibility for identical hash values for two different names, but that's always the case with hashing and can be taken care of. With sha1 and such you won't have any obvious connection between names and IDs, which can be a good or a bad thing, depending on your problem.
If you really want unique ids for sure, you will need to do something like NealB suggested, create IDs yourself and connect names and IDs in a Database (you could create them randomly and check for collisions or increment them, starting at 0000000000001 or so).
(improved answer after giving it some thought and reading the first comments)
You can use the BigInteger for encoding arbitrary strings like this:
BigInteger bi = new BigInteger("some string".getBytes());
And for getting the string back use:
String str = new String(bi.toByteArray());
I've been looking for a solution to a problem very similar to the one you proposed and this is what I came up with:
def hash_string(value):
score = 0
depth = 1
for char in value:
score += (ord(char)) * depth
depth /= 256.
return score
If you are unfamiliar with Python, here's what it does.
The score is initially 0 and the depth are set to 1
For every character add the ord value * the depth
The ord function returns the UTF-8 value (0-255) for each character
Then it's multiplied by the 'depth'.
Finally the depth is divided by 256.
Essentially, the way that it works is that the initial characters add more to the score while later characters contribute less and less. If you need an integer, multiply the end score by 2**64. Otherwise you will have a decimal value between 0-256. This encoding scheme works for binary data as well as there are only 256 possible values in a byte/char.
This method works great for smaller string values, however, for longer strings you will notice that the decimal value requires more precision than a regular double (64-bit) can provide. In Java, you can use the 'BigDecimal' and in Python use the 'decimal' module for added precision. A bonus to using this method is that the values returned are in sorted order so they can be searched 'efficiently'.
Take a look at https://en.wikipedia.org/wiki/Huffman_coding. That is the standard approach.
You can translate it, if every character (plus blank, at least) will occupy a position.
Therefore ABC, which is 1,2,3 has to be translated to
1*(2*26+1)² + 2*(53) + 3
This way, you could encode arbitrary strings, but if the length of the input isn't limited (and how should it?), you aren't guaranteed to have an upper limit for the length.