My client side code generates UUIDs and sends them to the server.
For example, '6ea140caa83b485f9c98ebaacfb536ce' would be a valid uuid4 to send back.
Is there any way to detect or prevent a user sending back a valid but "user generated" uuid4 like 'babebabebabe4abebabebabebabebabe'?
For example, one way to prevent a certain class of these would be looking at the number of occurrences of 0's and 1's in the binary representation of the number. This could work for a string like '00000000000040000000000000000000' but not for all strings.
It depends a little ...
there is no way to be entirely sure, but depending on the UUID version/subtype you are using there MIGHT be a way to detect at least some irregular values:
https://www.rfc-editor.org/rfc/rfc4122#section-4.1 defines the original version 1 of UUIDs, and a layout for the uuid fields ...
you could for example check if the version and variant fields are valid...
if your UUID generation actually uses Version 1 you could, in addition to the first test of version and variant, test if the timestamp is in a valid range ... for example, it might be unlikely that the UUID in question was generated in the year 1600 ... or in the future
so tests like there could be applied to check if the value actually makes sense, or is complete gibberish ... it can not protect you against someone thinking: ok ... lets analyze this and provide a manually choosen value that satisfies all conditions
No there is no way to distinguish user generated UUID's from randomly generated UUID's.
To start with, a user generated UUID may as well be partially random. But lets assume that it is not.
In that case you want to detect a pattern. However, although you give an example of a pattern, a pattern can be almost anything. For instance, the following byte array looks completely random, right?
40 09 21 fb 54 44 2d 18
But actually it is a nothing-up-my-sleeve number usually used within the cryptographic community: it's simply the encoding of Pi (in this case as a 64 bit floating point, as I was somewhat lazy).
There are certainly randomness tests, for instance FIPS random number tests. Those require a very high number of input to see if something fails or succeeds. Even then: it only shows that certain statistical properties have indeed been attained by a random number generator. The encoding of Pi might very well succeed.
And annoyingly, a random number generator is perfectly possible to generate bit strings that do not look random at all, if just by chance. The smaller the bit string the more chance of the random number generator generating something that doesn't look random at all. And UUID's are not all that big.
So yes, of course you can do some tests, but you can never be sure: you will have both false positives as false negatives.
Related
I have a string like this
ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8
The first part of the string is a random 18 digit number in base64 format and the second is a unix timestamp in base64 too, while the last is an hmac.
I want to make a model to recognize a string like this.
How may i do it?
While I did not necessarily think deeply about it, this would be what comes to my mind first.
You certainly don't need machine learning for this. In fact, machine learning would not only be inefficient for problems like this but may even be worse, depending on a given approach.
Here, an exact solution can be achieved, simply by understanding the problem.
One way people often go about matching strings with a certain structure is with so called regular expressions or RegExp.
Regular expressions allow you to match string patterns of varying complexity.
To give a simple example in Python:
import re
your_string = "ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8"
regexp_pattern = r"(.+)\.(.+)\.(.+)"
re.findall(regexp_pattern, your_string)
>>> [('ODQ1OTc3MzY0MDcyNDk3MTUy', 'YKoz0Q', 'wlST3vVZ3IN8nTtVX1tz8Vvq5O8')]
Now one problem with this is how do you know where your string starts and stops. Most of the times there are certain anchors, especially in strings that were created programmatically. For instance, if we knew that prior to each string you wanted to match there is the word Token: , you could include that in your RegExp pattern r"Token: (.+)\.(.+)\.(.+)".
Other ways to avoid mismatches would be to clearer define the pattern requirements. Right now we simply match a pattern with any amount of characters and two . separating them into three sequences.
If you would know which implementation of base64 you were using, you could limit the alphabet of potential characters from . (thus any) to the alphabet used in your base64 implementation [abcdefgh1234]. In this example it would be abcdefgh1234, so the pattern could be refined like this r"([abcdefgh1234]+).([abcdefgh1234]+).(.+)"`.
The same applies to the HMAC code.
Furthermore, you could specify the allowed length of each substring.
For instance, you said you have 18 random digits. This would likely mean each is encoded as 1 byte, which would translate to 18*8 = 144 bits, which in base64, would translate to 24 tokens (where each encodes a sextet, thus 6 bits of information). The same could be done with the timestamp, assuming a 32 bit timestamp, this would likely necessitate 6 base64 tokens (representing 36 bits, 36 because you could not divide 32 into sextets).
With this information, you could further refine the pattern
r"([abcdefgh1234]{24})\.([abcdefgh1234]{6})\.(.+)"`
In addition, the same could be applied to the HMAC code.
I leave it to you to read a bit about RegExp but I'd guess it is the easiest solution and certainly more appropriate than any kind of machine learning.
Lets say I have a large stream of data (for example packets coming in from a network), and I want to determine if this data contains a certain substring. There are multiple string searching algorithms, but they require the algorithm to know the plain text string they are searching for.
Lets say, the string being sought is a password, and you do not want to store it in plain text in this search application. It would however appear in the stream as plain text. You could for example, store the hash and length of the password. Then for every byte in the stream check if the next length byte data from the stream hash to the password hash you have a probable match.
That way you can determine if the password was in the stream, without knowing the password. However, hashing once for every byte is not fast/efficient.
Is there perhaps a clever algorithm that could find the plain text password in the stream, without directly knowing the plain text password (and instead some non-reversible equivalent). Alternatively could a low quality version of the password be used, with the risk of false positives? For example, if the search application only knew half the password (in plain text), it could with some error detect the full password without knowing it.
thanks
P.S This question comes from a hypothetical discussion I had with some friends, about alerting you if your password was spotted in plain text on a network.
You could use a low-entropy rolling hash to pre-screen each byte so that, for the cost of lg k bits of entropy, you reduce the number of invocations of the cryptographic hash by a factor of k.
SAT is an NP-hard problem. Suppose your password is n characters long. If you could find a way to make a large enough SAT instance that
used a contiguous sequence of m >= n bytes from the data stream as its 8m input bits, and
produced the output 1 if and only if the bits present at its inputs contains your password starting at an offset that is some multiple of 8 bits
then by "operating" this SAT instance as a circuit, you would have a password detector that is (at least potentially) very difficult to "invert".
In some ways, what you want is the opposite of Boolean logic minimisation. You want the biggest, hairiest circuit (ideally for some theoretically justified notions of size and hairiness :) ) that computes the truth table. It's easy enough to come up with truth-table-preserving ways to grow the original CNF propositional logic formula -- e.g., if you have two clauses A and B, then you can always safely add a new clause consisting of all the literals in either A or B -- but it's probably much harder to come up with ways to grow the formula in ways that will confuse a modern SAT solver, since a lot of research has gone into making these programs super-efficient at detecting and exploiting all kinds of structure in the problem.
One possible avenue for injecting "complications" is to make the circuit compute functions that are difficult for circuits to compute, like divisions or square roots, and then test the results of these for equality in addition to the raw inputs. E.g., instead of making the circuit merely test that X[1 .. 8n] = YOUR_PASSWORD, make it test that X[1 .. 8n] = YOUR_PASSWORD AND sqrt(X[1 .. 8n]) = sqrt(YOUR_PASSWORD). If a SAT solver is smart enough to "see" that the first test implies the second then it can immediately dispense with all the clauses corresponding to the second -- but since everything is represented at a very low level with propositional clauses, this relationship is (I hope; as I said, modern SAT solvers are pretty amazing) well obscured. My guess is that it's better to choose functions like sqrt() that are not one-to-one on integers: this will potentially cause a SAT solver to waste time exploring seemingly promising (but ultimately incorrect) solutions.
I have a problem that I've been trying to solve for a few days now, but it seems i got stuck!
The problem is, given a piece of data, i need to generate an output that has to be:
Impossible for others to reproduce.
Unique.
Short (since it will be used in a URL)
So i've decided that i would sign the given data with a private key, and then hash the result - This way i believe the first two properties are covered.
Now, to get the final result as short as possible i started reading and i came across the THIS, that talks about truncate a Hash, and i quote
As far as truncating a hash goes, that's fine. It's explicitly
endorsed by the NIST, and there are hash functions in the SHA-2 family
that are simple truncated variants of their full brethren:
SHA-256/224, SHA-512/224, SHA-512/256, and SHA-512/384, where SHA-x/y
denotes a full-length SHA-x truncated to y bits.
After reading a bit more about this, i decided that i would use SHA256 and truncated it to 128 bits.
Since it has to be used in a URL, i parsed the final result into a Base62 (Base64 uses ? and = signs which have a different meaning in a URL environment).
The final result was a string with 24 characters, which i thought was good but it seems it's still not good enough.
I need to get it even shorter (around 10 characters). I've been around this for a few days and i am starting to get out of ideas.
Does anyone have any suggestions? Is there something i can try?
Thank you very much!
Here's what I am going to do to obfuscate database id's in permalinks:
1) XOR the id with a lengthy secret key
2) Scramble (rotate, flip, reverse) bits around a little in the XOR'ed integer in a reversible way
3) Base 62 encode the resulting integer with my own secret scrambled up sequence of all
alphanumeric characters (A-Za-z0-9)
How difficult would it be to convert my Base 62 encoding back to base 10?
Also How difficult is it to reverse engineer the whole process? (obviously without taking a peak at source or compiled code) I know 'only XOR' is pretty susceptible to basic analysis.
EDIT: the result should be not more than 8-9 chars long, 3DES and AES seem to produce very long encrypted texts and can't be practically used in URLs
Resulting strings look something like:
In [2]: for i in range(1, 11):
print code(i)
...:
9fYgiSHq
MdKx0tZu
vjd0Dipm
6dDakK9x
Ph7DYBzp
sfRUFaRt
jkmg0hl
dBbX9nHk4
ifqBZwLW
WdaQE630
As you can see 1 looks nothing like 2 so this seems to works great for obfuscation of id's.
If the attacker is allowed to play around with the input, it will be trivial for a skilled attacker to "decrypt" the data. A crucial property of modern crypto systems is the "Avalanche effect" which your system lacks. Basically it means that every bit of the output is connected with every bit of the input.
If an attacker of your system is allowed to see that, for example, id = 1000 produces the output "AAAAAA" and id=1001 produces "ABAAA" and id=1002 produces "ACAAA" the algorithm can be easily reversed, and the value of the key obtained.
That said, this question is a better fit for https://security.stackexchange.com/ or https://crypto.stackexchange.com/
The standard advice for anyone trying to develop their own cryptography is, "Don't". The advanced advice is to read Bruce Schneier's Memo to the Amateur Cipher Designer and then don't.
You are not the first person to need to obfuscate IDs, so there are already methods available. #CodesInChaos suggested a good method above; you should try that first to see if it meets your needs.
I'm currently using a SHA1 to somewhat shorten an url:
Digest::SHA1.hexdigest("salt-" + url)
How safe is it to use only the first 8 characters of the SHA1 as a unique identifier, like GitHub does for commits apparently?
To calculate the probability of a collision with a given length and the number of hashes that you have, see the birthday problem. I don't know the number of hashes that you are going to have, but here are some examples. 8 hexadecimal characters is 32 bits, so for about 100 hashes the probability of a collision is about 1/1,000,000, for 10,000 hashes it's about 1/100, for 100,000 it's 3/4 etc.
See the table in the Birthday attack article on Wikipedia to find a good hash length that would satisfy your needs. For example if you want the collision to be less likely than 1/1,000,000,000 for a set of more than 100,000 hashes then use 64 bits, or 16 hexadecimal digits.
It all depends on how many hashes are you going to have and what probability of a collision are you willing to accept (because there is always some probability, even if insanely small).
If you're talking about a SHA-1 in hexadecimal, then you're only getting 4 bits per character, for a total of 32 bits. The chances of a collision are inversely proportional to the square root of that maximum value, so about 1/65536. If your URL shortener gets used much, it probably won't take terribly long before you start to see collisions.
As for alternatives, the most obvious is probably to just maintain a counter. Since you need to store a table of URLs to translate your shortened URL back to the original, you basically just store each new URL in your table. If it was already present, you give its existing number. Otherwise, you insert it and give it a new number. Either way, you give that number to the user.
It depends on what you are trying to accomplish. The output of SHA1 is effectively random with regards to the input (the output of a good hash function changes in half of its bits based on a one-bit change in the input, and SHA1, while not perfect, is pretty good), and by taking a 32-bit (assuming 8 hex digits) subset of the 160-bit output, you reduce the output space from 2^160 to 2^32 values. All things being equal, which they never are, this would significantly reduce the difficulty of finding a collision.
However, if the hash function's input must be a valid URL, that significantly reduces the number of possible inputs. #rsp points out the birthday problem, but given this, I'm not sure exactly how applicable it is at least in its simple form. Also, it largely assumes that there are no other precautions in place.
I would be more interested in why you are doing this. Is this about URLs that the user will need to remember and type? If so, tacking on a bunch of random hexadecimal digits is probably a bad idea. Is it a URL or URL parameter that will just be passed around programmatically? Then, I wouldn't care much about length. Either way, there are probably better ways to do what you are trying to accomplish.
If you use a binary output for SHA1 and Base64 encode the result, you will get much higher information density per character; you can have the same 8-character names, but rather than only 16^8 (2^32) possibilities, you'll have 64^8 (2^48) possibilities.
Using the assumption that the 50% probability-of-collision scales with 1.177*sqrt(N), using a Base64-style encoding will require 256 times more inputs than the hex-output before reaching the 50% chance of collision probability.