I know that key sizes in ECDH depend on size of Elliptic Curve.
If it is a 256-bit curve (secp256k1), keys will be:
Public: 32 bytes * 2 + 1 = 65 (uncompressed)
Private: 32 bytes
384-bit curve (secp384r1):
Public: 48 bytes * 2 + 1= 97 (uncompressed)
Private: 48 bytes
But with 521-bit curve (secp521r1) situation is very strange:
Public: 66 bytes * 2 + 1 = 133 (uncompressed)
Private: 66 bytes or 65 bytes.
I used node.js crypto module to generate this keys.
Why private key value of 521-bit curve is variable?
The private key of the other curves are variable as well, but they are less likely to exhibit this variance when it comes to encoding to bytes.
The public key is encoded as two statically sized integers, prefixed with the uncompressed point indicator 04. The size is identical to the key size in bytes.
The private key doesn't really have an pre-established encoding. It is a single random value (or vector) within the range 1..N-1 where N is the order of the curve. Now if you encode this value as a variable sized unsigned number then usually it will be the same size as the key in bytes. However, it may by chance be one byte smaller, or two, or three or more. Of course, the chance that it is much smaller is pretty low.
Now the 521 bit key is a bit strange that the first, most significant byte of the order doesn't start with a bit set to 1; it only has the least significant bit set to 1. This means that there is a much higher chance that the most significant byte of the private value (usually called s) is a byte shorter.
The exact chance of course depends on the full value of the order:
01FF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFA
51868783 BF2F966B 7FCC0148 F709A5D0
3BB5C9B8 899C47AE BB6FB71E 91386409
but as you may guess it is pretty close to 1 out of 2 because there are many bits set to 1 afterwards. The chance that two bytes are missing is of course 1 out of 512, and three bytes 1 out of 131072 (etc.).
Note that ECDSA signature sizes may fluctuate as well. The X9.42 signature scheme uses two DER encoded signed integers. The fact that they are signed may introduces a byte set all to zeros if the most significant bit of the most significant byte is set to 1, otherwise the value would be interpreted as being negative. The fact that it consists of two numbers, r and s, and that the size of DER encoding is also dependent of the size of the encoded integers makes the size of the full encoding rather hard to predict.
Another less common (flat) encoding of an ECDSA signature uses the same statically sized integers as the public key, in which case it is just twice the size of the order N in bytes.
ECDH doesn't have this issue. Commonly the shared secret is the statically encoded X coordinate of the point that is the result of the ECDH calculation - or at least a value that is derived from it using a Key Derivation Function (KDF).
Related
I am implementing a RSA and AES file encryption program. So far, I have RSA and AES implemented. What I wish to understand however is , if my AES implementation uses a 16 byte key (obtained by os.urandom(16)) how could I get an integer value from this to encrypt with the RSA ?
In essence, if I have a byte string like
b',\x84\x9f\xfc\xdd\xa8A\xa7\xcb\x07v\xc9`\xefu\x81'
How could I obtain an integer from this byte string (AES key) which could subsequently be used for encryption using (RSA)?
Flow of encryption
Encrypt file (AES Key) -> Encrypt AES key (using RSA)
TL;DR use from_bytes and to_bytes to implement OS2IP and I2OSP respectively.
For secure encryption, you don't directly turn the AES key into a number. This is because raw RSA is inherently insecure in many many ways (the list is not complete at the time of writing).
First you need to random-pad your key bytes to obtain a byte array that will represent a number close to the modulus. Then you can perform the byte array conversion to a number, and only then should you perform modular exponentiation. Modular exponentiation will also result in a number, and you need to turn that number into a statically sized byte array with the same size as the modulus (in bytes).
All this is standardized in the PKCS#1 RSA standard. In v2.2 there are two schemes specified, known as PKCS#1 v1.5 padding and OAEP padding. The first one is pretty easy to implement, but is more vulnerable to padding oracle attacks. OAEP is also vulnerable, but less so. You will however need to follow the implementation hints to the detail, especially during unpadding.
To circle back to your question, the number conversions are called the octet string to integer primitive (OS2IP) and the integer to octet string primitive (I2OSP). These are however not mathematical operations that you need to perform: they just describe how to represent how to encode the number as statically sized, big endian, unsigned integer.
Say that keysize is the key size (modulus size) in bits and em is the bytes or bytearray representing the padded key, then you'd just perform:
m = int.from_bytes(em, byteorder='big', signed=False)
for OS2IP where m will be the input for modular exponentiation and back using:
k = (keysize + 8 - 1) / 8
em = m.to_bytes(k, byteorder='big', signed=False)
for I2OSP.
And you will have to perform the same two operations for decryption...
To literally interpret the byte-string as an integer (which you should be able to do; python integers can get arbitrarily large), you could just sum up the values, shifted the appropriate number of bits:
bytestr = b',\x84\x9f\xfc\xdd\xa8A\xa7\xcb\x07v\xc9`\xefu\x81'
le_int = sum(v << (8 * i) for i, v in enumerate(bytestr))
# = sum([44, 33792, 10420224, 4227858432, 949187772416, 184717953466368, 18295873486192640, 12033618204333965312, 3744689046963038978048, 33056565380087516495872, 142653246714526242615328768, 62206486974090358813680992256, 7605903601369376408980219232256, 4847495895272749231323393057357824, 607498732448574832538068070518751232, 171470411456254147604591110776164450304])
# = 172082765352850773589848323076011295788
That would be a little-endian interpretation; a big-endian interpretation would just start reading from the other side, which you could do with reversed():
be_int = sum(v << (8 * i) for i, v in enumerate(reversed(bytestr)))
# = sum([129, 29952, 15663104, 1610612736, 863288426496, 129742372077568, 1970324836974592, 14627691589699371008, 3080606260309495119872, 306953821386526938890240, 203099537695257701350637568, 68396187170517260188176613376, 19965496953594613073573075484672, 3224903126980615597407612954476544, 685383185326597246966025515457052672, 58486031814536298407767510652335161344])
# = 59174659937086426622213974601606591873
I have been working on string encoding schemes and while I examine how UTF-16 works, I have a question. Why using complex surrogate pairs to represent 21 bits code point? Why not to simply store the bits in the first code unit and the remaining bits in the second code unit? Am I missing something! Is there a problem to store the bits directly like we did in UTF-8?
Example of what I am thinking of:
The character 'π'
Corresponding code point: 128579 (Decimal)
The binary form: 1 1111 0110 0100 0011 (17 bits)
It's 17-bit code point.
Based on UTF-8 schemes, it will be represented as:
240 : 11110 000
159 : 10 011111
153 : 10 011001
131 : 10 000011
In UTF-16, why not do something looks like that rather than using surrogate pairs:
49159 : 110 0 0000 0000 0111
30275 : 01 11 0110 0100 0011
Proposed alternative to UTF-16
I think you're proposing an alternative format using 16-bit code units analogous to the UTF-8 code scheme βΒ let's designate it UTF-EMF-16.
In your UTF-EMF-16 scheme, code points from U+0000 to U+7FFF would be encoded as a single 16-bit unit with the MSB (most significant bit) always zero. Then, you'd reserve 16-bit units with the 2 most significant bits set to 10 as 'continuation units', with 14 bits of payload data. And then you'd encode code points from U+8000 to U+10FFFF (the current maximum Unicode code point) in 16-bit units with the three most significant bits set to 110 and up to 13 bits of payload data. With Unicode as currently defined (U+0000 .. U+10FFFF), you'd never need more than 7 of the 13 bits set.
U+0000 .. U+7FFF β One 16-bit unit: values 0x0000 .. 0x7FFF
U+8000 .. U+10FFF β Two 16-bit units:
1. First unit 0xC000 .. 0xC043
2. Second unit 0x8000 .. 0xBFFF
For your example code point, U+1F683 (binary: 1 1111 0110 0100 0011):
First unit: 1100 0000 0000 0111 = 0xC007
Second unit: 1011 0110 0100 0011 = 0xB643
The second unit differs from your example in reversing the two most significant bits, from 01 in your example to 10 in mine.
Why wasn't such a scheme used in UTF-16
Such a scheme could be made to work. It is unambiguous. It could accommodate many more characters than Unicode currently allows. UTF-8 could be modified to become UTF-EMF-8 so that it could handle the same extended range, with some characters needing 5 bytes instead of the current maximum of 4 bytes. UTF-EMF-8 with 5 bytes would encode up to 26 bits; UTF-EMF-16 could encode 27 bits, but should be limited to 26 bits (roughly 64 million code points, instead of just over 1 million). So, why wasn't it, or something very similar, adopted?
The answer is the very common one β history (plus backwards compatibility).
When Unicode was first defined, it was hoped or believed that a 16-bit code set would be sufficient. The UCS2 encoding was developed using 16-bit values, and many values in the range 0x8000 .. 0xFFFF were given meanings. For example, U+FEFF is the byte order mark.
When the Unicode scheme had to be extended to make Unicode into a bigger code set, there were many defined characters with the 10 and 110 bit patterns in the most significant bits, so backwards compatibility meant that the UTF-EMF-16 scheme outlined above could not be used for UTF-16 without breaking compatibility with UCS2, which would have been a serious problem.
Consequently, the standardizers chose an alternative scheme, where there are high surrogates and low surrogates.
0xD800 .. 0xDBFF High surrogates (most signicant bits of 21-bit value)
0xDC00 .. 0xDFFF Low surrogates (less significant bits of 21-bit value)
The low surrogates range provides storage for 10 bits of data β the prefix 1101 11 uses 6 of 16 bits. The high surrogates range also provides storage for 10 bits of data β the prefix 1101 10 also uses 6 of 16 bits. But because the BMP (Basic Multilingual Plane β U+0000 .. U+FFFF) doesn't need to be encoded with two 16-bit units, the UTF-16 encoding subtracts 1 from the high order data, and can therefore be used to encode U+10000 .. U+10FFFF. (Note that although Unicode is a 21-bit encoding, not all 21-bit (unsigned) numbers are valid Unicode code points. Values from 0x110000 .. 0x1FFFFF are 21-bit numbers but are not a part of Unicode.)
From the Unicode FAQ β UTF-8, UTF-16, UTF-32 & BOM:
Q: Whatβs the algorithm to convert from UTF-16 to character codes?
A: The Unicode Standard used to contain a short algorithm, now there is just a bit distribution table. Here are three short code snippets that translate the information from the bit distribution table into C code that will convert to and from UTF-16.
Using the following type definitions
typedef unsigned int16 UTF16;
typedef unsigned int32 UTF32;
the first snippet calculates the high (or leading) surrogate from a character code C.
const UTF16 HI_SURROGATE_START = 0xD800
UTF16 X = (UTF16) C;
UTF32 U = (C >> 16) & ((1 << 5) - 1);
UTF16 W = (UTF16) U - 1;
UTF16 HiSurrogate = HI_SURROGATE_START | (W << 6) | X >> 10;
where X, U and W correspond to the labels used in Table 3-5 UTF-16 Bit Distribution. The next snippet does the same for the low surrogate.
const UTF16 LO_SURROGATE_START = 0xDC00
UTF16 X = (UTF16) C;
UTF16 LoSurrogate = (UTF16) (LO_SURROGATE_START | X & ((1 << 10) - 1));
Finally, the reverse, where hi and lo are the high and low surrogate, and C the resulting character
UTF32 X = (hi & ((1 << 6) -1)) << 10 | lo & ((1 << 10) -1);
UTF32 W = (hi >> 6) & ((1 << 5) - 1);
UTF32 U = W + 1;
UTF32 C = U << 16 | X;
A caller would need to ensure that C, hi, and lo are in the appropriate ranges. [
Page 11 of RFC 2898 states that for U_1 = PRF (P, S || INT (i)), INT (i) is a four-octet encoding of the integer i, most significant octet first.
Does that mean that i is a signed value and if so what happens on overflow?
Nothing says that it would be signed. The fact that dkLen is capped at (2^32 - 1) * hLen suggests that it's an unsigned integer, and that it cannot roll over from 0xFFFFFFFF (2^32 - 1) to 0x00000000.
Of course, PBKDF2(MD5) wouldn't hit 2^31 until you've asked for 34,359,738,368 bytes. That's an awful lot of bytes.
SHA-1: 42,949,672,960
SHA-2-256 / SHA-3-256: 68,719,476,736
SHA-2-384 / SHA-3-384: 103,079,215,104
SHA-2-512 / SHA-3-512: 137,438,953,472
Since the .NET implementation (in Rfc2898DeriveBytes) is an iterative stream it could be polled for 32GB via a (long) series of calls. Most platforms expose PBKDF2 as a one-shot, so you'd need to give them a memory range of 32GB (or more) to identify if they had an error that far out. So even if most platforms get the sign bit wrong... it doesn't really matter.
PBKDF2 is a KDF (key derivation function), so used for deriving keys. AES-256 is 32 bytes, or 48 if you use the same PBKDF2 to generate an IV (which you really shouldn't). Generating a private key for the ECC curve with a 34,093 digit prime is (if I did my math right) 14,157 bytes. Well below the 32GB mark.
i ranges from 1 to l = CEIL (dkLen / hLen), and dkLen and hLen are positive integers. Therefore, i is strictly positive.
You can, however, store i in a signed, 32-bit integer type without any special handling. If i rolls over (increments from 0x7FFFFFFF to 0xF0000000), it will continue to be encoded correctly, and continue to increment correctly. With two's complement encoding, bitwise results for addition, subtraction, and multiplication are the same as long as all values are treated as either signed or unsigned.
Going through Elisabeth Hendrickson's test heuristics cheatsheet , I see the following recommendations :
Numbers : 32768 (2^15) 32769 (2^15+ 1) 65536 (2^16) 65537 (2^16 +1) 2147483648 (2^31) 2147483649 (2^31+ 1) 4294967296 (2^32) 4294967297 (2^32+ 1)
Does someone know the reason for testing all theses cases ? My gut feeling goes with the data type the developer may have used ( integer, long, double...)
Similarly, with Strings :
Long (255, 256, 257, 1000, 1024, 2000, 2048 or more characters)
These represent boundaries
Integers
2^15 is at the bounds of signed 16-bit integers
2^16 is at the bounds of unsigned 16-bit integers
2^31 is at the bounds of signed 32-bit integers
2^32 is at the bounds of unsigned 32-bit integers
Testing for values close to common boundaries tests whether overflow is correctly handled (either arithmetic overflow in the case of various integer types, or buffer overflow in the case of long strings that might potentially overflow a buffer).
Strings
255/256 is at the bounds of numbers that can be represented in 8 bits
1024 is at the bounds of numbers that can be represented in 10 bits
2048 is at the bounds of numbers that can be represented in 11 bits
I suspect that the recommendations such as 255, 256, 1000, 1024, 2000, 2048 are based on experience/observation that some developers may allocate a fixed-size buffer that they feel is "big enough no matter what" and fail to check input. That attitude leads to buffer overflow attacks.
These are boundary values close to maximum signed short, maximum unsigned short and same for int. The reason to test them is to find bugs that occur close to the border values of typical data types.
E.g. your code uses signed short and you have a test that exercises something just below and just above the maximum value of such type. If the first test passes and the second one fails, you can easily tell that overflow/truncation on short was the reason.
Those numbers are border cases on either side of the fence (+1, 0, and -1) for "whole and round" computer numbers, which are always powers of 2. Those powers of 2 are also not random and are representing standard choices for integer precision - being 8, 16, 32, and so on bits wide.
If a server received a base64 string and wanted to check it's length before converting,, say it wanted to always permit the final byte array to be 16KB. How big could a 16KB byte array possibly become when converted to a Base64 string (assuming one byte per character)?
Base64 encodes each set of three bytes into four bytes. In addition the output is padded to always be a multiple of four.
This means that the size of the base-64 representation of a string of size n is:
ceil(n / 3) * 4
So, for a 16kB array, the base-64 representation will be ceil(16*1024/3)*4 = 21848 bytes long ~= 21.8kB.
A rough approximation would be that the size of the data is increased to 4/3 of the original.
From Wikipedia
Note that given an input of n bytes,
the output will be (n + 2 - ((n + 2) %
3)) / 3 * 4 bytes long, so that the
number of output bytes per input byte
converges to 4 / 3 or 1.33333 for
large n.
So 16kb * 4 / 3 gives very little over 21.3' kb, or 21848 bytes, to be exact.
Hope this helps
16kb is 131,072 bits. Base64 packs 24-bit buffers into four 6-bit characters apiece, so you would have 5,462 * 4 = 21,848 bytes.
Since the question was about the worst possible increase, I must add that there are usually line breaks at around each 80 characters. This means that if you are saving base64 encoded data into a text file on Windows it will add 2 bytes, on Linux 1 byte for each line.
The increase from the actual encoding has been described above.
This is a future reference for myself. Since the question is on worst case, we should take line breaks into account. While RFC 1421 defines maximum line length to be 64 char, RFC 2045 (MIME) states there'd be 76 char in one line at most.
The latter is what C# library has implemented. So in Windows environment where a line break is 2 chars (\r\n), we get this: Length = Floor(Ceiling(N/3) * 4 * 78 / 76)
Note: Flooring is because during my test with C#, if the last line ends at exactly 76 chars, no line-break follows.
I can prove it by running the following code:
byte[] bytes = new byte[16 * 1024];
Console.WriteLine(Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks).Length);
The answer for 16 kBytes encoded to base64 with 76-char lines: 22422 chars
Assume in Linux it'd be Length = Floor(Ceiling(N/3) * 4 * 77 / 76) but I didn't get around to test it on my .NET core yet.
Also it would depend on actual character encoding, i.e. if we encode to UTF-32 string, each base64 character would consume 3 additional bytes (4 byte per char).