If a server received a base64 string and wanted to check it's length before converting,, say it wanted to always permit the final byte array to be 16KB. How big could a 16KB byte array possibly become when converted to a Base64 string (assuming one byte per character)?
Base64 encodes each set of three bytes into four bytes. In addition the output is padded to always be a multiple of four.
This means that the size of the base-64 representation of a string of size n is:
ceil(n / 3) * 4
So, for a 16kB array, the base-64 representation will be ceil(16*1024/3)*4 = 21848 bytes long ~= 21.8kB.
A rough approximation would be that the size of the data is increased to 4/3 of the original.
From Wikipedia
Note that given an input of n bytes,
the output will be (n + 2 - ((n + 2) %
3)) / 3 * 4 bytes long, so that the
number of output bytes per input byte
converges to 4 / 3 or 1.33333 for
large n.
So 16kb * 4 / 3 gives very little over 21.3' kb, or 21848 bytes, to be exact.
Hope this helps
16kb is 131,072 bits. Base64 packs 24-bit buffers into four 6-bit characters apiece, so you would have 5,462 * 4 = 21,848 bytes.
Since the question was about the worst possible increase, I must add that there are usually line breaks at around each 80 characters. This means that if you are saving base64 encoded data into a text file on Windows it will add 2 bytes, on Linux 1 byte for each line.
The increase from the actual encoding has been described above.
This is a future reference for myself. Since the question is on worst case, we should take line breaks into account. While RFC 1421 defines maximum line length to be 64 char, RFC 2045 (MIME) states there'd be 76 char in one line at most.
The latter is what C# library has implemented. So in Windows environment where a line break is 2 chars (\r\n), we get this: Length = Floor(Ceiling(N/3) * 4 * 78 / 76)
Note: Flooring is because during my test with C#, if the last line ends at exactly 76 chars, no line-break follows.
I can prove it by running the following code:
byte[] bytes = new byte[16 * 1024];
Console.WriteLine(Convert.ToBase64String(bytes, Base64FormattingOptions.InsertLineBreaks).Length);
The answer for 16 kBytes encoded to base64 with 76-char lines: 22422 chars
Assume in Linux it'd be Length = Floor(Ceiling(N/3) * 4 * 77 / 76) but I didn't get around to test it on my .NET core yet.
Also it would depend on actual character encoding, i.e. if we encode to UTF-32 string, each base64 character would consume 3 additional bytes (4 byte per char).
Related
What determines the size of galois field when using reed-solomon algorithm to encode an arbitrary message of any size? Is it the symbol size, or the size of the message?
For example, if I am to encode ASCII characters, and I use GF(2^8) because ASCII's are 8 bits, I would end up with a maximum codeword length of 2^8 - 1 = 255 ASCII characters. Then I would have to split the message into sub-messages of length 255.
Or, if I use GF(2^s) such that 2^s - 1 >= the length of the message, then there's no need to split the message, but in this case even though I am encoding ASCII characters which are 8 bits, each symbol in the codeword would be considered 2^s bits.
Which is preferred? Or is there any other things that determine the selection of the Galois Field?
The fixed or maximum size of the message determines the symbol size. GF(2^2) for up to 15 nibbles (7.5 bytes), GF(2^8) for up to 255 bytes, GF(2^10) for up to 1023 10 bit symbols or 1278.75 bytes (often used for HDD 512 data byte sectors), GF(2^12) for up to 4095 12 bit symbols or 6142.5 bytes (often used for HDD 4096 data byte sectors).
I have been working on string encoding schemes and while I examine how UTF-16 works, I have a question. Why using complex surrogate pairs to represent 21 bits code point? Why not to simply store the bits in the first code unit and the remaining bits in the second code unit? Am I missing something! Is there a problem to store the bits directly like we did in UTF-8?
Example of what I am thinking of:
The character 'π'
Corresponding code point: 128579 (Decimal)
The binary form: 1 1111 0110 0100 0011 (17 bits)
It's 17-bit code point.
Based on UTF-8 schemes, it will be represented as:
240 : 11110 000
159 : 10 011111
153 : 10 011001
131 : 10 000011
In UTF-16, why not do something looks like that rather than using surrogate pairs:
49159 : 110 0 0000 0000 0111
30275 : 01 11 0110 0100 0011
Proposed alternative to UTF-16
I think you're proposing an alternative format using 16-bit code units analogous to the UTF-8 code scheme βΒ let's designate it UTF-EMF-16.
In your UTF-EMF-16 scheme, code points from U+0000 to U+7FFF would be encoded as a single 16-bit unit with the MSB (most significant bit) always zero. Then, you'd reserve 16-bit units with the 2 most significant bits set to 10 as 'continuation units', with 14 bits of payload data. And then you'd encode code points from U+8000 to U+10FFFF (the current maximum Unicode code point) in 16-bit units with the three most significant bits set to 110 and up to 13 bits of payload data. With Unicode as currently defined (U+0000 .. U+10FFFF), you'd never need more than 7 of the 13 bits set.
U+0000 .. U+7FFF β One 16-bit unit: values 0x0000 .. 0x7FFF
U+8000 .. U+10FFF β Two 16-bit units:
1. First unit 0xC000 .. 0xC043
2. Second unit 0x8000 .. 0xBFFF
For your example code point, U+1F683 (binary: 1 1111 0110 0100 0011):
First unit: 1100 0000 0000 0111 = 0xC007
Second unit: 1011 0110 0100 0011 = 0xB643
The second unit differs from your example in reversing the two most significant bits, from 01 in your example to 10 in mine.
Why wasn't such a scheme used in UTF-16
Such a scheme could be made to work. It is unambiguous. It could accommodate many more characters than Unicode currently allows. UTF-8 could be modified to become UTF-EMF-8 so that it could handle the same extended range, with some characters needing 5 bytes instead of the current maximum of 4 bytes. UTF-EMF-8 with 5 bytes would encode up to 26 bits; UTF-EMF-16 could encode 27 bits, but should be limited to 26 bits (roughly 64 million code points, instead of just over 1 million). So, why wasn't it, or something very similar, adopted?
The answer is the very common one β history (plus backwards compatibility).
When Unicode was first defined, it was hoped or believed that a 16-bit code set would be sufficient. The UCS2 encoding was developed using 16-bit values, and many values in the range 0x8000 .. 0xFFFF were given meanings. For example, U+FEFF is the byte order mark.
When the Unicode scheme had to be extended to make Unicode into a bigger code set, there were many defined characters with the 10 and 110 bit patterns in the most significant bits, so backwards compatibility meant that the UTF-EMF-16 scheme outlined above could not be used for UTF-16 without breaking compatibility with UCS2, which would have been a serious problem.
Consequently, the standardizers chose an alternative scheme, where there are high surrogates and low surrogates.
0xD800 .. 0xDBFF High surrogates (most signicant bits of 21-bit value)
0xDC00 .. 0xDFFF Low surrogates (less significant bits of 21-bit value)
The low surrogates range provides storage for 10 bits of data β the prefix 1101 11 uses 6 of 16 bits. The high surrogates range also provides storage for 10 bits of data β the prefix 1101 10 also uses 6 of 16 bits. But because the BMP (Basic Multilingual Plane β U+0000 .. U+FFFF) doesn't need to be encoded with two 16-bit units, the UTF-16 encoding subtracts 1 from the high order data, and can therefore be used to encode U+10000 .. U+10FFFF. (Note that although Unicode is a 21-bit encoding, not all 21-bit (unsigned) numbers are valid Unicode code points. Values from 0x110000 .. 0x1FFFFF are 21-bit numbers but are not a part of Unicode.)
From the Unicode FAQ β UTF-8, UTF-16, UTF-32 & BOM:
Q: Whatβs the algorithm to convert from UTF-16 to character codes?
A: The Unicode Standard used to contain a short algorithm, now there is just a bit distribution table. Here are three short code snippets that translate the information from the bit distribution table into C code that will convert to and from UTF-16.
Using the following type definitions
typedef unsigned int16 UTF16;
typedef unsigned int32 UTF32;
the first snippet calculates the high (or leading) surrogate from a character code C.
const UTF16 HI_SURROGATE_START = 0xD800
UTF16 X = (UTF16) C;
UTF32 U = (C >> 16) & ((1 << 5) - 1);
UTF16 W = (UTF16) U - 1;
UTF16 HiSurrogate = HI_SURROGATE_START | (W << 6) | X >> 10;
where X, U and W correspond to the labels used in Table 3-5 UTF-16 Bit Distribution. The next snippet does the same for the low surrogate.
const UTF16 LO_SURROGATE_START = 0xDC00
UTF16 X = (UTF16) C;
UTF16 LoSurrogate = (UTF16) (LO_SURROGATE_START | X & ((1 << 10) - 1));
Finally, the reverse, where hi and lo are the high and low surrogate, and C the resulting character
UTF32 X = (hi & ((1 << 6) -1)) << 10 | lo & ((1 << 10) -1);
UTF32 W = (hi >> 6) & ((1 << 5) - 1);
UTF32 U = W + 1;
UTF32 C = U << 16 | X;
A caller would need to ensure that C, hi, and lo are in the appropriate ranges. [
I changed csv to npy file. After the change, size of csv file is 5GB, and npy is 13GB.
I thought a npy file is more efficient than csv.
Am I misunderstanding this? Why is the size of npy bigger than csv?
I just used this code
full = pd.read_csv('data/RGB.csv', header=None).values
np.save('data/RGB.npy', full, allow_pickle=False, fix_imports=False)
and data structure like this:
R, G, B, is_skin
2, 5, 1, 0
10, 52, 242, 1
52, 240, 42, 0
...(row is 420,711,257)
In your case an element is an integer between 0 and 255, inclusive. That means, saved as ASCII it will need at most
3 chars for the number
1 char for ,
1 char for the whitespace
which results in at most 5 bytes (somewhat less on average) per element on the disc.
Pandas reads/interprets this as an int64 array (see full.dtype) as default, which means it needs 8 bytes per element, which leads to a bigger size of the npy-file (most of which are zeros!).
To save an integer between 0 and 255 we need only one byte, thus the size of the npy-file could be reduced by factor 8 without loosing any information - just tell pandas it needs to interpret the data as unsigned 8bit-integers:
full = pd.read_csv(r'e:\data.csv', dtype=np.uint8).values
# or to get rid of pandas-dependency:
# full = np.genfromtxt(r'e:\data.csv', delimiter=',', dtype=np.uint8, skip_header=1)
np.save(r'e:/RGB.npy', full, allow_pickle=False, fix_imports=False)
# an 8 times smaller npy-file
Most of the time npy-format needs less space, however there can be situations when the ASCII format results in smaller files.
For example if data consist mostly of very small numbers with one digit and some few very big numbers, that for them really 8bytes are needed:
in ASCII-format you pay on average 2 bytes per element (there is no need to write whitespace, , alone as delimiter is good enough).
in numpy-format you will pay 8 bytes per element.
I am working on bandwidth estimations for VoIP calls. I want to know the maximum size of RTP header. I looked on wiki but only the minimum size is available. I tried to calculate manually the number of bits used in the header but the field:
"Header extension: (optional) The first 32-bit word contains a profile-specific identifier (16 bits) and a length specifier (16 bits) that indicates the length of the extension (EHL = extension header length) in 32-bit units, excluding the 32 bits of the extension header"
is confusing me. Please help.
The header structure in the wiki page shows that the header size depends on the value of the CC field (bits 4-7). These four bits can hold at most 15, so the header size will be 128 + 32 x CC = 128 + 15 * 32 = 608 bits = 76 bytes.
For more info, see RFC 3550.
I'm setting up a function in my .vimrc (using MacVim in particular, but this should be universal to vim in general) to display file sizes (in Bytes, Kilobytes, and Megabytes) in my statusline. While the function works quite perfectly without errors, it's giving me unexpected output! In hindsight, it's certainly producing the output it should, but not the output I want.
Here's the function:
" I modified the FileSize() function shown here to suit my own preferences:
" http://got-ravings.blogspot.com/2008/08/vim-pr0n-making-statuslines-that-own.htm
function! StatuslineFileSize()
let bytes = getfsize(expand("%:p"))
if bytes < 1024
return bytes . "B"
elseif (bytes >= 1024) && (bytes < 10240)
return string(bytes / 1024.0) . "K"
elseif (bytes >= 10240) && (bytes < 1048576)
return string(bytes / 1024) . "K"
elseif (bytes >= 1048576) && (bytes < 10485760)
return string(bytes / 1048576.0) . "M"
elseif bytes >= 10485760
return string(bytes / 1048576) . "M"
endif
endfunction
Here's the way it basically works:
If filesize is less than 1KB, output size in Bytes as an integer
If filesize is between 1KB and 10KB, output size in Kilobytes as a decimal
If filesize is between 10KB and 1MB, output size in Kilobites as an integer
If filesize is between 1MB and 10MB, output size in Megabytes as a decimal
If filesize is greater than 10MB, output size in Megabytes as an integer
The output produced for steps 2 and 4 are decimals with six (6) places of precision. The desired output I would like to have should be decimals with just one (1) place of precision.
I've already searched help documentation for the round() and trunc() functions, but they will only round and truncate floats to the nearest whole number value, which is not what I would like to have happen. I've also searched the Google and StackOverflow for solutions, but most of what I can find involves altering text in the edit buffer or completely unrelated problems such as rounding floats in Java (!!!)
I'm preferably looking for a vim built-in function that can do this, a la round({expr},{prec}) or trunc({expr},{prec}), but if a user defined function can provide a sufficiently elegant solution then I'm all for that as well. I don't mind if the output is a string, since I'm obviously returning a string from StatuslineFileSize() anyways!
Use printf with precision specifiers to convert the results to strings instead of string:
return printf('%.1fM', bytes / 1048576)