Is it possible to base64-encode a file in chunks?

Is it possible to base64-encode a file in chunks? - base64

I'm trying to base64 encode a huge input file and end up with an text output file, and I'm trying to find out whether it's possible to encode the input file bit-by-bit, or whether I need to encode the entire thing at once.
This will be done on the AS/400 (iSeries), if that makes any difference. I'm using my own base64 encoding routine (written in RPG) which works excellently, and, were it not a case of size limitations, would be fine.

It's not possible bit-by-bit but 3 bytes at a time, or multiples of 3 bytes at time will do!.
In other words if you split your input file in "chunks" which size(s) is (are) multiples of 3 bytes, you can encode the chunks separately and piece together the resulting B64-encoded pieces together (in the corresponding orde, of course. Note that the last chuink needn't be exactly a multiple of 3 bytes in size, depending on the modulo 3 value of its size its corresponding B64 value will have a few of these padding characters (typically the equal sign) but that's ok, as thiswill be the only piece that has (and needs) such padding.
In the decoding direction, it is the same idea except that you need to split the B64-encoded data in multiples of 4 bytes. Decode them in parallel / individually as desired and re-piece the original data by appending the decoded parts together (again in the same order).
Example:
"File" contents =
"Never argue with the data." (Jimmy Neutron).
Straight encoding = Ik5ldmVyIGFyZ3VlIHdpdGggdGhlIGRhdGEuIiAoSmltbXkgTmV1dHJvbik=
Now, in chunks:
"Never argue --> Ik5ldmVyIGFyZ3Vl
with the --> IHdpdGggdGhl
data." (Jimmy Neutron) --> IGRhdGEuIiAoSmltbXkgTmV1dHJvbik=
As you see piece in that order the 3 encoded chunks amount the same as the code produced for the whole file.
Decoding is done similarly, with arbitrary chuncked sized provided they are multiples of 4 bytes. There is absolutely not need to have any kind of correspondance between the sizes used for encoding. (although standardizing to one single size for each direction (say 300 and 400) may makes things more uniform and easier to manage.

It is a trivial effort to split any given bytestream into chunks.
You can base64 any chunk of bytes without problem.
The problem you are faced with is that unless you place specific requirements on your chunks (multiples of 3 bytes), the sequence of base64-encoded chunks will be different than the actual output you want.
In C#, this is one (sloppy) way you could do it lazily. The execution is actually deferred until string.Concat is called, so you can do anything you want with the chunked strings. (If you plug this into LINQPad you will see the output)
void Main()
{
var data = "lorum ipsum etc lol this is an example!!";
var bytes = Encoding.ASCII.GetBytes(data);
var testFinal = Convert.ToBase64String(bytes);
var chunkedBytes = bytes.Chunk(3);
var base64chunks = chunkedBytes.Select(i => Convert.ToBase64String(i.ToArray()));
var final = string.Concat(base64chunks);
testFinal.Dump(); //output
final.Dump(); //output
}
public static class Extensions
{
public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> list, int chunkSize)
{
while(list.Take(1).Count() > 0)
{
yield return list.Take(chunkSize);
list = list.Skip(chunkSize);
}
}
}
Output
bG9ydW0gaXBzdW0gZXRjIGxvbCB0aGlzIGlzIGFuIGV4YW1wbGUhIQ==
bG9ydW0gaXBzdW0gZXRjIGxvbCB0aGlzIGlzIGFuIGV4YW1wbGUhIQ==

Hmmm, if you wrote the base64 conversion yourself you should have noticed the obvious thing: each sequence of 3 octets is represented by 4 characters in base64.
So you can split the base64 data at every multiple of four characters, and it will be possible to convert these chunks back to their original bits.
I don't know how character files and byte files are handled on an AS/400, but if it has both concepts, this should be very easy.
are text files limited in the length of each line?
are text files line-oriented, or are they just character streams?
how many bits does one byte have?
are byte files padded at the end, so that one can only create files that span whole disk sectors?
If you can answer all these questions, what exact difficulties do you have left?

Related

Node.JS AES decryption truncates initial result

I'm attempting to replicate some python encryption code in node.js using the built in crypto library. To test, I'm encrypting the data using the existing python script and then attempting to decrypt using node.js.
I have everything working except for one problem, doing the decryption results in a truncated initial decrypted result unless I grab extra data, which then results in a truncated final result.
I'm very new to the security side of things, so apologize in advance if my vernacular is off.
Python encryption logic:
encryptor = AES.new(key, AES.MODE_CBC, IV)
<# Header logic, like including digest, salt, and IV #>
for rec in vect:
chunk = rec.pack() # Just adds disparate pieces of data into a contiguous bytearray of length 176
encChunk = encryptor.encrypt(chunk)
outfile.write(encChunk)
Node decryption logic:
let offset = 0;
let derivedKey = crypto.pbkdf2Sync(secret, salt, iterations, 32, 'sha256');
let decryptor = crypto.createDecipheriv('aes-256-cbc', derivedKey, iv);
let chunk = data.slice(offset, (offset + RECORD_LEN))
while(chunk.length > 0) {
let clearChunk = decryptor.update(chunk);
// unpack clearChunk and do something with that data
offset += RECORD_LEN;
chunk = data.slice(offset, offset + RECORD_LEN);
}
I would expect my initial result to print something like this to hex:
54722e34d8b2bf158db6b533e315764f87a07bbfbcf1bd6df0529e56b6a6ae0f123412341234123400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
And it gets close, expect it cuts off the final 16 bytes (in example above the final 32 "0's" would be missing). This shifts all following decryptions by those 16 bytes, meaning those 32 "0's" are added to the front of the next decrypted chunk.
If I add 16 bytes to the initial chunk size (meaning actually grab more data, not just shift the offset) then this solves everything on the front end, but results in the final chunk losing it's last 16 bytes of data.
One thing that seems weird to me: The initial chunk has a length of 176, but the decrypted results has a length of 160. All other chunks have lengths 176 before and after decryption. I'm assuming I'm doing something wrong with how I'm initializing the decryptor which is causing it to expect an extra 16 bytes of data at the beginning, but can't for the life of me figure out what.
I must be close since the decrypted data is correct, minus the mystery shifting, even when reading in large amounts of data. Just need to figure out this final step.

Short version based on your updated code: if you are absolutely certain that every block will be 176 bytes (i.e. a multiple of 16), then you can add cipher.setAutoPadding(false) to your Node code. If that's not true, or for more about why, read on.
At the end of your decryption, you need to call decryptor.final to get the final block.
If you have all the data together, you can decrypt it in one call:
let clearChunk = decryptor.update(chunk) + decryptor.final()
update() exists so that you can pass data to the decryptor in chunks. For example, if you had a very large file, you may not want a full copy of the encrypted data plus a full copy of the decrypted data in memory at the same time. You can therefore read encrypted data from the file in chunks, pass it to update(), and write out the decrypted data in chunks.
The input data using CBC mode must be a multiple of 16 bytes long. To ensure this, we typically use PKCS7 padding. That will pad out your input data to a multiple of 16. If it's already a multiple of 16 it will add an extra block of 16 bytes. The padding value is the number of padding values. So if your block is 12 bytes long, it will be padded with 04040404. If it's a multiple of 16, then the padding is 16 bytes of 0x10. This padding system lets the decryptor validate that it's removing the right amount of padding. This is likely what's causing your 176/160 issue.
This padding issue is why there's a final() call. The system needs to know which block is the last block so it can remove the padding. So the first call to update() will always return one fewer blocks than you pass in, since it's holding onto it until it knows whether it's the last block.
Looking at your Python code, I think it's not padding at all (most Python libraries I'm familiar with don't pad automatically). As long as the input is certain to be a multiple of 16, that's ok. But the default for Node is to expect padding. If you know that your size will always be a multiple of 16, then you can change the Node side with cipher.setAutoPadding(false). If you don't know for certain that the input size will always be a multiple of 16, then you need to add a pad() call on the Python side for the final block.

How to convert padbytes function to coldfusion

I have the following code in node and I am trying to convert to ColdFusion:
// a correct implementation of PKCS7. The rijndael js has a PKCS7 padding already implemented
// however, it incorrectly pads expecting the phrase to be multiples of 32 bytes when it should pad based on multiples
// 16 bytes. Also, every javascript implementation of PKCS7 assumes utf-8 char encoding. C# however is unicode or utf-16.
// This means that chars need to be treated in our code as 2 byte chars and not 1 byte chars.
function padBytes(string){
const strArray = [...new Buffer(string, 'ucs2')];
const need = 16 - ((strArray.length) % 16);
for(let i = 0; i < need; i++) {
strArray.push(need);
}
return Buffer.from(strArray);
}
I'm trying to understand exactly what this function is doing to convert it. As I think I understand it, it's converting the string to UTF-16 (UCS2) and then adding padding to each character. However, I don't understand why the need variable is the value it is, nor how exactly to achieve that in CF.
I also don't understand why it's only pushing the same value into the array over and over again. For starters, in my example script the string is 2018-06-14T15:44:10Z testaccount. The string array length is 64. I'm not sure how to achieve even that in CF.
I've tried character encoding, converting to binary and stuff to UTF-16 and just don't understand well enough the js function to replicate it in ColdFusion. I feel I'm missing something with the encoding.
EDIT:
The selected answer solves this problem, but because I was eventually trying to use the input data for encryption, the easier method was to not use this function at all but do the following:
<cfset stringToEncrypt = charsetDecode(input,"utf-16le") />
<cfset variables.result = EncryptBinary(stringToEncrypt, theKey, theAlgorithm, theIV) />

Update:
We followed up in chat and turns out the value is ultimately used with encrypt(). Since encrypt() already handles padding (automatically), no need for the custom padBytes() function. However, it did require switching to the less commonly used encryptBinary() function to maintain the UTF-16 encoding. The regular encrypt() function only handles UTF-8, which produces totally different results.
Trycf.com Example:
// Result with sample key/iv: P22lWwtD8pDrNdQGRb2T/w==
result = encrypt("abc", theKey, theAlgorithm, theEncoding, theIV);
// Result Result with sample key/iv: LJCROj8trkXVq1Q8SQNrbA==
input = charsetDecode("abc", "utf-16le");
result= binaryEncode(encryptBinary(input, theKey, theAlgorithm, theIV), "base64);
it's converting the string to utf-16
(ucs2) and then adding padding to each character.
... I feel I'm missing something with the encoding.
Yes, the first part seems to be decoding the string as UTF-16 (or UCS2 which are slightly different). As to what you're missing, you're not the only one. I couldn't get it to work either until I found this comment which explained "UTF-16" prepends a BOM. To omit the BOM, use either "UTF-16BE" or "UTF-16LE" depending on the endianess needed.
why it's only pushing the same value into the array over and over again.
Because that's the definition of PCKS7 padding. Instead of padding with something like nulls or zeroes, it calculates how many bytes padding are needed. Then uses that number as the padding value. For example, say a string needs an extra three bytes padding. PCKS7 appends the value 3 - three times: "string" + "3" + "3" + "3".
The rest of the code is similar in CF. Unfortunately, the results of charsetDecode() aren't mutable. You must build a separate array to hold the padding, then combine the two.
Note, this example combines the arrays using CF2016 specific syntax, but it could also be done with a simple loop instead
Function:
function padBytes(string text){
var combined = [];
var padding = [];
// decode as utf-16
var decoded = charsetDecode(arguments.text,"utf-16le");
// how many padding bytes are needed?
var need = 16 - (arrayLen(decoded) % 16);
// fill array with any padding bytes
for(var i = 0; i < need; i++) {
padding.append(need);
}
// concatenate the two arrays
// CF2016+ specific syntax. For earlier versions, use a loop
combined = combined.append(decoded, true);
combined = combined.append(padding, true);
return combined;
}
Usage:
result = padBytes("2018-06-14T15:44:10Z testaccount");
writeDump(binaryEncode( javacast("byte[]", result), "base64"));

Buffer constructor not treating encoding correctly for buffer length

I am trying to construct a utf16le string from a javascript string as a new buffer object.
It appears that setting a new Buffer('xxxxxxxxxx', utf16le) will actually have a length of 1/2 what it is expected to have. Such as we will only see 5 x's in the console logs.
var test = new Buffer('xxxxxxxxxx','utf16le');
for (var i=0;i<test.length;i++) {
console.log(i+':'+String.fromCharCode(test[i]));
}
Node version is v0.8.6

It is really unclear what you want to accomplish here. Your statement can mean (at least) 2 things:
How to convert an JS-String into a UTF-16-LE Byte-Array
How to convert a Byte-Array containing a UTF-16-LE String into a JS-String
What you are doing in your code sample is decoding a Byte-Array in a string represented as UTF-16-LE to a UTF-8 string and storing that as a buffer. Until you actually state what you want to accomplish, you have 0 chance of getting a coherent answer.
new Buffer('FF', 'hex') will yield a buffer of length 1 with all bits of the octet set. Which is likely the opposite of what you think it does.

Reading stream line by line without knowing its encoding

I have a situation where I need to process some data from a stream line by line. The problem is that the encoding of the data is not known in advance; it might be UTF-8 or any legacy single-byte encoding (e.g. Latin1, ISO-8859-5, etc). It will not be UTF16 or exotics like EBCDIC, so I can reasonably expect \n to be unambiguous, so in theory I can split it into lines. At some point, when I encounter an empty line, I will need to feed the rest of the stream somewhere else (without splitting it into lines, but still without any reencoding); think in terms of HTTP-style headers followed by an opaque body.
Here is what I got:
function processStream(stream) {
var buffer = '';
function splitLines(data) {
buffer += data;
var lf = buffer.indexOf('\n');
while (lf >= 0) {
var line = buffer.substr(0, lf - 1);
buffer = buffer.substr(lf + 1);
this.emit('line', line);
lf = buffer.indexOf('\n');
}
}
function processHeader(line) {
if (line.length) {
// do something with the line
} else {
// end of headers, stop splitting lines and start processing the body
this
.removeListener('data', splitLines)
.removeAllListeners('line')
.on('data', processBody);
if (buffer.length) {
// process leftover buffer as part of the body
processBody(buffer);
buffer = '';
}
}
}
function processBody(data) {
// do something with the body chunks
}
stream.setEncoding('binary');
stream
.on('data', splitLines)
.on('line', processHeader);
}
It does the job, but the problem is that the binary encoding is deprecated and will probably disappear in the future, leaving me without that option. All other Buffer encodings will either mangle the data or fail to decode it altogether if (most likely, when) it does not match the encoding. Working with Uint8Array instead will mean slow and inconvenient Javascript loops over the data just to find a newline.
Any suggestions on how to split a stream into lines on the fly, while remaining encoding-agnostic without using binary encoding?

Disclaimer: I'm not a Javascript developer.
At some point, when I encounter an empty line, I will need to feed the rest of the stream somewhere else (without splitting it into lines, but still without any reencoding)
Right. In that case, it sounds like you really don't want to think about the data as text at all. Treat it as you would any binary data, and split it on byte 0x0A. (Note that if it came from Windows to start with, you might want to also remove any trailing 0x0D value.)
I know it's text really, but without any encoding information, it's dangerous to impose any sort of interpretation on the data.
So you should keep two pieces of state:
A list of byte arrays
The current buffer
When you receive data, you logically want to create a new array with the current buffer prepending the new data. (For efficiency you may not want to actually create such an array, but I would do so to start with, until you've got it working.) Look for any 0x0A bytes, and split the array accordingly (create a new byte array as a "slice" of the existing array, and add the slice to the list). The new "current buffer" will be whatever data you have left after the final 0x0A.
If you see two 0x0A values in a row, then you'd go into your second mode of just copying the data.
This is all assuming that the Javascript / Node combination allows you to manipulate binary data as binary data, but I'd be shocked if it didn't. The important point is not to interpret it as text at any point.

Remove trailing "=" when base64 encoding

I am noticing that whenever I base64 encode a string, a "=" is appended at the end. Can I remove this character and then reliably decode it later by adding it back, or is this dangerous? In other words, is the "=" always appended, or only in certain cases?
I want my encoded string to be as short as possible, that's why I want to know if I can always remove the "=" character and just add it back before decoding.

The = is padding. <!------------>
Wikipedia says
An additional pad character is
allocated which may be used to force
the encoded output into an integer
multiple of 4 characters (or
equivalently when the unencoded binary
text is not a multiple of 3 bytes) ;
these padding characters must then be
discarded when decoding but still
allow the calculation of the effective
length of the unencoded text, when its
input binary length would not be a
multiple of 3 bytes (the last non-pad
character is normally encoded so that
the last 6-bit block it represents
will be zero-padded on its least
significant bits, at most two pad
characters may occur at the end of the
encoded stream).
If you control the other end, you could remove it when in transport, then re-insert it (by checking the string length) before decoding.
Note that the data will not be valid Base64 in transport.
Also, Another user pointed out (relevant to PHP users):
Note that in PHP base64_decode will accept strings without padding, hence if you remove it to process it later in PHP it's not necessary to add it back. – Mahn Oct 16 '14 at 16:33
So if your destination is PHP, you can safely strip the padding and decode without fancy calculations.

I wrote part of Apache's commons-codec-1.4.jar Base64 decoder, and in that logic we are fine without padding characters. End-of-file and End-of-stream are just as good indicators that the Base64 message is finished as any number of '=' characters!
The URL-Safe variant we introduced in commons-codec-1.4 omits the padding characters on purpose to keep things smaller!
http://commons.apache.org/codec/apidocs/src-html/org/apache/commons/codec/binary/Base64.html#line.478
I guess a safer answer is, "depends on your decoder implementation," but logically it is not hard to write a decoder that doesn't need padding.

In JavaScript you could do something like this:
// if this is your Base64 encoded string
var str = 'VGhpcyBpcyBhbiBhd2Vzb21lIHNjcmlwdA==';
// make URL friendly:
str = str.replace(/\+/g, '-').replace(/\//g, '_').replace(/\=+$/, '');
// reverse to original encoding
if (str.length % 4 != 0){
str += ('===').slice(0, 4 - (str.length % 4));
}
str = str.replace(/-/g, '+').replace(/_/g, '/');
See also this Fiddle: http://jsfiddle.net/7bjaT/66/

= is added for padding. The length of a base64 string should be multiple of 4, so 1 or 2 = are added as necessary.
Read: No, you shouldn't remove it.

On Android I am using this:
Global
String CHARSET_NAME ="UTF-8";
Encode
String base64 = new String(
Base64.encode(byteArray, Base64.URL_SAFE | Base64.NO_PADDING | Base64.NO_CLOSE | Base64.NO_WRAP),
CHARSET_NAME);
return base64.trim();
Decode
byte[] bytes = Base64.decode(base64String,
Base64.URL_SAFE | Base64.NO_PADDING | Base64.NO_CLOSE | Base64.NO_WRAP);
equals this on Java:
Encode
private static String base64UrlEncode(byte[] input)
{
Base64 encoder = new Base64(true);
byte[] encodedBytes = encoder.encode(input);
return StringUtils.newStringUtf8(encodedBytes).trim();
}
Decode
private static byte[] base64UrlDecode(String input) {
byte[] originalValue = StringUtils.getBytesUtf8(input);
Base64 decoder = new Base64(true);
return decoder.decode(originalValue);
}
I had never problems with trailing "=" and I am using Bouncycastle as well

If you're encoding bytes (at fixed bit length), then the padding is redundant. This is the case for most people.
Base64 consumes 6 bits at a time and produces a byte of 8 bits that only uses six bits worth of combinations.
If your string is 1 byte (8 bits), you'll have an output of 12 bits as the smallest multiple of 6 that 8 will fit into, with 4 bits extra. If your string is 2 bytes, you have to output 18 bits, with two bits extra. For multiples of six against multiple of 8 you can have a remainder of either 0, 2 or 4 bits.
The padding says to ignore those extra four (==) or two (=) bits. The padding is there tell the decoder about your padding.
The padding isn't really needed when you're encoding bytes. A base64 encoder can simply ignore left over bits that total less than 8 bits. In this case, you're best off removing it.
The padding might be of some use for streaming and arbitrary length bit sequences as long as they're a multiple of two. It might also be used for cases where people want to only send the last 4 bits when more bits are remaining if the remaining bits are all zero. Some people might want to use it to detect incomplete sequences though it's hardly reliable for that. I've never seen this optimisation in practice. People rarely have these situations, most people use base64 for discrete byte sequences.
If you see answers suggesting to leave it on, that's not a good encouragement if you're simply encoding bytes, it's enabling a feature for a set of circumstances you don't have. The only reason to have it on in that case might be to add tolerance to decoders that don't work without the padding. If you control both ends, that's a non-concern.

If you're using PHP the following function will revert the stripped string to its original format with proper padding:
<?php
$str = 'base64 encoded string without equal signs stripped';
$str = str_pad($str, strlen($str) + (4 - ((strlen($str) % 4) ?: 4)), '=');
echo $str, "\n";

Using Python you can remove base64 padding and add it back like this:
from math import ceil
stripped = original.rstrip('=')
original = stripped.ljust(ceil(len(stripped) / 4) * 4, '=')

Yes, there are valid use cases where padding is omitted from a Base 64 encoding.
The JSON Web Signature (JWS) standard (RFC 7515) requires Base 64 encoded data to omit
padding. It expects:
Base64 encoding [...] with all trailing '='
characters omitted (as permitted by Section 3.2) and without the
inclusion of any line breaks, whitespace, or other additional
characters. Note that the base64url encoding of the empty octet
sequence is the empty string. (See Appendix C for notes on
implementing base64url encoding without padding.)
The same applies to the JSON Web Token (JWT) standard (RFC 7519).
In addition, Julius Musseau's answer has indicated that Apache's Base 64 decoder doesn't require padding to be present in Base 64 encoded data.

I do something like this with java8+
private static String getBase64StringWithoutPadding(String data) {
if(data == null) {
return "";
}
Base64.Encoder encoder = Base64.getEncoder().withoutPadding();
return encoder.encodeToString(data.getBytes());
}
This method gets an encoder which leaves out padding.
As mentioned in other answers already padding can be added after calculations if you need to decode it back.

For Android You may have trouble if You want to use android.util.base64 class, since that don't let you perform UnitTest others that integration test - those uses Adnroid environment.
In other hand if You will use java.util.base64, compiler warns You that You sdk may to to low (below 26) to use it.
So I suggest Android developers to use
implementation "commons-codec:commons-codec:1.13"
Encoding object
fun encodeObjectToBase64(objectToEncode: Any): String{
val objectJson = Gson().toJson(objectToEncode).toString()
return encodeStringToBase64(objectJson.toByteArray(Charsets.UTF_8))
}
fun encodeStringToBase64(byteArray: ByteArray): String{
return Base64.encodeBase64URLSafeString(byteArray).toString() // encode with no padding
}
Decoding to Object
fun <T> decodeBase64Object(encodedMessage: String, encodeToClass: Class<T>): T{
val decodedBytes = Base64.decodeBase64(encodedMessage)
val messageString = String(decodedBytes, StandardCharsets.UTF_8)
return Gson().fromJson(messageString, encodeToClass)
}
Of course You may omit Gson parsing and put straight away into method Your String transformed to ByteArray

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string