Reading very large gzip file stream in node.js - node.js

I'm trying to read a very large gzipped csv file in node.js. So far, I've been using zlib for this:
file.createReadStream().pipe(zlib.createGunzip()
is the stream I pass to Papa.parse. This works fine for most files, but it fails with a very large gzipped CSV file (250 MB, unzips to 1.2 GB), throwing this error:
Error: incorrect header check
at Zlib.zlibOnError [as onerror] (zlib.js:180:17) {
errno: -3,
code: 'Z_DATA_ERROR'
}
Originally I thought it was the size of the file that caused the error, but now I'm not so sure; maybe it's because the file has been encrypted using a different algorithm. zlib.error: Error -3 while decompressing: incorrect header check suggests passing either -zlib.Z_MAX_WINDOWBITS or zlib.Z_MAX_WINDOWBITS|16 to correct for that, but I tried it and that's not the problem.

Despite being absolutely sure we had a gzip stream, it turns out we didn't. We got this file from an AWS S3 bucket which contained a lot of versions of this file with different time stamps. For that reason, we selected files based on prefix and loaded only the most recent one.
However, the S3 bucket also contained json files with metadata about these files. It was pure luck that for so long we always got the gzip instead of the json, and recently that luck faltered. So where we always got a gzip file, this time we got a json instead.
The header check error was entirely correct: the file we were looking at was not the gzip file we thought we had, so it didn't have the proper header.
Leaving this answer here instead of removing the question because it's always possible that someone in the future running into this error is absolutely sure they're gunzipping the correct file when they're actually not. Double check which file you're loading.

Related

Linux Split for tar.gz works well when joined but when tranferred to remote machine with help of S3 bucket

I have few files which i did tar.gz.
As this file can get too big thus I used the Linux split.
As this needs to be transferred to a different machine i have used s3 bucket to transfer these files. I used application/octet-stream content-type to upload these files.
The files when downloaded shows exactly same size as original size thus no bytes are lost.
now when I do cat downloaded_files_* > tarball.tar.gz the size is exactly as the original file
but only the part with _aa gets extracted.
i checked the type of files
file downloaded_files_aa
this is tar zip file(gzip compressed data, from Unix, last modified: Sun May 17 15:00:41 2020)
but all other files are data files
I am wondering how can i get the files.
Note: Http upload via API gateway done to upload the files to s3
================================
Just putting my debugging finding with a hope probably it will help someone facing same problem.
As we wanted to use API gateway out upload calls were done http calls. This is something which is not using regular aws sdk.
https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-post-example.html
Code Samples: https://docs.aws.amazon.com/AmazonS3/latest/API/samples/AWSS3SigV4JavaSamples.zip
After some debugging, we found this leg was working fine.
As the machine which we wanted to download the files had direct access to s3 we used the aws sdk for downloading the files.
This is the URL
https://docs.aws.amazon.com/AmazonS3/latest/dev/RetrievingObjectUsingJava.html
this code does not work well, though it showed the exact file size download as upload the file lost some information. The code also complained about still pending bytes. Some changes were done to get rid of error but it never worked.
the code which I found here is working like magic
InputStream reader = new BufferedInputStream(
object.getObjectContent());
File file = new File("localFilename");
OutputStream writer = new BufferedOutputStream(new FileOutputStream(file));
int read = -1;
while ( ( read = reader.read() ) != -1 ) {
writer.write(read);
}
writer.flush();
writer.close();
reader.close();
This code also make the download much faster then our previous approach.

CSV UTF-8 vs Normal CSV in Excel

we have a CSV file that was creating validation errors in a process we run. The validation error made no sense as the error indicated didn't appear as the file was created exactly as instructed. I've tried several ways to resolve it without success. I eventually tried re-saving the file as CSV via Excel and noticed the file was in CSV UTF-8 format and this apparently resolved the error. I noticed the new file size is 3 bytes less than the old one but the content should be exactly the same. The file is completely in English so I am not sure what was causing this. Can anyone advise why the file size is 3 bytes less when saving as CSV rather than CSV UTF-8 format even though data in the file should be identical? These extra 3 bytes likely have caused the validation error.
Thanks for your help

AmazonClientException when uploading file with spring-integration-aws S3MessageHandler

I have configured an S3MessageHandler from spring-integration-aws to upload a File object to S3.
The upload fails with the following trace:
Caused by: com.amazonaws.AmazonClientException: Data read has a different length than the expected: dataLength=0; expectedLength=26; includeSkipped=false; in.getClass()=class com.amazonaws.internal.ResettableInputStream; markedSupported=true; marked=0; resetSinceLastMarked=false; markCount=1; resetCount=0
at com.amazonaws.util.LengthCheckInputStream.checkLength(LengthCheckInputStream.java:152)
...
Looking at the source code for S3MessageHandler, I'm not sure how uploading a File would ever succeed. The s3MessageHandler.upload() method does the following when I trace its execution:
Creates a FileInputStream for the File.
Computes the MD5 hash for the file contents, using the input stream.
Resets the stream if it can be reset (not possible for FileInputStream).
Sets up the S3 transfer using the input stream. This fails because the stream is at the EOF, so the number of transferable bytes doesn't match what's in the Content-Length header.
Am I missing something, or is this a bug in the message handler?
Yes; it's a bug; please open an Issue in GitHub and/or a JIRA Issue.
For FileInputStream a new one should be created, for InputStream payloads, we need to assert that markSupported() is true if MD5 consumes the stream.
Consider Contributing a fix after "signing" the CLA.
EDIT
I opened JIRA Issue INTEXT-225.

NodeJS - Reading image source returns incorrect file size

This could be a basic question, but wanted to understand why the size of the file being read using fs.readFileSync is incorrect if the source is referring to an 'image' or non-text file path.
Example:
fs.writeFileSync(outputPath, fs.readFileSync(source, 'utf8'));
Because you are calling fs.readFileSync(source, 'utf8').
The important part is utf8, you are telling it to decode the file as if it is utf8 text. If it is a non-text file then it will not work properly and thus produce the incorrect file size.

saving an image to bytes and uploading to boto3 returning content-MD5 mismatch

I'm trying to pull an image from s3, quantize it/manipulate it, and then store it back into s3 without saving anything to disk (entirely in-memory). I was able to do it once, but upon returning to the code and trying it again it did not work. The code is as follows:
import boto3
import io
from PIL import Image
client = boto3.client('s3',aws_access_key_id='',
aws_secret_access_key='')
cur_image = client.get_object(Bucket='mybucket',Key='2016-03-19 19.15.40.jpg')['Body'].read()
loaded_image = Image.open(io.BytesIO(cur_image))
quantized_image = loaded_image.quantize(colors=50)
saved_quantized_image = io.BytesIO()
quantized_image.save(saved_quantized_image,'PNG')
client.put_object(ACL='public-read',Body=saved_quantized_image,Key='testimage.png',Bucket='mybucket')
The error I received is:
botocore.exceptions.ClientError: An error occurred (BadDigest) when calling the PutObject operation: The Content-MD5 you specified did not match what we received.
It works fine if I just pull an image, and then put it right back without manipulating it. I'm not quite sure what's going on here.
I had this same problem, and the solution was to seek to the beginning of the saved in-memory file:
out_img = BytesIO()
image.save(out_img, img_type)
out_img.seek(0) # Without this line it fails
self.bucket.put_object(Bucket=self.bucket_name,
Key=key,
Body=out_img)
The file may need to be saved and reloaded before you send it off to S3. The file pointer seek also needs to be at 0.
My problem was sending a file after reading out the first few bytes of it. Opening a file cleanly did the trick.
I found this question getting the same error trying to upload files -- two scripts clashed, one creating, the other uploading. My answer was to create using ".filename" then:
os.rename(filename.replace(".filename","filename"))
The upload script then needs to ignore . files. This ensured the file was done being created.
To anyone else facing similar errors, this usually happens when content of the file gets modified during file upload, possibly due to file being modified by another process/thread.
A classic example would be to scripts modifying the same file at the same time, which throws the bad digest due to change in MD5 content. In the below example, the data file is being uploaded to s3, while it is being uploaded, if another process overwrites it, you will end up with this exception
random_uuid=$(uuidgen)
cat data
aws s3api put-object --acl bucket-owner-full-control --bucket $s3_bucket --key $random_uuid --body data

Resources