AmazonClientException when uploading file with spring-integration-aws S3MessageHandler - spring-integration

I have configured an S3MessageHandler from spring-integration-aws to upload a File object to S3.
The upload fails with the following trace:
Caused by: com.amazonaws.AmazonClientException: Data read has a different length than the expected: dataLength=0; expectedLength=26; includeSkipped=false; in.getClass()=class com.amazonaws.internal.ResettableInputStream; markedSupported=true; marked=0; resetSinceLastMarked=false; markCount=1; resetCount=0
at com.amazonaws.util.LengthCheckInputStream.checkLength(LengthCheckInputStream.java:152)
...
Looking at the source code for S3MessageHandler, I'm not sure how uploading a File would ever succeed. The s3MessageHandler.upload() method does the following when I trace its execution:
Creates a FileInputStream for the File.
Computes the MD5 hash for the file contents, using the input stream.
Resets the stream if it can be reset (not possible for FileInputStream).
Sets up the S3 transfer using the input stream. This fails because the stream is at the EOF, so the number of transferable bytes doesn't match what's in the Content-Length header.
Am I missing something, or is this a bug in the message handler?

Yes; it's a bug; please open an Issue in GitHub and/or a JIRA Issue.
For FileInputStream a new one should be created, for InputStream payloads, we need to assert that markSupported() is true if MD5 consumes the stream.
Consider Contributing a fix after "signing" the CLA.
EDIT
I opened JIRA Issue INTEXT-225.

Related

Reading very large gzip file stream in node.js

I'm trying to read a very large gzipped csv file in node.js. So far, I've been using zlib for this:
file.createReadStream().pipe(zlib.createGunzip()
is the stream I pass to Papa.parse. This works fine for most files, but it fails with a very large gzipped CSV file (250 MB, unzips to 1.2 GB), throwing this error:
Error: incorrect header check
at Zlib.zlibOnError [as onerror] (zlib.js:180:17) {
errno: -3,
code: 'Z_DATA_ERROR'
}
Originally I thought it was the size of the file that caused the error, but now I'm not so sure; maybe it's because the file has been encrypted using a different algorithm. zlib.error: Error -3 while decompressing: incorrect header check suggests passing either -zlib.Z_MAX_WINDOWBITS or zlib.Z_MAX_WINDOWBITS|16 to correct for that, but I tried it and that's not the problem.
Despite being absolutely sure we had a gzip stream, it turns out we didn't. We got this file from an AWS S3 bucket which contained a lot of versions of this file with different time stamps. For that reason, we selected files based on prefix and loaded only the most recent one.
However, the S3 bucket also contained json files with metadata about these files. It was pure luck that for so long we always got the gzip instead of the json, and recently that luck faltered. So where we always got a gzip file, this time we got a json instead.
The header check error was entirely correct: the file we were looking at was not the gzip file we thought we had, so it didn't have the proper header.
Leaving this answer here instead of removing the question because it's always possible that someone in the future running into this error is absolutely sure they're gunzipping the correct file when they're actually not. Double check which file you're loading.

Linux Split for tar.gz works well when joined but when tranferred to remote machine with help of S3 bucket

I have few files which i did tar.gz.
As this file can get too big thus I used the Linux split.
As this needs to be transferred to a different machine i have used s3 bucket to transfer these files. I used application/octet-stream content-type to upload these files.
The files when downloaded shows exactly same size as original size thus no bytes are lost.
now when I do cat downloaded_files_* > tarball.tar.gz the size is exactly as the original file
but only the part with _aa gets extracted.
i checked the type of files
file downloaded_files_aa
this is tar zip file(gzip compressed data, from Unix, last modified: Sun May 17 15:00:41 2020)
but all other files are data files
I am wondering how can i get the files.
Note: Http upload via API gateway done to upload the files to s3
================================
Just putting my debugging finding with a hope probably it will help someone facing same problem.
As we wanted to use API gateway out upload calls were done http calls. This is something which is not using regular aws sdk.
https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-post-example.html
Code Samples: https://docs.aws.amazon.com/AmazonS3/latest/API/samples/AWSS3SigV4JavaSamples.zip
After some debugging, we found this leg was working fine.
As the machine which we wanted to download the files had direct access to s3 we used the aws sdk for downloading the files.
This is the URL
https://docs.aws.amazon.com/AmazonS3/latest/dev/RetrievingObjectUsingJava.html
this code does not work well, though it showed the exact file size download as upload the file lost some information. The code also complained about still pending bytes. Some changes were done to get rid of error but it never worked.
the code which I found here is working like magic
InputStream reader = new BufferedInputStream(
object.getObjectContent());
File file = new File("localFilename");
OutputStream writer = new BufferedOutputStream(new FileOutputStream(file));
int read = -1;
while ( ( read = reader.read() ) != -1 ) {
writer.write(read);
}
writer.flush();
writer.close();
reader.close();
This code also make the download much faster then our previous approach.

DocuSign Connect: pdfbytes leads to corrupted pdf file

I am trying to connect docusign with my java application and I was successful.
I have created listener to listen response of docusign after user complete sign process so that document saved/updated automatically in my system.
I am able to get that response in xml format with pdfbytes but as soon as I create pdf from that pdfBytes,I am not able to opening that pdf(might be corrupted pdfbytes).
I am making base64 decoding of that byte before generating pdf.
This is a common problem when the pdfbytes are not managed as a run of binary bytes. At some point you may be treating the data as a string. The PDF file becomes corrupted at that point.
Issues to check:
When you Base64 decode the string, the result is binary. Is your receiving variable capable of receiving binary data? (No codeset transformations.)
When you write your binary buffer to the output file, check that your output file format is binary clean. This is especially an issue on Windows systems.
If you're still having a problem, edit your question to include your code.

saving an image to bytes and uploading to boto3 returning content-MD5 mismatch

I'm trying to pull an image from s3, quantize it/manipulate it, and then store it back into s3 without saving anything to disk (entirely in-memory). I was able to do it once, but upon returning to the code and trying it again it did not work. The code is as follows:
import boto3
import io
from PIL import Image
client = boto3.client('s3',aws_access_key_id='',
aws_secret_access_key='')
cur_image = client.get_object(Bucket='mybucket',Key='2016-03-19 19.15.40.jpg')['Body'].read()
loaded_image = Image.open(io.BytesIO(cur_image))
quantized_image = loaded_image.quantize(colors=50)
saved_quantized_image = io.BytesIO()
quantized_image.save(saved_quantized_image,'PNG')
client.put_object(ACL='public-read',Body=saved_quantized_image,Key='testimage.png',Bucket='mybucket')
The error I received is:
botocore.exceptions.ClientError: An error occurred (BadDigest) when calling the PutObject operation: The Content-MD5 you specified did not match what we received.
It works fine if I just pull an image, and then put it right back without manipulating it. I'm not quite sure what's going on here.
I had this same problem, and the solution was to seek to the beginning of the saved in-memory file:
out_img = BytesIO()
image.save(out_img, img_type)
out_img.seek(0) # Without this line it fails
self.bucket.put_object(Bucket=self.bucket_name,
Key=key,
Body=out_img)
The file may need to be saved and reloaded before you send it off to S3. The file pointer seek also needs to be at 0.
My problem was sending a file after reading out the first few bytes of it. Opening a file cleanly did the trick.
I found this question getting the same error trying to upload files -- two scripts clashed, one creating, the other uploading. My answer was to create using ".filename" then:
os.rename(filename.replace(".filename","filename"))
The upload script then needs to ignore . files. This ensured the file was done being created.
To anyone else facing similar errors, this usually happens when content of the file gets modified during file upload, possibly due to file being modified by another process/thread.
A classic example would be to scripts modifying the same file at the same time, which throws the bad digest due to change in MD5 content. In the below example, the data file is being uploaded to s3, while it is being uploaded, if another process overwrites it, you will end up with this exception
random_uuid=$(uuidgen)
cat data
aws s3api put-object --acl bucket-owner-full-control --bucket $s3_bucket --key $random_uuid --body data

ParseExceptions when using HQL file on HDInsight

I'm following this tutorial http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-hive/ but have become stuck when changing the source of the query to use a file.
It all works happily when using New-AzureHDInsightHiveJobDefinition -Query $queryString but when I try New-AzureHDInsightHiveJobDefinition -File "/example.hql" with example.hql stored in the "root" of the blob container I get ExitCode 40000 and the following in standarderror:
Logging initialized using configuration in file:/C:/apps/dist/hive-0.11.0.1.3.7.1-01293/conf/hive-log4j.properties
FAILED: ParseException line 1:0 character 'Ã?' not supported here
line 1:1 character '»' not supported here
line 1:2 character '¿' not supported here
Even when I deliberately misspell the hql filename the above error is still generated along with the expected file not found error so it's not the content of the hql that's causing the error.
I have not been able to find the hive-log4j.properties in the blob store to see if it's corrupt, I have torn down the HDInsight cluster and deleted the associated blob store and started again but ended up with the same result.
Would really appreciate some help!
I am able to induce a similar error by putting a Utf-8 or Unicode encoded .hql file into blob storage and attempting to run it. Try saving your example.hql file as 'ANSI' in Notepad (Open, the Save As and the encoding option is at the bottom of the dialog) and then copy it to blob storage and try again.
If the file is not found on Start-AzureHDInsightJob, then that cmdlet errors out and does not return a new AzureHDInsightJob object. If you had a previous instance of the result saved, then the subsequent Wait-AzureHDInsightJob and Get-AzureHDInsightJobOutput would be referring to a previous run, giving the illusion of the same error for the not found case. That error should definitely indicate a problem reading an UTF-8 or Unicode file when one is not expected.

Resources