How to copy S3 object with special character in key

How to copy S3 object with special character in key - node.js

I have objects in an S3 bucket, and I do not have control over the names of the keys. Some of these keys have special characters and AWS SDK does not like them.
For example, one object key is: folder/‍Johnson, Scott to JKL-Discovery.pdf, it might look fine at first glance, but if I URL encode it: folder%2F%E2%80%8DJohnson%2C+Scott+to+JKL-Discovery.pdf, you can see that after folder/ (or folder%2F when encoded) there is a random sequence of characters %E2%80%8D before Johnson.
It is unclear where these characters come from, however, I need to be able to handle this use case. When I try to make a copy of this object using the Node.js AWS SDK,
const copyParams = {
Bucket,
CopySource,
Key : `folder/‍Johnson, Scott to JKL-Discovery.pdf`
};
let metadata = await s3.copyObject(copyParams).promise();
It fails and can't find the object, if I encodeURI() the key, it also fails.
How can I deal with this?
DO NOT SUGGEST I CHANGE THE ALLOWED CHARACTERS IN THE KEY NAME. I DO NOT HAVE CONTROL OVER THIS

Faced the same problem but with PHP. copyObject() method is automatically encoding destination parameters (Bucket and Key) parameters, but not source parameter (CopySource) so it has to be encoded manually. In php it looks like this:
$s3->copyObject([
'Bucket' => $targetBucket,
'Key' => $targetFilePath,
'CopySource' => $s3::encodeKey("{$sourceBucket}/{$sourceFilePath}"),
]);
I'm not familiar with node.js but there should also exist that encodeKey() method that can be used?

trying your string, there's a tricky 'zero width space' unicode char...
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=342+200+213&mode=obytes
I would sanitize the string removing unicode chars and then proceding with url encoding as requested by official docs.
encodeURI('folder/‍johnson, Scott to JKL-Discovery.pdf'.replace(/[^\x00-\x7F]/g, ""))

Related

How to get and submit characters like letters with accents and 'ñ' in a Loopback 4 API

I have an API created in Loopback 4 which retrieves data from a database in PostgreSQL 13 encoded with UTF8. Visiting the API explorer (localhost:3000/explorer) and executing the GET requests I realize that even when the database fields contain characters like letters with accents and ñ's; the retrieved JSON only shows blanks in the position where the character must have appeared. For example, if the database has a field with a word like 'piña', the JSON returns 'pi a'.
When I try a POST request, inserting a field like 'ramírez' (note the í), in the database, the field is shown as 'ramφrez' and when I execute a GET of that entry, the JSON now has de correct value, 'ramírez'.
How can I fix that?

I'd recommend using the Buffer class:
var encodedString = Buffer.from('string', 'utf-8');
with this way you will be able to return anything you want. In NodeJS Buffer class already included so you don't need to install any dependencies.
If you don't get what you need you can change 'utf-8' part.

Can't list bucket objects on Scaleway using boto3

I saw a few similar posts, but unfortunately none helped me.
I have an s3 bucket (on scaleway), and I'm trying to simply list all objects contained in that bucket, using boto3 s3 client as follow:
s3 = boto3.client('s3',
region_name=AWS_S3_REGION_NAME,
endpoint_url=AWS_S3_ENDPOINT_URL,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
all_objects = s3.list_objects_v2(Bucket=AWS_STORAGE_BUCKET_NAME)
This simple piece of code responds with an error:
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjects operation: The specified key does not exist.
First, the error seems inapropriate to me since I'm not specifying any key to search. I also tried to pass a Prefix argument to this method to narrow down the search to a specific subdirectory, same error.
Second, I tried to achieve the same thing using boto3 Resource rather than Client, as follow:
session = boto3.Session(
region_name=AWS_S3_REGION_NAME,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
resource = session.resource(
's3',
endpoint_url=AWS_S3_ENDPOINT_URL,
)
for bucket in resource.buckets.all():
print(bucket.name)
That code produces absolutely nothing. One weird thing that strikes me is that I don't pass the bucket_name anywhere here, which seems to be normal according to aws documentation
There's no chance that I misconfigured the client, since I'm able to use the put_object method perfectly with that same client. One strange though: when I want to put a file, I pass the whole path to put_object as Key (as I found it to be the way to go), but the object is inserted with the bucket name prepend to it. So let's say I call put_object(Key='/path/to/myfile.ext'), the object will end up to be /bucket-name/path/to/myfile.ext.
Is this strange behavior the key to my problem ? How can I investigate what's happening, or is there another way I could try to list bucket files ?
Thank you
EDIT: So, after logging the request that boto3 client is sending, I noticed that the bucket name is append to the url, so instead of requesting https://<bucket_name>.s3.<region>.<provider>/, it requests https://<bucket_name>.s3.<region>.<provider>/<bucket-name>/, which is leading to the NoSuchKey error.
I took a look into the botocore library, and I found this:
url = _urljoin(endpoint_url, r['url_path'], host_prefix)
in botocore.awsrequest line 252, where r['url_path'] contains /skichic-bucket?list-type=2. So from here, I should be able to easily patch the library core to make it work for me.
Plus, the Prefix argument is not working, whatever I pass into it I always receive the whole bucket content, but I guess I can easily patch this too.
Now it's not satisfying, since there's no issue related to this on github, I can't believe that the library contains such a bug that I'm the first one to encounter.
Does anyone can explain this whole mess ? >.<

For those who are facing the same issue, try changing your endpoint_url parameter in your boto3 client or resource instantiation from https://<bucket_name>.s3.<region>.<provider> to https://s3.<region>.<provider> ; i.e for Scaleway : https://s3.<region>.scw.cloud.
You can then set the Bucket parameter to select the bucket you want.
list_objects_v2(Bucket=<bucket_name>)

you can try this. you'll have to use your resource instead of my s3sr.
s3sr = resource('s3')
bucket = 'your-bucket'
prefix = 'your-prefix/' # if no prefix, pass ''
def get_keys_from_prefix(bucket, prefix):
'''gets list of keys for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
keys_list = get_keys_from_prefix(bucket, prefix)

After looking more closely into things, I've found out that (a lot) of botocore services endpoints patterns starts with the bucket name. For example, here's the definition of the list_objects_v2 service:
"ListObjectsV2":{
"name":"ListObjectsV2",
"http":{
"method":"GET",
"requestUri":"/{Bucket}?list-type=2"
},
My guess is that in the standard implementation of AWS S3, there's a genericendpoint_url (which explains #jordanm comment) and the targeted bucket is reached through the endpoint.
Now, in the case of Scaleway, there's an endpoint_url for each bucket, with the bucket name contained in that url (e.g https://<bucket_name>.s3.<region>.<provider>), and any endpoint should directly starts with a bucket Key.
I made a fork of botocore where I rewrote every endpoint to remove the bucket name, if that can help someone in the future.
Thank's again to all contributors !

Generate a consistent sha256 hash from an object in Node

I have an object that I'd like to hash with sha256 in Node. The contents of the object are simple Javascript types. For example's sake, let's say:
var payload = {
"id": "3bab3f00-7d55-11e7-9b0a-4c32759242a5",
"foo": "a message",
"version": 7,
};
I create a hash like this:
const crypto = require('crypto');
var hash = crypto.createHash('sha256');
hash.update( ... ).digest('hex');
The question is, what to pass to update? The documentation for crypto says you can pass a <string> | <Buffer> | <TypedArray> | <DataView>, which seems to suggest an object is not a good thing to pass.
I can't use toString() because that prints "[object Object]". I could use JSON.stringify, however I have read elsewhere that the output from stringify is not guaranteed to be deterministic for the same input.
Are there any other options? I do not want to download a package from NPM.

The right terms are "canonical" and the action is called "canonicalization" (I'm assuming EN-US here), you can find a stringify that produces canonical output here.
Beware that you must make sure that the output also has the right character set (UTF-8 should be preferred) and line endings. Spurious data should not be present, e.g. a byte order mark or NUL termination string is enough to void the hash value.
After that you can pass it as string I suppose.
You can of course use any canonical encoding. Note that XML has defined XML-digsig, which contains canonicalization during signature generation and signing, which means that the verification will even succeed if the XML code is altered (without altering the structure or contents of course, but whitespace / indentation will not matter).
I'd still recommend regression testing between implementations and even version updates of the libraries.

JSON stored in AWS EB environment variables is retrieved without quotes

I'm running a node.js EB container and trying to store JSON inside an Environment Variable. The JSON is stored correctly, but when retrieving it via process.env.MYVARIABLE it is returned with all the double quotes stripped.
E.g. MYVARIABLE looks like this:
{ "prop": "value" }
when I retrieve it via process.env.MYVARIABLE its value is actualy { prop: value} which isn't valid JSON. I've tried to escape the quotes with '\' ie { \"prop\": \"value\" } that just adds more weird behavior where the string comes back as {\ \"prop\\":\ \"value\\" }. I've also tried wrapping the whole thing in single quotes e.g. '{ "prop": "value" }', but it seems to strip those out too.
Anyone know how to store JSON in environment variables?
EDIT: some more info, it would appear that certain characters are being doubly escaped when you set an environment variable. E.g. if I wrap the object in single quotes. the value when I fetch it using the sdk, becomes:
\'{ "prop": "value"}\'
Also if I leave the quotes out, backslashes get escaped so if the object looks like {"url": "http://..."} the result when I query via the sdk is {"url": "http:\\/\\/..."}
Not only is it mangling the text, it's also rearranging the JSON properties, so properties are appearing in a different order than what I set them to.
UPDATE
So I would say this seems to be a bug in AWS based on the fact that it seems to be mangling the values that are submitted. This happens whether I use the node.js sdk or the web console. As a workaround I've taken to replacing double quotes with single quotes on the json object during deployment and then back again in the application.

Use base64 encoding
An important string is being auto-magically mangled. We don't know the internals of EB, but we can guess it is parsing JSON. So don't store JSON, store the base64-encoded JSON:
a = `{ "public": { "s3path": "https://d2v4p3rms9rvi3.cloudfront.net" } }`
x = btoa(a) // store this as B_MYVAR
// "eyAicHVibGljIjogeyAiczNwYXRoIjogImh0dHBzOi8vZDJ2NHAzcm1zOXJ2aTMuY2xvdWRmcm9udC5uZXQiIH0gfQ=="
settings = JSON.parse(atob(process.env.B_MYVAR))
settings.public.s3path
// "https://d2v4p3rms9rvi3.cloudfront.net"
// Or even:
process.env.MYVAR = atob(process.env.B_MYVAR)
// Sets MYVAR at runtime, hopefully soon enough for your purposes
Since this is JS, there are caveats about UTF8 and node/browser support, but I think atob and btoa are common. Docs.

mongodb, node.js and encrypted data

I'm working on a project which involves a lot of encrypted data. Basically, these are JSON objects serialized into a String, then encrypted with AES256 into a Cyphertext, and then have to be stored in Mongo.
I could of course do this the way described above, which will store the cyphertext as String into a BSON document. However, this way, if for some reason along the way the Cyphertext isn't treated properly (for instance, different charset or whatever reason), the cyphertext is altered and I cannot rebuild the original string anymore. With millions of records, that's unacceptable (it's also slow).
Is there a proper way to save the cyphertext in some kind of native binary format, retrieve it binary and then return it to the original string? I'm used to working with strings, my skills with binary format are pretty rusty. I'm very interested in hearing your thoughts on the subject.
Thanks everyone for your input,
Fabian

yes :)
var Binary = require('mongodb').Binary;
var doc = {
data: new Binary(new Buffer(256))
}
or with 1.1.5 of the driver you can do
var doc = {
data: new Buffer(256)
}
The data is always returned as a Binary object however and not a buffer. The link to the docs is below.
http://mongodb.github.com/node-mongodb-native/api-bson-generated/binary.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string