How to read and write data in spark via an S3 access point - apache-spark

I am attempting to use an S3 access point to store data in an S3 bucket. I have tried saving as I would if I had access to the bucket directly:
someDF.write.format("csv").option("header","true").mode("Overwrite")
.save("arn:aws:s3:us-east-1:000000000000:accesspoint/access-point/prefix/")
This returns the error
IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: "arn:aws:s3:us-east-1:000000000000:accesspoint/access-point/prefix/"
I havnt been able to find any documentation on how to do this. Are access points not supported? Is there a way to set up the access point as a custom data source?
Thank you

The problem is that you have provided the arn instead of the s3 URL. The URL would be something like this (assuming accesspoint is the bucket name):
s3://accesspoint/access-point/prefix/
There is a button in the AWS console if you are in the object or prefix, top right Copy S3 URL

Related

How TensorFlow read file from s3 bytestream

I have done a deep learning model in TensorFlow for image recognition, and this one works reading an image file from local directory with tf.read_file() method, but I need now that the file be read by TensorFlow since a variable that is a Byte-Streaming that extract the image file since an S3 Bucket of Amazon without storage the streaming in local directory
You should be able to pass in the fully formed s3 path to tf.read_file(), like:
s3://bucket-name/path/to/file.jpeg where bucket-name is the name of your s3 bucket, and path/to/file.jpeg is where it's stored in your bucket. It seems possible you might be running into some access permissions issue, depending on if your bucket is private. You can follow https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/s3.md to set up your credentials
Is there an error you ran into when doing this?

BOTO3 - Getting Access Denied when copying a s3 object

I am trying to copy from one bucket to another bucket and each bucket has their own access key and secret.
I can connect to the first bucket and down load a file just fine. It might be important to note that I do not have full access to the bucket I am copying from, meaning I can not read all keys in the bucket, just a subset I have access to. I have complete control on the second bucket I am copying to.
client2 is where I am copying to and client is where I am copying from.
copy_source = {
'Bucket': bucketName,
'Key': key
}
client2.copy(CopySource = copy_source,Bucket=bucketName2,Key=key,SourceClient=client)
Here is the error I get:
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the UploadPartCopy operation: Access Denied
I am a newbie and any help would be greatly appreciated!!
The reason you're likely getting the Access Denied on this is because the SourceClient is only used for getting the size of the object to determine if it can be copied directly, or if a multi-part upload is required.
When it comes to the actual copy itself, the underlying the underlying copy_object method on the client, which does not accept a SourceClient, and calls out to the S3 APIs PUT Object - Copy method.
As such, if you want to be able to perform an S3 copy from one bucket to another, you can either give the user associated with the access key used by client2 permission to read from the Source bucket, or you can perform an S3 Get using client1 then an S3 Put with client2.

Amazonka: How to generate S3:// uri from Network.AWS.S3.Types.Object?

I've been using turtle to call "aws s3 ls" and I'm trying to figure out how to replace that with amazonka.
Absolute s3 urls were central to how my program worked. I now know how to get objects and filter them, but I don't know how to convert an object to an S3 url to integrate with my existing program.
I came across the getFile function and tried downloading a file from s3.
Perhaps I had something wrong, but it didn't seem like just the S3 Bucket and S3 Object key were enough to download a given file. If I'm wrong about that I need to double check my configuration.

Amazon S3 403 Forbidden Error for KML files but not JPG files

I am able to successfully upload (put object) jpg files to S3 with a particular code path, but receive a 403 forbidden error when using the same code path to upload a KML file. I am not restricting file types explicitly with "bucket policy," but feel that this must somehow be tied to bucket policy or CORS configuration.
I was using code based off the Heroku tutorial for uploading images to Amazon S3. The issue ended up being that the '+' symbol in the appropriate mime type is "application/vnd.google-earth.kml+xml" and the + symbol was being replaced with a space when fetching the file-type query parameter for our own S3 endpoint to generate signed requests. We were able to quickly fix this by just forcing the ContentType to be "application/vnd.google-earth.kml+xml" for all kml files going to our endpoint for generating signed S3 requests.

Setting Metadata in Google Cloud Storage (Export from BigQuery)

I am trying to update the metadata (programatically, from Python) of several CSV/JSON files that are exported from BigQuery. The application that exports the data is the same with the one modifying the files (thus using the same server certificate). The export goes all well, that is until I try to use the objects.patch() method to set the metadata I want. The problem is that I keep getting the following error:
apiclient.errors.HttpError: <HttpError 403 when requesting https://www.googleapis.com/storage/v1/b/<bucket>/<file>?alt=json returned "Forbidden">
Obviously, this has something to do with bucket or file permissions, but I can't manage to get around it. How come if the same certificate is being used in writing files and updating file metadata, i'm unable to update it? The bucket is created with the same certificate.
If that's the exact URL you're using, it's a URL problem: you're missing the /o/ between the bucket name and the object name.

Resources