Append files to existing S3 bucket folder via Spark - apache-spark

I am working in Spark where we need to write the data to S3 bucket after performing some tranformations. I know that while writing dtaa to HDFS/S3 via Spark throws an exception if the folder path already exists. So in our case if S3://bucket_name/folder already exists while writing the data to the same S3 bucket path, it will throw an exception.
Now the possible solution is to use mode as OVERWRITE while writing through Spark. But that would delete all the files already present in it. I want to have a kind of APPEND functionality with the same folder. So if folder already has some files, then it would just add more files to it.
I am not sure if API out of the box gives any such functionality. Of course there is an option where I can create a temporary folder inside a folder and save the file. After that I can move that file to its parent folder and delete the temporary folder. But this kind of approach is not best.
So please suggest how to proceed with this.

Related

How to copy folder having files, also having sub-folders with different files to aws S3 bukcet with same folder level structure using python language

Hi all can you please help to figure out this issue
How to copy a folder having py files, and aslo having sub folder level files which are present in a particular path with the same folder structure to asw s3 bucket path. The files should be reflected as same in the way how they are look like folder level that should be same in s3 bucket path as well
There is a fully fledged Python library for this. Install it via
pip install s3
The documentation is well written and you should have no trouble following it. The examples section shows how to upload file to S3
storage.write("example", remote_name, headers=headers)
You should be able to instruct the package to upload the folders and subfolders while maintaining their folder structure. You can also use os.walk to walk through your files and directories if you need to individually pick which files and folders need to be uploaded.

Unzip and Rename underlying File using Azure Logic App

Possible to rename an underlying file while Unzipping using Logic App? I am calling an HTTP activity to download a ZIP file. That Zip contains only 1 Underlying file with some value appended to the name. I want to store the Unzipped file with a better name so that it can be used further. Is it possible ?
Incoming ZIP File --> SAMPLEFile.ZIP
Underlying File --> SampleTextFile20200824121212.TXT
Desired File --> SampleTextFile.TXT
Suggestions ?
As far as I know, we can't implement this requirement directly in "Extract archive to folder" action. We can just rename the file by copy it from one folder to another folder (shown as below).
You can create a new ticket on feedback page to ask azure team for this feature.

How can I delete a file with images in s3?

I am new to web development and new to using s3. Im using node.js to delete files in s3, which I am able to do, but i run into trouble when a file has things inside it, like images or more files. I don't get any errors when I try and delete the files with things inside. I heard I had to add some metadata, but im not sure what type of metadata needs to be added. Is there any good example on how I would be able to delete those files?

How to download all the files from S3 bucket irrespective of file key using python

I am working on an automation piece where I need to download all files from a folder inside a S3 bucket irrespective of the file name. I understand that the using boto3 in python I can download a file like:
s3BucketObj = boto3.client('s3', region_name=awsRegion, aws_access_key_id=s3AccessKey, aws_secret_access_key=s3SecretKey)
s3BucketObj.download_file(bucketName, "abc.json", "/tmp/abc.json")
but I was then trying to download all files irrespective of what filename to be specified in this way:
s3BucketObj.download_file(bucketName, "test/*.json", "/test/")
I know the syntax above could be totally wrong but is there a simple way to do that?
I did find a thread which helps here but seems a bit complex: Boto3 to download all files from a S3 Bucket
There is no API call to Amazon S3 that can download multiple files.
The easiest way is to use the AWS Command-Line Interface (CLI), which has aws s3 cp --recursive and aws s3 sync commands. It will do everything for you.
If you choose to program it yourself, then Boto3 to download all files from a S3 Bucket is a good way to do it. This is because you need to do several things:
Loop through every object (there is no S3 API to copy multiple files)
Create a local directory if it doesn't exist
Download the object to the appropriate local directory
The task can be made simpler if you do not wish to reproduce the directory structure (eg if all objects are in the same path). In that case, you can simply loop through the objects and download each of them to the same directory.

How can I have Azure File Share automatically generate non-existing directories?

With AWS S3, I can upload a file test.png to any directory I like, regardless of whether or not it exists... because S3 will automatically generate the full path & directories.
For example, if I when I upload to S3, I use the path this/is/a/new/home/for/test.png, S3 will create directories this, is, a, ... and upload test.png to the correct folder.
I am migrating over to Azure, and I am looking to use their file storage. However, it seems that I must manually create EVERY directory... I could obviously do it programmatically by checking to see if the folder exists and if not, create it... but wow...why should I work so hard?
I did try:
file_service.create_file_from_path('testshare', 'some/long/path', 'test.png', 'path/to/local/location/of/test.png')
However, that complains that the directory does not exist... and will only work if I either manually create the directories or replace some/long/path with None.
Is it possible to just hand Azure a path and have it create the directories?
Azure Files closely mimics OS File System and thus in order to push a file in a directory, that directory must exist. What that means is if you need to create a file in a nested directory structure, that directory structure must exist. Azure File service will not create that structure for you.
A better option in your scenario would be to use Azure Blob Storage. It closely mimics Amazon S3 behavior that you mentioned above. You can create a Container (similar to Bucket in S3) and then upload a file with a name like this/is/a/new/home/for/test.png.
However please note that the folders are virtual in Blob Storage (same as S3) and not the real one. Essentially the name with which the blob (similar to Object in S3) will be saved is this/is/a/new/home/for/test.png.

Resources