AWS S3 File Sync - multithreading

I am trying to do file syncing from local source to a S3 bucket where I am uploading the files to S3 bucket by calculating MD5 checksum and putting it in the metadata for each file. The issue is that while doing so I also checked the files which are already there at destination to avoid duplicate upload. This I do by creating a list of files for upload which doesn't match on name and MD5 both. This operation of fetching the metadata for S3 files and computing MD5 for local files on the fly and then matching them is taking lot of time as I have around 200000 to 500000 files for matching.
Is there any better way to achieve this either by using multithreading or anything else. I have not much idea how to achieve it in multithreading environment as I eventually need one list and multiple threads doing the processing and adding to the same list. Any code sample or help is much appreciated.
This Windows job application is written in C#, using .NET 4.6.1 framework.

You could use the AWS Command-Line Interface (CLI), which has a aws s3 sync command that performs very similar to what you describe. However, with several hundreds thousand files, it is going to perform slowly on the matching, too.
Or, you could use Amazon S3 Inventory - Amazon Simple Storage Service to obtain a daily listing of the files in the S3 bucket (including MD5 checksum) and then compare your files against that.

Related

Which way to go when Zipping files on a cloud bucket?

I have a pipeline process which streams data from DB, maps them, and convert the mapped data into XML, then finally create the files on google cloud.
Think of it as this way
pipeline(
[
db_stream(),
mapJsonStream(),
convertToXmlStream(),
checkIfNewFileIsNeeded(),
// option 1. Add a zipping stream here before writing to the cloud
writeToCloudStream()
callback() // cleanup stuff - option 2. I can re-stream files and zip them here and stream back to cloud bucket
]
)
This will create multiple xml files (there's a limit on each xml file size, so our data needs to be split to different xml files) on the bucket.
Now I need all these files to be zipped, I searched if I can do that directly on the cloud using google Nodejs client, but it seems they don't support this kinda functionality.
I am a bit confused how to proceed because after searching abit I found that all zip libraries needs to built the archive in memory before streaming it which defeats the purpose of the pipeline I already have (option 1)
Option 2 seems a bit expensive and unnecessary.
If you any thoughts on this I would really appreciate it, Thanks in advance!

AWS Lambda with Node - saving files into Lambda's file system

I need to save files I get from S3 into a Lambda's file system and I wanted to know if I can do that simply using fs.writeFileSync ?
Or do I have to still use the context function as described here:
How to Write and Read files to Lambda-AWS with Node.js
(tried to find newer examples, but could not).
What is the recommended method?
Please advise.
Yes, you can use the typical fs functions to read/write from local disk, but be aware that writing is limited to the /tmp directory and the default max diskspace available to your Lambda function in that location is 512 MB. Also note that files written there may persist to the next (warm) Lambda invocation.
If you want to simply download an object from S3 to the local disk (assuming it will fit in the available diskspace) then you can combine AWS SDK methods and Node.js streaming to stream the content to disk.
Also, it's worth noting that, depending on your app, you may be able to process the entire S3 object in RAM via streaming, without any need to actually persist to disk. This is helpful if your object size is over 512MB.
Update: as of March 2022, you can now configure Lambda functions with more ephemeral storage in /tmp, up to 10 GB. You get 512 MB included with your Lambda function invocation and are charged for the additional configured storage above 512 MB.
If you need to persist very large files, consider using Elastic File System.
Lambda does not allow access to the local file system. It is mean to be an ephemeral environment. They allow access to the /tmp folder, but only for a maximum of 512MB. If you want to have storage along with your function, you will need to implement AWS S3 or AWS EFS.
Here's an article from AWS explaining this.
Here's the docs on adding storage to Lambda.

Quick way of uploading many images (3000+) to a google cloud server

I am working on object detection for a school project. To train my CNN model I am using a google cloud server because I do not own a strong enough GPU to train it locally.
The training data consists of images (.jpg files) and annotations (.txt files) and is spread over around 20 folders due to the fact that they come from different sources and I do not want to mix pictures from different sources so I want to keep this directory structure.
My current issue is that I could not find a fast way of uploading them to my google cloud server.
My workaround was to upload those image folders as a .zip file on google drive and download them on the cloud and unzip them there. This process needs way too much time because I have to upload many folders and google drive does not have a good API to download folders to Linux.
On my local computer, I am using Windows 10 and my cloud server runs Debian.
Therefore, I'd be really grateful if you know a fast and easy way to either upload my images directly to the server or at least to upload my zipped folders.
Couldn't you just create an infinite loop to look for jpg files and scp/sftp the jpg directly to your server once the file is there? On windows, you can achieve this using WSL.
(sorry this may not be your final answer, but i don't have the reputation to ask you this question)
I would upload them to a Google Cloud Storage bucket using gsutil with multithreading. This means that multiple files are copied at once, so the only limitation here is your internet speed. Gsutil installers for Windows and Linux are found here. Example command:
gsutil -m cp -r dir gs://my-bucket
Then on the VM you do exactly the opposite:
gsutil -m cp -r gs://my-bucket dir
This is super fast, and you only pay a small amount for the storage, which is super cheap and/or falls within the GCP free tier.
Note: make sure you have write permissions on the storage bucket and the default compute service account (i.e. the VM service account) has read permissions on the storage bucket.
The best stack for the use case will be gsutil + storage bucket
Copy the zip files to cloud storage bucket and put a sync cron to get the files on the VM.
Make use of gsutil
https://cloud.google.com/storage/docs/gsutil

How do I load a file from Cloud Storage into memory

I have end users that are going to be uploading a csv file into a bucket which will then be loaded to BigQuery.
The issue is the content of the data is unreliable.
i.e. it contains fields with free text that may contain linefeeds,extra commas, invalid date formats e.t.c. e.t.c.
I have a python script that will pre-process the file and write out a new one with all errors corrected.
I need to be able to automate this into the cloud.
I was thinking I could load the contents of the file (it's only small) into memory and process the records then write it back out to the Bucket.
I do not want to process the file locally.
Despite extensive searching I can't find how to load a file in a bucket into memory and then write it back out again.
Can anyone help ?
I believe what you’re looking for is Google Cloud Functions. You can set a Cloud Function to be triggered by an upload to the GCS bucket, and use your Python code in the same Cloud Function to process the .csv and upload it to BigQuery, however, please bear in mind that Python 3.7.1 support for Cloud Functions is currently in a Beta state of development.

Update the ID3 tags of S3 bucket files

In my AWS s3 bucket I have thousand of mp3 files and I want to modify the ID3 tags for those files. please suggest the best way.
Sorry to give you the bad news, but only way to do is downloading files one by one update id3 tags and upload them back to s3 bucket. You cannot edit files in place, because AWS S3 is object storage, meaning it keeps data in key: value pairs, key is the folder/filename, value is the file content. It's not suitable for file system, databases, etc.
If you do it this way, one warning, check if you have versioning is on or off for your bucket. Sometimes it's nice to have versioning handled automatically by S3 but, you should remember that each version adds to the storage space that you're paying for.
If you want to edit/modify your files every now and then, you can use AWS EBS or EFS. Both EBS and EFS are block storages, and you can attach them to any EC2 instance, then you can edit/modify your files. The difference between EBS and EFS mainly is, EFS can be attached to multiple EC2 instances at the same time, and share the files in between them.
One more thing about EBS and EFS though, to reach your files, you need to attach it to an EC2 instance. There is no other way to reach your files as easily as in S3.

Resources