How to organize s3 uploads client/server with AWS SDK - node.js

I have a bucket that has multiple users, and would like to pre-sign urls for the client to upload to s3 (some files can be large, so I'd rather they not pass through the Node server. My question is this: Until the mongo database is hit, there is no mongo Object Id to tag as a prefix for the file. (I'm separating the files in this structure: (UserID/PostID/resource) so you can check all of a user's pictures by looking under /UserID, and you can target a specific post by also adding the PostID. Conversely, there is no Object URL until the client uploads the file, so I'm at a bit of an impasse.
Is it bad practice to rename files after they touch the bucket? I just can't pre-know the ObjectID (the post has to be created in Mongo first) - but the user has to select what files they want to upload before the object is created. I was thinking the best flow could be one of two situations:
Client sets files -> Mongo created Document -> Responds to client with ObjectID and pre-signed urls for each file with the key set to /UserID/PostID/name. After successful upload, it re-triggers an update function on the server to edit the urls of the post. after update, send success to client.
Client uploads files to root of bucket -> Mongo doc created where urls of uploaded s3 files are being stored -> iterate over list and prepend the UserID and newly-created PostID, updating mongo document -> success response to client
Is there another approach that I don't know about?

Answering your question:
Is it bad practice to rename files after they touch the server?
If you are planing to use S3 to save your files, there is no server, so there is no problems to change these files after you upload them.
The only thing that you need to understand is renaming a object you need to two requests:
copy the object with a new name
delete the old object with the old name
And this means that maybe can be a problem in costs/latency if you have a huge number of changes (but I can say for most of cases this will not be a problem)
I can say that the first option will be a good option for you, and the only thing that I would change is adding a Serverless processing for your object/files, using the AWS Lambda service will be a good option .
In this case instead of updating the files on the server, you will update using a Lambda function, you only need to add a trigger for your bucket in the PutObject event on S3, this way will can change the name of your files in the best processing time for your client and with low costs.

Related

Handling file uploads and storage in a node.js app using AWS S3

I am busy with a ToDo-like app where I want users to be able to add attachments to tasks.
I am struggling with the architecture of my app more than with the code.
For my frontend I am using Vuejs, with Nodejs as backend and MongoDB for my database, which I'm considering to host on Heroku. I was thinking to use AWS S3 for storing the attachments for my tasks.
I am unsure if I should do file uploads via my Node server to S3, or if I should do the uploads via pre-signed URL's. Also I am unsure which is the best way to download the attachments from S3, I was thinking pre-signed URL's would be the best way to do this.
My main confusion is how to keep an index of all attachments of a task. Would storing an index in MongoDB that is related to my Task model be the best way to do this? Also what conventions are there as to what meta-data should be stored?
Lastly, I was wondering if there are any conventions as to how to organize the files uploaded to S3. Is it fine to just save the file under the Task's database ID? Should I change the file's name at all?
Store your attachments in S3. I would recommend you keep a separate bucket in S3 for attachments and keep track of those files in a MongoDB collection called Attachments.
For each file you keep the following document :
{
"source_name" : "helloworld.txt"
"s3_url" : "https://bucket-name.s3-eu-west-1.amazonaws.com/A591A6D40BF420404A011733CFB7B190D62C65BF0BCDA32B57B277D9AD9F146E"
"sha256" : "A591A6D40BF420404A011733CFB7B190D62C65BF0BCDA32B57B277D9AD9F146E"
"uploaded" : "Mon May 11 2020 13:40:28" # Alway UTC time
"size" : 12
}
The source_name is the name of the file that was uploaded. The s3_url is the location in S3. This should be a non-public bucket. This is the sha256 checksum of the file that you generate. You also store this as a separate entity. Finally the date uploaded and the size in bytes.
Why go to the overhead of a checksum? It is more secure you automatically dedupe your files and you can easily detect uploaded files that you already have in your collection.
This means you can find files by checksum and name quickly and you can add other discriminator fields in the future.
Upload and download should be managed by your application. Your store the _id field of this document in your Task documents so attachments can be retrieved fast.
A final optimisation is to embed this document in your Task document and save the complexity and overhead of an additional collection. Do this if the ratio of attachments to Tasks is low.

AWS Lambda Function - Image Upload - Process Review

I'm trying to better understand how the overall flow should work with AWS Lambda and my Web App.
I would like to have the client upload a file to a public bucket (completely bypassing my API resources), with the client UI putting it into a folder for their account based on a GUID. From there, I've got lambda to run when it detects a change to the public bucket, then resizing the file and placing it into the processed bucket.
However, I need to update a row in my RDS Database.
Issue
I'm struggling to understand the best practice to use for identifying the row to update. Should I be uploading another file with the necessary details (where every image upload consists really of two files - an image and a json config)? Should the image be processed, and then the client receives some data and it makes an API request to update the row in the database? What is the right flow for this step?
Thanks.
You should use a pre-signed URL for the upload. This allows your application to put restrictions on the upload, such as file type, directory and size. It means that, when the file is uploaded, you already know who did the upload. It also prevents people from uploading randomly to the bucket, since it does not need to be public.
The upload can then use an Amazon S3 Event to trigger the Lambda function. The filename/location can be used to identify the user, so the database can be updated at the time that the file is processed.
See: Uploading Objects Using Presigned URLs - Amazon Simple Storage Service
I'd avoid uploading a file directly to S3 bypassing the API. Uploading file from your API allows you to control type of file, size etc as well as you will know who exactly is uploading the file (API authid or user id in API body). This is also a security risk to open a bucket to public for writes.
Your API clients can then upload the file via API, which then can store file on S3 (trigger another lambda for processing) and then update your RDS with appropriate meta-data for that user.

Upload to AWS S3 Best Practice

I'm creating a node backend application and I have an entity which can have files assigned.
I have the following options:
Make a request and upload the files as soon as the user selects them in the frontend form and assign them to the entity when the user makes the request to create / update it
Upload the files in the same request which creates / updates the entity
I was wondering if there is a best practice for this scenario. I can't really decide whats better.
This is one of those "depends" answers, and it depends how you are doing uploads and if you plan to clean up your S3 buckets.
I'd suggest creating the entity first (option #2), because than you can store what S3 files are with that entity. If you tried option #1, you might have untracked files (or some kind of staging area), which could require cleanup at some point in the future. (If you files are small, it may never matter, and you just eat that $0.03/GB fee each month : )
I've been seeing on some web sites that look like option #1, where files are included in my form/document as I'm "editing". Pasting an image from my buffer is particularly sweet, and sometimes I see a placeholder of text while it is uploaded, showing the picture when complete. Now I think these "documents" are actually saved on their servers in some kind of draft status, so it might be your option #2 anyway. You could do the same that creates a draft entity, and finalizes it later on (and then have a way to clean out drafts and their attachments at some point).
Also, depending on bucket privacy you need to achieve, have a look at AWS Cognito to upload directly from the browser. You could save your server bandwidth, and reduce your request time, by not using your server as as pass-through.

NodeJS File Upload to Amazon S3 Store Mapping to Database

Im looking at Knox S3 library. In my web app, users are allowed to upload files. Im thinking of using Amazon S3.
In Knox, how to map the file uploaded to the user? Let say I want to know the files that a user uploaded. This will be stored in MongoDB.
Just an idea.
To figure out what data was uploaded by a specific user, you have two options:
Use an external data store (MongoDB or any other database) and record the relation between the uploaded object on S3 (it's key) and the user. This data recording would be done by your app logic when a new object is uploaded to S3.
Upload the data on S3 under a common prefix (e.g. all files uploaded by a user would be placed in /<user-id>/file.ext) and do a LIST request on the bucket to find all the keys under that prefix.
Usually the first option works best, especially if you need to record multiple relations between your S3 objects and other entities.
These scenarios are not specific to using Knox as an S3 client. They apply independent of the programming language or S3 library.

Amazon S3 Browser Based Upload - Prevent Overwrites

We are using Amazon S3 for images on our website and users upload the images/files directly to S3 through our website. In our policy file we ensure it "begins-with" "upload/". Anyone is able to see the full urls of these images since they are publicly readable images after they are uploaded. Could a hacker come in and use the policy data in the javascript and the url of the image to overwrite these images with their data? I see no way to prevent overwrites after uploading once. The only solution I've seen is to copy/rename the file to a folder that is not publicly writeable but that requires downloading the image then uploading it again to S3 (since Amazon can't really rename in place)
If I understood you correctly The images are uploaded to Amazon S3 storage via your server application.
So the Amazon S3 write permission has only your application. Clients can upload images only throw your application (which will store them on S3). Hacker can only force your application to upload image with same name and rewrite the original one.
How do you handle the situation when user upload a image with a name that already exists in your S3 storage?
Consider following actions:
First user upload a image some-name.jpg
Your app stores that image in S3 under name upload-some-name.jpg
Second user upload a image some-name.jpg
Will your application overwrite the original one stored in S3?
I think the question implies the content goes directly through to S3 from the browser, using a policy file supplied by the server. If that policy file has set an expiration, for example, one day in the future, then the policy becomes invalid after that. Additionally, you can set a starts-with condition on the writeable path.
So the only way a hacker could use your policy files to maliciously overwrite files is to get a new policy file, and then overwrite files only in the path specified. But by that point, you will have had the chance to refuse to provide the policy file, since I assume that is something that happens after authenticating your users.
So in short, I don't see a danger here if you are handing out properly constructed policy files and authenticating users before doing so. No need for making copies of stuff.
actually S3 does have a copy feature that works great
Copying Amazon S3 Objects
but as amra stated above, doubling your space by copying sounds inefficient
mybe itll be better to give the object some kind of unique id like a guid and set additional user metadata that begin with "x-amz-meta-" for some more information on the object, like the user that uploaded it, display name, etc...
on the other hand you could always check if the key exists already and prompt for an error

Resources