Refresh every 5 seconds - how to cache s3 files? - node.js

I store image files of my user model on s3. My frontend fetches new data from the backend (nodeJS) every 5 seconds. In each of those fetches, all users are retrieved which involves getting the image file from s3. Once the application scales this results in a huge request amount on s3 and high costs so I guess caching the files on the backend makes sense since they rarely change once uploaded.
How would I do it? Cache the file once downloaded from s3 onto the local file system of the server and only download them again if a new upload happened? Or is there a better mechanism for this?
Alternatively, when I set the cache header on the s3 files, are they still being fetched everytime I call s3.getObject or does that already achieve what I'm trying to do?

You were right in terms of the cost, which CloudFront would not improve. I was misleading.
Back to your problem, you can cache the files in the S3 bucket adding the metadata for that.
For example:
Cache-control = max-age=172800
You can do that in the console, or through the aws cli for instance.
If you request the files directly, and these have the headers, the browser should do a check on the etag
Validating cached responses with ETags TL;DR The server uses the ETag
HTTP header to communicate a validation token. The validation token
enables efficient resource update checks: no data is transferred if
the resource has not changed.
If you requests the files whit s3.getObject method it would do the request anyway, so it would download the file again.
Pushing not requesting:
If you can't do this, you might want to think about the backend pushing only new data to the frontend, instead of it requesting new data every 5 seconds, which would make the load significantly lower.
---
No so cost effective, more speed focused.
You could use CloudFront as a CDN for your S3 bucket. This will allow you to get the file faster, and also CloudFront would handle the cache for you.
You would need to setup the TTL accordingly to your needs, you can also invalidate the cache everytime you make an upload of a file if you need so.
From the docs:
Storing your static content with S3 provides a lot of advantages. But to help optimize your application’s performance and security while effectively managing cost, we recommend that you also set up Amazon CloudFront to work with your S3 bucket to serve and protect the content. CloudFront is a content delivery network (CDN) service that delivers static and dynamic web content, video streams, and APIs around the world, securely and at scale. By design, delivering data out of CloudFront can be more cost effective than delivering it from S3 directly to your users.

Related

Serving and processing large CSV and XML files from API

I am working on a node web application and require a form to enable users to provide a URL containing a (potentially 100mb) large CSV or XML file. This would then be submitted and trigger the server (Express) to download the file using fetch, process it and then save it to my Postgres database.
The problem I am having is the size of the file. Responses from the API take minutes to return and I'm worried this solution is not optimal for a production application. I've also seen that many servers (including cloud based ones) have response size limits on them, which would obviously be exceeded here.
Is there a better way to do this than simply via a fetch request?
Thanks

AWS Lambda Function - Image Upload - Process Review

I'm trying to better understand how the overall flow should work with AWS Lambda and my Web App.
I would like to have the client upload a file to a public bucket (completely bypassing my API resources), with the client UI putting it into a folder for their account based on a GUID. From there, I've got lambda to run when it detects a change to the public bucket, then resizing the file and placing it into the processed bucket.
However, I need to update a row in my RDS Database.
Issue
I'm struggling to understand the best practice to use for identifying the row to update. Should I be uploading another file with the necessary details (where every image upload consists really of two files - an image and a json config)? Should the image be processed, and then the client receives some data and it makes an API request to update the row in the database? What is the right flow for this step?
Thanks.
You should use a pre-signed URL for the upload. This allows your application to put restrictions on the upload, such as file type, directory and size. It means that, when the file is uploaded, you already know who did the upload. It also prevents people from uploading randomly to the bucket, since it does not need to be public.
The upload can then use an Amazon S3 Event to trigger the Lambda function. The filename/location can be used to identify the user, so the database can be updated at the time that the file is processed.
See: Uploading Objects Using Presigned URLs - Amazon Simple Storage Service
I'd avoid uploading a file directly to S3 bypassing the API. Uploading file from your API allows you to control type of file, size etc as well as you will know who exactly is uploading the file (API authid or user id in API body). This is also a security risk to open a bucket to public for writes.
Your API clients can then upload the file via API, which then can store file on S3 (trigger another lambda for processing) and then update your RDS with appropriate meta-data for that user.

S3 access private bucket files

I have gone through all the existing questions doesn't seems to be fullfill my requirements.
I have a S3 private bucket with 10000 files, Privately accessing via Nodejs server to display in my angular application atleast 25 per page.
I found multiple solutions those seems Inefficient to my thoughts.
Generate pre-signed urls for files.
Pulls the image via the Nodejs API from S3
To display 10 or more need to generate signed Url's each time which is a time consuming process. And pulling image via api using s3.getObject method gives me a Buffer data converting it to a Base64 is hard to handle at the client side and fetching each consumes time this too.
Are these any solutions out there which I'm not aware of and how this can be implemented without affecting user experience.
PS: My Bucket is private not public
Have you tried signed cookies?
I think this may help you by just considering AWS CloudFront and signed the cookie one time to let the client access any file(s) directly after that.
There is some reference.
Also, CloudFront will give you more benefits such as optimize the access speed, attach SSL Certificates to your S3 buckets, and more.
"Sorry for my English"

For a web app that allows simple image uploads, how should I store the images? Confused about file system vs. cdn

Every search result says something about storing the images in the file system but store the paths in the database, but I'm not sure exactly what "file system" means. Would that mean you have something like:
/public (assets)
/js
/css
/img
/app (frontend)
/server (backend)
and you'd upload directly to that /public/img directory?
I remember trying something like that in the past with a Node.js app hosted on Heroku, and it wouldn't let me. I had to set up Amazon S3 and upload the images THERE, which leads to my confusion.
Is using something like Amazon S3 the usual practice or do people upload directly to the /img directory (assuming this is the "file system"?) and it just happened to be the case that Heroku doesn't allow this but other hosts do?
I'd characterize the pattern as "store the data in a blob storage service, store a pointer in your database". The uploaded file is the "blob" - once it has left the user's computer and filesystem, is it really a file anymore? :) On the server, a file system can store that "blob". S3 can store that blob. In the first case, you are storing a path. In the second case, you are storing the URL to the S3 object. A database could even store that blob (not at all recommended, though...)
In any case, the question to ask is: "what happens when I need two app servers to support my traffic?". Wherever that blob goes, both app servers need access to it.
In a data center under your control, there are many ways to share a filesystem across servers - network attached storage (NFS- or SMB-mounted volumes), or storage area networks (iSCSI, Fibre Channel). With more limited network/hardware configuration options in cloud-based Infrastructure/Platform-as-a-Service providers, the de facto standard is S3 because it is inexpensive, reliable, easy to use, and can completely offload serving the file from your servers.
For Heroku, though, you don't have much control over the file system. And, know that the file system for each of your dynos is "ephemeral" - it goes away when the dyno restarts. Which will happen when your app goes idle, or every 24 hours, whichever comes first. So that forces the choice a little.
Final point - S3 comes with the ancillary benefit of taking the burden of serving the blob off of your servers. You can also store files directly to S3 from the browser, without routing it through your app (see https://devcenter.heroku.com/articles/s3-upload-node). The benefit in both cases is that those downloads/uploads can take up lots of your application's precious time for stuff that's pretty rote.
Uploading directly to a host file system is generally not a best practice. This is one reason services like S3 are so popular.
If you're using the host file system and ever need more than one instance of a server, the file systems will grow out of sync. Imagine one user uploads 'foo.jpg' to server A (A/app/uploads) and another uploads 'bar.jpg' to server B (B/app/uploads). When either of these images is later requested, the request has a 50% chance of failing, depending on whether the load balancer routes the request to server A or server B.
There are several ancillary benefits to avoiding the host filesystem. For instance, you can set the filesystem serving your app to read-only for increased security. Files are a form of state, and stateless web servers allow you to do things like blow away one instance and deploy another instance to take over its work.
You might find this of help:
https://codeforgeek.com/2014/11/file-uploads-using-node-js/
I used multer in my node.js server file to handle uploading from the front end. Basically I had an html form that would submit the image to the server file, where it would be handled by multer. This actually led it to be saved in the file system (to answer your question concretely, yes, this was to something like the /img directory right in your project file structure). My application is running on heroku, and this feature works on there as well. However, I would not recommending using the file system to store your image like this (I doubt you will have enough space for a large amount of images/files) - using AWS storage or a DB would be better.

Amazon S3 Browser Based Upload - Prevent Overwrites

We are using Amazon S3 for images on our website and users upload the images/files directly to S3 through our website. In our policy file we ensure it "begins-with" "upload/". Anyone is able to see the full urls of these images since they are publicly readable images after they are uploaded. Could a hacker come in and use the policy data in the javascript and the url of the image to overwrite these images with their data? I see no way to prevent overwrites after uploading once. The only solution I've seen is to copy/rename the file to a folder that is not publicly writeable but that requires downloading the image then uploading it again to S3 (since Amazon can't really rename in place)
If I understood you correctly The images are uploaded to Amazon S3 storage via your server application.
So the Amazon S3 write permission has only your application. Clients can upload images only throw your application (which will store them on S3). Hacker can only force your application to upload image with same name and rewrite the original one.
How do you handle the situation when user upload a image with a name that already exists in your S3 storage?
Consider following actions:
First user upload a image some-name.jpg
Your app stores that image in S3 under name upload-some-name.jpg
Second user upload a image some-name.jpg
Will your application overwrite the original one stored in S3?
I think the question implies the content goes directly through to S3 from the browser, using a policy file supplied by the server. If that policy file has set an expiration, for example, one day in the future, then the policy becomes invalid after that. Additionally, you can set a starts-with condition on the writeable path.
So the only way a hacker could use your policy files to maliciously overwrite files is to get a new policy file, and then overwrite files only in the path specified. But by that point, you will have had the chance to refuse to provide the policy file, since I assume that is something that happens after authenticating your users.
So in short, I don't see a danger here if you are handing out properly constructed policy files and authenticating users before doing so. No need for making copies of stuff.
actually S3 does have a copy feature that works great
Copying Amazon S3 Objects
but as amra stated above, doubling your space by copying sounds inefficient
mybe itll be better to give the object some kind of unique id like a guid and set additional user metadata that begin with "x-amz-meta-" for some more information on the object, like the user that uploaded it, display name, etc...
on the other hand you could always check if the key exists already and prompt for an error

Resources