Scan files in AWS S3 bucket for virus using lambda - node.js

We've a requirement to scan the files uploaded by the user and check if it has virus and then tag it as infected. I checked few blogs and other stackoverflow answers and got to know that we can use calmscan for the same.
However, I'm confused on what should be the path for virus scan in clamscan config. Also, is there tutorial that I can refer to. Our application backend is in Node.js.
I'm open to other libraries/services as well

Hard to say without further info (i.e the architecture your code runs on, etc).
I would say the easiest possible way to achieve what you want is to hook up a trigger on every PUT event on your S3 Bucket. I have never used any virus scan tool, but I believe that all of them run as a daemon within a server, so you could subscribe an SQS Queue to your S3 Bucket event and have a server (which could be an EC2 instance or an ECS task) with a virus scan tool installed poll the SQS queue for new messages.
Once the message is processed and a vulnerability is detected, you could simply invoke the putObjectTagging API on the malicious object.

We have been doing something similar, but in our case, its before the file storing in S3. Which is OK, I think, solution would still works for you.
We have one EC2 instance where we have installed the clamav. Then written a web-service that accepts Multi-part file and take that file content and internally invokes ClamAv command for scanning that file. In response that service returns whether the file is Infected or not.
Your solution, could be,
Create a web-service as mentioned above and host it on EC2(lets call it, virus scan service).
On Lambda function, call the virus scan service by passing the content.
Based on the Virus Scan service response, tag your S3 file appropriately.
If your open for paid service too, then in above the steps, #1 won't be applicable, replace the just the call the Virus-Scan service of Symantec or other such providers etc.
I hope it helps.

You can check this solution by AWS, it will give you an idea of a similar architecture: https://aws.amazon.com/blogs/developer/virus-scan-s3-buckets-with-a-serverless-clamav-based-cdk-construct/

Related

Listen to aws s3 bucket when upload file

I have a problem that I need to know if anyone has an idea how to solve it..
I need to create something that listens to the s3 bucket when a file is uploaded there and actually take the file they uploaded and manipulate it in my website with all kinds of processes that I already have
So basically, is there something like this that lets me listen to uploads that have been made in s3 and then manipulate it?
Thank you
There are many ways to achieve this.
First enable S3 Notification that will be triggered on s3 PutObject, and trigger any of these -
Lambda - gets the object and processes (not for large files, lambda can run for 15 mint)
put new object notifications in a SQS queue. Then launch ec2 instances to process the files. You can use autoscaling and cloudwatch alarm with it. Get some ideas from here.
Or some more.
My suggestion would be this -
s3 notification -> Trigger Lambda -> get object key and run ec2 instance -> ec2 does the hard work
No ideas are perfect, it highly depends on your system. Look for better solution that meets your need.
Best wishes.

Media conversion on AWS

I have an API written in nodeJS (/api/uploadS3) which is a PUT request and accepts a video file and a URL (AWS s3 URL in this case). Once called its task is to upload the file on the s3 URL.
Now, users are uploading files to this node API in different formats (thanks to the different browsers recording videos in different formats) and I want to convert all these videos to mp4 and then store them in s3.
I wanted to know what is the best approach to do this?
I have 2 solutions till now
1. Convert on node server using ffmpeg -
The issue with this is that ffmpeg can only execute a single operation at a time. And since I have only one server I will have to implement a queue for multiple requests which can lead to longer waiting times for users who are at the end of the queue. Apart from that, I am worried that during any ongoing video conversion if my node's traffic handling capability will be affected.
Can someone help me understand what will be the effect of other requests coming to my server while video conversion is going on? How will it impact the RAM, CPU usage and speed of processing other requests?
2. Using AWS lambda function -
To avoid load on my node server I was thinking of using an AWS lambda server where my node API will upload the file to S3 in the format provided by the user. Once, done s3 will trigger a lambda function which can then take that s3 file and convert it into .mp4 using ffmpeg or AWS MediaConvert and once done it uploads the mp4 file to a new s3 path. Now I don't want the output path to be any s3 path but the path that was received by the node API in the first place.
Moreover, I want the user to wait while all this happens as I have to enable other UI features based on the success or error of this upload.
The query here is that, is it possible to do this using just a single API like /api/uploadS3 which --> uploads to s3 --> triggers lambda --> converts file --> uploads the mp4 version --> returns success or error.
Currently, if I upload to s3 the request ends then and there. So is there a way to defer the API response until and unless all the operations have been completed?
Also, how will the lambda function access the path of the output s3 bucket which was passed to the node API?
Any other better approach will be welcomed.
PS - the s3 path received by the node API is different for each user.
Thanks for your post. The output S3 bucket generates File Events when a new file arrives (i.e., is delivered from AWS MediaConvert).
This file event can trigger a second Lambda Function which can move the file elsewhere using any of the supported transfer protocols, re-try if necessary; log a status to AWS CloudWatch and/or AWS SNS; and then send a final API response based on success/completion of them move.
AWS has a Step Functions feature which can maintain state across successive lambda functions, for automating simple workflows. This should work for what you want to accomplish. see https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-creating-lambda-state-machine.html
Note, any one lambda function has a 15 minute maximum runtime, so any one transcoding or file copy operation must complete within 15min. The alternative is to run on EC2.
I hope this helps you out!

Spring Integration: Inbound File Adapter drops files when service restarts

We're using the S3InboundFileSynchronizingMessageSource feature of Spring Integration to locally sync and then send messages for any files retrieved from an S3 bucket.
Before syncing, we apply a couple of S3PersistentAcceptOnceFileListFilter filters (to check the file's TimeModified and Hash/ETag) to make sure we only sync "new" files.
Note: We use the JdbcMetadataStore table to persist the record of the files that have previously made it through the filters (using a different REGION for each filter).
Finally, for the S3InboundFileSynchronizingMessageSource local filter, we have a S3PersistentAcceptOnceFileListFilter FileSystemPersistentAcceptOnceFileListFilter -- again on TimeModified and again persisted but in a different region.
The issue is: if the service is restarted after the file has made it through the 1st filter but before the message source successfully sent the message along, we essentially drop the file and never actually process it.
What are we doing wrong? How can we avoid this "dropped file" issue?
I assume you use a FileSystemPersistentAcceptOnceFileListFilter for the localFilter since S3PersistentAcceptOnceFileListFilter is not going to work there.
Let see how you use those filters in the configuration! I wonder if switching to the ChainFileListFilter for your remote files helps you somehow.
See docs: https://docs.spring.io/spring-integration/docs/current/reference/html/file.html#file-reading
EDIT
if the service is restarted after the file has made it through the 1st filter but before the message source successfully sent the message along
I think Gary is right: you need a transaction around that polling operation which includes filter logic as well.
See docs: https://docs.spring.io/spring-integration/docs/current/reference/html/jdbc.html#jdbc-metadata-store
This way the TX is not going to be committed until the message for a file leaves the polling channel adapter. Therefore after restart you simply will be able to synchronize the rolled back files again. Just because they are not present in the store for filtering.

NodeJS app best way to process large logic in azure vm

I have created a NodeJS express post api which process the file stored in azure blob. In the post call to this api, I am just sending filename in body. This call will get the data from blob, create a local file with that data, run command on that file and store it in blob again. But that command is taking 10-15 mins to process one file. That's why the request is getting timed out. I had 2 questions here:
Is there way to respond to the call before processing starts. Like respond to the api call and then start file processing.
If 1 is not possible, which is the best solution for this problem
Thank you in advance.
You must use queue for long running tasks. For that you can choose any library like agendajs, bull, bee-queue, kue etc

Is AWS Lambda using Elastic IP?

First my question: are AWS Lambda "instances" using EIP?
My background:
I'm using lambda as solution to reduce my application load of certain task(download youtube videos).
In the past I was having problems trying to do this very thing in my ec2 instances, in which I used them with EIP, which always returned limit exceed message, and prompted human captcha verification. I solved this at that time by using the instances without EIP and worked like a charm.
Now using lambda for certain videos it throws me Error: Code 150: The uploader has not made this video available in your country. and I double checked that the video was not blocked for US, and it wasn't. So I decided to go back and test with an instance with EIP, and that was it, the same message that was been returned in my lambda function.
It seems to be a change from youtube, because around 3-4 months ago the error when using EIP was limit exceed, but now it turned to country blocked issue. So it's like lambda uses EIP or alike service which youtube doesn't seems to like.
PS: I'm running my lambda function with nodejs and download the videos with ytdl-core btw.
PS2: I asked this very question in aws forums but no luck so far in a week or so. So I decided to try asking here.
Thanks in advance
AWS Lambda is not the same an as EC2 instance. It runs on containers within the AWS infrastructure. Traffic would "appear" to be coming from certain IP addresses, but there is no way to configure which IP address is used.
It is possible that the range of "IP addresses from which Lambda appears to come" is not correctly updated in the geo-database used by the video service, and it thinks they are located in a different location.
Bottom line: There is nothing you can configure.

Resources