How to grep into large gzip/zstd compressed files about 3GB in aws lambda (docker container) without decompressing it? - node.js

Currently I am getting an arrayBuffer from the api call. I manage to convert them in a temporary file using tmp package of nodeJs by simply calling fs.appendFileSync() and writing it chunk by chunk.
while (chunkCount) {
var chunk = buffer.subarray(index, index + MAX_DATA_SEND_SIZE);
index += MAX_DATA_SEND_SIZE;
fs.appendFileSync(tmpObj.name, chunk);
chunkCount--;
}
The tmp files are stored in memory of lambda (a docker container version of cent OS) and then using child_process and zstd tool to grep (zstdgrep) a string and getting the result.
I am running this command on the file.
const commandToSearch = `zstdgrep "${requestQueryParams.requestId}" -f ${tmpObj.name}`;
This process is working fine for files under 1.8GB but after that the problem of ENOMEM arises in lambda. Any help would be appreciated.
Can I use streams in nodeJS to run commands on chunks of tmp .zst/.gz files??
Or can i run commands on buffer instead of creating file and storing it?

Related

Node-RED docker problem reading directory contents

I have a Node-RED app running in a docker container, with the aim to periodically read contents of a directory where .csv files are constantly updated and new .csv files are sometimes added. The point is to read new entries periodically, parse data, and send it onward.
I have not utilized the numerous 'contrib' nodes, as I have enabled the NodeJS 'fs' module and played with it. Additionally the built-in 'file' and 'file in' Node-RED modules are useful when reading the .csv files' contents, so that is not an issue.
The problem comes with the new .csv files being added into the directory where all the .csv files are. I want be able to read all the file names and subsequently read all the .csv files.
I have mounted the .csv file directory into the docker container, and when testing whether I'm able to read the file names, weird things happen. Even though the files are visible in the container (viewed using docker exec -it CONTAINER /bin/bash) a piece of code containing fs.readdir does not list the files. When I try the fs.readdir too see the contents of /data directory, which is mounted into the container, it lists the contents like 10 % of the time (injecting a timestamp into the node to run it)
As you can see from the image, the contents of the directorty in question are not listed on every execution of the node. The contents of the mounted directory containing the .csv files are never listed upon running this node with the correct path as parameter.
The operating system is CentOS 7, where I am not a sudoer. I have managed to make it so that none of the mounted files or directories are owned by root, so they are owned by user node-red within the container. I managed to pull this directory file listing through on my ubuntu where I am a sudoer, but as none of the stuff is root-owned there either, I am not sure if that is the problem. I have a feeling this might be an operating system -relating thing.
Notes:
All relevant files and directories have permissions rwxr-xr-x
I have tried to mount the .csv files containing directory under /data directory, and as its own directory directly under root as /files
I am able to read the file contents with the Node-RED file nodes, just not the directories. Reading static file names is not enough as the directory contents keep changing
I have enabled NodeJS 'fs' module from the settings.js file which is mounted into the container
The Node-RED node (in image) does not output any errors (I tried this by adding an error return to the function in the image)
I have tried to run the Node-RED container as root user and without defining the user
I am running the Node-RED container using docker-compose
I hope this was not too much text or too unclear, I just wanted to make sure at least most of the stuff I have tried would be written here. If someone has some insight on the workings of Node-RED under docker and using the NodeJS fs module, it would be most appreciated :)
The core Watch node should do all of this for you, no need to write function nodes.
If you want walk subdirectories make sure you tick the right box in the config.
From the Sidebar docs for the watch node:
The full filename of the file that actually changed is put into
msg.payload and msg.filename, while a stringified version of the watch
list is returned in msg.topic.
msg.file contains just the short filename of the file that changed.
msg.type has the type of thing changed, usually file or directory,
while msg.size holds the file size in bytes.
To answer my question of why Node-RED was unable to read directory contents most of the time, it was because of using the asynchronous fs.readdir module. When I switched to using the synchronous version fs.readdirSync, Node-RED was able to read directory contents without problems.

Rscript and Nodejs integration on Ubuntu Server

I am trying to build a node js app in which i call rscript to do some statistical computation and return an array with 8 elements which then i pass back to nodejs so that we can display those elements on ejs pages .
I am successfully able to do this on local host everything is working fine and even rscript is running and giving back the output, but when we try to do the same on ubuntu server we are not getiing any console.log(out) on our terminal (out is the variable which gets the output from the rscript) we get a null.
We are calling the script in localhost and server in same way as shown.
`console.log(data);
var out = rscript(abc.R)
.data(data.xyz,data.abc)
.callSync();
console.log(out);`
In the above code we get json in the data variable and it gives log as well both on local and server.
I have installed all the libraries needed like rscirpt inside nodejs using npm and have already installed R and Rstudio on my ubuntu server and installed all the libraries too which are needed to run the rscript.
The rscript is placed in same folder where my index.js is alll the ejs pages are stored in other folder which the node app is able to access and display them too.
You will have to deploy your R script somewhere else and then call that R script using API calls in your node server file.
One of the services that you can use to call rscript as an API in node is Algorithmia. You will just need to follow their instructions and wrap all your code inside a function. It will appear as a sample there, once you create an R project.

Streaming the file with different name nodejs

I am using lambda function to read a file and stream it using a different name written in nodejs
http.get('https://www.blog.google/static/blog/images/google-200x200.7714256da16f.png', res=> res.pipe(fs.createReadStream('data.png')));
request('https://www.blog.google/static/blog/images/google-200x200.7714256da16f.png').pipe(fs.createWriteStream('data.png'))
It gives me following error:
Error: EROFS: read-only file system, open 'data.png'
at Error (native)
This error is caused due to AWS Lambda environment. By default, Lambda runs in the /var/task directory. But it is read-only. You have an ephemeral storage of 512 MB mounted under /tmp which is writable!. This can be found in the docs: http://docs.aws.amazon.com/lambda/latest/dg/limits.html
This means you have to modify your code to write file into /tmp like that:
http.get('https://www.blog.google/static/blog/images/google-200x200.7714256da16f.png', res=> res.pipe(fs.createReadStream('/tmp/data.png')));
request('https://www.blog.google/static/blog/images/google-200x200.7714256da16f.png').pipe(fs.createWriteStream('/tmp/data.png'))

What is the fastest way to copy a large file to S3 from Node.js?

I imagine using child_process with the AWS CLI using aws s3 cp is faster than using the aws-sdk module, but I'm wondering if this has actually been tested before? Is there a faster way to copy files asynchronously from a Node.js environment, ideally using a separate thread?
To copy single file to S3 bucket:
aws s3 cp filename s3://bucket/ --recursive
To copy all files in a directory to S3 bucket:
aws s3 sync directory/ s3://bucket/ --recursive

AWS Lambda making video thumbnails

I want make thumbnails from videos uploaded to S3, I know how to make it with Node.js and ffmpeg.
According to this forum post I can add libraries:
ImageMagick is the only external library that is currently provided by
default, but you can include any additional dependencies in the zip
file you provide when you create a Lambda function. Note that if this
is a native library or executable, you will need to ensure that it
runs on Amazon Linux.
But how can I put static ffmpeg binary on aws lambda?
And how can I call from Node.js this static binary (ffmpeg) with AWS Lambda?
I'm newbie with amazon AWS and Linux
Can anyone help me?
The process as outlined by Naveen is correct, but it glosses over a detail that can be pretty painful - including the ffmpeg binary in the zip and accessing it within your lambda function.
I just went through this, it went like this:
Include the ffmpeg static binary in your zipped lambda function package (I have a gulp task to copy this into the /dist every time it builds)
When your function is called, move the binary to a /tmp/ dir and chmod it to give yourself access (Update Feb 2017: it's reported that this is no longer necessary, re: #loretoparisi and #allen's answers).
update your PATH to include the ffmpeg executable (I used fluent-ffmpeg which lets you set two env vars to handle that more easily.
Let me know if more detail is necessary, I can update this answer.
The copy and chmod (step 2) is obviously not ideal.... would love to know if anyone's found a better way to handle this, or if this is typical for this architecture style.
(2nd Update, writing it before the first update b/c it's more relevant):
The copy + chmod step is no longer necessary, as #Allen pointed out – I'm executing ffmpeg in Lambda functions directly from /var/task/ with no trouble at this point. Be sure to chmod 755 whatever binaries before uploading them to Lambda (also as #Allen pointed out).
I'm no longer using fluent-ffmpeg to do the work. Rather, I'm updating the PATH to include the process.env['LAMBDA_TASK_ROOT'] and executing simple bash scripts.
At the top of your Lambda function:
process.env['PATH'] = process.env['PATH'] + "/" + process.env['LAMBDA_TASK_ROOT']
For an example that uses ffmpeg: lambda-pngs-to-mp4.
For a slew of useful lambda components: lambduh.
The below update left in for posterity, but no longer necessary:
UPDATE WITH MORE DETAIL:
I downloaded the static ffmpeg binary here. Amazon recommends booting up an EC2 and building a binary for your use on there, because that environment will be the same as the conditions Lambda runs on. Probably a good idea, but more work, and this static download worked for me.
I pulled only the ffmpeg binary into my project's to-be-archived /dist folder.
When you upload your zip to lambda, it lives at /var/task/. For whatever reason, I ran into access issues trying to use the binary at that location, and more issues trying to edit permissions on the file there. A quick work-around is to move the binary to /tmp/ and chmod permissions on it there.
In Node, you can run shell via a child_process. What I did looks like this:
require('child_process').exec(
'cp /var/task/ffmpeg /tmp/.; chmod 755 /tmp/ffmpeg;',
function (error, stdout, stderr) {
if (error) {
//handle error
} else {
console.log("stdout: " + stdout)
console.log("stderr: " + stderr)
//handle success
}
}
)
This much should give you an executable ffmpeg binary in your lambda function – but you still need to make sure it's on your $PATH.
I abandoned fluent-ffmpeg and using node to launch ffmpeg commands in favor of just launching a bash script out of node, so for me, I had to add /tmp/ to my path at the top of the lambda function:
process.env.PATH = process.env.PATH + ':/tmp/'
If you use fluent-ffmpeg, you can set the path to ffmpeg via:
process.env['FFMPEG_PATH'] = '/tmp/ffmpeg';
Somewhat related/shameless self-plug: I'm working on a set of modules to make building Lambda functions out of composable modules easier under the name Lambduh. Might save some time getting these things together. A quick example: handling this scenario with lambduh-execute would be as simple as:
promises.push(execute({
shell: "cp /var/task/ffmpeg /tmp/.; chmod 755 /tmp/ffmpeg",
})
Where promises is an array of promises to be run.
I created a GitHub repo that does exactly this (as well as resizes the video at the same time). Russ Matney's answer was extremely helpful to make the FFmpeg file executable.
I am not sure what custom mode library you would use for the ffmpeg task; nevertheless the steps to accomplish that are the same.
Create a separate directory for your lambda project
Run npm install <package name> inside that directory ( this would automatically put in place the node_modules and appropriate files )
Create index.js file in the lambda project directory then use the require(<package-name>) and perform your main task for video thumbnails creation
Once you are done, you can zip the lambda project folder and upload it I'm AWS management console and configure the index file and handler.
Rest of configurations follow the same process like IAM Execution Role, Trigger, Memory and Timeout specification etc.
I got this working without moving it to /tmp. I ran chmod 755 on my executable and then it worked! I had problems when I previously set it to chmod 777.
At the time I'm writing, as well described above there is no need anymore to copy binaries from current folder, that is the var/task or the process.env['LAMBDA_TASK_ROOT'] folder to the /tmp folder.
So it is just necessary to do
chmod 755 dist/ff*
if you have your ffmpeg and ffprobe binaries there.
By the way, previously my 2 cents solution that wasted 2 days time was this
Configure : function(options, logger) {
// default options
this._options = {
// Temporay files folder for caching and modified/downloaded binaries
tempDir : '/tmp/',
/**
* Copy binaries to temp and fix permissions
* default to false - since this is not longer necessary
* #see http://stackoverflow.com/questions/27708573/aws-lambda-making-video-thumbnails/29001078#29001078
*/
copyBinaries : false
};
// override defaults
for (var attrname in options) { this._options[attrname] = options[attrname]; }
this.logger=logger;
var self=this;
// add temporary folder and task root folder to PATH
process.env['PATH'] = process.env['PATH'] + ':/tmp/:' + process.env['LAMBDA_TASK_ROOT']
if(self._options.copyBinaries)
{
var result = {}
execute(result, {
shell: "cp ./ffmpeg /tmp/.; chmod 755 /tmp/ffmpeg", // copies an ffmpeg binary to /tmp/ and chmods permissions to run it
logOutput: true
})
.then(function(result) {
return execute(result, {
shell: "cp ./ffprobe /tmp/.; chmod 755 /tmp/ffprobe", // copies an ffmpeg binary to /tmp/ and chmods permissions to run it
logOutput: true
})
})
.then(function(result) {
self.logger.info("LambdaAPIHelper.Configure done.");
})
.fail(function(err) {
self.logger.error("LambdaAPIHelper.Configure: error %s",err);
});
} //copyBinaries
}
helped by the good lambduh module:
// lambuh & dependencies
var Q = require('q');
var execute = require('lambduh-execute');
As described here and confirmed by module author now this can be considered not needed, by the way it's interesting to have a well understanding of the lambda runtime (the machine) environment that is well described in Exploring the Lambda Runtime environment.
I just went through the same issues as described above and ended up moving with the same concept of moving my scripts requiring execution to the /tmp directory.
var childProcess = require("child_process");
var Q = require('q');
Code I used is below with promises:
.then(function(result) {
console.log('Move shell ffmpeg shell script to executable state and location');
var def = Q.defer();
childProcess.exec("mkdir /tmp/bin; cp /var/task/bin/ffmpeg /tmp/bin/ffmpeg; chmod 755 /tmp/bin/ffmpeg",
function (error, stdout, stderr) {
if (error) {
console.log("error: " + error)
} else {
def.resolve(result);
}
}
)
return def.promise;
})
In order for the binary to be directly executable on AWS Lambda (without first having to copy to /tmp and chmod), you need to ensure the binary has executable permission when it is added to the ZIP file.
This is problematic on Windows because Windows doesn't recognize Linux binaries. If you're using Windows 10, use the Ubuntu Bash shell to create the package.
I created a Node.js function template specifically for this purpose here. It allows you to deploy one or more binaries to Lambda, then execute an arbitrary shell command and capture the output.

Resources