Which nodejs library should I use to write into HDFS? - node.js

I have a nodejs application and I want to write data into hadoop HDFS file system. I have seen two main nodejs libraries that can do it: node-hdfs and node-webhdfs. Someone have tried it? Any hints? Which one should I use in production?
I am inclined to use node-webhdfs since it uses WebHDFS REST API. node-hdfs seem to be a c++ binding.
Any help will be greatly appreciated.

You may want to check out webhdfs library. It provides nice and straightforward (similar to fs module API) interface for WebHDFS REST API calls.
Writing to the remote file:
var WebHDFS = require('webhdfs');
var hdfs = WebHDFS.createClient();
var localFileStream = fs.createReadStream('/path/to/local/file');
var remoteFileStream = hdfs.createWriteStream('/path/to/remote/file');
localFileStream.pipe(remoteFileStream);
remoteFileStream.on('error', function onError (err) {
// Do something with the error
});
remoteFileStream.on('finish', function onFinish () {
// Upload is done
});
Reading from the remote file:
var WebHDFS = require('webhdfs');
var hdfs = WebHDFS.createClient();
var remoteFileStream = hdfs.createReadStream('/path/to/remote/file');
remoteFileStream.on('error', function onError (err) {
// Do something with the error
});
remoteFileStream.on('data', function onChunk (chunk) {
// Do something with the data chunk
});
remoteFileStream.on('finish', function onFinish () {
// Upload is done
});

Not good news!!!
Do not use node-hdfs. Although it seems promising, it is now two years obsolete. I've tried to compile it but it does not match the symbols of current libhdfs. If you want to use something like that you'll have to make your own nodejs binding.
You can use node-webhdfs but IMHO there's not much advantage on that. It is better to use an http nodejs lib to make your own requests. The hardest part here is try to hold the very async nature of nodejs, since you might want first to create a folder, and then after successfully create it, create a file and then, at last, write or append data. Everything through http requests that you must send and wait the for answer to then go on....
At least node-webhdfs might be a good reference to you take a look and start your own code.
Br,
Fabio Moreira

Related

node.js - res.end vs fs.createWriteStream

I am rather new to Node and am attempting to learn streaming; please correct me if my understanding is flawed.
Using fs.createReadStream and fs.createWriteStream together with .pipe method will effectively stream any kind of data.
Also res.end method utilizes streaming by default.
So could we use fs.createReadStream together with res.end to create the same streaming effect?
How would this look?
Under what circumstances would you normally use res.end?
Thank you
You can use pipe like:
readStream.pipe(res);
To stream some readable stream to the response.
See this answer for a working example of using it.
Basically it's something like:
var s = fs.createReadStream(file);
s.on('open', function () {
s.pipe(res);
});
plus some error handling and MIME types support - see this for full code:
How to serve an image using nodejs
where you can find it used in three examples using three node modules:
express
connect
http

using streamlinejs with nodejs express framework

I am new to the 'nodejs' world.So wanting to explore the various technologies,frameworks involved i am building a simple user posts system(users posting something everybody else seeing the posts) backed by redis.I am using express framework which is recommended by most tutorials.But i have some difficulty in gettting data from the redis server i need to do 3 queries from the redis server to display the posts.In which case have to use neested callback after each redis call.So i wanted to use streamline.js to simplify the callbacks.But i am unable to get it to work even after i used npm install streamline -g and require('streamline').register(); before calling
var keys=['comments','timestamp','id'];
var posts=[];
for(var key in keys){
var post=client.sort("posts",'by','nosort',"get","POST:*->"+keys[key],_);
posts.push(post);
}
i get the error ReferenceError: _ is not defined.
Please point me in the right direction or point to any resources i might have missed.
The require('streamline').register() call should be in the file that starts your application (with a .js extension). The streamline code should be in another file with a ._js extension, which is required by the main script.
Streamline only allows you to have async calls (calls with _ argument) at the top level in a main script. Here, your streamline code is in a module required by the main script. So you need to put it inside a function. Something like:
exports.myFunction = function(_) {
var keys=['comments','timestamp','id'];
var posts=[];
for(var key in keys){
var post=client.sort("posts",'by','nosort',"get","POST:*->"+keys[key],_);
posts.push(post);
}
}
This is because require is synchronous. So you cannot put asynchronous code at the top level of a script which is required by another script.

Node.js // A module to manage a sqlite database

I managed to create a module to handle all the database call. It uses this lib: https://github.com/developmentseed/node-sqlite3
My issues are the following.
Everytime I make a call, I need to make sure the database exist, and if not to create it.
Plus, as all the calls are asynchronous, I end up having loads of functions in functions in callbacks ... etc.
It pretty much looks like this:
getUsers : function (callback){
var _aUsers = [];
var that = this;
this._setupDb(function(){
var db = that.db;
db.all("SELECT * FROM t_client", function(err, rows) {
rows.forEach(function (row) {
_aUsers.push({"cli_id":row.id,"cli_name":row.cli_name,"cli_path":row.cli_path});
});
callback(_aUsers);
});
});
},
So, is there any way I can export my module only when the database is ready and fully created if it does not exist yet?
Does anyone see a way around the "asynchronous" issue?
You could also try using promises or fibers ...
I don't think so. If you make it synchronous, you are taking away the advantage. Javascript functions are meant to be that way. Such a situation is referred to as callback hell. If you are facing problems managing callbacks then you can use these libraries :
underscore
async
See these guides to understand basics of asynchronous programming
Node.js: Style and structure
Using underscore.js managing-callback-spaghetti-in-nodejs
Using async.js node-js-async-programming

Observe file changes with node.js

I have the following UseCase:
A creates a Chat and invites B and C - On the Server A creates a
File. A, B and C writes messages into this file. A, B and C read this
file.
I want a to create a file on server and observe this file if anybody else writes something into this file send the new content back with websockets.
So, any change of this file should be observed by my node.js application.
How can I observe files-changes? Is this possible with node js without locking the files?
If not possible with files, would it be possible with database object (NoSQL)
Good news is that you can observe filechanges with Node's API.
This however doesn't give you access to the contents that has been written into the file.
You can maybe use the fs.appendFile(); function so that when something is being written into the file you emit an event to something else that "logs" your new data that is being written.
fs.watch(): Directly pasted from the docs
fs.watch('somedir', function (event, filename) {
console.log('event is: ' + event);
if (filename) {
console.log('filename provided: ' + filename);
} else {
console.log('filename not provided');
}
});
Read here about the fs.watch(); function
EDIT: You can use the function
fs.watchFile();
Read here about the fs.watchFile(); function
This will allow you to watch a file for changes. Ie. whenever it is accessed by some other processes of any kind.
Also you could use node-watch. Here's an easy example:
const watch = require('node-watch')
watch('README.md', function(event, filename) {
console.log(filename, ' changed.')
})
I do not think you need to have observe file changes or use a NoSQL database for this (if you do not want to). My advice would be to look at events(Observer pattern). There are more than enough tutorials on this topic available online (Google). For example Felix's article about Using EventEmitters
This publish/subcribe semantic can also be achieved with NoSQL. In Redis for example, I think you should have a look at pubsub.
In MongoDB I think tailable cursors is what you are looking for. On their blog they have a post explaining pub/sub.

node-cloudfiles module - Is there a way to track upload progress

If anyone here is familiar with the node-cloudfiles module for node.js, I could use some help in several different areas. Unfortunately, is seems the authors are nearly impossible to reach via their github repo (EDIT: nevermind, someone did reach out to me, I'll send an update when I have an answer of some sort prepared.)
I'll start with my most basic challenge: is there a way to track the progress of the upload? I have tried many things, but the object returned from the .addFile command does not seem to hold any sort of progress stats.
Here is a basic outline of what I am working with.
var readStream = fs.createReadStream(path+'.'+extension, streamopts);
var upOpts = {
headers: {
'content-type': 'video/'+extension,
'content-length': totalBytes
},
remote: CDNfilename,
stream: readStream
};
//reqStream is the object returned from the 'request' module,
//which is used by the 'cloudfiles' module.
var reqStream = cloudClient.addFile(Container.name, upOpts, function (err, uploaded) {
if (err) { console.log(err); }
});
At first I thought I could just use the .bytesWritten property connected to an interval timer, but the object is not a normal node writeStream, so there is no such property.
Charlie (the author of the module) told me that this is possible because it's using a pipe and you just check the data events from the object returned from .addFile, like so:
reqStream.on('data', function () {
/* track progress /*
});
Whenever you need to contact somebody from the nodejitsu team, join the #nodejitsu channel on IRC, they're really active.
At the time of writing this answer, there isn't really a good way to get upload progress for files being sent to cloudfiles. However, one of the nodejitsu geniuses implemented chunked uploading, which in my case, eliminates the need for progress reports. Thanks Bradley.

Resources