There are several tutorials that describe how to scrape websites with request and cheerio. In these tutorials they send the output to the console or stream the DOM with fs into a file as seen in the example below.
request(link, function (err, resp, html) {
if (err) return console.error(err)
var $ = cheerio.load(html),
img = $('#img_wrapper').data('src');
console.log(img);
}).pipe(fs.createWriteStream('img_link.txt'));
But what if I would like to process the output during script execution? How can I access the output or send it back to the calling function? I, of course, could load img_link.txt and get the information from there, but this would be to costly and doesn't make sense.
You can wrap request in a function that will callback with html
function(link, callback){
request(link, function(err, im, body){
callback(err, body);
});
});
Then assign it to exports and use in any other module.
Remove the pipe all together.
request(link, function (err, resp, html) {
if (err) return console.error(err)
var $ = cheerio.load(html);
var img = $('#img_wrapper').data('src'); // the var img now has the src attr of some image
return img; // Will return the src attr
});
Update
By your comments, it seems like your request function is working as expected, but the problem is rather accessing the data from another module.
I suggest you read this Purpose of Node.js module.exports and how you use it.
This is also a good resource article describing how require and exports are working.
Put the code above in a module
Use the module.exports
Require the module in another file
Related
I'm quite new to Nodejs. In the following code I am getting json data from an API.
let data_json = ''; // global variable
app.get('/', (req, res) => {
request('http://my-api.com/data-export.json', (error, response, body) => {
data_json = JSON.parse(body);
console.log( data_json ); // data prints successfully
});
console.log(data_json, 'Data Test - outside request code'); // no data is printed
})
data_json is my global variable and I assign the data returned by the request function. Within that function the json data prints just fine. But I try printing the same data outside the request function and nothing prints out.
What mistake am I making?
Instead of waiting for request to resolve (get data from your API), Node.js will execute the code outside, and it will print nothing because there is still nothing at the moment of execution, and only after node gets data from your api (which will take a few milliseconds) will it execute the code inside the request. This is because nodejs is asynchronous and non-blocking language, meaning it will not block or halt the code until your api returns data, it will just keep going and finish later when it gets the response.
It's a good practice to do all of the data manipulation you want inside the callback function, unfortunately you can't rely on on the structure you have.
Here's an example of your code, just commented out the order of operations:
let data_json = ''; // global variable
app.get('/', (req, res) => {
//NodeJS STARTS executing this code
request('http://my-api.com/data-export.json', (error, response, body) => {
//NodeJS executes this code last, after the data is loaded from the server
data_json = JSON.parse(body);
console.log( data_json );
//You should do all of your data_json manipluation here
//Eg saving stuff to the database, processing data, just usual logic ya know
});
//NodeJS executes this code 2nd, before your server responds with data
//Because it doesn't want to block the entire code until it gets a response
console.log(data_json, 'Data Test - outside request code');
})
So let's say you want to make another request with the data from the first request - you will have to do something like this:
request('https://your-api.com/export-data.json', (err, res, body) => {
request('https://your-api.com/2nd-endpoint.json', (err, res, body) => {
//Process data and repeat
})
})
As you can see, that pattern can become very messy very quickly - this is called a callback hell, so to avoid having a lot of nested requests, there is a syntactic sugar to make this code look far more fancy and maintainable, it's called Async/Await pattern. Here's how it works:
let data_json = ''
app.get('/', async (req,res) => {
try{
let response = await request('https://your-api.com/endpoint')
data_json = response.body
} catch(error) {
//Handle error how you see fit
}
console.log(data_json) //It will work
})
This code does the same thing as the one you have, but the difference is that you can make as many await request(...) as you want one after another, and no nesting.
The only difference is that you have to declare that your function is asynchronous async (req, res) => {...} and that all of the let var = await request(...) need to be nested inside try-catch block. This is so you can catch your errors. You can have all of your requests inside catch block if you think that's necessary.
Hopefully this helped a bit :)
The console.log occurs before your request, check out ways to get asynchronous data: callback, promises or async-await. Nodejs APIs are async(most of them) so outer console.log will be executed before request API call completes.
let data_json = ''; // global variable
app.get('/', (req, res) => {
let pr = new Promise(function(resolve, reject) {
request('http://my-api.com/data-export.json', (error, response, body) => {
if (error) {
reject(error)
} else {
data_json = JSON.parse(body);
console.log(data_json); // data prints successfully
resolve(data_json)
}
});
})
pr.then(function(data) {
// data also will have data_json
// handle response here
console.log(data_json); // data prints successfully
}).catch(function(err) {
// handle error here
})
})
If you don't want to create a promise wrapper, you can use request-promise-native (uses native Promises) created by the Request module team.
Learn callbacks, promises and of course async-await.
I have a bog standard nodejs and express app. I then have a 3rd party API call (https://github.com/agilecrm/nodejs) that has a set function to collect the data I require. Normally, with an DB call, I am fine, where I call the data return it via res.json(data) and and its available client side in the public folder from express, but I seem to really be struggling with the format of the 3rd party function to get the data to return so I can collect it client side.
Here is an example of the api call:
var AgileCRMManager = require("./agilecrm.js");
var obj = new AgileCRMManager("DOMAIN", "KEY", "EMAIL");
var success = function (data) {
console.log(data);
};
var error = function (data) {
console.log(data);
};
obj.contactAPI.getContactsByTagFilter('tester tag',success, error);
This works fine to console the data, but I need to get it client side so I can use it in the front end, and the only method I know is via routing, how would I achieve this, or is there a better method? Its the fact where the data runs via the 2nd element in the function, that I can't get in my response in the various methods I have tried.
app.get('/get_contacts_by_tag', function (req, res) {
obj.contactAPI.getContactsByTagFilter('Confirmed', success, error);
var success = function (data) {
res.json(data);
};
});
Any help would be greatly appreciated.
You didn't define the error callback and also you assign the success callback after the api call.
app.get('/get_contacts_by_tag', function (req, res) {
var success = function (data) {
res.json(data);
};
var error = function (data) {
res.status(500).json(data);
};
obj.contactAPI.getContactsByTagFilter('Confirmed', success, error);
});
I am using the Twilio Node Helper Library to make a call and record it.
According to the API link, GET should return a WAV file, but in my case it just returns a json with the recording metadata.
This is what I'm writing:
twilioClient = require('twilio')(config.twilio.acct_sid, config.twilio.auth_token)
var request = twilioClient.recordings('RE01234567890123456789012345678901')
get(function (err, recording){ // <- this "recording" is JSON
It doesn't matter if I tack on a '.mp3' to the end of the SID, I always get a JSON.
Ideally I want to write something like this:
var file = fs.createWriteStream('/Users/yasemin/Desktop/rec.mp3');
twilioClient.recordings('RE01234567890123456789012345678901')
.get(function (err, recording) {
if(!err){ recording.pipe(file); }});
Thanks!
I came across this and had to develop my own code to handle this.
Here is the code I came up with below
con.on('getvmx', function(data){
comModel.find({_id: data.id}, function(err, results){
var https = require('https');
var options = {
host: 'api.twilio.com',
port: 443,
path: '/2010-04-01/Accounts/' + sid + '/Recordings/'+ results[0].sid + '.mp3',
method: 'GET',
auth: sid + ":" + auth,
agent: false
};
var req = https.request(options, function(res) {
res.setEncoding('binary');
var mp3data = '';
res.on('data', function (chunk) {
mp3data += chunk;
});
res.on('end', function(){
try{
var fileName = "/var/www/tcc/public/vm/" + results[0].sid + '.mp3';
fs.writeFile(fileName, mp3data, 'binary', function(err){
if(err){
return console.log(err);
}else{
console.log("File Saved");
con.emit('vmload', results);
}
});
}catch(err){
console.log(err.message);
}
});
});
req.end();
console.log(results);
//load all messages
//load line from reach message
});
});
TLDR: Node Helper Library doesn't have recoded file downloading capability at the moment.
This is the response from Twilio Support:
Looking at the documentation on our web portal, you are certainly
correct, downloading the .wav or .mp3 is possible via API call.
However, from what I can see looking at the Node example code here:
https://www.twilio.com/user/account/developer-tools/api-explorer/recording
And the documentation from the Twilio-Node developer here:
http://twilio.github.io/twilio-node/#recordings
It looks to me like the helper library doesn't actually support direct
downloading, just viewing the recording data. You can download the
application through an HTTP call, as shown in the original docs link
you noted on your Stackoverflow question. Let me know if you need help
with that.
In the mean time, I've reached out to the author of the library to see
if this is by design or a feature to be added to the library. It's
open source of course, so you could make a pull and add it yourself if
you like!
I am trying to read a file in node.js but I am getting tired of writing so many callbacks. Is there a way I can just read a file in one line?
If you're just loading a config or a template you can use the sync read method
var my fileData = fs.readFileSync('myFileName');
If you need you do it as a reply to an http request you can use the streaming API
function (req, res) {
fs.createReadStream('myFileName').pipe(res);
}
Callbacks are king, but you can use anonymous callbacks...
fs.readFile('/etc/passwd', function (err, data) {
if (err) throw err;
console.log(data);
});
http://nodejs.org/api/fs.html#fs_fs_readfile_filename_options_callback
I'm apparently a little newer to Javascript than I'd care to admit. I'm trying to pull a webpage using Node.js and save the contents as a variable, so I can parse it however I feel like.
In Python, I would do this:
from bs4 import BeautifulSoup # for parsing
import urllib
text = urllib.urlopen("http://www.myawesomepage.com/").read()
parse_my_awesome_html(text)
How would I do this in Node?
I've gotten as far as:
var request = require("request");
request("http://www.myawesomepage.com/", function (error, response, body) {
/*
Something here that lets me access the text
outside of the closure
This doesn't work:
this.text = body;
*/
})
var request = require("request");
var parseMyAwesomeHtml = function(html) {
//Have at it
};
request("http://www.myawesomepage.com/", function (error, response, body) {
if (!error) {
parseMyAwesomeHtml(body);
} else {
console.log(error);
}
});
Edit: As Kishore noted, there are nice options for parsing available. Also see cheerio if you have python/gyp issues with jsdom on windows. Cheerio on github
That request() call is asynchronous, so the response is only available inside the callback. You have to call your parse function from it:
function parse_my_awesome_html(text){
...
}
request("http://www.myawesomepage.com/", function (error, response, body) {
parse_my_awesome_html(body)
})
Get used to chaining callbacks, that's essentially how any I/O will happen in javascript :)
JsDom is pretty good to achieve things like this if you want to parse the response.
var request = require('request'),
jsdom = require('jsdom');
request({ uri:'http://www.myawesomepage.com/' }, function (error, response, body) {
if (error && response.statusCode !== 200) {
console.log('Error when contacting myawesomepage.com')
}
jsdom.env({
html: body,
scripts: [
'http://code.jquery.com/jquery-1.5.min.js'
]
}, function (err, window) {
var $ = window.jQuery;
// jQuery is now loaded on the jsdom window created from 'agent.body'
console.log($('body').html());
});
});
also if your page has lot of javascript/ajax content being loaded you might want to consider using phantomjs
Source http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs/