I am using request and cheerio to pars some web pages in nodejs. We do this every day more than 20 times so we lost many bandwidth for loading images and css content that is not useful for parsing.
I used some code like this:
request(url, function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
$('.n-item').each(function(i, element){
//do something
});
}
});
1- I want to know this is correct that request loads images/content and may lost my server bandwidth ?
2- Show me a solution to prevent load images/content
thanks
Request itself doesn't parse HTML code or run Javascript. It will only download the source or the URL that you enter. If it's a normal website, it literally returns the HTML source.
The only time you can pull images with "request" is if you use a URL that directly links to an image. E.g http://example.com/image.jpg
Related
I'm working on a react-express project.
On the back end I made a small API that stream some information on my /API/ routes. Just a JSON object.
The thing is, I do not know how am I supposed to put that information on my front end and use it.
I'm using the project as a learning exercise. I have never use an API before.
My main problem (I think) is that English is not my first language. So when I try to google this issue, I get all kinds of results because I'm probably not using the right words.
Any help would be appreciated!
You typically pull the data using a JSON HTTP request. Let's say you have a route /API/myData that returns a JSON formatted response. Your server code should like :
app.get('/API/myData', function(request, response) {
response.json(myData);
});
On your react app you can pull this data with any request library. For example with request:
var request = require('request');
request('localhost/API/mydata', function (error, response, body) {
if (!error && response.statusCode == 200) {
var result = JSON.parse(body); // here is your JSON data
}
});
It's just a starting point. You should have a look at express examples, request examples and other similar libraries to get familiar with it.
I'm using window.fetch here because it's the easiest thing to start with (even though it's not supported in all browsers yet). You could also use jQuery's ajax function or any number of things.
fetch('https://httpbin.org/ip')
.then(data => data.json())
.then(json => document.getElementById('your-ip').innerHTML = json.origin)
Your IP is: <div id="your-ip"></div>
I would like calculator distance between 2 coordinates using GMap API.
I'm looking for anyway to catch return data from URL
https://maps.googleapis.com/maps/api/distancematrix/json?origins=Seattle&destinations=San+Francisco&key={{myKey}}
I tried to searching but no any thing I purpose.
Please help me or give me keywords. Thanks a lot!
You can use the super awesome request package
From it's documentation:
Request is designed to be the simplest way possible to make http calls. It supports HTTPS and follows redirects by default.
var request = require('request');
request('http://www.google.com', function (error, response, body) {
if (!error && response.statusCode == 200) {
console.log(body) // Show the HTML for the Google homepage.
}
})
Hope that helps!
Google's own API documentation should contain everything you need:
https://developers.google.com/maps/documentation/javascript/distancematrix
First geocode your origin and destination from cities (or whatever) to LatLng objects:
Geocoding is the process of converting addresses (like "1600
Amphitheatre Parkway, Mountain View, CA") into geographic coordinates
(like latitude 37.423021 and longitude -122.083739), which you can use
to place markers on a map, or position the map.
https://developers.google.com/maps/documentation/geocoding/intro
Using the LatLng objects you get in response, call Distance Matrix Service.
I understand that the major application for cheerio is web scraping. Is there any way to manipulate and update the html using cheerio commands?
request('http://localhost:3000', function (error, response, html) {
if (!error && response.statusCode == 200) {
$ = cheerio.load(html);
}
$('ul').append('<li class="plum">Plum</li>');
$.html();
});
While the above code does not exactly affect the html, is there any way the changes made in the DOM such as using $('ul').append('<li class="plum">Plum</li>') be reflected on the HTML?
In the snippet you provided the required code is already present. It is $.html(). The result of this statement is exactly what you need. But if you want that result to be saved on the requested server than this is another story and there will be questions:
Have you an access to the server contents?
How server forms the request: from static files or dynamically?
I would like to scraping Google Translate with NodeJS and cheerio library:
request("http://translate.google.de/#de/en/hallo%20welt", function(err, resp, body) {
if(err) throw err;
$ = cheerio.load(body);
console.log($('#result_box').find('span').length);
}
But he can't find the necessary span-elements from translation box (result_box). In source code of the website it looks like this:
<span id="result_box">
<span class="hps">hello</span>
<span class="hps">world</span>
</span>
So I think I could wait 5-10 seconds til Google has created all span-elements, but no.. seems to be that isn't..
setTimeout(function() {
$ = cheerio.load(body);
console.log($('#result_box').find('span').length);
}, 15000);
Could you help me, please? :)
Solution:
Instead of cheerio I use http.get:
http.get(
this.prepareURL("http://translate.google.de/translate_a/t?client=t&sl=de&tl=en&hl=de&ie=UTF-8&oe=UTF-8&oc=2&otf=1&ssel=5&tsel=5&pc=1&q=Hallo",
function(result) {
result.setEncoding('utf8');
result.on("data", function(chunk) {
console.log(chunk);
});
}));
So I get a result string with translation. The used url is the request to server.
I know you've already resolved this, but i think the reason why your code didn't work was because you should have written [...].find("span.hps").[...]
Or at least for me it worked always only with the class identifier, when present.
The reason that you can't use cheerio in node to scrape google translation that google is not rendering the translation page at google side!
They reply with a script to your request then the script make an api request that includes your string. Then the script at the user side run again and build the content you see and that's what not happen in cheerio!
So you need to do a request to the api but it's google and they can detect scraping so they will block you after a few attempts!
You still can fake a user behavior but it'll take long time and they may block you at any time!
I've been using Google Translate API for a while now, without any problems.
I recently pushed my app to my new server and even if it has been working perfectly on my local server, the same source code always gives me the "Required parameter: q" as error message.
I'm using NodeJS + ExpressJS + Request to send this request. Here's my test case:
var request = require('request');
request.post({
url: "https://www.googleapis.com/language/translate/v2",
headers: {"X-HTTP-Method-Override": "GET"},
form: {
key: /* My Google API server key */,
target: "en",
q: ["Mon premier essai", "Mon second essai"]
}
}, function(error, response, data) {
if (!error && response.statusCode == 200) {
console.log("everything works fine");
} else {
console.log("something went wrong")
}
});
Running on my local machine gives me "everything works fine", and running it on my server gives me "something went wrong". Digging more into it, I get the error message mentioned above.
As you can see, I'm trying to translate in one request two sentences. It's just a test case, but I really need to use this through POST request instead of doing two GET request.
I have no idea what this is happening, and I double checked my Google settings and I can't find something wrong there.
Also, I'm having no problem using Google Places APi with this same api key on my server.
I'm stuck. Anyone has any idea what's wrong here?
Well I finally found what was wrong: the new version of RequestJS doesn't work as the old one and my server was running 2.16 when my local machine was running 2.14.
The difference is the way the array is sent. I debugged and the old version was sending
key=my_api_key&target=en&q=Mon%20premier%20essai&q=Mon%20second%20essai
When the new version is sending
key=my_api_key&target=en&q[0]=Mon%20premier%20essai&q[1]=Mon%20second%20essai
So I just added 2.14.x instead of 2.x in my package.json file for now, hopefully it will get fixed soon — or maybe it's not a bug? I don't know.
This answer is a little late but to help people out there with this problem. The problem comes from the way the querystring module converts array parameters:
https://github.com/visionmedia/node-querystring
Its function qs.stringify converts fieldnames (q in the given example) that have an array value to the format:
q[0]=..q[1]=...
This is not a bug but an intended functionality. To overcome this problem without reverting to an old version of the request module you need to manually create your post by using the body option instead of the form option. Also you will need to manually add the content-type header with this method:
var request = require('request');
request.request({
url: "https://www.googleapis.com/language/translate/v2",
headers: {
"X-HTTP-Method-Override": "GET",
'content-type':'application/x-www-form-urlencoded; charset=utf-8'
},
body:'key=xxxx&target=en&q=q=Mon%20premier%20essai&q=Mon%20second%20essai'
}, function(error, response, data) {
if (!error && response.statusCode == 200) {
console.log("everything works fine");
} else {
console.log("something went wrong")
}
});
Obviously this is not as clean but you can easily create a utility function that creates the body string from the object the way you want it to.
things that pop into my head:
jquery file version on server and local PC are not the same
file encoding issues (UTF8 on PC ascii on server?)
have you tried testing it with chrome with Developer Tools open, then check "Network Tab" and verify exactly what is being sent to Google.
For me at least, when it works on one machine and not the other, it is usually due to the first 2 options.
Good Luck!