Scraping Google Translate

Scraping Google Translate - node.js

I would like to scraping Google Translate with NodeJS and cheerio library:
request("http://translate.google.de/#de/en/hallo%20welt", function(err, resp, body) {
if(err) throw err;
$ = cheerio.load(body);
console.log($('#result_box').find('span').length);
}
But he can't find the necessary span-elements from translation box (result_box). In source code of the website it looks like this:
<span id="result_box">
<span class="hps">hello</span>
<span class="hps">world</span>
</span>
So I think I could wait 5-10 seconds til Google has created all span-elements, but no.. seems to be that isn't..
setTimeout(function() {
$ = cheerio.load(body);
console.log($('#result_box').find('span').length);
}, 15000);
Could you help me, please? :)
Solution:
Instead of cheerio I use http.get:
http.get(
this.prepareURL("http://translate.google.de/translate_a/t?client=t&sl=de&tl=en&hl=de&ie=UTF-8&oe=UTF-8&oc=2&otf=1&ssel=5&tsel=5&pc=1&q=Hallo",
function(result) {
result.setEncoding('utf8');
result.on("data", function(chunk) {
console.log(chunk);
});
}));
So I get a result string with translation. The used url is the request to server.

I know you've already resolved this, but i think the reason why your code didn't work was because you should have written [...].find("span.hps").[...]
Or at least for me it worked always only with the class identifier, when present.

The reason that you can't use cheerio in node to scrape google translation that google is not rendering the translation page at google side!
They reply with a script to your request then the script make an api request that includes your string. Then the script at the user side run again and build the content you see and that's what not happen in cheerio!
So you need to do a request to the api but it's google and they can detect scraping so they will block you after a few attempts!
You still can fake a user behavior but it'll take long time and they may block you at any time!

Related

Problem with scraping make my trip flight data using cheerio

I am scraping Make My Trip Flight data for a project but for some reason it doesn't work. I've tried many selectors but none of them worked. On the other hand I also tried scraping another site with the same logic, and it worked. Can someone point out where did I go wrong?
I am using cheerio and axios
const cheerio = require('cheerio');
const axios = require('axios');
Make My Trip:
axios.get('https://www.makemytrip.com/flight/search?itinerary=BOM-DEL-14/11/2020&tripType=O&paxType=A-1_C-0_I-0&intl=false&cabinClass=E').then(urlRes => {
const $ = cheerio.load(urlRes.data);
$('.fli-list.one-way').each((i, el) => {
const airway = $(el).find('.airways-name ').text();
console.log(airway);
});
}).catch(err => console.log(err));
The other site for which the code works:
axios.get('https://arstechnica.com/gadgets/').then(urlRes => {
const $ = cheerio.load(urlRes.data);
$('.tease.article').each((i, el) => {
const link = $(el).find('a.overlay').attr('href');
console.log(link);
});
}).catch(err => console.log(err));

TLDR you should parse
https://voyager.goibibo.com/api/v2/flights_search/find_node_by_name_v2/?search_query=DEL&limit=15&v=2
instead of
https://www.makemytrip.com/flight/search?itinerary=BOM-DEL-14/11/2020&tripType=O&paxType=A-1_C-0_I-0&intl=false&cabinClass=E
Explanation (hope it is clear enough)
Cause you're trying to parse heavy web application using one plain GET request ... it is impossible in this way :)
The main difference between provided urls:
the second web page (yes just a page not js app like makemytrip) 'https://arstechnica.com/gadgets/' respond to you with a complete content
makemytrip respond to you only with a js script, which do the work - loads data and etc.
To parse such complicated web apps you should investigate (press f12 in browser -> web) all requests which are running in your browser on page load and repeat these requests in your script ... like in this case you could notice API endpoint which responds with all needed data.

I think cheerio works just fine, I will recommend go over the HTML again and find a new element, class or something else to search for.
When I went into the given url I did not find .fli-list.one-way in any combination.
Just try to find something more particular to filter on.
If you still need help I can try and scrape it by myself and send you some code

prevent load image on nodejs request

I am using request and cheerio to pars some web pages in nodejs. We do this every day more than 20 times so we lost many bandwidth for loading images and css content that is not useful for parsing.
I used some code like this:
request(url, function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
$('.n-item').each(function(i, element){
//do something
});
}
});
1- I want to know this is correct that request loads images/content and may lost my server bandwidth ?
2- Show me a solution to prevent load images/content
thanks

Request itself doesn't parse HTML code or run Javascript. It will only download the source or the URL that you enter. If it's a normal website, it literally returns the HTML source.
The only time you can pull images with "request" is if you use a URL that directly links to an image. E.g http://example.com/image.jpg

understanding and using API's

I'm working on a react-express project.
On the back end I made a small API that stream some information on my /API/ routes. Just a JSON object.
The thing is, I do not know how am I supposed to put that information on my front end and use it.
I'm using the project as a learning exercise. I have never use an API before.
My main problem (I think) is that English is not my first language. So when I try to google this issue, I get all kinds of results because I'm probably not using the right words.
Any help would be appreciated!

You typically pull the data using a JSON HTTP request. Let's say you have a route /API/myData that returns a JSON formatted response. Your server code should like :
app.get('/API/myData', function(request, response) {
response.json(myData);
});
On your react app you can pull this data with any request library. For example with request:
var request = require('request');
request('localhost/API/mydata', function (error, response, body) {
if (!error && response.statusCode == 200) {
var result = JSON.parse(body); // here is your JSON data
}
});
It's just a starting point. You should have a look at express examples, request examples and other similar libraries to get familiar with it.

I'm using window.fetch here because it's the easiest thing to start with (even though it's not supported in all browsers yet). You could also use jQuery's ajax function or any number of things.
fetch('https://httpbin.org/ip')
.then(data => data.json())
.then(json => document.getElementById('your-ip').innerHTML = json.origin)
Your IP is: <div id="your-ip"></div>

Sending nested request to node.js web server

I am about to teach creating a simple web server in node.js to my students. I am doing it initially using the http module and returning a static page. The server code looks like this:
var http = require('http');
var fs = require('fs');
http.createServer(function(request, response) {
getFile(response);
}).listen(8080);
function getFile(response) {
var fileName = __dirname + "/public/index.html";
fs.readFile(fileName, function(err, contents) {
if (!err) {
response.end(contents);
} else {
response.end();
console.log("ERROR ERROR ERROR");
}
});
}
index.html looks like this:
<!DOCTYPE html>
<html>
<head>
<title>Static Page</title>
</head>
</body>
<h1>Returned static page</h1>
<p>This is the content returned from node as the default file</p>
<img src="./images/portablePhone.png" />
</body>
</html>
As I would expect, I am getting the index.html page display without the image (because I am not handling the mime-type). This is fine; what is confusing me is, when I look at the network traffic, I would expect to have the index.html returned three times (the initial request, the image request and one for favicon.ico request). This should happen, because the only thing the web server should ever return is the index.html page in the current folder. I logged the __dirname and fileName var and they came out correctly on each request and there were indeed three requests.
So my question is, what am I missing? Why am I not seeing three index.html response objects in the network monitor on Chrome? I know one of the students will ask and I'd like to have the right answer for him.

what is confusing me is, when I look at the network traffic, I would
expect to have the index.html returned three times (the initial
request, the image request and one for favicon.ico request)
When I run your app, I see exactly three network requests in the network tab in the Chrome debugger, exactly as you proposed and exactly as the HTML page and the web server are coded to do. One for the initial page request, one for the image and one for favicon.ico.
The image doesn't work because you don't actually serve an image (you are serving index.html for all requests) - but perhaps you already know that.
So my question is, what am I missing? Why am I not seeing three
index.html response objects in the network monitor on Chrome?
Here's my screenshot from the network tab of the Chrome debugger when I run your app:

The code that you actually wrote (originally, can't be sure you won't edit the question) just serves an index.html. There is nothing in there that could read any other file (like an image).
I don't think you should teach students that syntax/mechanism because it is outdated. For starters, do not teach them to indent with tabs or four spaces. Indent with 2 spaces for JavaScript. Also, it just doesn't make sense to teach ES5 at this point. They should learn ES2015 or later (ES6/ECMAScript 2016/whatever they call it). For the current version of Node out of the box (6.6 as of writing), this would be the equivalent of what you wrote:
const http = require('http');
const fs = require('fs-promise');
http.createServer( (request, response) => {
fs.readFile(`${__dirname}/public/index.html`)
.then( data => {response.end(data)} )
.catch(console.error);
}).listen(8080);
But what you seem to be trying to do is create a gallery script.
Another thing about Node is, there are more than 300,000 modules available. So it just absolutely does not make sense to start from 0 and ignore all 300,000+ modules.
Also, within about three months, 6 at the most, async/await will land in Node 7 without requiring babel. And people will argue that kids will be confused if they don't have enough time toiling with promises, but I don't think I buy that. I think you should just teach them how to set up babel and use async/await. Overall its going to make more sense and they will learn a much clearer way to do things. And then the next time you teach the class you won't need babel probably.
So this is one way I would make a simple gallery script that doesn't ignore all of the modules on npm and uses up-to-date syntax:
import {readFile} from 'fs-promise';
import listFilepaths from 'list-filepaths';
import Koa from 'koa';
const app = new Koa();
app.use(async (ctx) => {
if (ctx.request.querystring.indexOf('.jpg')>0) {
const fname = ctx.request.querystring.split('=')[1];
ctx.body = await readFile(`images/${fname}`);
} else {
let images = await listFilepaths('./images',{relative:true});
images = images.map(i=>i.replace('images/',''));
ctx.body = `${images.map(i=> `<img src = "/?i=${i}" />` )}`;
}
});
app.listen(3000);

How can I replicate Chrome's ability to 'resolve' a DOM from bad html?

I'm using cheerio and node.js to parse a webpage and then use css selectors to find data on it. Cheerio doesn't perform so well on malformed html. jsdom is more forgiving, but both behave differently and I've seen both break when the other works fine in certain cases.
Chrome seems to do a fine job with the same malformed html in creating a DOM.
How can I replicate Chrome's ability to create a DOM from malformed HTML, then give the 'cleaned' html representation of this DOM to cheerio for processing?
This way I'll know the html it gets is wellformed. I tried phantomjs by setting page.content, but then when I read page.content's value the html is still malformed.

So you can use https://github.com/aredridel/html5/ which is a lot more forgiving and from my experience works where jsdom fails.
But last time I tested it, a few month back, it was super slow. I hope it got better.
Then there is also the possibility to spawn a phantomjs process and to output on stdout a json of the data you want to feed it back to your Node.

This seems to do the trick, using phantomjs-node and jquery:
function cleanHtmlWithPhantom(html, callback){
var phantom = require('phantom');
phantom.create(
function(ph){
ph.createPage(
function(page){
page.injectJs(
"/some_local_location/jquery_1.6.1.min.js",
function(){
page.evaluate(
function(){
$('html').html(newHtml)
return $('html').html();
}.toString().replace(/newHtml/g, "'"+html+"'"),
function(result){
callback("<html>" + result + "</html>")
ph.exit();
}
)
}
);
}
)
}
)
}
cleanHtmlWithPhantom(
"<p>malformed",
function(newHtml){
console.log(newHtml);
}
)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scraping Google Translate - node.js

I know you've already resolved this, but i think the reason why your code didn't work was because you should have written [...].find("span.hps").[...] Or at least for me it worked always only with the class identifier, when present.

Related

Problem with scraping make my trip flight data using cheerio

prevent load image on nodejs request

understanding and using API's

Sending nested request to node.js web server

How can I replicate Chrome's ability to 'resolve' a DOM from bad html?

Categories

Resources