I am scraping Make My Trip Flight data for a project but for some reason it doesn't work. I've tried many selectors but none of them worked. On the other hand I also tried scraping another site with the same logic, and it worked. Can someone point out where did I go wrong?
I am using cheerio and axios
const cheerio = require('cheerio');
const axios = require('axios');
Make My Trip:
axios.get('https://www.makemytrip.com/flight/search?itinerary=BOM-DEL-14/11/2020&tripType=O&paxType=A-1_C-0_I-0&intl=false&cabinClass=E').then(urlRes => {
const $ = cheerio.load(urlRes.data);
$('.fli-list.one-way').each((i, el) => {
const airway = $(el).find('.airways-name ').text();
console.log(airway);
});
}).catch(err => console.log(err));
The other site for which the code works:
axios.get('https://arstechnica.com/gadgets/').then(urlRes => {
const $ = cheerio.load(urlRes.data);
$('.tease.article').each((i, el) => {
const link = $(el).find('a.overlay').attr('href');
console.log(link);
});
}).catch(err => console.log(err));
TLDR you should parse
https://voyager.goibibo.com/api/v2/flights_search/find_node_by_name_v2/?search_query=DEL&limit=15&v=2
instead of
https://www.makemytrip.com/flight/search?itinerary=BOM-DEL-14/11/2020&tripType=O&paxType=A-1_C-0_I-0&intl=false&cabinClass=E
Explanation (hope it is clear enough)
Cause you're trying to parse heavy web application using one plain GET request ... it is impossible in this way :)
The main difference between provided urls:
the second web page (yes just a page not js app like makemytrip) 'https://arstechnica.com/gadgets/' respond to you with a complete content
makemytrip respond to you only with a js script, which do the work - loads data and etc.
To parse such complicated web apps you should investigate (press f12 in browser -> web) all requests which are running in your browser on page load and repeat these requests in your script ... like in this case you could notice API endpoint which responds with all needed data.
I think cheerio works just fine, I will recommend go over the HTML again and find a new element, class or something else to search for.
When I went into the given url I did not find .fli-list.one-way in any combination.
Just try to find something more particular to filter on.
If you still need help I can try and scrape it by myself and send you some code
Related
I am currently using the below server side rendering logic (using reactjs + nodejs +redux) to fetch the data synchronously the first time and set it as initial state in store.
fetchInitialData.js
export function fetchInitialData(q,callback){
const url='http://....'
axios.get(url)
.then((response)=>{
callback((response.data));
}).catch((error)=> {
console.log(error)
})
}
I fetch the data asynchronously and load the output in to store the first time the page loads using callback.
handleRender(req, res){
fetchInitialData(q,apiResult => {
const data=apiResult;
const results ={data,fetched:true,fetching:false,queryValue:q}
const store = configureStore(results, reduxRouterMiddleware);
....
const html = renderToString(component);
res.status(200);
res.send(html);
})
}
I have a requirement to make 4 to 5 API calls on initial page load hence thought of checking to see if there is an easy way to achieve making multiple calls on page load.
Do I need to chain the api calls and manually merge the response from different API calls and send it back to load the initial state?
Update 1:
I am thinking of using axios.all approach, can someone let me know if that is a ideal approach?
You want to make sure that requests happen in parallel, and not in sequence.
I have solved this previously by creating a Promise for each API call, and wait for all of them with axios.all(). The code below is
basically pseudo code, I could dig into a better implementation at a later time.
handleRender(req, res){
fetchInitialData()
.then(initialResponse => {
return axios.all(
buildFirstCallResponse(),
buildSecondCallResponse(),
buildThirdCallResponse()
)
})
.then(allResponses => res.send(renderToString(component)))
}
buildFirstCallResponse() {
return axios.get('https://jsonplaceholder.typicode.com/posts/1')
.catch(err => console.error('Something went wrong in the first call', err))
.then(response => response.json())
}
Notice how all responses are bundled up into an array.
The Redux documentation goes into server-side rendering a bit, but you might already have seen that. redux.js.org/docs/recipes/ServerRendering
Also checkout the Promise docs to see exactly what .all() does.developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Promise/all
Let me know if anything is unclear.
You could try express-batch or with GraphQL is another option.
You also could use Redux-Sagas to use pure Redux actions to trigger multiple api calls and handle all of those calls using pure actions. Introduction to Sagas
I am about to teach creating a simple web server in node.js to my students. I am doing it initially using the http module and returning a static page. The server code looks like this:
var http = require('http');
var fs = require('fs');
http.createServer(function(request, response) {
getFile(response);
}).listen(8080);
function getFile(response) {
var fileName = __dirname + "/public/index.html";
fs.readFile(fileName, function(err, contents) {
if (!err) {
response.end(contents);
} else {
response.end();
console.log("ERROR ERROR ERROR");
}
});
}
index.html looks like this:
<!DOCTYPE html>
<html>
<head>
<title>Static Page</title>
</head>
</body>
<h1>Returned static page</h1>
<p>This is the content returned from node as the default file</p>
<img src="./images/portablePhone.png" />
</body>
</html>
As I would expect, I am getting the index.html page display without the image (because I am not handling the mime-type). This is fine; what is confusing me is, when I look at the network traffic, I would expect to have the index.html returned three times (the initial request, the image request and one for favicon.ico request). This should happen, because the only thing the web server should ever return is the index.html page in the current folder. I logged the __dirname and fileName var and they came out correctly on each request and there were indeed three requests.
So my question is, what am I missing? Why am I not seeing three index.html response objects in the network monitor on Chrome? I know one of the students will ask and I'd like to have the right answer for him.
what is confusing me is, when I look at the network traffic, I would
expect to have the index.html returned three times (the initial
request, the image request and one for favicon.ico request)
When I run your app, I see exactly three network requests in the network tab in the Chrome debugger, exactly as you proposed and exactly as the HTML page and the web server are coded to do. One for the initial page request, one for the image and one for favicon.ico.
The image doesn't work because you don't actually serve an image (you are serving index.html for all requests) - but perhaps you already know that.
So my question is, what am I missing? Why am I not seeing three
index.html response objects in the network monitor on Chrome?
Here's my screenshot from the network tab of the Chrome debugger when I run your app:
The code that you actually wrote (originally, can't be sure you won't edit the question) just serves an index.html. There is nothing in there that could read any other file (like an image).
I don't think you should teach students that syntax/mechanism because it is outdated. For starters, do not teach them to indent with tabs or four spaces. Indent with 2 spaces for JavaScript. Also, it just doesn't make sense to teach ES5 at this point. They should learn ES2015 or later (ES6/ECMAScript 2016/whatever they call it). For the current version of Node out of the box (6.6 as of writing), this would be the equivalent of what you wrote:
const http = require('http');
const fs = require('fs-promise');
http.createServer( (request, response) => {
fs.readFile(`${__dirname}/public/index.html`)
.then( data => {response.end(data)} )
.catch(console.error);
}).listen(8080);
But what you seem to be trying to do is create a gallery script.
Another thing about Node is, there are more than 300,000 modules available. So it just absolutely does not make sense to start from 0 and ignore all 300,000+ modules.
Also, within about three months, 6 at the most, async/await will land in Node 7 without requiring babel. And people will argue that kids will be confused if they don't have enough time toiling with promises, but I don't think I buy that. I think you should just teach them how to set up babel and use async/await. Overall its going to make more sense and they will learn a much clearer way to do things. And then the next time you teach the class you won't need babel probably.
So this is one way I would make a simple gallery script that doesn't ignore all of the modules on npm and uses up-to-date syntax:
import {readFile} from 'fs-promise';
import listFilepaths from 'list-filepaths';
import Koa from 'koa';
const app = new Koa();
app.use(async (ctx) => {
if (ctx.request.querystring.indexOf('.jpg')>0) {
const fname = ctx.request.querystring.split('=')[1];
ctx.body = await readFile(`images/${fname}`);
} else {
let images = await listFilepaths('./images',{relative:true});
images = images.map(i=>i.replace('images/',''));
ctx.body = `${images.map(i=> `<img src = "/?i=${i}" />` )}`;
}
});
app.listen(3000);
Let's say I configure redstone as follows
#app.Route("/raw/user/:id", methods: const [app.GET])
getRawUser(int id) => json_about_user_id;
When I run the server and go to /raw/user/10 I get raw json data in a form of a string.
Now I would like to be able to go to, say, /user/10 and get a nice representation of this json I get from /raw/user/10.
Solutions that come to my mind are as follows:
First
create web/user/user.html and web/user/user.dart, configure the latter to run when index.html is accessed
in user.dart monitor query parameters (user.dart?id=10), make appropriate requests and present everything in user.html, i.e.
var uri = Uri.parse( window.location.href );
String id = uri.queryParameters['id'];
new HttpRequest().getString(new Uri.http(SERVER,'/raw/user/${id}').toString() ).then( presentation )
A downside of this solution is that I do not achieve /user/10-like urls at all.
Another way is to additionally configure redstone as follows:
#app.Route("/user/:id", methods: const [app.GET])
getUser(int id) => app.redirect('/person/index.html?id=${id}');
in this case at least urls like "/user/10" are allowed, but this simply does not work.
How would I do that correctly? Example of a web app on redstone's git is, to my mind, cryptic and involved.
I am not sure whether this have to be explained with connection to redstone or dart only, but I cannot find anything related.
I guess you are trying to generate html files in the server with a template engine. Redstone was designed to mainly build services, so it doesn't have a built-in template engine, but you can use any engine available on pub, such as mustache. Although, if you use Polymer, AngularDart or other frameowrk which implements a client-side template system, you don't need to generate html files in the server.
Moreover, if you want to reuse other services, you can just call them directly, for example:
#app.Route("/raw/user/:id")
getRawUser(int id) => json_about_user_id;
#app.Route("/user/:id")
getUser(int id) {
var json = getRawUser();
...
}
Redstone v0.6 (still in alpha) also includes a new foward() function, which you can use to dispatch a request internally, although, the response is received as a shelf.Response object, so you have to read it:
#app.Route("/user/:id")
getUser(int id) async {
var resp = await chain.forward("/raw/user/$id");
var json = await resp.readAsString();
...
}
Edit:
To serve static files, like html files and dart scripts which are executed in the browser, you can use the shelf_static middleware. See here for a complete Redstone + Polymer example (shelf_static is configured in the bin/server.dart file).
I would like to scraping Google Translate with NodeJS and cheerio library:
request("http://translate.google.de/#de/en/hallo%20welt", function(err, resp, body) {
if(err) throw err;
$ = cheerio.load(body);
console.log($('#result_box').find('span').length);
}
But he can't find the necessary span-elements from translation box (result_box). In source code of the website it looks like this:
<span id="result_box">
<span class="hps">hello</span>
<span class="hps">world</span>
</span>
So I think I could wait 5-10 seconds til Google has created all span-elements, but no.. seems to be that isn't..
setTimeout(function() {
$ = cheerio.load(body);
console.log($('#result_box').find('span').length);
}, 15000);
Could you help me, please? :)
Solution:
Instead of cheerio I use http.get:
http.get(
this.prepareURL("http://translate.google.de/translate_a/t?client=t&sl=de&tl=en&hl=de&ie=UTF-8&oe=UTF-8&oc=2&otf=1&ssel=5&tsel=5&pc=1&q=Hallo",
function(result) {
result.setEncoding('utf8');
result.on("data", function(chunk) {
console.log(chunk);
});
}));
So I get a result string with translation. The used url is the request to server.
I know you've already resolved this, but i think the reason why your code didn't work was because you should have written [...].find("span.hps").[...]
Or at least for me it worked always only with the class identifier, when present.
The reason that you can't use cheerio in node to scrape google translation that google is not rendering the translation page at google side!
They reply with a script to your request then the script make an api request that includes your string. Then the script at the user side run again and build the content you see and that's what not happen in cheerio!
So you need to do a request to the api but it's google and they can detect scraping so they will block you after a few attempts!
You still can fake a user behavior but it'll take long time and they may block you at any time!
I'm using cheerio and node.js to parse a webpage and then use css selectors to find data on it. Cheerio doesn't perform so well on malformed html. jsdom is more forgiving, but both behave differently and I've seen both break when the other works fine in certain cases.
Chrome seems to do a fine job with the same malformed html in creating a DOM.
How can I replicate Chrome's ability to create a DOM from malformed HTML, then give the 'cleaned' html representation of this DOM to cheerio for processing?
This way I'll know the html it gets is wellformed. I tried phantomjs by setting page.content, but then when I read page.content's value the html is still malformed.
So you can use https://github.com/aredridel/html5/ which is a lot more forgiving and from my experience works where jsdom fails.
But last time I tested it, a few month back, it was super slow. I hope it got better.
Then there is also the possibility to spawn a phantomjs process and to output on stdout a json of the data you want to feed it back to your Node.
This seems to do the trick, using phantomjs-node and jquery:
function cleanHtmlWithPhantom(html, callback){
var phantom = require('phantom');
phantom.create(
function(ph){
ph.createPage(
function(page){
page.injectJs(
"/some_local_location/jquery_1.6.1.min.js",
function(){
page.evaluate(
function(){
$('html').html(newHtml)
return $('html').html();
}.toString().replace(/newHtml/g, "'"+html+"'"),
function(result){
callback("<html>" + result + "</html>")
ph.exit();
}
)
}
);
}
)
}
)
}
cleanHtmlWithPhantom(
"<p>malformed",
function(newHtml){
console.log(newHtml);
}
)