Disclaimer: I'm not a Node-pro. I've read so many tickets and sites today, trying to solve my issues. But when one problem is solved, another occured and vice versa.
Currently, Puppeteer is used in the following way:
const browser = await puppeteer.launch({
headless: true,
ignoreHTTPSErrors: true
});
const page = await browser.newPage();
const response = await page.goto(targetUrl, {waitUntil: 'load'});
const cdp = await page.target().createCDPSession();
const cookies = (await cdp.send('Network.getAllCookies')).cookies;
const localStorage = await page.evaluate(() => Object.assign({}, window.localStorage));
const sessionStorage = await page.evaluate(() => Object.assign({}, window.sessionStorage));
This works for most pages, but when trying to grab https://cioudways.com for example, I get Execution context was destroyed, most likely because of a navigation.
Replacing {waitUntil: 'load'} with {waitUntil: 'networkidle2'} it will only fail randomly then. But when trying to grab https://github.com using networkidle2, the whole process will timeout, resulting in Navigation timeout of 30000 ms exceeded (but with load instead of networkidle2 it works).
How can I solve this to get a stable script, that is able to work with nearly every URL?
The answer to the error in the case of the first URL is in the error message literally: Execution context was destroyed, most likely because of a navigation. When you open https://cioudways.com it will immediately replace location.href to https://www.cloudways.com/en/?id={foo}&data1={bar}&data2=in (note: it is not a regular HTTP 301 redirection, both are HTTP 200. HTTP 30x are handled by puppeteer) so your page is immediately destroyed before you'd have the chance evaluate it.
For this specific URL awaiting a new load event - right after page.goto() - would solve your issue:
await page.goto('https://cioudways.com', { waitUntil: 'load' })
await page.waitForNavigation({ waitUntil: 'load' })
Of course, this would break the script for any other target URL (as it is unusual to have additional navigation on site launch). So you can't apply it as a general solution.
You could use (for this specific site) the redirected page https://www.cloudways.com/ to avoid this issue.
The second case has a different cause. The https://github.com page seems to have an issue with its resources. If I log all network calls:
await page.setRequestInterception(true)
page.on('request', request => {
console.log(request.url())
request.continue()
})
await page.goto('https://github.com', { waitUntil: 'networkidle2' })
It always stops at https://github.githubassets.com/images/modules/site/home/globe/flag.obj. I have no answer for this, the login page is full of canvas-es and animations that may affect networkidle2 (that means there are no more than 2 network calls in progress). Can be caused by a bug on Github's side. Maybe it is worth its own question.
Suggestion
As your problem lies in the unreliability of page loads I suggest using { waitUntil: 'load' } (as this is the default you can omit this argument completely) and I'd pause the page (page.waitForTimeout()) for a short while to give time for localStorage etc. to be filled in case of Angular/React apps too. This is only a workaround, pausing script execution for a huge amount of URLs is not a good thing, for slower pages maybe the hardcoded pause won't be enough while for others it will be unnecessarily long wait.
await page.goto(targetUrl)
await page.waitForTimeout(4000)
Related
I'm currently scraping a public webpage in the event that it goes down, and this site has some files where opening them in Chromium will usually download the file to your downloads folder automatically. For example, accessing https://www.7-zip.org/a/7z2201-x64.exe downloads a file instead of showing you the binary.
My code is really complicated, but what the main part of the code is, is this:
const page = await browser.newPage();
page.on("response", async response => {
// saves the file to a place I want it, but doesn't cancel the chrome-based download.
buffer = Buffer.from(new Uint8Array(await page.evaluate(function(x:string) {
return fetch(x).then(r=>r.arrayBuffer())
}, response.url())));
fs.writeFileSync('path', buffer);
return void 0;
});
await page.goto('https://www.7-zip.org/a/7z2201-x64.exe', { waitUntil: "load", timeout: 120000 });
I can't just assume the mime type either, the page could go to any URL from an html file to a zip file, so is it possible to disable downloads or rewire it to /dev/null? I've looked into response intercepting and it doesn't seem to be a thing based on this.
After reading a bit about /dev/null and seeing this answer, I figured out that I can do this:
const page = await browser.newPage();
await (await page.target().createCDPSession()).send("Page.setDownloadBehavior", {
behavior: "deny",
downloadPath: "/dev/null"
});
Setting the download path to /dev/null is redundant, but if you don't want to fully deny the download behavior for a tab but also don't want it going to your downloads folder, /dev/null will essentially delete whatever it receives on the spot.
I set the download behavior before navigating to a page, and this also focusing on behavior related to Chromium-based browsers, not just mime types.
I am scraping Make My Trip Flight data for a project but for some reason it doesn't work. I've tried many selectors but none of them worked. On the other hand I also tried scraping another site with the same logic, and it worked. Can someone point out where did I go wrong?
I am using cheerio and axios
const cheerio = require('cheerio');
const axios = require('axios');
Make My Trip:
axios.get('https://www.makemytrip.com/flight/search?itinerary=BOM-DEL-14/11/2020&tripType=O&paxType=A-1_C-0_I-0&intl=false&cabinClass=E').then(urlRes => {
const $ = cheerio.load(urlRes.data);
$('.fli-list.one-way').each((i, el) => {
const airway = $(el).find('.airways-name ').text();
console.log(airway);
});
}).catch(err => console.log(err));
The other site for which the code works:
axios.get('https://arstechnica.com/gadgets/').then(urlRes => {
const $ = cheerio.load(urlRes.data);
$('.tease.article').each((i, el) => {
const link = $(el).find('a.overlay').attr('href');
console.log(link);
});
}).catch(err => console.log(err));
TLDR you should parse
https://voyager.goibibo.com/api/v2/flights_search/find_node_by_name_v2/?search_query=DEL&limit=15&v=2
instead of
https://www.makemytrip.com/flight/search?itinerary=BOM-DEL-14/11/2020&tripType=O&paxType=A-1_C-0_I-0&intl=false&cabinClass=E
Explanation (hope it is clear enough)
Cause you're trying to parse heavy web application using one plain GET request ... it is impossible in this way :)
The main difference between provided urls:
the second web page (yes just a page not js app like makemytrip) 'https://arstechnica.com/gadgets/' respond to you with a complete content
makemytrip respond to you only with a js script, which do the work - loads data and etc.
To parse such complicated web apps you should investigate (press f12 in browser -> web) all requests which are running in your browser on page load and repeat these requests in your script ... like in this case you could notice API endpoint which responds with all needed data.
I think cheerio works just fine, I will recommend go over the HTML again and find a new element, class or something else to search for.
When I went into the given url I did not find .fli-list.one-way in any combination.
Just try to find something more particular to filter on.
If you still need help I can try and scrape it by myself and send you some code
This code runs correctly locally every time. However, when I deploy to the server (ubuntu on Raspberry Pi using chromium-browser) I sometimes get errors around 3/10 attempts. This code works best...
await page.goto('http://mywebsite.com')
const element = await page.$('div[class="user-tags"]')
const value = await page.evaluate(el => el.textContent, element)
but sometimes returns... "Error Getting Experience Level Error: Evaluation failed: TypeError: Cannot read property 'textContent' of null"
So I looked around for solutions and tried this but it fails every time (both code blocks run locally fine)...
await page.goto('http://mywebsite.com')
await page.waitForSelector('div[class="user-tags"]')
const element = await page.$('div[class="user-tags"]')
const value = await page.evaluate(el => el.textContent, element)
Which throws " Error Getting Experience Level TimeoutError: waiting for selector "div[class="user-tags"]" failed: timeout 30000ms exceeded 9/10/2020 # 06:02:35"
Thanks for any suggestions!
The difference between first and second code snippets
In the second code sample you instruct puppeteer to wait until div[class="user-tags"] exists on the target page:
await page.waitForSelector('div[class="user-tags"]')
That is the correct way of getting data from an element - first make sure it is available, then query it.
Timeout error happens because the given element is not found within 30 seconds (it is the default timeout).
Ways to solve this
First you need to figure out why the element is not found by the puppeteer.
Maybe div.user-tags is not supposed to be present on every page?
Maybe Raspberry Pi is not powerful enough to load and process the target page in 30 seconds? - It is possible to increase the timeout.
You can also wait until the page is completely loaded, this way puppeteer will make sure all of the resources are loaded before going on with the script:
await page.goto(url, { waitUntil: 'networkidle0' })
I am developing a Bot using NodeJs which should ask the user a set of questions and then after a break, ask the same another of questions again.
I am using await sleep(milliseconds) in between.
While testing using the Emulator, I noticed that the questions from the first set are sent one by one, saving the user's response. The second set is sent all at once without allowing the user to respond to each question of the second set individually.
await turnContext.sendActivity(askFirstSetOfQuestions(question));
await sleep(60000);
await turnContext.sendActivity(askSecondSetOfQuestions(question));
await next();
image - screen capture from emulator
Quite lost here..
any help would be appreciated.
There are two things to consider here. First, if you truly need a delay, it is better to await a promise than use sleep. You can do this via await new Promise (resolve => setTimeout(resolve, DELAY_LENGTH);. I use this frequently in conjunction with typing indicator to give the bot a more natural feeling conversation flow when it sends two or more discrete messages without waiting for user input.
However, as Eric mentioned, it seems like you might want a waterfall dialog instead for your use case. This sample from botbuilder-samples is a good example. Each step would be a prompt, and it will wait for user input before proceeding. I won't try to write an entire bot here, but a single question-answer step would look something like:
async firstQuestion(stepContext) {
return await stepContext.prompt(TEXT_PROMPT, 'Question 1');
}
async secondQuestion(stepContext) {
stepContext.values.firstAnswer = stepContext.result;
return await stepContext.prompt(TEXT_PROMPT, 'Question 2');
}
and so forth. Not sure what you are doing with the responses, but in the example above I'm saving it to stepContext.values so that the answers are all available in later steps as part of the context object.
If you can detail more of what your use case/expected behavior and share what askFirstSetOfQuestions is, we could provide further assistance.
When I try to add multiple emoji's to a message on discord, there is a slight delay before adding the next one. This is a little annoying and i was wondering if there was any way to add each emoji at the same time.
Snippet of Current Code:
await client.add_reaction(message=msg,emoji='❌')
await client.add_reaction(message=msg,emoji='\u0031\u20E3')
await client.add_reaction(message=msg,emoji='\u0032\u20E3')
await client.add_reaction(message=msg,emoji='\u0033\u20E3')
await client.add_reaction(message=msg,emoji='\u0034\u20E3')
await client.add_reaction(message=msg,emoji='\u0035\u20E3')
await client.add_reaction(message=msg,emoji='\u0036\u20E3')
await client.add_reaction(message=msg,emoji='\u0037\u20E3')
await client.add_reaction(message=msg,emoji='\u0038\u20E3')
await client.add_reaction(message=msg,emoji='\u0039\u20E3')
await client.add_reaction(message=msg,emoji='\u0030\u20E3')
await client.add_reaction(message=msg,emoji='\u27A1')
No. There isn't a coroutine for adding more than one reaction to a message at a time because there isn't an endpoint in the discord API for doing it. Since that means you have to make multiple API calls, the discord.py bot is rate-limiting itself to avoid the restrictions of the API.