I'm fairly new to nodejs and I am using puppeteer to automate some browsing , but I am getting a bit lost with the complexity of a certain scenario.
I am clicking a button , and it will search some records ( using ajax ) and put the result on the page.
Wait for Response / request doesn't really fit , because I am waiting for 2-3 requests depending on the type of search - and the response URL's are exactly the same for each. So , I guess I want to wait for 3 url responses of this particular URL to be complete.
Maybe I need to rethink this , or maybe I am close? The promise always times out even though it seems to be increasing the responseCount
async function intercepted(resp) {
if (resp.url().includes('/ajaxpro/')) {
return 1
}
return 0
}
let responseCount = 0
page.on('response', async resp => {
responseCount += await intercepted(resp)
})
const getResponse = await new Promise((resolve, reject) => {
setTimeout(() => resolve(responseCount > 3), 60000)
})
Try checking the condition after each response is received.
async function intercepted(resp) {
if (resp.url().includes('/ajaxpro/')) {
return 1
}
return 0
}
let responseCount = 0
page.on('response', async resp => {
let isTargetSearch = await intercepted(resp);
responseCount += isTargetSearch;
// - The current response is what we are looking for
// - and reached 3 times.
if(isTargetSearch && responseCount == 3) {
// Do what you need to do here
}
})
Maybe you don't have to wait until a certain number of responses have been intercepted but wait until the results from those AJAX calls have rendered something on the page (wait for visual results). In that case, you would be doing a page.waitForSelector(selector), where the selector is the information that the result of the call render on screen.
When dealing with Puppeteer, I usually find waiting for visible results to be better...
Related
I am trying to extract the year, make, model, colour and plate number from carjam.co.nz. An example of a URL I am scraping from is https://www.carjam.co.nz/car/?plate=JKY242.
If the plate has been recently requested, then the response will be a HTML document with the vehicle details.
Result where the plate details have been recently requested.
If the plate details haven't been recently requested (as is the case with most plates) the response is a HTML document with "Trying to get some vehicle data". I'm guessing that this page displays while the information is fetched from the database, then the page is reloaded to show the vehicle details. This appears to be rendered server-side, I can't see any AJAX requests.
The URL is the same for each result.
Result where the vehicle hasn't been recently requested.
How do I 'wait' for the correct information?
I am using request (deprecated I know, but it is what I am most comfortable using) on a Node.js with Express server.
My (very reduced) code:
app.get("/:numberPlate", (req, res) => {
request("https://www.carjam.co.nz/car/?plate=" + req.params.numberPlate, function(error, response, body) {
const $ = cheerio.load(body);
res.status(200).send(JSON.stringify({
year: $("[data-key=year_of_manufacture]").next().html(),
make: toTitleCase($("[data-key=make]").next().html()),
model: toTitleCase($("[data-key=model]").next().html()),
colour: toTitleCase($("[data-key=main_colour]").next().html()),
}));
}
}
I have considered:
Making a request and discarding it, sleeping for 2 - 3 seconds, then making a second request. The advantage of this approach is that every request would work. Disadvantage is that every request takes 2 - 3 second (too slow).
Making a request and checking to see if the body contains "Trying to get some vehicle data". If so, sleep a few seconds, make another request and take action on the result of that second request (but how?).
I'm sure this is a common problem with an easy answer, but I don't have enough experience to figure it out myself, or to know exactly what to Google!
To test: New Zealand has number places in the format "ABC123" – three letters, three numbers. These are released in alphabetical-ish order, currently we have nothing past NLU999 (excluding custom numberplates, numberplates issued out of sequence, etc).
To reproduce the "Trying to get some vehicle data", you need to find a new numberplate each time – most numberplates earlier in the sequence than NLU999 should work.
This code snippet should generate a valid numberplate.
console.log(Math.random().toString(36).replace(/[^a-n]+/g, '').substr(0, 1).toUpperCase() + Math.random().toString(36).replace(/[^a-z]+/g, '').substr(0, 2).toUpperCase() + Math.floor(Math.random() * 10).toString() + Math.floor(Math.random() * 10).toString() + Math.floor(Math.random() * 10).toString());
05 May 2021 update
Upon further thought, this pseudocode could be what I'm after – but unsure how to practically implement.
request(url) {
if (url body contains "Trying to get some vehicle data") {
wait(2 seconds)
request(url again) {
return second_result
}
} else {
return first_result
}
}
then
process(first_result or second_result)
My difficulty here: I am used to the format request().then(), taking action directly from the request.
Assuming this approach is correct, how would I conduct the following?
Send the request, then
Assess the response, then
Pass this response on, or send another request then pass that response on
Process response
From this javascript file, the website loads the page every X seconds if the data is not found with a max retry set to 10. Also the refresh value in seconds is retrieved from the Refresh http header value.
You can reproduce this flow, so that you have exactly the same behaviour as the frontend code.
In the following example I'm using axios
const axios = require("axios");
const cheerio = require("cheerio");
const rootUrl = "https://www.carjam.co.nz/car/";
const plate = "NLU975";
const maxRetry = 10;
const waitingString = "Waiting for a few more things";
async function getResult() {
return axios.get(rootUrl, {
params: {
plate: plate,
},
});
}
async function processRetry(result) {
const refreshSeconds = parseInt(result.headers["refresh"]);
var retryCount = 0;
while (retryCount < maxRetry) {
console.log(
`retry: ${retryCount} time, waiting for ${refreshSeconds} second(s)`
);
retryCount++;
await timeout(refreshSeconds * 1000);
result = await getResult();
if (!result.data.includes(waitingString)) {
break;
}
}
return result;
}
(async () => {
var result = await getResult();
if (result.data.includes(waitingString)) {
result = await processRetry(result);
}
const $ = cheerio.load(result.data);
console.log({
year: $("[data-key=year_of_manufacture]").next().html(),
make: $("[data-key=make]").next().html(),
model: $("[data-key=model]").next().html(),
colour: $("[data-key=main_colour]").next().html(),
});
})();
function timeout(ms) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
repl.it link: https://replit.com/#bertrandmartel/ScrapeCarJam
Sample output:
retry: 0 time, waiting for 1 second(s)
retry: 1 time, waiting for 1 second(s)
retry: 2 time, waiting for 1 second(s)
{ year: 'XXXX', make: 'XXXXXX', model: 'XX', colour: 'XXXX' }
It uses async/await instead of promise.
Note that request is deprecated
My issues
Launch 1000+ online API that limits the number of API calls to 10 calls/sec.
Wait for all the API calls to give back a result (or retry), it can take 5 sec before the API sends it data
Use the combined data in the rest of my app
What I have tried while looking at a lot of different questions and answers here on the site
Use promise to wait for one API request
const https = require("https");
function myRequest(param) {
const options = {
host: "api.xxx.io",
port: 443,
path: "/custom/path/"+param,
method: "GET"
}
return new Promise(function(resolve, reject) {
https.request(options, function(result) {
let str = "";
result.on('data', function(chunk) {str += chunk;});
result.on('end', function() {resolve(JSON.parse(str));});
result.on('error', function(err) {console.log("Error: ", err);});
}).end();
});
};
Use Promise.all to do all the requests and wait for them to finish
const params = [{item: "param0"}, ... , {item: "param1000+"}]; // imagine 1000+ items
const promises = [];
base.map(function(params){
promises.push(myRequest(params.item));
});
result = Promise.all(promises).then(function(data) {
// doing some funky stuff with dat
});
So far so good, sort of
It works when I limit the number of API requests to a maximum of 10 because then the rate limiter kicks in. When I console.log(promises), it gives back an array of 'request'.
I have tried to add setTimeout in different places, like:
...
base.map(function(params){
promises.push(setTimeout(function() {
myRequest(params.item);
}, 100));
});
...
But that does not seem to work. When I console.log(promises), it gives back an array of 'function'
My questions
Now I am stuck ... any ideas?
How do I build in retries when the API gives an error
Thank you for reading up to hear, you are already a hero in my book!
When you have a complicated control-flow using async/await helps a lot to clarify the logic of the flow.
Let's start with the following simple algorithm to limit everything to 10 requests per second:
make 10 requests
wait 1 second
repeat until no more requests
For this the following simple implementation will work:
async function rateLimitedRequests (params) {
let results = [];
while (params.length > 0) {
let batch = [];
for (i=0; i<10; i++) {
let thisParam = params.pop();
if (thisParam) { // use shift instead
batch.push(myRequest(thisParam.item)); // of pop if you want
} // to process in the
// original order.
}
results = results.concat(await Promise.all(batch));
await delayOneSecond();
}
return results;
}
Now we just need to implement the one second delay. We can simply promisify setTimeout for this:
function delayOneSecond() {
return new Promise(ok => setTimeout(ok, 1000));
}
This will definitely give you a rate limiter of just 10 requests each second. In fact it performs somewhat slower than that because each batch will execute in request time + one second. This is perfectly fine and already meet your original intent but we can improve this to squeeze a few more requests to get as close as possible to exactly 10 requests per second.
We can try the following algorithm:
remember the start time
make 10 requests
compare end time with start time
delay one second minus request time
repeat until no more requests
Again, we can use almost exactly the same logic as the simple code above but just tweak it to do time calculations:
const ONE_SECOND = 1000;
async function rateLimitedRequests (params) {
let results = [];
while (params.length > 0) {
let batch = [];
let startTime = Date.now();
for (i=0; i<10; i++) {
let thisParam = params.pop();
if (thisParam) {
batch.push(myRequest(thisParam.item));
}
}
results = results.concat(await Promise.all(batch));
let endTime = Date.now();
let requestTime = endTime - startTime;
let delayTime = ONE_SECOND - requestTime;
if (delayTime > 0) {
await delay(delayTime);
}
}
return results;
}
Now instead of hardcoding the one second delay function we can write one that accept a delay period:
function delay(milliseconds) {
return new Promise(ok => setTimeout(ok, milliseconds));
}
We have here a simple, easy to understand function that will rate limit as close as possible to 10 requests per second. It is rather bursty in that it makes 10 parallel requests at the beginning of each one second period but it works. We can of course keep implementing more complicated algorithms to smooth out the request pattern etc. but I leave that to your creativity and as homework for the reader.
I'm working on a proxy that caches files and I'm trying to add some logic that prevents multiple clients from downloading the same files before the proxy has a chance to cache them.
Basically, the logic I'm trying to implement is the following:
Client 1 requests a file. The proxy checks if the file is cached. If it's not, it requests it from the server, caches it, then sends it to the client.
Client 2 requests the same file after client 1 requested it, but before the proxy has a chance to cache it. So the proxy will tell client 2 to wait a few seconds because there is already a download in progress.
A better approach would probably be to give client 2 a "try again later" message, but let's just say that's currently not an option.
I'm using Nodejs with the anyproxy library. According to the documentation, delayed responses are possible by using promises.
However, I don't really see a way to achieve what I want using Promises. From what I can tell, I could do something like this:
module.exports = {
*beforeSendRequest(requestDetail) {
if(thereIsADownloadInProgressFor(requestDetail.url)) {
return new Promise((resolve, reject) => {
setTimeout(() => { // delay
resolve({ response: responseDetail.response });
}, 10000);
});
}
}
};
But that would mean simply waiting for a maximum amount of time and hoping the download finishes by then.
And I don't want that.
I would prefer to be able to do something like this (but with Promises, somehow):
module.exports = {
*beforeSendRequest(requestDetail) {
if(thereIsADownloadInProgressFor(requestDetail.url)) {
var i = 0;
for(i = 0 ; i < 10 ; i++) {
JustSleep(1000);
if(!thereIsADownloadInProgressFor(requestDetail.url))
return { response: responseDetail.response };
}
}
}
};
Is there any way I can achieve this with Promises in Nodejs?
Thanks!
You can use a Map to cache your file downloads.
The mapping in Map would be url -> Promise { file }
// Map { url => Promise { file } }
const cache = new Map()
const thereIsADownloadInProgressFor = url => cache.has(url)
const getCachedFilePromise = url => cache.get(url)
const downloadFile = async url => {/* download file code here */}
const setAndReturnCachedFilePromise = url => {
const filePromise = downloadFile(url)
cache.set(url, filePromise)
return filePromise
}
module.exports = {
beforeSendRequest(requestDetail) {
if(thereIsADownloadInProgressFor(requestDetail.url)) {
return getCachedFilePromise(requestDetail.url).then(file => ({ response: file }))
} else {
return setAndReturnCachedFilePromise(requestDetail.url).then(file => ({ response: file }))
}
}
};
You don't need to send a try again response, simply serve the same data to both requests. All you need to do is store the requests somewhere in the caching system and trigger all of them when the fetching is done.
Here's a cache implementation that does only a single fetch for multiple requests. No delays and no try-laters:
export class class Cache {
constructor() {
this.resultCache = {}; // this object is the cache storage
}
async get(key, cachedFunction) {
let cached = this.resultCache[key];
if (cached === undefined) { // No cache so fetch data
this.resultCache[key] = {
pending: [] // This is the magic, store further
// requests in this pending array.
// This way pending requests are directly
// linked to this cache data
}
try {
let result = await cachedFunction(); // Wait for result
// Once we get result we need to resolve all pending
// promises. Loop through the pending array and
// resolve them. See code below for how we store pending
// requests.. it will make sense:
this.resultCache[key].pending
.forEach(waiter => waiter.resolve(result));
// Store the result of the cache so later we don't
// have to fetch it again:
this.resultCache[key] = {
data: result
}
// Return result to original promise:
return result;
// Note: yes, this means pending promises will get triggered
// before the original promise is resolved but normally
// this does not matter. You will need to modify the
// logic if you want promises to resolve in original order
}
catch (err) { // Error when fetching result
// We still need to trigger all pending promises to tell
// them about the error. Only we reject them instead of
// resolving them:
if (this.resultCache[key]) {
this.resultCache[key].pending
.forEach((waiter: any) => waiter.reject(err));
}
throw err;
}
}
else if (cached.data === undefined && cached.pending !== undefined) {
// Here's the condition where there was a previous request for
// the same data. Instead of fetching the data again we store
// this request in the existing pending array.
let wait = new Promise((resolve, reject) => {
// This is the "waiter" object above. It is basically
// It is basically the resolve and reject functions
// of this promise:
cached.pending.push({
resolve: resolve,
reject: reject
});
});
return await wait; // await response form original request.
// The code above will cause this to return.
}
else {
// Return cached data as normal
return cached.data;
}
}
}
The code may look a bit complicated but it is actually quite simple. First we need a way to store the cached data. Normally I'd just use a regular object for this:
{ key : result }
Where the cached data is stored in the result. But we also need to store additional metadata such as pending requests for the same result. So we need to modify our cache storage:
{ key : {
data: result,
pending: [ array of requests ]
}
}
All this is invisible and transparent to code using this Cache class.
Usage:
const cache = new Cache();
// Illustrated with w3c fetch API but you may use anything:
cache.get( URL , () => fetch(URL) )
Note that wrapping the fetch in an anonymous function is important because we want the Cache.get() function to conditionally call the fetch to avoid multiple fetch being called. It also gives the Cache class flexibility to handle any kind of asynchronous operation.
Here's another example for caching a setTimeout. It's not very useful but it illustrates the flexibility of the API:
cache.get( 'example' , () => {
return new Promise((resolve, reject) => {
setTimeout(resolve, 1000);
});
});
Note that the Cache class above does not have any invalidations or expiry logic for the sake of clarity but it's fairly easy to add them. For example if you want the cache to expire after some time you can just store the timestamp along with the other cache data:
{ key : {
data: result,
timestamp: timestamp,
pending: [ array of requests ]
}
}
Then in the "no-cache" logic simply detect the expiry time:
if (cached === undefined || (cached.timestamp + timeout) < now) ...
I'm writing a script that is intended to load some stuff from .txt files and then perform multiple ( in a loop) requests to a website with node.js` browser emulator nightmare.
I have no problem with reading from the txt files and so no, but managing to make it run sync and without exceptions.
function visitPage(url, code) {
new Promise((resolve, reject) => {
Nightmare
.goto(url)
.click('.vote')
.insert('input[name=username]', 'testadmin')
.insert('.test-code-verify', code)
.click('.button.vote.submit')
.wait('.tag.vote.disabled,.validation-error')
.evaluate(() => document.querySelector('.validation -error').innerHTML)
.end()
.then(text => {
return text;
})
});
}
async function myBackEndLogic() {
try {
var br = 0, user, proxy, current, agent;
while(br < loops){
current = Math.floor(Math.random() * (maxLoops-br-1));
/*...getting user and so on..*/
const response = await visitPage('https://example.com/admin/login',"code")
br++;
}
} catch (error) {
console.error('ERROR:');
console.error(error);
}
}
myBackEndLogic();
The error that occurs is:
UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'webContents' of undefined
So the questions are a few:
1) How to fix the exception
2) How to make it actually work sync and emulate everytime the address ( as in a previous attempt, which I didn't save, I fixed the exception, but the browser wasn't actually openning and it was basically skipped
3) (Not so important) Is it possible to select a few objects with
.wait('.class1,.class2,.validation-error')
and save each value in different variables or just get the text from the first that occured? ( if no any of these has occurred, then return 0 for example )
I see a few issues with the code above.
In the visitPage function, you are returning a Promise. That's fine, except you don't have to create the wrapping promise! It looks like nightmare returns a promise for you. Today, you're dropping an errors that promise returns by wrapping it. Instead - just use an async function!
async function visitPage(url, code) {
return Nightmare
.goto(url)
.click('.vote')
.insert('input[name=username]', 'testadmin')
.insert('.test-code-verify', code)
.click('.button.vote.submit')
.wait('.tag.vote.disabled,.validation-error')
.evaluate(() => document.querySelector('.validation -error').innerHTML)
.end();
}
You probably don't want to wrap the content of this method in a 'try/catch'. Just let the promises flow :)
async function myBackEndLogic() {
var br = 0, user, proxy, current, agent;
while(br < loops){
current = Math.floor(Math.random() * (maxLoops-br-1));
const response = await visitPage('https://example.com/admin/login',"code")
br++;
}
}
When you run your method - make sure to include a catch! Or a then! Otherwise, your app may exit early.
myBackEndLogic()
.then(() => console.log('donesies!'))
.catch(console.error);
I'm not sure if any of this will help with your specific issue, but hopefully it gets you on the right path :)
So here's the code snippet:
for (let item of items)
{
await page.waitFor(10000)
await page.click("#item_"+item)
await page.click("#i"+item)
let pages = await browser.pages()
let tempPage = pages[pages.length-1]
await tempPage.waitFor("a.orange", {timeout: 60000, visible: true})
await tempPage.click("a.orange")
counter++
}
page and tempPage are two different pages.
What happens is that page waits for 10 seconds, then clicks some stuff, which opens a second page.
What's supposed to happen is that tempPage waits for an element, clicks it, then page should wait 10 seconds before doing it all over again.
However, what actually happens is that page waits for 10 seconds, clicks the stuff, then starts waiting for 10 seconds without waiting for tempPage to finish its tasks.
Is this a bug, or am I misunderstanding something? How should I fix this so that when the for loop loops again, it is only after tempPage has clicked.
Generally, you cannot rely on await tempPage.click("a.orange") to pause execution until tempPage has "finish[ed] its tasks". For super simple code that executes synchronously, it may work. But in general, you cannot rely on it.
If the click triggers an Ajax operation, or starts a CSS animation, or starts a computation that cannot be immediately computed, or opens a new page, etc., then the result you are waiting for is asynchronous, and the .click method will not wait for this asynchronous operation to complete.
What can you do? In some cases you may be able to hook into the code that is running on the page and wait for some event that matters to you. For instance, if you want to wait for an Ajax operation to be done and the code on the page uses jQuery, then you might use ajaxComplete to detect when the operation is complete. If you cannot hook into any event system to detect when the operation is done, then you may need to poll the page to wait for evidence that the operation is done.
Here is an example that shows the issue:
const puppeteer = require('puppeteer');
function getResults(page) {
return page.evaluate(() => ({
clicked: window.clicked,
asynchronousResponse: window.asynchronousResponse,
}));
}
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.goto("https://example.com");
// We add a button to the page that will click later.
await page.evaluate(() => {
const button = document.createElement("button");
button.id = "myButton";
button.textContent = "My Button";
document.body.appendChild(button);
window.clicked = 0;
window.asynchronousResponse = 0;
button.addEventListener("click", () => {
// Synchronous operation
window.clicked++;
// Asynchronous operation.
setTimeout(() => {
window.asynchronousResponse++;
}, 1000);
});
});
console.log("before clicks", await getResults(page));
const button = await page.$("#myButton");
await button.click();
await button.click();
console.log("after clicks", await getResults(page));
await page.waitForFunction(() => window.asynchronousResponse === 2);
console.log("after wait", await getResults(page));
await browser.close();
});
The setTimeout code simulates any kind of asynchronous operation started by the click.
When you run this code, you'll see on the console:
before click { clicked: 0, asynchronousResponse: 0 }
after click { clicked: 2, asynchronousResponse: 0 }
after wait { clicked: 2, asynchronousResponse: 2 }
You see that clicked is immediately incremented twice by the two clicks. However, it takes a while before asynchronousResponse is incremented. The statement await page.waitForFunction(() => window.asynchronousResponse === 2) polls the page until the condition we are waiting for is realized.
You mentioned in a comment that the button is closing the tab. Opening and closing tabs are asynchronous operations. Here's an example:
puppeteer.launch().then(async browser => {
let pages = await browser.pages();
console.log("number of pages", pages.length);
const page = pages[0];
await page.goto("https://example.com");
await page.evaluate(() => {
window.open("https://example.com");
});
do {
pages = await browser.pages();
// For whatever reason, I need to have this here otherwise
// browser.pages() always returns the same value. And the loop
// never terminates.
await page.evaluate(() => {});
console.log("number of pages after evaluating open", pages.length);
} while (pages.length === 1);
let tempPage = pages[pages.length - 1];
// Add a button that will close the page when we click it.
tempPage.evaluate(() => {
const button = document.createElement("button");
button.id = "myButton";
button.textContent = "My Button";
document.body.appendChild(button);
window.clicked = 0;
window.asynchronousResponse = 0;
button.addEventListener("click", () => {
window.close();
});
});
const button = await tempPage.$("#myButton");
await button.click();
do {
pages = await browser.pages();
// For whatever reason, I need to have this here otherwise
// browser.pages() always returns the same value. And the loop
// never terminates.
await page.evaluate(() => {});
console.log("number of pages after click", pages.length);
} while (pages.length > 1);
await browser.close();
});
When I run the above, I get:
number of pages 1
number of pages after evaluating open 1
number of pages after evaluating open 1
number of pages after evaluating open 2
number of pages after click 2
number of pages after click 1
You can see it takes a bit before window.open() and window.close() have detectable effects.
In your comment you also wrote:
I thought await was basically what turned an asynchronous function into a synchronous one
I would not say it turns asynchronous functions into synchronous ones. It makes the current code wait for an asynchronous operation's promise to be resolved or rejected. However, more importantly for the issue at hand here, the problem is that you have two virtual machines executing JavaScript code: there's Node which runs puppeteer and the script that controls the browser, and there's the browser itself which has its own JavaScript virtual machine. Any await that you use on the Node side affects only the Node code: it has no bearing on the code that runs in the browser.
It can get confusing when you see things like await page.evaluate(() => { some code; }). It looks like it is all of one piece, and all executing in the same virtual machine, but it is not. puppeteer takes the parameter passed to .evaluate, serializes it, and sends it over to the browser, where it executes. Try adding something like await page.evaluate(() => { button.click(); }); in the script above, after const button = .... Something like this:
const button = await tempPage.$("#myButton");
await button.click();
await page.evaluate(() => { button.click(); });
In the script, button is defined before page.evaluate, but you'll get a ReferenceError when page.evaluate runs because button is not defined on the browser side!