I am able to fetch data from the API, and I can loop through the multiple pages of data if there are multiple pages. However, to speed it up, I would like to try and retrieve multiple pages at a time.
But I can't get this code to work.
async people() {
// https://swapi.co/api/people/
// The API returns 10 items per page
const perPage = 10
let promises = []
// Start with an empty array and add the results from each API call
let allResults = []
// Get first page and total number of pages
// based on total number of results (data.count)
let people = await fetch(`https://swapi.co/api/people`)
let data = await people.json()
const totalPages = Math.ceil(data.count / perPage)
// Add results to array
allResults = allResults.concat(data.results)
// If the total results is greater than the results per page,
// get the rest of the results and add to the aLLResults array
if (data.count > perPage) {
for (let page = 2; page <= totalPages; page++) {
promises.push(
new Promise((resolve, reject) => {
people = fetch(`https://swapi.co/api/people/?page=${page}`).
then(response => {
data = response.json()
},
response => {
allResults = allResults.concat(response.results)
}
)
})
)
}
return Promise.all(promises)
}
return allResults
},
then(response => {
data = response.json()
},
response => {
allResults = allResults.concat(response.results)
}
In this section, the allResults = allResults.concat(response.results) is only triggered as the error-handling callback, equivalent to a .catch((err) => { ... }). Is that what you intend?
`
Related
I'm relatively new to working with NodeJS, and I'm doing a practice project using the Youtube API to get some data on a user's videos. The Youtube API returns a list of videos with a page token, to successfully collect all of a user's videos, you would have to make several API requests, each with a different page token. When you reach the end of these requests, there will be no new page token present in the response, so you can move on. Doing it in a for, or while loop seemed like the way to handle this, but these are synchronous operations that do not appear to work in promises, so I had to look for an alternative
I looked at a few previous answers to similar questions, including the ones here and here. I got the general idea of the code in the answers, but I couldn't quite figure out how to get it working fully myself. The request I am making is already chained in a .then() of a previous API call - I would like to complete the recursive fetch calls with new page tokens, and then move onto another .then(). Right now, when I run my code, it moves onto the next .then() without the requests that use the tokens being complete. Is there any way to stop this from happening? I know async/await may be a solution, but I've decided to post here just to see if there are any possible solutions without having to go down that route in the hope I learn a bit about fetch/promises in general. Any other suggestions/advice about the way the code is structured is welcome too, as I'm pretty conscious that this is probably not the best way to handle making all of these API calls.
Code :
let body = req.body
let resData = {}
let channelId = body.channelId
let videoData = []
let pageToken = ''
const fetchWithToken = (nextPageToken) => {
let uploadedVideosUrlWithToken = `https://youtube.googleapis.com/youtube/v3/playlistItems?part=ContentDetails&playlistId=${uploadedVideosPlaylistId}&pageToken=${nextPageToken}&maxResults=50&key=${apiKey}`
fetch(uploadedVideosUrlWithToken)
.then(res => res.json())
.then(uploadedVideosTokenPart => {
let {items} = uploadedVideosTokenPart
videoData.push(...items.map(v => v.contentDetails.videoId))
pageToken = (uploadedVideosTokenPart.nextPageToken) ? uploadedVideosTokenPart.nextPageToken : ''
if (pageToken) {
fetchWithToken(pageToken)
} else {
// tried to return a promise so I can chain .then() to it?
// return new Promise((resolve) => {
// return(resolve(true))
// })
}
})
}
const channelDataUrl = `https://youtube.googleapis.com/youtube/v3/channels?part=snippet%2CcontentDetails%2Cstatistics&id=${channelId}&key=${apiKey}`
// promise for channel data
// get channel data then store it in variable (resData) that will eventually be sent as a response,
// contentDetails.relatedPlaylists.uploads is the playlist ID which will be used to get individual video data.
fetch(channelDataUrl)
.then(res => res.json())
.then(channelData => {
let {snippet, contentDetails, statistics } = channelData.items[0]
resData.snippet = snippet
resData.statistics = statistics
resData.uploadedVideos = contentDetails.relatedPlaylists.uploads
return resData.uploadedVideos
})
.then(uploadedVideosPlaylistId => {
// initial call to get first set of videos + first page token
let uploadedVideosUrl = `https://youtube.googleapis.com/youtube/v3/playlistItems?part=ContentDetails&playlistId=${uploadedVideosPlaylistId}&maxResults=50&key=${apiKey}`
fetch(uploadedVideosUrl)
.then(res => res.json())
.then(uploadedVideosPart => {
let {nextPageToken, items} = uploadedVideosPart
videoData.push(...items.map(v => v.contentDetails.videoId))
// idea is to do api calls until pageToken is non existent, and add the video id's to the existing array.
fetchWithToken(nextPageToken)
})
})
.then(() => {
// can't seem to get here synchronously - code in this block will happen before all the fetchWithToken's are complete - need to figure this out
})
Thanks to anyone who takes the time out to read this.
Edit:
After some trial and error, this seemed to work - it is a complete mess. The way I understand it is that this function now recursively creates promises that resolve to true only when there is no page token from the api response allowing me to return this function from a .then() and move on to a new .then() synchronously. I am still interested in better solutions, or just suggestions to make this code more readable as I don't think it's very good at all.
const fetchWithToken = (playlistId, nextPageToken) => {
let uploadedVideosUrlWithToken = `https://youtube.googleapis.com/youtube/v3/playlistItems?part=ContentDetails&playlistId=${playlistId}&pageToken=${nextPageToken}&maxResults=50&key=${apiKey}`
return new Promise((resolve) => {
resolve( new Promise((res) => {
fetch(uploadedVideosUrlWithToken)
.then(res => res.json())
.then(uploadedVideosTokenPart => {
let {items} = uploadedVideosTokenPart
videoData.push(...items.map(v => v.contentDetails.videoId))
pageToken = (uploadedVideosTokenPart.nextPageToken) ? uploadedVideosTokenPart.nextPageToken : ''
// tried to return a promise so I can chain .then() to it?
if (pageToken) {
res(fetchWithToken(playlistId, pageToken))
} else {
res(new Promise(r => r(true)))
}
})
}))
})
}
You would be much better off using async/await which are basically a wrapper for promises. Promise chaining, which is what you are doing with the nested thens, can get messy and confusing...
I converted your code to use async/await so hopefully this will help you see how to solve your problem. Good luck!
Your initial code:
let { body } = req
let resData = {}
let { channelId } = body
let videoData = []
let pageToken = ''
const fetchWithToken = async (nextPageToken) => {
const someData = (
await fetch(
`https://youtube.googleapis.com/youtube/v3/playlistItems?part=ContentDetails&playlistId=${uploadedVideosPlaylistId}&pageToken=${nextPageToken}&maxResults=50&key=${apiKey}`,
)
).json()
let { items } = someData
videoData.push(...items.map((v) => v.contentDetails.videoId))
pageToken = someData.nextPageToken ? someData.nextPageToken : ''
if (pageToken) {
await fetchWithToken(pageToken)
} else {
// You would need to work out
}
}
const MainMethod = async () => {
const channelData = (
await fetch(
`https://youtube.googleapis.com/youtube/v3/channels?part=snippet%2CcontentDetails%2Cstatistics&id=${channelId}&key=${apiKey}`,
)
).json()
let { snippet, contentDetails, statistics } = channelData.items[0]
resData.snippet = snippet
resData.statistics = statistics
resData.uploadedVideos = contentDetails.relatedPlaylists.uploads
const uploadedVideosPlaylistId = resData.uploadedVideos
const uploadedVideosPart = (
await fetch(
`https://youtube.googleapis.com/youtube/v3/playlistItems?part=ContentDetails&playlistId=${uploadedVideosPlaylistId}&maxResults=50&key=${apiKey}`,
)
).json()
let { nextPageToken, items } = uploadedVideosPart
videoData.push(...items.map((v) => v.contentDetails.videoId))
await fetchWithToken(nextPageToken)
}
MainMethod()
Your Edit:
const fetchWithToken = (playlistId, nextPageToken) => {
return new Promise((resolve) => {
resolve(
new Promise(async (res) => {
const uploadedVideosTokenPart = (
await fetch(
`https://youtube.googleapis.com/youtube/v3/playlistItems?part=ContentDetails&playlistId=${playlistId}&pageToken=${nextPageToken}&maxResults=50&key=${apiKey}`,
)
).json()
let { items } = uploadedVideosTokenPart
videoData.push(...items.map((v) => v.contentDetails.videoId))
pageToken = uploadedVideosTokenPart.nextPageToken
? uploadedVideosTokenPart.nextPageToken
: ''
if (pageToken) {
res(fetchWithToken(playlistId, pageToken))
} else {
res(new Promise((r) => r(true)))
}
}),
)
})
}
I am trying to get the response of the users using auth function and i have to create an excel sheet using the xlsx-populate library and i am able to convert that into an array of objects as the limit is 1000 so there are multiple arrays of objects. and i am not able to figure out how can i do this problem.in this problem, i am simply fetching results using auth and try to get the results into an array of objects. and i am also tried to use the objects to pass into the excel sheet but it gives the excel sheet with last 1000 queries response
const admin = require("firebase-admin");
const momentTz = require("moment-timezone");
const XlsxPopulate = require("xlsx-populate");
momentTz.suppressDeprecationWarnings = true;
const {
alphabetsArray
} = require("./constant");
var start = momentTz().subtract(4, "days").startOf("day").format();
var start = momentTz(start).valueOf();
const end = momentTz().subtract(1, "days").endOf("day").format();
const listAllUsers = async(nextPageToken) =>{
const [workbook] = await Promise.all([
XlsxPopulate.fromBlankAsync()
]);
const reportSheet = workbook.addSheet("Signup Report");
workbook.deleteSheet("Sheet1");
reportSheet.row(1).style("bold", true);
[
"Date",
"TIME",
"Phone Number"
].forEach((field, index) => {
reportSheet.cell(`${alphabetsArray[index]}1`).value(field);
});
let count = 0
// List batch of users, 1000 at a time.
const data = [];
admin
.auth()
.listUsers(1000, nextPageToken)
.then (async (listUsersResult) => {
listUsersResult.users.forEach((userRecord) =>{
const time = userRecord.metadata.creationTime;
const timestamp = momentTz(time).valueOf();
// console.log(timestamp)
if (timestamp >= 1585704530967 ) {
console.log(time);
let column = count+2;
count++;
data.push(userRecord.toJSON())
reportSheet.cell(`A${column}`).value(time);
reportSheet.cell(`C${column}`).value(userRecord.phoneNumber);
}
});
console.log(JSON.stringify(data))//this is the array of the object and i am getting after 1000 response
if (listUsersResult.pageToken) {
// List next batch of users.
listAllUsers(listUsersResult.pageToken);
await workbook.toFileAsync("./SignUp.xlsx");
}
})
// .catch(function (error) {
// console.log("Error listing users:", error);
// });
// const datas = []
// datas.push(data)
// console.log(datas)
return ;
}
// Start listing users from the beginning, 1000 at a time.
listAllUsers();
and the output i am getting is like this
[]
[]
[]
[]
[]
i want to convert this into a single array of response
You have a race condition. When you perform your console.log(JSON.stringify(data)) your listUserQuery is in progress (and in async mode) and you don't have yet the answer when you print the array. Thus the array is empty.
Try this (I'm not sure of this optimal solution, I'm not a nodeJS dev)
admin
.auth()
.listUsers(1000, nextPageToken)
.then (async (listUsersResult) => {
listUsersResult.users.forEach((userRecord) =>{
const time = userRecord.metadata.creationTime;
const timestamp = momentTz(time).valueOf();
// console.log(timestamp)
if (timestamp >= 1585704530967 ) {
console.log(time);
let column = count+2;
count++;
data.push(userRecord.toJSON())
reportSheet.cell(`A${column}`).value(time);
reportSheet.cell(`C${column}`).value(userRecord.phoneNumber);
}
}
console.log(JSON.stringify(data))//this is the array of the object and i am getting after 1000 response
if (listUsersResult.pageToken) {
// List next batch of users.
listAllUsers(listUsersResult.pageToken);
await workbook.toFileAsync("./SignUp.xlsx");
}
);
I'm building a simple NodeJS web scraper, and I want to re-run the function like a 'for loop' until pageNum = totalNumberOfPages... im having a brain fart, and unable to re-run the function from inside itself, since it returns an array fragment and kills itself. Could someone help me overcome this obstacle? I'm pretty sure it's very simple.
I looked at this and this but didn't figure it out...
const cheerio = require("cheerio");
const axios = require("axios");
let pageNum = 0;
let siteUrl = "https://whatever.com?&page=" + pageNum + "&viewAll=true";
let productArray = [];
let vendor = [];
let productTitle = [];
let plantType = [];
let thcRange = [];
let cbdRange = [];
let price = [];
let totalNumberOfPages = undefined;
// called by getResults()
const fetchData = async () => {
const result = await axios.get(siteUrl);
return cheerio.load(result.data);
};
// this function is called from index.js
const getResults = async () => {
// >>>>>>>>>>>>>>>>>> HOW DO I RERUN FROM HERE <<<<<<<<<<<<<<<<<<<<<<<<<<<
const $ = await fetchData();
// first check how many total pages there are
totalNumberOfPages = parseInt($('.pagination li:nth-last-child(2)').text());
// use fetched data to grab elements (and their text) and push into arrays defined above
$('.product-tile__vendor').each((index, element) => {
vendor.push($(element).text());
});
$('.product-tile__title').each((index, element) => {
productTitle.push($(element).text());
});
$('.product-tile__plant-type').each((index, element) => {
plantType.push($(element).text());
});
$('.product-tile__properties li:nth-child(2) p').each((index, element) => {
thcRange.push($(element).text());
});
$('.product-tile__properties li:nth-child(3) p').each((index, element) => {
cbdRange.push($(element).text());
});
$('.product-tile__price').each((index, element) => {
price.push($(element).text());
});
// increment page number to get more products if the page count is less than total number of pages
if (pageNum < totalNumberOfPages) {
pageNum ++;
};
//Convert to an array so that we can sort the results.
productArray.push ({
vendors: [...vendor],
productTitle: [...productTitle],
plantType: [...plantType],
thcRange: [...thcRange],
cbdRange: [...cbdRange],
price: [...price],
pageNum
});
// >>>>>>>>>>>>>>>>>> UNTIL HERE I THINK <<<<<<<<<<<<<<<<<<<<<<<<<<<
return productArray;
};
module.exports = getResults;
you can use recursion concept in your code:
which means the function itself will call itself
so what you can do is
const getResults = async () => {
// >>>>>>>>>>>>>>>>>> HOW DO I RERUN FROM HERE <<<<<<<<<<<<<<<<<<<<<<<<<<<
const $ = await fetchData();
// first check how many total pages there are
totalNumberOfPages = parseInt($('.pagination li:nth-last-child(2)').text());
// use fetched data to grab elements (and their text) and push into arrays defined above
$('.product-tile__vendor').each((index, element) => {
vendor.push($(element).text());
});
$('.product-tile__title').each((index, element) => {
productTitle.push($(element).text());
});
$('.product-tile__plant-type').each((index, element) => {
plantType.push($(element).text());
});
$('.product-tile__properties li:nth-child(2) p').each((index, element) => {
thcRange.push($(element).text());
});
$('.product-tile__properties li:nth-child(3) p').each((index, element) => {
cbdRange.push($(element).text());
});
$('.product-tile__price').each((index, element) => {
price.push($(element).text());
});
// increment page number to get more products if the page count is less than total number of pages
if (pageNum < totalNumberOfPages) {
pageNum ++;
};
//Convert to an array so that we can sort the results.
productArray.push ({
vendors: [...vendor],
productTitle: [...productTitle],
plantType: [...plantType],
thcRange: [...thcRange],
cbdRange: [...cbdRange],
price: [...price],
pageNum
});
// >>>>>>>>>>>>>>>>>> UNTIL HERE I THINK <<<<<<<<<<<<<<<<<<<<<<<<<<<
if(pageNum >= totalNumberOfPages) getResults()
return productArray;
};
I am using google-play-scraper module in node.js to scrape google play reviews. The review function for a single page is as below:
var gplay = require('google-play-scraper');
gplay.reviews({
appId: 'es.socialpoint.chefparadise',
page: 0,
}).then(console.log, console.log);
Now, I like to scrape all the comments on all pages at once and save them in a logger. For this, I am using winston logger and a for loop as below:
var gplay = require('google-play-scraper');
const winston= require('winston');
const logger = winston.createLogger({
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'rev1.log' })
]
});
package_id='com.jetstartgames.chess'
for (i=0; i<112; i++){
gplay.reviews({
appId: package_id,
page: i,
}).then(logger.info, logger.info);
}
The problem is that I should pre-defined the maximum number of pages that each application owns for its reviews (I should determine the maximum value of i for the loop). In order to do this, I taught of checking for the null value but I couldn't find a plausible way for doing it. The log file for a page that doesn't exist in reality has a structure as below:
{"message":[],"level":"info"}
I tried this code which doesn't work:
max=0
for (i=0; i<10000; i++){
data=gplay.reviews({
appId: 'com.jetstartgames.chess',
page: i,
});
if (data.message==null || data.message==undefined){
break;
} else {
max+=1;
}
}
Is there any way that I can figure out the maximum number of pages by checking of the first null output? or any other suggestion for this purpose?
So there is a couple issues, it looks like the api your using uses Promises so the returned value won't be available for you until further loops.
If your using a node.js > 7.6 you can you use async / await like so;
import gplay from 'google-play-scraper';
async function getReviews(appId, page = 1) {
return await gplay.reviews({
appId,
page,
});
}
async function process(appId) {
let page = 1;
let messages = [];
let result;
do {
result = await getReviews(appId, page);
messages = messages.concat(result);
++page;
} while (result.length > 0);
return messages;
}
process('com.jetstartgames.chess')
.then((messages) => {
console.log(messages);
})
I try to implement like this. Pls try and let me know if it works :)
In document from reviews, pls noted:
Note that this method returns reviews in a specific language (english
by default), so you need to try different languages to get more
reviews. Also, the counter displayed in the Google Play page refers to
the total number of 1-5 stars ratings the application has, not the
written reviews count. So if the app has 100k ratings, don't expect to
get 100k reviews by using this method.
var gplay = require('google-play-scraper');
var appId = 'com.jetstartgames.chess';
var taskList = [];
for(var i = 1 ; i < 10000; i++){
taskList.push(new Promise((res, rej)=>{
gplay.reviews({
appId: appId,
page: i,
sort: gplay.sort.RATING
}).then(result =>{
res(result.length);
})
.catch(err => rej(err))
}));
}
Promise.all(taskList)
.then(results => {
results = results.filter(x => x > 0);
var maxPage = results.length;
console.log('maxPage', maxPage);
})
.catch(err => console.log(err))
The problem is that I should pre-defined the maximum number of pages that each application owns for its reviews (I should determine the maximum value of i for the loop).
I think we can get this data from app response.
{
appId: 'es.socialpoint.chefparadise',
...
ratings: 27904,
reviews: 11372, // data to determine pagenumber
...
}
Also, review offers a ball park number for the page number calculation.
page (optional, defaults to 0): Number of page that contains reviews. Every page has 40 reviews at most.
Making those changes,
'use strict';
const gplay = require('google-play-scraper');
const packageId = 'es.socialpoint.chefparadise';
function getAppDetails(packageId) {
return gplay.app({ appId: packageId })
.catch(console.log);
}
getAppDetails(packageId).then(appDetails => {
let { reviews, ratings } = appDetails;
const totalPages = Math.round(reviews / 40);
console.log(`Total reviews => ${reviews} \nTotal ratings => ${ratings}\nTotal pages => ${totalPages} `);
let rawReview = [];
let pageNumber = 0;
while (pageNumber < totalPages) {
console.log(`pageNumber =${pageNumber},totalPages=${totalPages}`);
rawReview.push(gplay.reviews({
appId: packageId,
page: pageNumber,
}).catch(err => {
console.log(packageId, pageNumber);
console.log(err);
}));
pageNumber++;
}
return Promise.all(rawReview);
}).then(reviewsResults => {
console.log('***Reviews***');
for (let review of reviewsResults) {
console.log(review);
}
}).catch(err => {
console.log('Err ', err);
});
It worked well for the packageId which had less reviews. But for es.socialpoint.chefparadise I frequently ran into Issue #298 since the data size is huge.
Output
Total reviews => 215922
Total ratings => 688107
Total pages => 5398
Reviews
....
I am using Puppeteer to build a basic web-scraper and so far I can return all the data I require from any given page, however when pagination is involved my scraper comes unstuck (only returning the 1st page).
See example - this returns Title/Price for 1st 20 books, but doesn't look at the other 49 pages of books.
Just looking for guidance on how to overcome this - I can't see anything in the docs.
Thanks!
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('http://books.toscrape.com/');
const result = await page.evaluate(() => {
let data = [];
let elements = document.querySelectorAll('.product_pod');
for (var element of elements){
let title = element.childNodes[5].innerText;
let price = element.childNodes[7].children[0].innerText;
data.push({title, price});
}
return data;
});
browser.close();
return result;
};
scrape().then((value) => {
console.log(value);
});
To be clear. I am following a tutorial here - this code comes from Brandon Morelli on codeburst.io!! https://codeburst.io/a-guide-to-automating-scraping-the-web-with-javascript-chrome-puppeteer-node-js-b18efb9e9921
I was following same article in order to educate myself on how to use Puppeteer.
Short answer on your question is that you need to introduce one more loop to iterate over all available pages in online book catalogue.
I've done following steps in order to collect all book titles and prices:
Extracted page.evaluate part in separate async function that takes page as argument
Introduced for-loop with hardcoded last catalogue page number (you can extract it with help of Puppeteer if you wish)
Placed async function from step one inside a loop
Same exact code from Brandon Morelli article, but now with one extra loop:
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('http://books.toscrape.com/');
var results = []; // variable to hold collection of all book titles and prices
var lastPageNumber = 50; // this is hardcoded last catalogue page, you can set it dunamically if you wish
// defined simple loop to iterate over number of catalogue pages
for (let index = 0; index < lastPageNumber; index++) {
// wait 1 sec for page load
await page.waitFor(1000);
// call and wait extractedEvaluateCall and concatenate results every iteration.
// You can use results.push, but will get collection of collections at the end of iteration
results = results.concat(await extractedEvaluateCall(page));
// this is where next button on page clicked to jump to another page
if (index != lastPageNumber - 1) {
// no next button on last page
await page.click('#default > div > div > div > div > section > div:nth-child(2) > div > ul > li.next > a');
}
}
browser.close();
return results;
};
async function extractedEvaluateCall(page) {
// just extracted same exact logic in separate function
// this function should use async keyword in order to work and take page as argument
return page.evaluate(() => {
let data = [];
let elements = document.querySelectorAll('.product_pod');
for (var element of elements) {
let title = element.childNodes[5].innerText;
let price = element.childNodes[7].children[0].innerText;
data.push({ title, price });
}
return data;
});
}
scrape().then((value) => {
console.log(value);
console.log('Collection length: ' + value.length);
console.log(value[0]);
console.log(value[value.length - 1]);
});
Console output:
...
{ title: 'In the Country We ...', price: '£22.00' },
... 900 more items ]
Collection length: 1000
{ title: 'A Light in the ...', price: '£51.77' }
{ title: '1,000 Places to See ...', price: '£26.08' }