How can I download FULL QUALITY pictures from google images using puppeteer? - node.js

Need help in how to create a stable selector that saves URL in full quality from google images.
Trying to download 4-25 pictures from google images using puppeteer in full quality.
It doesn't work.
The problem is to create a stable selector and getting URLs of the pictures in full quality and not the URL of google's preview mode.
I had it running already but it broke down due to what I understand to be a not so well selector. Now trying rebuild it.
Old selector that results in "elements" being undefined:
let previewimagexpath =
"/html/body/div[2]/c-wiz/div[3]/div[2]/div[3]/div/div/div[3]/div[2]/c-wiz/div/div[1]/div[1]/div[2]/div/a/img";
// previewimagexpath = '//*[#id="Sva75c"]/div/div/div[3]/div[2]/c-wiz/div/div[1]/div[1]/div[2]/div/a/img'
for (let i = 1; i < numOfPics; i++) {
let imagexpath =
"/html/body/div[2]/c-wiz/div[3]/div[1]/div/div/div/div[1]/div[1]/span/div[1]/div[1]/div[" +
i +
"]/a[1]/div[1]/img";
const elements = await page.$x(imagexpath);
await elements[0].click();
await page.waitForTimeout(3000);
const image = await page.$x(previewimagexpath);
let d = await image[0].getProperty("src");
//console.log(d._remoteObject.value);
imagelinkslist.push(d._remoteObject.value);
}
await browser.close();
};
new selector which is resulting in URLs of the preview mode and not in URLs of full quality images.
axios
.get(
"https://www.google.com/search?q=dogs&sxsrf=ALiCzsZW27NYppMFDO9xwabkhmXUQMku8g:1651495383126&source=lnms&tbm=isch&sa=X&ved=2ahUKEwj4-qLd68D3AhUR3KQKHdk3CFYQ_AUoAXoECAIQAw&biw=1680&bih=948&dpr=2"
)
.then(response => {
const $ = cheerio.load(response.data);
const image = $("img");
$("img").each((i, elem) => {});
console.log(image);
});

Related

How to get lastModified property of another website

When I use the inspect/developer tool in chrome I can find the last modified date from browser but I want to see the same date in my nodeJS application.
I have already tried
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.tbsnews.net/economy/bsec-chairman-stresses-restoring-investor-confidence-mutual-funds-500126');
const names = page.evaluate( ()=> {
console.log(document.lastModified);
})
Unfortunately this code shows the current time of new DOM creation as we are using newPage(). Can anyone help me ?
I have also tired JSDOM as well.
Thanks in advance.

Xpath: Working in Devtools but returns empty object with document.evaluate()

The following command works as expected in Devtools console (for example, here - https://news.ycombinator.com/)
$x('//a[#class="storylink"]') (Edge browser)
But the following code:
const page = await browser.newPage();
await page.goto("https://news.ycombinator.com/");
let urls = await page.evaluate(() => {
var item = document.evaluate(
'//a[#class="storylink"]',
document,
null,
XPathResult.FIRST_ORDERED_NODE_TYPE,
null
).singleNodeValue;
return item
})
browser.close();
returns an empty object: {}. The same is happening in every other website.
Why is that happening?
If you are automatic Chrome with Puppeteer from Node.js then the page object you have already exposes a method $x for XPath evaluation, see https://pptr.dev/#?product=Puppeteer&version=v10.4.0&show=api-pagexexpression. That means that doing
let urls = await page.$x('//a[#class="storylink"]')
should suffice.

How to use axios to get all response data when there are multiple pages?

I have a URL like https://abc123/api/records that gives me:
{"page":1,"per_page":10,"total":99,"total_pages":10,"data":[object1, object2, object3,...]}
axios.get("https://abc123.com/api/records?&page=2")
.then(response => console.log(response.data))
Will then give me something like
{"page":2,"per_page":10,"total":99,"total_pages":10,"data":[object1, object2, object3,...]}
I want to access all the objects in "data" from all 10 pages. How would I do that using nodeJS and axios?
You can use a loop such as do ... while to download each page of data. We'll stop downloading once we have retrieved the last page of data.
We set the page parameter using the params object, starting with page 1.
const axios = require("axios");
// Replace with the appropriate url.
const url = "https://abc123.com/api/records"
async function downloadRecords() {
let records = [];
let page = 0;
let totalPages = 0;
do {
let { data: response } = await axios.get(url, { params: { page: ++page } });
totalPages = response.total_pages;
console.log(`downloadRecords: page ${page} of ${totalPages} downloaded...`);
records = records.concat(response.data);
console.log("records.length:", records.length);
} while (page < totalPages)
console.log("downloadRecords: download complete.")
}
downloadRecords();

node js puppeteer metadata

I am new to Puppeteer, and I am trying to extract meta data from a Web site using Node.JS and Puppeteer. I just can't seem to get the syntax right. The code below works perfectly extracting the Title tag, using two different methods, as well as text from a paragraph tag. How would I extract the content text for the meta data with the name of "description" for example?
meta name="description" content="Stack Overflow is the largest, etc"
I would be seriously grateful for any suggestions! I can't seem to find any examples of this anywhere (5 hours of searching and code hacking later). My sample code:
const puppeteer = require('puppeteer');
async function main() {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://stackoverflow.com/', {waitUntil: 'networkidle2'});
const pageTitle1 = await page.evaluate(() => document.querySelector('title').textContent);
const pageTitle2 = await page.title();
const innerText = await page.evaluate(() => document.querySelector('p').innerText);
console.log(pageTitle1);
console.log(pageTitle2);
console.log(innerText);
};
main();
You need a deep tutorial for CSS selectors MDN CSS Selectors.
Something that I highly recommend is testing your selectors on the console directly in the page you will apply the automation, this will save hours of running-stop your system. Try this:
document.querySelectorAll("head > meta[name='description']")[0].content;
Now for puppeteer, you need to copy that selector and past on puppeteer function also I like more this notation:
await page.$eval("head > meta[name='description']", element => element.content);
Any other question or problem just comment.
For anyone struggling to get the OG tags in Puppeteer , here is the solution.
let dom2 = await page.evaluate(() => {
return document.head.querySelector('meta[property="og:description"]').getAttribute("content");
});
console.log(dom2);
If you prefer to avoid $eval, you can do:
const descriptionTag = await page.$('meta[name="description"]');
const description = await descriptionTag?.getAttribute('content');

Puppeteer: How to handle multiple tabs?

Scenario: Web form for developer app registration with two part workflow.
Page 1: Fill out developer app details and click on button to create Application ID, which opens, in a new tab...
Page 2: The App ID page. I need to copy the App ID from this page, then close the tab and go back to Page 1 and fill in the App ID (saved from Page 2), then submit the form.
I understand basic usage - how to open Page 1 and click the button which opens Page 2 - but how do I get a handle on Page 2 when it opens in a new tab?
Example:
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({headless: false, executablePath: '/Applications/Google Chrome.app'});
const page = await browser.newPage();
// go to the new bot registration page
await page.goto('https://register.example.com/new', {waitUntil: 'networkidle'});
// fill in the form info
const form = await page.$('new-app-form');
await page.focus('#input-appName');
await page.type('App name here');
await page.focus('#input-appDescription');
await page.type('short description of app here');
await page.click('.get-appId'); //opens new tab with Page 2
// handle Page 2
// get appID from Page 2
// close Page 2
// go back to Page 1
await page.focus('#input-appId');
await page.type(appIdSavedFromPage2);
// submit the form
await form.evaluate(form => form.submit());
browser.close();
})();
Update 2017-10-25
The work for Browser.pages has been completed and merged
Fixes Emit new Page objects when new tabs created #386 and Request: browser.currentPage() or similar way to access Pages #443.
Still looking for a good usage example.
A new patch has been committed two days ago and now you can use browser.pages() to access all Pages in current browser.
Works fine, tried myself yesterday :)
Edit:
An example how to get a JSON value of a new page opened as 'target: _blank' link.
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'load'});
// click on a 'target:_blank' link
await page.click(someATag);
// get all the currently open pages as an array
let pages = await browser.pages();
// get the last element of the array (third in my case) and do some
// hucus-pocus to get it as JSON...
const aHandle = await pages[3].evaluateHandle(() => document.body);
const resultHandle = await pages[3].evaluateHandle(body =>
body.innerHTML, aHandle);
// get the JSON value of the page.
let jsonValue = await resultHandle.jsonValue();
// ...do something with JSON
This will work for you in the latest alpha branch:
const newPagePromise = new Promise(x => browser.once('targetcreated', target => x(target.page())));
await page.click('my-link');
// handle Page 2: you can access new page DOM through newPage object
const newPage = await newPagePromise;
await newPage.waitForSelector('#appid');
const appidHandle = await page.$('#appid');
const appID = await page.evaluate(element=> element.innerHTML, appidHandle );
newPage.close()
[...]
//back to page 1 interactions
Be sure to use the last puppeteer version (from Github master branch) by setting package.json dependency to
"dependencies": {
"puppeteer": "git://github.com/GoogleChrome/puppeteer"
},
Source: JoelEinbinder # https://github.com/GoogleChrome/puppeteer/issues/386#issuecomment-343059315
According to the Official Documentation:
browser.pages()
returns: <Promise<Array<Page>>> Promise which resolves to an array of all open pages. Non visible pages, such as "background_page", will not be listed here. You can find them using target.page().
An array of all pages inside the Browser. In case of multiple browser contexts, the method will return an array with all the pages in all browser contexts.
Example Usage:
let pages = await browser.pages();
await pages[0].evaluate(() => { /* ... */ });
await pages[1].evaluate(() => { /* ... */ });
await pages[2].evaluate(() => { /* ... */ });
In theory, you could override the window.open function to always open "new tabs" on your current page and navigate via history.
Your workflow would then be:
Override the window.open function:
await page.evaluateOnNewDocument(() => {
window.open = (url) => {
top.location = url
}
})
Go to your first page and perform some actions:
await page.goto(PAGE1_URL)
// ... do stuff on page 1
Navigate to your second page by clicking the button and perform some actions there:
await page.click('#button_that_opens_page_2')
await page.waitForNavigation()
// ... do stuff on page 2, extract any info required on page 1
// e.g. const handle = await page.evaluate(() => { ... })
Return to your first page:
await page.goBack()
// or: await page.goto(PAGE1_URL)
// ... do stuff on page 1, injecting info saved from page 2
This approach, obviously, has its drawbacks, but I find it simplifies multi-tab navigation drastically, which is especially useful if you're running parallel jobs on multiple tabs already. Unfortunately, current API doesn't make it an easy task.
You could remove the need to switch page in case it is caused by target="_blank" attribute - by setting target="_self"
Example:
element = page.$(selector)
await page.evaluateHandle((el) => {
el.target = '_self';
}, element)
element.click()
If your click action is emitting a pageload, then any subsequent scripts being ran are effectively lost. To get around this you need to trigger the action (a click in this case) but not await for it. Instead, wait for the pageload:
page.click('.get-appId');
await page.waitForNavigation();
This will allow your script to effectively wait for the next pageload event before proceeding with further actions.
You can't currently - Follow https://github.com/GoogleChrome/puppeteer/issues/386 to know when the ability is added to puppeteer (hopefully soon)
it looks like there's a simple 'page.popup' event
Page corresponding to "popup" window
Emitted when the page opens a new tab or window.
const [popup] = await Promise.all([
new Promise(resolve => page.once('popup', resolve)),
page.click('a[target=_blank]'),
]);
const [popup] = await Promise.all([
new Promise(resolve => page.once('popup', resolve)),
page.evaluate(() => window.open('https://example.com')),
]);
credit to this github issue for easier 'targetcreated'
You can have multiple inheritance from browser.newPage() to open multiple tabs
Example
const page = await browser.newPage();
await page.goto("https://www.google.com/");
const page2 = await browser.newPage();
await page2.goto("https://www.youtube.com/");

Resources