node js puppeteer metadata - node.js

I am new to Puppeteer, and I am trying to extract meta data from a Web site using Node.JS and Puppeteer. I just can't seem to get the syntax right. The code below works perfectly extracting the Title tag, using two different methods, as well as text from a paragraph tag. How would I extract the content text for the meta data with the name of "description" for example?
meta name="description" content="Stack Overflow is the largest, etc"
I would be seriously grateful for any suggestions! I can't seem to find any examples of this anywhere (5 hours of searching and code hacking later). My sample code:
const puppeteer = require('puppeteer');
async function main() {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://stackoverflow.com/', {waitUntil: 'networkidle2'});
const pageTitle1 = await page.evaluate(() => document.querySelector('title').textContent);
const pageTitle2 = await page.title();
const innerText = await page.evaluate(() => document.querySelector('p').innerText);
console.log(pageTitle1);
console.log(pageTitle2);
console.log(innerText);
};
main();

You need a deep tutorial for CSS selectors MDN CSS Selectors.
Something that I highly recommend is testing your selectors on the console directly in the page you will apply the automation, this will save hours of running-stop your system. Try this:
document.querySelectorAll("head > meta[name='description']")[0].content;
Now for puppeteer, you need to copy that selector and past on puppeteer function also I like more this notation:
await page.$eval("head > meta[name='description']", element => element.content);
Any other question or problem just comment.

For anyone struggling to get the OG tags in Puppeteer , here is the solution.
let dom2 = await page.evaluate(() => {
return document.head.querySelector('meta[property="og:description"]').getAttribute("content");
});
console.log(dom2);

If you prefer to avoid $eval, you can do:
const descriptionTag = await page.$('meta[name="description"]');
const description = await descriptionTag?.getAttribute('content');

Related

How to get lastModified property of another website

When I use the inspect/developer tool in chrome I can find the last modified date from browser but I want to see the same date in my nodeJS application.
I have already tried
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.tbsnews.net/economy/bsec-chairman-stresses-restoring-investor-confidence-mutual-funds-500126');
const names = page.evaluate( ()=> {
console.log(document.lastModified);
})
Unfortunately this code shows the current time of new DOM creation as we are using newPage(). Can anyone help me ?
I have also tired JSDOM as well.
Thanks in advance.

How can I download FULL QUALITY pictures from google images using puppeteer?

Need help in how to create a stable selector that saves URL in full quality from google images.
Trying to download 4-25 pictures from google images using puppeteer in full quality.
It doesn't work.
The problem is to create a stable selector and getting URLs of the pictures in full quality and not the URL of google's preview mode.
I had it running already but it broke down due to what I understand to be a not so well selector. Now trying rebuild it.
Old selector that results in "elements" being undefined:
let previewimagexpath =
"/html/body/div[2]/c-wiz/div[3]/div[2]/div[3]/div/div/div[3]/div[2]/c-wiz/div/div[1]/div[1]/div[2]/div/a/img";
// previewimagexpath = '//*[#id="Sva75c"]/div/div/div[3]/div[2]/c-wiz/div/div[1]/div[1]/div[2]/div/a/img'
for (let i = 1; i < numOfPics; i++) {
let imagexpath =
"/html/body/div[2]/c-wiz/div[3]/div[1]/div/div/div/div[1]/div[1]/span/div[1]/div[1]/div[" +
i +
"]/a[1]/div[1]/img";
const elements = await page.$x(imagexpath);
await elements[0].click();
await page.waitForTimeout(3000);
const image = await page.$x(previewimagexpath);
let d = await image[0].getProperty("src");
//console.log(d._remoteObject.value);
imagelinkslist.push(d._remoteObject.value);
}
await browser.close();
};
new selector which is resulting in URLs of the preview mode and not in URLs of full quality images.
axios
.get(
"https://www.google.com/search?q=dogs&sxsrf=ALiCzsZW27NYppMFDO9xwabkhmXUQMku8g:1651495383126&source=lnms&tbm=isch&sa=X&ved=2ahUKEwj4-qLd68D3AhUR3KQKHdk3CFYQ_AUoAXoECAIQAw&biw=1680&bih=948&dpr=2"
)
.then(response => {
const $ = cheerio.load(response.data);
const image = $("img");
$("img").each((i, elem) => {});
console.log(image);
});

How to read the pdf file using puppeteer.js and display in html?

I hope you are safe.
I'm making one script which perform some scraping in the site. Now issue is, I have one site which has pdf. So I'm not able to read that pdf file using puppeteer and Node.js.
I'm able to read other text from other links.
What I tried
const puppeteer = require('puppeteer')
async function printPDF() {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://blog.risingstack.com', {waitUntil: 'networkidle0'});
const pdf = await page.pdf({ format: 'A4' });
await browser.close();
return pdf
})
It will work to add text into pdf, but I need pdf to text.
Can someone help me with this?
There is a npm module named "pdfreader". You can check that out.
FYI this was possible in Playwright by using Firefox and navigating to a PDF file, which would be opened using PDF.js. However, recent versions of Playwright broken this functionality:
https://github.com/microsoft/playwright/issues/13157

Upload webscraped data with puppeteer to firebase cloud storage in node.js

I’m trying to webscrape a press site, open every link of the articles and get the data. I was able to webscrape with puppeteer but cannot upload it to fire base cloud storage. How do I do that every hour or so?
I webscraped in asynchrones function and then called it in the cloud function:
I used puppeteer to scrape the links of the articles from newsroom website and then used the links to get more information from the articles. I first had everything in a single async function but cloud functions threw an error that there should not be any awaits in a loop.
UPDATE:
I implanted the code above in a firebase function but still get no-await in loop error.
There is a couple of things wrong here, but you are on a good path of getting this to work. The main problem is, that you can't have await within a try {} catch {} block. Asynchronous JavaScript has a different way of dealing with errors. See: try/catch blocks with async/await.
In your case, it's totally fine to write everything in one async function. Here is how I would do it:
async function scrapeIfc() {
const completeData = [];
const url = 'https://www.ifc.org/wps/wcm/connect/news_ext_content/ifc_external_corporate_site/news+and+events/pressroom/press+releases';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await page.setDefaultNavigationTimeout(0);
const links = await page.evaluate(() =>
Array.from(document.querySelectorAll('h3 > a')).map(anchor => anchor.href)
);
for (const link of links) {
const newPage = await browser.newPage();
await newPage.goto(link);
const data = await newPage.evaluate(() => {
const titleElement = document.querySelector('td[class="PressTitle"] > h3');
const contactElement = document.querySelector('center > table > tbody > tr:nth-child(1) > td');
const txtElement = document.querySelector('center > table > tbody > tr:nth-child(2) > td');
return {
source: 'ITC',
title: titleElement ? titleElement.innerText : undefined,
contact: contactElement ? contactElement.innerText : undefined,
txt: txtElement ? txtElement.innerText : undefined,
}
})
completeData.push(data);
newPage.close();
}
await browser.close();
return completeData;
}
There is couple of other things you should note:
You have a bunch of unused import title, link, resolve and reject the head of your script, which might have been added automatically by your code editor. Get rid of them, as they might overwrite the real variables.
I changed your document.querySelectors to be more specific, as I couldn't select the actual elements from the ITC website. You might need to revise them.
For local development I use Google's functions-framework, which helps me to run and test the function locally before deploying. If you have errors on your local machine, you'll have error when deploying to Google Cloud.
(Opinion) If you don't need Firebase, I would run this with Google Cloud Functions, Cloud Scheduler and the Cloud Firestore. For me, this has been the go-to workflow for periodic web scraping.
(Opinion) Puppeteer might be overkill for scraping a simple static website, since it runs in a headless Browser. Something like Cheerio is much more lightweight and much faster.
Hope I could help. If you encounter other problems, let us know. Welcome to the Stack Overflow community!

Selective Rendering in Puppeteer

Is it possible to only render a single div (or using any selector) in puppeteer?
example: there's a lot of information on my page and I want to screenshot only a part of it, a div, currently I use the clip option of the screenshot api
but is there a way I can screen shot by specifying a selector?
There are many cool examples ElementHandle.screenshots tests, e.g.:
await page.setViewport({width: 500, height: 500});
await page.goto(server.PREFIX + '/grid.html');
await page.evaluate(() => window.scrollBy(50, 100));
const elementHandle = await page.$('.box:nth-of-type(3)');
const screenshot = await elementHandle.screenshot();

Resources