Puppeteer cannot read elements loaded by data-react-helmet - node.js

I need to read an website to get some SEO tags, but this tags was embed by React Helmet (I believe). During the standard process (page.goto(url)) everything works fine, but in SPA pages with this way to load lazy data, I cannot read the tags.
const page = await browser.newPage();
await page.emulate(device);
await page.setRequestInterception(false);
const json_headers = process.argv[3];
const extra_headers = JSON.parse(json_headers);
await page.setExtraHTTPHeaders(extra_headers);
const response = await page.goto(process.argv[2],{ waitUntil: 'networkidle0',referer: process.argv[2]});
await autoScroll(page);
If I put any kind of "wait" function the program simply stop because the DOM was received and it not contains the expect argument, for example:
await page.waitForSelector('meta[name="description"]');
I did more than 30 different ways, but the order of natural request not is applied in this case, because the developer put (I don't know how) the tags after delivery the result/response, and this scenario is impossible to crawl the page.
Here an example of tag generated on demand (during the page load it not exists)
<link rel="canonical" href="https://someexample.com/testes" data-react-helmet="true">
Any suggestion (again, all wait*** I tried) ?

Related

Web Scraping NodeJs - How to recover resources when the page loads in full after several requests

i'm trying to retrieve each item (composed of an image, a word and its translation) from this page
Link of the website: https://livingdictionaries.app/hazaragi/entries/gallery?entries_prod%5Btoggle%5D%5BhasImage%5D=true"
I used JsDom and Got.
Here is the code
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const got = require('got');
(async () => {
const response = await got("https://livingdictionaries.app/hazaragi/entries/gallery?entries_prod%5Btoggle%5D%5BhasImage%5D=true");
console.log(response.body);
const dom = new JSDOM(response.body);
console.log(dom.window.document.querySelectorAll(".ld-egdn1r"))
})();
when I display the html code that is returned to me it does not correspond to what I open the site with my browser.There are no html tags that contain the items.
When I look at the Network tab, other resources are loaded, but again I can't find the query that retrieves the words.
I think that what I am looking for is loaded in several queries but I don't know which one
Here are the step:
enter image description here
then you will get a code like that
fetch("https://xcvbaysyxd-dsn.algolia.net/1/indexes/*/queries", {
"credentials": "omit",
"headers": {},
"referrer": "https://livingdictionaries.app/",
"body": "...",
"method": "POST",
"mode": "cors"
});
you will just have to process the data manualy after that
const fetch = require("node-fetch") // npm i node-fetch
const data = await fetch(...).then(r=>r.json())
const product = data.results.map(r=>r.hits)
in your case
The site you are trying to scrape is a Single Page Application (SPA) built with Svelte and the individual elements are dynamically rendered as needed, as many websites are today. Since the HTML is not hard-coded, these sites are notoriously difficult to scrape.
If you just log the response, you will see that the elements for which you are selecting do not exist. This is because it is the browser that interprets the JavaScript at run time and updates the UI. A GET request using got, axios, fetch, whatever, cannot perform such tasks.
You will need to implement the use of a headless browser like Puppeteer in order to dynamically render the site and scrape.

Simple way to add Firefox Extensions/Add Ons

I know with Pyppeteer (Puppeteer) or Selenium, I can simply add chrome/chromium extensions by including them in args like this:
args=[
f'--disable-extensions-except={pathToExtension}',
f'--load-extension={pathToExtension}'
]
I also know the selenium has the very useful load_extension fx.
I was wondering if there was a similarly easy way to load extensions/add ons in firefox for Playwright? Or perhaps just with the firefox_user_args
I've seen an example in JS using this:
const path = require('path');
const {firefox} = require('playwright');
const webExt = require('web-ext').default;
(async () => {
// 1. Enable verbose logging and start capturing logs.
webExt.util.logger.consoleStream.makeVerbose();
webExt.util.logger.consoleStream.startCapturing();
// 2. Launch firefox
const runner = await webExt.cmd.run({
sourceDir: path.join(__dirname, 'webextension'),
firefox: firefox.executablePath(),
args: [`-juggler=1234`],
}, {
shouldExitProgram: false,
});
// 3. Parse firefox logs and extract juggler endpoint.
const JUGGLER_MESSAGE = `Juggler listening on`;
const message = webExt.util.logger.consoleStream.capturedMessages.find(msg => msg.includes(JUGGLER_MESSAGE));
const wsEndpoint = message.split(JUGGLER_MESSAGE).pop();
// 4. Connect playwright and start driving browser.
const browser = await firefox.connect({ wsEndpoint });
const page = await browser.newPage();
await page.goto('https://mozilla.org');
// .... go on driving ....
})();
Is there anything similar for python?
Tldr; Code at the end
After wasting too much time into this, I have found a way to install extensions in Firefox in Playwright, feature that I think it is not to be supported for now, since Chromium has that feature and works.
Since in firefox adding an extension requires user clicking a special popup that raises when you click to install the extension, I figured it was easier just to download the xpi file and then install it through the file.
To install a file as an extension, we need to get to the url 'about:debugging#/runtime/this-firefox', to install a temporary extension.
But in that url you cannot use the console or the dom due to protection that firefox has and that I haven't been able to avoid.
However, we know that about:debugging runs in a special tab id, so whe can open a new tab 'about:devtools-toolbox' where we can fake user inputs to run commands in a GUI console.
The code on how to run a file is to load the file as 'nsIFile'. To do that we make use of the already loaded packages in 'about:debugging' and we load the required packages.
The following code is Python, but I guess translating it into Javascript should be no big deal
# get the absolute path for all the xpi extensions
extensions = [os.path.abspath(f"Settings/Addons/{file}") for file in os.listdir("Settings/Addons") if file.endswith(".xpi")]
if(not len(extensions)):
return
c1 = "const { AddonManager } = require('resource://gre/modules/AddonManager.jsm');"
c2 = "const { FileUtils } = require('resource://gre/modules/FileUtils.jsm');"
c3 = "AddonManager.installTemporaryAddon(new FileUtils.File('{}'));"
context = await browser.new_context()
page = await context.new_page()
page2 = await context.new_page()
await page.goto("about:debugging#/runtime/this-firefox", wait_until="domcontentloaded")
await page2.goto("about:devtools-toolbox?id=9&type=tab", wait_until="domcontentloaded")
await asyncio.sleep(1)
await page2.keyboard.press("Tab")
await page2.keyboard.down("Shift")
await page2.keyboard.press("Tab")
await page2.keyboard.press("Tab")
await page2.keyboard.up("Shift")
await page2.keyboard.press("ArrowRight")
await page2.keyboard.press("Enter")
await page2.keyboard.type(f"{' '*10}{c1}{c2}")
await page2.keyboard.press("Enter")
for extension in extensions:
print(f"Adding extension: {extension}")
await asyncio.sleep(0.2)
await page2.keyboard.type(f"{' '*10}{c3.format(extension)}")
await page2.keyboard.press("Enter")
#await asyncio.sleep(0.2)
await page2.bring_to_front()
Note that there are some sleep because the page needs to load but Playwright cannot detect it
I needed to add some whitespaces because for some reason, playwright or firefox were missing some of the first characters in the commands
Also, if you want to install more than one addon, I suggest you try to find the amount of sleep before bringing to front in case the addon opens a new tab

Upload webscraped data with puppeteer to firebase cloud storage in node.js

I’m trying to webscrape a press site, open every link of the articles and get the data. I was able to webscrape with puppeteer but cannot upload it to fire base cloud storage. How do I do that every hour or so?
I webscraped in asynchrones function and then called it in the cloud function:
I used puppeteer to scrape the links of the articles from newsroom website and then used the links to get more information from the articles. I first had everything in a single async function but cloud functions threw an error that there should not be any awaits in a loop.
UPDATE:
I implanted the code above in a firebase function but still get no-await in loop error.
There is a couple of things wrong here, but you are on a good path of getting this to work. The main problem is, that you can't have await within a try {} catch {} block. Asynchronous JavaScript has a different way of dealing with errors. See: try/catch blocks with async/await.
In your case, it's totally fine to write everything in one async function. Here is how I would do it:
async function scrapeIfc() {
const completeData = [];
const url = 'https://www.ifc.org/wps/wcm/connect/news_ext_content/ifc_external_corporate_site/news+and+events/pressroom/press+releases';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await page.setDefaultNavigationTimeout(0);
const links = await page.evaluate(() =>
Array.from(document.querySelectorAll('h3 > a')).map(anchor => anchor.href)
);
for (const link of links) {
const newPage = await browser.newPage();
await newPage.goto(link);
const data = await newPage.evaluate(() => {
const titleElement = document.querySelector('td[class="PressTitle"] > h3');
const contactElement = document.querySelector('center > table > tbody > tr:nth-child(1) > td');
const txtElement = document.querySelector('center > table > tbody > tr:nth-child(2) > td');
return {
source: 'ITC',
title: titleElement ? titleElement.innerText : undefined,
contact: contactElement ? contactElement.innerText : undefined,
txt: txtElement ? txtElement.innerText : undefined,
}
})
completeData.push(data);
newPage.close();
}
await browser.close();
return completeData;
}
There is couple of other things you should note:
You have a bunch of unused import title, link, resolve and reject the head of your script, which might have been added automatically by your code editor. Get rid of them, as they might overwrite the real variables.
I changed your document.querySelectors to be more specific, as I couldn't select the actual elements from the ITC website. You might need to revise them.
For local development I use Google's functions-framework, which helps me to run and test the function locally before deploying. If you have errors on your local machine, you'll have error when deploying to Google Cloud.
(Opinion) If you don't need Firebase, I would run this with Google Cloud Functions, Cloud Scheduler and the Cloud Firestore. For me, this has been the go-to workflow for periodic web scraping.
(Opinion) Puppeteer might be overkill for scraping a simple static website, since it runs in a headless Browser. Something like Cheerio is much more lightweight and much faster.
Hope I could help. If you encounter other problems, let us know. Welcome to the Stack Overflow community!

How to get visual DOM structure from url in node.js

I am wondering how to get "visual" DOM structure from url in node.js. When I try to get html content with request library, html structure is not correct.
const request = require('request');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
request({ 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/', jar: true }, function (e, r, body) {
console.log(body);
});
reurned html structure is here, where meta tags are not correct:
<meta property="og:title" content=""/>
<meta itemprop="description" name="description" content=""/>
If I open website in web browser, I can see correct meta tags in web inspector:
<meta property="og:title" content="Trump promised to destroy the Johnson Amendment. Congress is targeting it now."/>
<meta itemprop="description" name="description" content="Observers believe the proposed legislation would make it harder for the IRS to enforce a law preventing pulpit endorsements."/>
I might need more clarification on what a "visual" DOM structure is, but as a commenter pointed out a headless browser like puppeteer is probably the way to go when a website has complex loading behavior.
The advantage here is, with puppeteer at least, you can navigate to a page and then programmatically wait until some condition is satisfied before continuing. In this case, I chose to wait until one of the meta tags you specified's content attribute is truthy, but depending on your needs you could wait for something else or even wait for multiple conditions to be true.
You might have to analyze the behavior of the page in question a little deeper to figure out what you should wait for though, but at the very least the following code seems to correctly load the tags in your question.
import puppeteer from 'puppeteer'
(async ()=>{
const url = 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/'
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(url)
// wait until <meta property="og:title"> has a truthy value for content attribute
await page.waitForFunction(()=>{
return document.querySelector('meta[property="og:title"]').getAttribute('content')
})
const html = await page.content()
console.log(html)
await browser.close()
})()
(pastebin of formatted html result)
Also, since this solution uses puppeteer I'd recommend not working with the html string and instead using the puppeteer API to extract the information you need.

node js puppeteer metadata

I am new to Puppeteer, and I am trying to extract meta data from a Web site using Node.JS and Puppeteer. I just can't seem to get the syntax right. The code below works perfectly extracting the Title tag, using two different methods, as well as text from a paragraph tag. How would I extract the content text for the meta data with the name of "description" for example?
meta name="description" content="Stack Overflow is the largest, etc"
I would be seriously grateful for any suggestions! I can't seem to find any examples of this anywhere (5 hours of searching and code hacking later). My sample code:
const puppeteer = require('puppeteer');
async function main() {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://stackoverflow.com/', {waitUntil: 'networkidle2'});
const pageTitle1 = await page.evaluate(() => document.querySelector('title').textContent);
const pageTitle2 = await page.title();
const innerText = await page.evaluate(() => document.querySelector('p').innerText);
console.log(pageTitle1);
console.log(pageTitle2);
console.log(innerText);
};
main();
You need a deep tutorial for CSS selectors MDN CSS Selectors.
Something that I highly recommend is testing your selectors on the console directly in the page you will apply the automation, this will save hours of running-stop your system. Try this:
document.querySelectorAll("head > meta[name='description']")[0].content;
Now for puppeteer, you need to copy that selector and past on puppeteer function also I like more this notation:
await page.$eval("head > meta[name='description']", element => element.content);
Any other question or problem just comment.
For anyone struggling to get the OG tags in Puppeteer , here is the solution.
let dom2 = await page.evaluate(() => {
return document.head.querySelector('meta[property="og:description"]').getAttribute("content");
});
console.log(dom2);
If you prefer to avoid $eval, you can do:
const descriptionTag = await page.$('meta[name="description"]');
const description = await descriptionTag?.getAttribute('content');

Resources