I know with Pyppeteer (Puppeteer) or Selenium, I can simply add chrome/chromium extensions by including them in args like this:
args=[
f'--disable-extensions-except={pathToExtension}',
f'--load-extension={pathToExtension}'
]
I also know the selenium has the very useful load_extension fx.
I was wondering if there was a similarly easy way to load extensions/add ons in firefox for Playwright? Or perhaps just with the firefox_user_args
I've seen an example in JS using this:
const path = require('path');
const {firefox} = require('playwright');
const webExt = require('web-ext').default;
(async () => {
// 1. Enable verbose logging and start capturing logs.
webExt.util.logger.consoleStream.makeVerbose();
webExt.util.logger.consoleStream.startCapturing();
// 2. Launch firefox
const runner = await webExt.cmd.run({
sourceDir: path.join(__dirname, 'webextension'),
firefox: firefox.executablePath(),
args: [`-juggler=1234`],
}, {
shouldExitProgram: false,
});
// 3. Parse firefox logs and extract juggler endpoint.
const JUGGLER_MESSAGE = `Juggler listening on`;
const message = webExt.util.logger.consoleStream.capturedMessages.find(msg => msg.includes(JUGGLER_MESSAGE));
const wsEndpoint = message.split(JUGGLER_MESSAGE).pop();
// 4. Connect playwright and start driving browser.
const browser = await firefox.connect({ wsEndpoint });
const page = await browser.newPage();
await page.goto('https://mozilla.org');
// .... go on driving ....
})();
Is there anything similar for python?
Tldr; Code at the end
After wasting too much time into this, I have found a way to install extensions in Firefox in Playwright, feature that I think it is not to be supported for now, since Chromium has that feature and works.
Since in firefox adding an extension requires user clicking a special popup that raises when you click to install the extension, I figured it was easier just to download the xpi file and then install it through the file.
To install a file as an extension, we need to get to the url 'about:debugging#/runtime/this-firefox', to install a temporary extension.
But in that url you cannot use the console or the dom due to protection that firefox has and that I haven't been able to avoid.
However, we know that about:debugging runs in a special tab id, so whe can open a new tab 'about:devtools-toolbox' where we can fake user inputs to run commands in a GUI console.
The code on how to run a file is to load the file as 'nsIFile'. To do that we make use of the already loaded packages in 'about:debugging' and we load the required packages.
The following code is Python, but I guess translating it into Javascript should be no big deal
# get the absolute path for all the xpi extensions
extensions = [os.path.abspath(f"Settings/Addons/{file}") for file in os.listdir("Settings/Addons") if file.endswith(".xpi")]
if(not len(extensions)):
return
c1 = "const { AddonManager } = require('resource://gre/modules/AddonManager.jsm');"
c2 = "const { FileUtils } = require('resource://gre/modules/FileUtils.jsm');"
c3 = "AddonManager.installTemporaryAddon(new FileUtils.File('{}'));"
context = await browser.new_context()
page = await context.new_page()
page2 = await context.new_page()
await page.goto("about:debugging#/runtime/this-firefox", wait_until="domcontentloaded")
await page2.goto("about:devtools-toolbox?id=9&type=tab", wait_until="domcontentloaded")
await asyncio.sleep(1)
await page2.keyboard.press("Tab")
await page2.keyboard.down("Shift")
await page2.keyboard.press("Tab")
await page2.keyboard.press("Tab")
await page2.keyboard.up("Shift")
await page2.keyboard.press("ArrowRight")
await page2.keyboard.press("Enter")
await page2.keyboard.type(f"{' '*10}{c1}{c2}")
await page2.keyboard.press("Enter")
for extension in extensions:
print(f"Adding extension: {extension}")
await asyncio.sleep(0.2)
await page2.keyboard.type(f"{' '*10}{c3.format(extension)}")
await page2.keyboard.press("Enter")
#await asyncio.sleep(0.2)
await page2.bring_to_front()
Note that there are some sleep because the page needs to load but Playwright cannot detect it
I needed to add some whitespaces because for some reason, playwright or firefox were missing some of the first characters in the commands
Also, if you want to install more than one addon, I suggest you try to find the amount of sleep before bringing to front in case the addon opens a new tab
Related
When I use the inspect/developer tool in chrome I can find the last modified date from browser but I want to see the same date in my nodeJS application.
I have already tried
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.tbsnews.net/economy/bsec-chairman-stresses-restoring-investor-confidence-mutual-funds-500126');
const names = page.evaluate( ()=> {
console.log(document.lastModified);
})
Unfortunately this code shows the current time of new DOM creation as we are using newPage(). Can anyone help me ?
I have also tired JSDOM as well.
Thanks in advance.
I'm trying to run puppeteer but I keep getting the error below.
C:\Users\Diamo\Documents\Project Laura\node_modules\puppeteer-core\lib\cjs\puppeteer\node\Launcher.js:105
throw new Error(missingText);
^
Error: Could not find expected browser (chrome) locally. Run `npm install` to download the correct Chromium revision (982053).
at ChromeLauncher.launch (C:\Users\Diamo\Documents\Project Laura\node_modules\puppeteer-core\lib\cjs\puppeteer\node\Launcher.js:105:23)
at PuppeteerNode.launch (C:\Users\Diamo\Documents\Project Laura\node_modules\puppeteer-core\lib\cjs\puppeteer\node\Puppeteer.js:125:31)
at PuppeteerExtra.launch (C:\Users\Diamo\Documents\Project Laura\node_modules\puppeteer-extra\dist\index.cjs.js:129:41)
at async credittest (C:\Users\Diamo\Documents\Project Laura\laura\credit.js:8:21)
Here is my Code, there isn't much here since I'm just trying to figure out if it works.
const puppeteer = require('puppeteer-extra')
async function credittest(creditcost) {
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())
const browser = await puppeteer.launch()
const page = await browser.newPage();
await page.goto('https://www.capitalone.com/');
page.waitFor(2000)
await page.click('#ods-input-0')
await page.keyboard.type('username')
await page.screenshot({ path: './test/example.png' });
await browser.close();
}
credittest()
If anyone is curious on what I'm doing here, I'm trying to automate grabbing balances from my bank account, Capital One loves headless browsers and someone told me to use puppeteers stealth plugin. The code ran before, but when I installed the stealth extra is not what I get the error.
Also i did try npm install on puppeteer-core and Puppeteer but nothing came of it.
I am trying to scrape data from different websites using only one Puppeteer instance. I don't want to launch a new browser for each website. So I need to check if any existing browser has already launched then just open a new tab. I did something like the below, Some conditions I always check before launching any browser
const browser = await puppeteer.launch();
browser?.isConnected()
browser.process() // null if browser is still running
but still, I found sometimes my script re-launch the browser if any old browser has already been launched. So I am thinking to kill if any old browser has been launched or what would be the best check? Any other good suggestion will be highly appreciated.
I'm not sure if that specific command (Close existing browsers) can be done inside puppeteer's APIs, but what I could recommend is how would people usually handle this situation which is to make sure that the browser instance is closed if any issue was encountered:
let browser = null;
try {
browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox'],
});
const page = await browser.newPage();
url = req.query.url;
await page.goto(url);
const bodyHTML = await page.evaluate(() => document.body.innerHTML);
res.send(bodyHTML);
} catch (e) {
console.log(e);
} finally {
if (browser)
await browser.close();
}
Otherwise, you can use shell based commands like kill or pkill if you have access to the process ID of the previous browser.
The most reliable means of closing a puppeteer instance that I've found is to close all of the pages within a BrowserContext, which automatically closes the Browser. I've seen instances of chromium linger in Task Manager after calling just await browser.close().
Here is how I do this:
const openAndCloseBrowser = async () => {
const browser = await puppeteer.launch();
try {
// your logic
catch(ERROR) {
// error handling
} finally {
const pages = await browser.pages();
for(const page of pages) await page.close();
}
}
If you try running await browser.close() after running the loop and closing each page individually, you should see an error stating that the browser was already closed and your Task Manager should not have lingering chromium instances.
I hope you are safe.
I'm making one script which perform some scraping in the site. Now issue is, I have one site which has pdf. So I'm not able to read that pdf file using puppeteer and Node.js.
I'm able to read other text from other links.
What I tried
const puppeteer = require('puppeteer')
async function printPDF() {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://blog.risingstack.com', {waitUntil: 'networkidle0'});
const pdf = await page.pdf({ format: 'A4' });
await browser.close();
return pdf
})
It will work to add text into pdf, but I need pdf to text.
Can someone help me with this?
There is a npm module named "pdfreader". You can check that out.
FYI this was possible in Playwright by using Firefox and navigating to a PDF file, which would be opened using PDF.js. However, recent versions of Playwright broken this functionality:
https://github.com/microsoft/playwright/issues/13157
I’m trying to webscrape a press site, open every link of the articles and get the data. I was able to webscrape with puppeteer but cannot upload it to fire base cloud storage. How do I do that every hour or so?
I webscraped in asynchrones function and then called it in the cloud function:
I used puppeteer to scrape the links of the articles from newsroom website and then used the links to get more information from the articles. I first had everything in a single async function but cloud functions threw an error that there should not be any awaits in a loop.
UPDATE:
I implanted the code above in a firebase function but still get no-await in loop error.
There is a couple of things wrong here, but you are on a good path of getting this to work. The main problem is, that you can't have await within a try {} catch {} block. Asynchronous JavaScript has a different way of dealing with errors. See: try/catch blocks with async/await.
In your case, it's totally fine to write everything in one async function. Here is how I would do it:
async function scrapeIfc() {
const completeData = [];
const url = 'https://www.ifc.org/wps/wcm/connect/news_ext_content/ifc_external_corporate_site/news+and+events/pressroom/press+releases';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await page.setDefaultNavigationTimeout(0);
const links = await page.evaluate(() =>
Array.from(document.querySelectorAll('h3 > a')).map(anchor => anchor.href)
);
for (const link of links) {
const newPage = await browser.newPage();
await newPage.goto(link);
const data = await newPage.evaluate(() => {
const titleElement = document.querySelector('td[class="PressTitle"] > h3');
const contactElement = document.querySelector('center > table > tbody > tr:nth-child(1) > td');
const txtElement = document.querySelector('center > table > tbody > tr:nth-child(2) > td');
return {
source: 'ITC',
title: titleElement ? titleElement.innerText : undefined,
contact: contactElement ? contactElement.innerText : undefined,
txt: txtElement ? txtElement.innerText : undefined,
}
})
completeData.push(data);
newPage.close();
}
await browser.close();
return completeData;
}
There is couple of other things you should note:
You have a bunch of unused import title, link, resolve and reject the head of your script, which might have been added automatically by your code editor. Get rid of them, as they might overwrite the real variables.
I changed your document.querySelectors to be more specific, as I couldn't select the actual elements from the ITC website. You might need to revise them.
For local development I use Google's functions-framework, which helps me to run and test the function locally before deploying. If you have errors on your local machine, you'll have error when deploying to Google Cloud.
(Opinion) If you don't need Firebase, I would run this with Google Cloud Functions, Cloud Scheduler and the Cloud Firestore. For me, this has been the go-to workflow for periodic web scraping.
(Opinion) Puppeteer might be overkill for scraping a simple static website, since it runs in a headless Browser. Something like Cheerio is much more lightweight and much faster.
Hope I could help. If you encounter other problems, let us know. Welcome to the Stack Overflow community!