Puppeteer in headless false mode on ubuntu error - node.js

I have this web-scraping function
async () => {
try {
const browser = await puppeteer.launch({
headless: false,
ignoreHTTPSErrors: true,
args: ['--no-sandbox', "--disabled-setupid-sandbox"]
})
const page = await browser.newPage();
await page.goto('https://finviz.com/map.ashx');
await page.waitForTimeout(3000);
await page.click('.content #root > div > div:nth-child(3) > button:nth-child(1)'); // Fullscreen
await page.click('.content #root > div > div:nth-child(3) > button:nth-child(2)'); // Share map
await page.waitForTimeout(3000);
const imageUrl = await page.$eval('img.w-full', el => el.src);
console.log(imageUrl);
await browser.close();
} catch (err) {
console.log(err);
}
};
When I try to run it on ubuntu I get an error
Missing X server or $DISPLAY The platform failed to initialize.
Exiting. NaCl helper process running without a sandbox!
If I try to run it in headless mode, I get an error
Error: No node found for selector: .content #root > div > div:nth-child(3) > button:nth-child(1)
On my local machine the script runs fine in mode headless : true
How can you get out of this situation?

Related

Node puppeteer scraping YouTube and encountering redirected you too many times

I'm trying to scrape a YouTube playlists URL using Node / puppeteer. It was working, but now I'm getting ERR_TOO_MANY_REDIRECTS error. I can still access the page using chrome from my desktop.
I've tried using the chromium browser and chrome browsers. I've also tried using the puppeteer-extra stealth plugin and the random-useragent.
This is how my code stand at the moment:
const browser = await puppeteer.launch({
stealth: true,
headless: false // true,
executablePath: "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe",
args: [
'--disable-notifications', '--disable-features=site-per-process'
],
defaultViewport: null
});
const page = await browser.newPage()
await page.setUserAgent(random_useragent.getRandom());
await page.goto(<playlist-url, {
waitUntil: 'networkidle2',
timeout: 0
})
await page.waitForSelector('button[aria-label="Agree to the use of cookies and other data for the purposes described"')
It at the page.goto it bombs. And it happens even if I try going to https://www.youtube.com.
Any suggestions what I should try next. I tried a proxy server but couldn't get it to work. I suspect I need a proxy to actually route through.
If all you need is playlist IDs for a given channel, it's possible to query a feed at:
https://youtube.com/feeds/videos.xml?channel_id=<Channel ID>
To get IDs of videos you can query a feed at:
https://youtube.com/feeds/videos.xml?playlist_id=PLAYLIST_ID
You can get playlists (and Mixes) links from YouTube like in the code example below (also check full code the online IDE):
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());
const searchString = "java course";
const requestParams = {
baseURL: `https://www.youtube.com`,
encodedQuery: encodeURI(searchString), // what we want to search for in URI encoding
};
async function fillPlaylistsDataFromPage(page) {
const dataFromPage = await page.evaluate((requestParams) => {
const mixes = Array.from(document.querySelectorAll("#contents > ytd-radio-renderer")).map((el) => ({
title: el.querySelector("a > h3 > #video-title")?.textContent.trim(),
link: `${requestParams.baseURL}${el.querySelector("a#thumbnail")?.getAttribute("href")}`,
videos: Array.from(el.querySelectorAll("ytd-child-video-renderer a")).map((el) => ({
title: el.querySelector("#video-title")?.textContent.trim(),
link: `${requestParams.baseURL}${el.getAttribute("href")}`,
length: el.querySelector("#length")?.textContent.trim(),
})),
thumbnail: el.querySelector("a#thumbnail #img")?.getAttribute("src"),
}));
const playlists = Array.from(document.querySelectorAll("#contents > ytd-playlist-renderer")).map((el) => ({
title: el.querySelector("a > h3 > #video-title")?.textContent.trim(),
link: `${requestParams.baseURL}${el.querySelector("a#thumbnail")?.getAttribute("href")}`,
channel: {
name: el.querySelector("#channel-name a")?.textContent.trim(),
link: `${requestParams.baseURL}${el.querySelector("#channel-name a")?.getAttribute("href")}`,
},
videoCount: el.querySelector("yt-formatted-string.ytd-thumbnail-overlay-side-panel-renderer")?.textContent.trim(),
videos: Array.from(el.querySelectorAll("ytd-child-video-renderer a")).map((el) => ({
title: el.querySelector("#video-title")?.textContent.trim(),
link: `${requestParams.baseURL}${el.getAttribute("href")}`,
length: el.querySelector("#length")?.textContent.trim(),
})),
thumbnail: el.querySelector("a#thumbnail #img")?.getAttribute("src"),
}));
return [...mixes, ...playlists];
}, requestParams);
return dataFromPage;
}
async function getYoutubeSearchResults() {
const browser = await puppeteer.launch({
headless: false,
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const page = await browser.newPage();
const URL = `${requestParams.baseURL}/results?search_query=${requestParams.encodedQuery}`;
await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector("#contents > ytd-video-renderer");
const playlists = await fillPlaylistsDataFromPage(page);
await browser.close();
return playlists;
}
getYoutubeSearchResults().then(console.log);
📌Note: to get thumbnail you need to scroll playlist into view (using .scrollIntoView() method).
Output:
[
{
"title":"Java Complete Course | Placement Series",
"link":"https://www.youtube.com/watch?v=yRpLlJmRo2w&list=PLfqMhTWNBTe3LtFWcvwpqTkUSlB32kJop",
"channel":{
"name":"Apna College",
"link":"https://www.youtube.com/c/ApnaCollegeOfficial"
},
"videoCount":"35",
"videos":[
{
"title":"Introduction to Java Language | Lecture 1 | Complete Placement Course",
"link":"https://www.youtube.com/watch?v=yRpLlJmRo2w&list=PLfqMhTWNBTe3LtFWcvwpqTkUSlB32kJop",
"length":"18:46"
},
{
"title":"Variables in Java | Input Output | Complete Placement Course | Lecture 2",
"link":"https://www.youtube.com/watch?v=LusTv0RlnSU&list=PLfqMhTWNBTe3LtFWcvwpqTkUSlB32kJop",
"length":"42:36"
}
],
"thumbnail":null
},
{
"title":"Java Tutorials For Beginners In Hindi",
"link":"https://www.youtube.com/watch?v=ntLJmHOJ0ME&list=PLu0W_9lII9agS67Uits0UnJyrYiXhDS6q",
"channel":{
"name":"CodeWithHarry",
"link":"https://www.youtube.com/c/CodeWithHarry"
},
"videoCount":"113",
"videos":[
{
"title":"Introduction to Java + Installing Java JDK and IntelliJ IDEA for Java",
"link":"https://www.youtube.com/watch?v=ntLJmHOJ0ME&list=PLu0W_9lII9agS67Uits0UnJyrYiXhDS6q",
"length":"19:00"
},
{
"title":"Basic Structure of a Java Program: Understanding our First Java Hello World Program",
"link":"https://www.youtube.com/watch?v=zIdg7hkqNE0&list=PLu0W_9lII9agS67Uits0UnJyrYiXhDS6q",
"length":"14:09"
}
],
"thumbnail":null
}
]
You can read more about scraping YouTube playlists from blog post Web scraping YouTube secondary search results with Nodejs.

Run Chai and Puppeteer Test in Google Cloud Functions not Working

I am running a Script based in Puppeteer, Chai and Mocha that runs locally fine, but not in Google Cloud functions.
I can´t make chai work in Google Cloud Functions.
This is the error I get:
and this is Code I am using in index.js :
/**
* Responds to any HTTP request.
*
* #param {!express:Request} req HTTP request context.
* #param {!express:Response} res HTTP response context.
*/
const puppeteer = require('puppeteer');
//const expect = require('chai')
describe('My First Puppeteer Test', () => {
let browser
let page
before(async function () {
browser = await puppeteer.launch({
headless: true,
devtools: false,
})
page = await browser.newPage()
await page.setDefaultTimeout(100000)
await page.setDefaultNavigationTimeout(20000)
})
after(async function () {
await browser.close()
})
beforeEach(async function () {
//Runs before each Test step
})
afterEach(async function () {
//Runs after each Test step
})
it('login', async function () {
await page.goto('https://my.website.com/user/')
await page.goto('https://my.website.com/user/')
await page.waitForSelector('#mat-input-0')
await page.type('#mat-input-0', 'MyUsername')
await page.waitForTimeout(1000)
await page.waitForSelector('#mat-input-1')
await page.type('#mat-input-1', 'MyPassword')
await page.waitForTimeout(1000)
await page.click(
'body > sbnb-root > sbnb-login > section > article > sbnb-login-form > form > button'
)
await page.waitForTimeout(10000)
})
it('Go To metrics', async function () {
await page.goto('https://my.website.com/stadistic')
//Click Reviews
await page.waitForSelector(
'body > sbnb-root > sbnb-exports > div > main > section.create-export-container > div:nth-child(2) > section:nth-child(2) > div > div:nth-child(1) > div > a'
)
await page.click(
'body > sbnb-root > sbnb-exports > div > main > section.create-export-container > div:nth-child(2) > section:nth-child(2) > div > div.text__small'
)
await page.waitForTimeout(10000)
// Click Single File
await page.waitForSelector(
'body > sbnb-root > sbnb-exports > div > main > section.create-export-container > div:nth-child(8) > div:nth-child(1)'
)
await page.click(
'body > sbnb-root > sbnb-exports > div > main > section.create-export-container > div:nth-child(8) > div:nth-child(1)'
)
await page.waitForTimeout(10000)
})
it('Click send and close', async function () {
// Click Send File
await page.waitForSelector(
'body > sbnb-root > sbnb-exports > div > main > section.create-export-container > div.export-final > button'
)
await page.click(
'body > sbnb-root > sbnb-exports > div > main > section.create-export-container > div.export-final > button'
)
await page.waitForTimeout(80000)
})
})
and this is how my package.json looks like:
{
"name": "sample-http",
"version": "0.0.1",
"scripts": {
"test": "mocha --reporter spec"
},
"dependencies": {
"chai": "^4.3.4",
"mocha": "^9.1.3",
"puppeteer": "^1.20.0"
}
}
How I can I make it work in Google Cloud Functions?
Some ideas about how to rewrite the code just using puppeteer in a way that it works?
Based on the GCP documentation and a related question, it looks like the proper way to test a Cloud Function is to separate the testing and the actual code to be tested. My suggestion is to build your Puppeteer function and deploy it (here is a guide for using Puppeteer with Cloud functions), and then perform tests against it, as shown in the documentation.
As for the specific error you are seeing, it appears that it’s caused by not running the tests through the correct mocha commands. You will be able to run these commands by refactoring your code according to the GCP documentation.

Package puppeteer don't work properly (centos 7)

I used the package puppeteer to generate pdf in nodejs in local developement i use lubuntu 18.04 and every thing works great but in production after deploying the code in centos 7 it works sometimes but the majority of time it doesn't work.
i installed necessary packages of chromium using this command:
sudo yum install pango.x86_64 libXcomposite.x86_64 libXcursor.x86_64 libXdamage.x86_64 libXext.x86_64 libXi.x86_64 libXtst.x86_64 cups-libs.x86_64 libXScrnSaver.x86_64 libXrandr.x86_64 GConf2.x86_64 alsa-lib.x86_64 atk.x86_64 gtk3.x86_64 ipa-gothic-fonts xorg-x11-fonts-100dpi xorg-x11-fonts-75dpi xorg-x11-utils xorg-x11-fonts-cyrillic xorg-x11-fonts-Type1 xorg-x11-fonts-misc
I added the argument {args:['--no-sandbox']} but the same problem.
This is my code :
const generator = (data, fileName, prepa) =>
new Promise(async (resolve, reject) => {
try {
const browser = await puppeteer.launch({ args: ["--no-sandbox"] });
const page = await browser.newPage();
let content;
if (prepa === true) {
content = templateCarnetPrepa(data);
} else {
content = templateCarnet(data);
}
await page.setContent(content);
await page.addStyleTag({ path: "./bootstrap.min.css" });
await page.emulateMediaType("screen");
await page.pdf({
path: `./${ENV.MEDIA_STORAGE}/carnet/${fileName}.pdf`,
format: "A4",
margin: { top: 20, left: 20, right: 20, bottom: 20 },
printBackground: true,
});
await browser.close();
//process.exit()
resolve(0);
} catch (e) {
console.log(e);
reject(e);
}
});
Does any one have a solution to this problem ?

Using the default chrome profile with puppeteer that my chrome app uses

I'm having issues getting puppeteer to use the default profile that my chrome browser uses. I've tried setting path to the user profile, but when I go to a site with puppeteer that I know is saved with chrome app's userDataDir, there's nothing saved there. What am I doing wrong? I appreciate any help!
const browser = await puppeteer.launch({
headless: false,
userDataDir: 'C:\\Users\\Bob\\AppData\\Local\\Google\\Chrome\\User Data',
}).then(async browser => {
I've also tried userDataDir: 'C:/Users/Phil/AppData/Local/Google/Chrome/User Data',, but still nothing.
UPDATED:
const username = os.userInfo().username;
(async () => {
try {
const browser = await puppeteer.launch({
headless: false, args: [
`--user-data-dir=C:/Users/${username}/AppData/Local/Google/Chrome/User Data`]
}).then(async browser => {
I had same exact issue before. However connecting my script to a real chrome instance helped to solve a lot of problems specially the profile one.
You can see the steps here:
https://medium.com/#jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0
//MACOS
/*
Open this instance first:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --no-first-run --no-default-browser-check --user-data-dir=$(mktemp -d -t 'chrome-remote_data_dir')
// Windows:
- Add this to Target of launching chrome --remote-debugging-port=9222
- Navigate to http://127.0.0.1:9222/json/version
- copy webSocketDebuggerUrl
More Info: https://medium.com/#jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0
*/
// Puppeteer Part
// Always update this socket after running the instance in terminal (look up ^)
and this is abstracted controller written in Typescript, that I always use in any project:
import * as puppeteer from 'puppeteer';
import { Browser } from 'puppeteer/lib/cjs/puppeteer/common/Browser';
import { Page } from 'puppeteer/lib/cjs/puppeteer/common/Page';
import { PuppeteerNode } from 'puppeteer/lib/cjs/puppeteer/node/Puppeteer';
import { getPuppeteerWSUrl } from './config/config';
export default class Puppeteer {
public browser: Browser;
public page: Page;
getBrowser = () => {
return this.browser;
};
getPage = () => {
return this.page;
};
init = async () => {
const webSocketUrl = await getPuppeteerWSUrl();
try {
this.browser = await ((puppeteer as unknown) as PuppeteerNode).connect({
browserWSEndpoint: webSocketUrl,
defaultViewport: {
width: 1920,
height: 1080,
},
});
console.log('BROWSER CONNECTED OK');
} catch (e) {
console.error('BROWSER CONNECTION FAILED', e);
}
this.page = await this.browser.newPage();
this.page.on('console', (log: any) => console.log(log._text));
};
}
Abstracted webosocket fecther:
import axios from "axios";
import { exit } from "process";
export const getPuppeteerWSUrl = async () => {
try {
const response = await axios.get("http://127.0.0.1:9222/json/version");
return response.data.webSocketDebuggerUrl;
} catch (error) {
console.error("Can not get puppeteer ws url. error %j", error);
console.info(
"Make sure you run this command (/Applications/Google Chrome.app/Contents/MacOS/Google Chrome --remote-debugging-port=9222 --no-first-run --no-default-browser-check --user-data-dir=$(mktemp -d -t 'chrome-remote_data_dir')) first on a different shell"
);
exit(1);
}
};
Feel free to adjust the template to suit whatever you enviroment/tools currrently look like.

While using capture-website, puppeteer with webpack throwing error:browser not downloaded, at ChromeLauncher.launch (webpack-internal)

I am trying to take screenshot by providing html file on node js.
I have used capture-website package.
Here is the code:
try{
await captureWebsite.file('file.html', 'file.png', {overwrite: true}, function (error) {
if (error) {
console.log('error',error);
}
});
}catch(e){
console.log('error in capture image:', e)
}
Version:
node:12.16.1
capture-website: "0.8.1"
Angular : 7
You can do something like:
async function screenshotFromHtml ({ html, timeout = 2000 }: ScreenshotOptions) {
const browser = await puppeteer.launch({
headless: !process.env.DEBUG_HEADFULL,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
]
})
const page = await browser.newPage()
// Set viewport to something big
// Prevents Carbon from cutting off lines
await page.setViewport({
width: 2560,
height: 1080,
deviceScaleFactor: 2
})
page.setContent(html)
const base64 = await page.screenshot({ encoding: "base64" }) as string;
// Wait some more as `waitUntil: 'load'` or `waitUntil: 'networkidle0'
await page.waitFor(timeout)
// Close browser
await browser.close()
return base64
}
This code is in typescript, but you can use the function body in your JS project
In this github file you can see a html render code too

Resources