On heroku, puppeteer's Network.webSocketFrameReceived event is never triggered. Why? - node.js

I have built a small app that I deployed to heroku. Locally, the whole thing is working as expected. But when deployed, the Network.webSocketFrameReceived event is never triggered. It is a node app that runs on express with a minimal websocket server.
The goal of the app is to open some url using headless chrome (i am using puppeteer here), record the websocket frames and parse them if they contain some specific fields, close connection when successful. Then move to next url.
async function openUrlAndParseFrames(page, url) {
await new Promise(async function (resolve) {
const parseWebsocketFrame = (response) => {
console.log('parsing websocket frame...', response);
let payload;
try {
// some parsing here
} catch (e) {
console.error(`Error while parsing payload ${response.response.payloadData}`)
}
}
console.log('Go to url', url);
await page.goto(url);
const cdp = await page.target().createCDPSession();
await cdp.send('Network.enable');
await cdp.send('Page.enable');
cdp.on('Network.webSocketFrameReceived', parseWebsocketFrame);
});
}
Is it not possible to make this websocket connection on heroku using puppeteer? I never receive the "parsing websocket frame..." logs...
PS:
I am aware of this special args I need to set for puppeteer to run on heroku
puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox'] });
Also I added the buildpacks heroku/nodejs and https://github.com/jontewks/puppeteer-heroku-buildpack

I found the answer myself. The real problem was, that the IP range (from Heroku) was blocked and I didn't even access the page I was trying to but was blocked with a 403 from CloudFront.
I figured it out by logging the page content. const websiteContent = await page.content(); Which showed the error page html.
After trying various things I decided to move away from Heroku and now successfully deployed to Google App Engine.

Related

How can I speed up firebase function request?

I just migrated my website over to firebase from localhost and its working fine however I do have the issue of my firebase functions taking a pretty significant amount of time to resolve. One of the core features is pulling files from a google cloud bucket which was taking only 3 seconds on localhost and is now taking around x3 as long after migrating.
Is there anyway for me to speed up my firebase function query time? If not then is there a way for me to at least wait for a request to resolve before redirecting to a new page?
Here is the code for pulling the file in case it helps at all.
app.get('/gatherFromStorage/:filepath',async(req,res) => {
try{
const {filepath} = req.params;
const file = bucket.file(filepath)
let fileDat = []
const newStream = file.createReadStream()
newStream.setEncoding('utf8')
.on('data',function(chunk){
fileDat.push(chunk)
})
.on('end', function() {
console.log('done')
res.json(fileDat)
})
.on('error',function(err){
console.log(err)
});
}catch(error){
res.status(500).send(error)
console.log(error)
}
})
Also this question may come off as silly but I just don't know the answer. When I create a express endpoint should each endpoint be its own firebase function or is it fine that I wrap all my endpoints into one firebase function?

Loading dynamic webpage with Puppeteer works on localhost but not Heroku

Node.js app with Express, deployed on Heroku. It's just dynamic webpages. Loading static webpages works fine.
Loading dynamic webpages works on localhost, but on Heroku it throws me code=H12, desc="Request timeout", service=30000ms, status=503.
In addition, fresh after doing heroku restart or making a deployment, there always seems to be one instance of a status=200 that loads only the static portion of a dynamic webpage.
Screenshot of logs here.
I've tried the following, which have all led to either the same or other unexpected results when deployed on Heroku (such as Error R14 (Memory quota exceeded) and code=H13 desc="Connection closed without response"):
Switching the Puppeteer Heroku buildpack I was using. I've tried the ones mentioned in this troubleshooting guide and this comment.
Adding headless: true in Puppeteer's launch arguments.
Adding the --no-sandbox, --disable-setuid-sandbox, --single-process, and --no-zygote flags in args of Puppeteer's launch arguments. (Reference: this comment & this comment)
Setting the waitUntil argument in Puppeteer's goto function to domcontentloaded, networkidle0 and networkidle2. (Reference: this comment)
Passing a timeout argument in Puppeteer goto function; I've tried 30000 and 60000 specifically, as well as 0 per this comment.
Using the waitForSelector function.
Clearing Heroku's build cache, as per this article.
Printing the url variable (see my code below) in the console. Output is as expected.
I've observed that:
With the code I have right now (see below), the try-catch-finally block never catches any error. It's always one of the following: I get an incomplete result (static portion of requested dynamic webpage), or the app crashes (code=H13 desc="Connection closed without response"). So I haven't been able to get anything out of attempting to print exception in the console from within the catch block.
Any ideas on how I could get this to work?
const app = express();
const puppeteer = require("puppeteer");
let port = process.env.PORT || 3000;
let browser;
...
app.listen(port, async() => {
browser = await puppeteer
.launch({
timeout: 0,
headless: true,
args: [
"--no-sandbox",
"--disable-setuid-sandbox",
"--single-process",
"--no-zygote",
],
});
});
...
app.get("/appropriate-route-name", async (req, res) => {
let url = req.query.url;
let page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: "networkidle2",
});
res.send({ data: await page.content() });
} catch (exception) {
res.send({ data: null });
} finally {
await browser.close();
}
}
Was able to get it to work by using user-agents. Dynamic pages now load just fine on Heroku; requests don't time out every single time anymore.
const app = express();
const puppeteer = require("puppeteer");
let port = process.env.PORT || 3000;
var userAgent = require("user-agents");
...
app.get("/route-name", async (req, res) => {
let url = req.query.url;
let browser = await puppeteer.launch({
args: ["--no-sandbox"],
});
let page = await browser.newPage();
try {
await page.setUserAgent(userAgent.toString()); // added this
await page.goto(url, {
timeout: 30000,
waitUntil: "newtorkidle2", // or "networkidle0", depending on what you need
});
res.send({ data: await page.content() });
} catch (e) {
res.send({ data: null });
} finally {
await browser.close();
}
});

puppeteer - how to intercept requests and responses only from a certain url in nodejs [duplicate]

Using Puppeteer, I'd like to load a URL in Chrome and capture the following information:
request URL
request headers
request post data
response headers text (including duplicate headers like set-cookie)
transferred response size (i.e. compressed size)
full response body
Capturing the full response body is what causes the problems for me.
Things I've tried:
Getting response content with response.buffer - this does not work if there are redirects at any point, since buffers are wiped on navigation
intercepting requests and using getResponseBodyForInterception - this means I can no longer access the encodedLength, and I also had problems getting the correct request and response headers in some cases
Using a local proxy works, but this slowed down page load times significantly (and also changed some behavior for e.g. certificate errors)
Ideally the solution should only have a minor performance impact and have no functional differences from loading a page normally. I would also like to avoid forking Chrome.
You can enable a request interception with page.setRequestInterception() for each request, and then, inside page.on('request'), you can use the request-promise-native module to act as a middle man to gather the response data before continuing the request with request.continue() in Puppeteer.
Here's a full working example:
'use strict';
const puppeteer = require('puppeteer');
const request_client = require('request-promise-native');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const result = [];
await page.setRequestInterception(true);
page.on('request', request => {
request_client({
uri: request.url(),
resolveWithFullResponse: true,
}).then(response => {
const request_url = request.url();
const request_headers = request.headers();
const request_post_data = request.postData();
const response_headers = response.headers;
const response_size = response_headers['content-length'];
const response_body = response.body;
result.push({
request_url,
request_headers,
request_post_data,
response_headers,
response_size,
response_body,
});
console.log(result);
request.continue();
}).catch(error => {
console.error(error);
request.abort();
});
});
await page.goto('https://example.com/', {
waitUntil: 'networkidle0',
});
await browser.close();
})();
Puppeteer-only solution
This can be done with puppeteer alone. The problem you are describing that the response.buffer is cleared on navigation, can be circumvented by processing each request one after another.
How it works
The code below uses page.setRequestInterception to intercept all requests. If there is currently a request being processed/being waited for, new requests are put into a queue. Then, response.buffer() can be used without the problem that other requests might asynchronously wipe the buffer as there are no parallel requests. As soon as the currently processed request/response is handled, the next request will be processed.
Code
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
const results = []; // collects all results
let paused = false;
let pausedRequests = [];
const nextRequest = () => { // continue the next request or "unpause"
if (pausedRequests.length === 0) {
paused = false;
} else {
// continue first request in "queue"
(pausedRequests.shift())(); // calls the request.continue function
}
};
await page.setRequestInterception(true);
page.on('request', request => {
if (paused) {
pausedRequests.push(() => request.continue());
} else {
paused = true; // pause, as we are processing a request now
request.continue();
}
});
page.on('requestfinished', async (request) => {
const response = await request.response();
const responseHeaders = response.headers();
let responseBody;
if (request.redirectChain().length === 0) {
// body can only be access for non-redirect responses
responseBody = await response.buffer();
}
const information = {
url: request.url(),
requestHeaders: request.headers(),
requestPostData: request.postData(),
responseHeaders: responseHeaders,
responseSize: responseHeaders['content-length'],
responseBody,
};
results.push(information);
nextRequest(); // continue with next request
});
page.on('requestfailed', (request) => {
// handle failed request
nextRequest();
});
await page.goto('...', { waitUntil: 'networkidle0' });
console.log(results);
await browser.close();
})();
I would suggest you to search for a quick proxy server which allows to write requests logs together with actual content.
The target setup is to allow proxy server to just write a log file, and then analyze the log, searching for information you need.
Don't intercept requests while proxy is working (this will lead to slow down)
The performance issues(with proxy as logger setup) you may encounter are mostly related to TLS support, please pay attention to allow quick TLS handshake, HTTP2 protocol in the proxy setup
E.g. Squid benchmarks show that it is able to process hundreds RPS, which should be enough for testing purposes
I would suggest using a tool namely 'fiddler'. It will capture all the information that you mentioned when you load a URL url.
Here's my workaround which I hope will help others.
I had issues with the await page.setRequestInterception(True) command blocking the flow and made the page hanging until timeout.
So I added this function
async def request_interception(req):
""" await page.setRequestInterception(True) would block the flow, the interception is enabled individually """
# enable interception
req.__setattr__('_allowInterception', True)
if req.url.startswith('http'):
print(f"\nreq.url: {req.url}")
print(f" req.resourceType: {req.resourceType}")
print(f" req.method: {req.method}")
print(f" req.postData: {req.postData}")
print(f" req.headers: {req.headers}")
print(f" req.response: {req.response}")
return await req.continue_()
removed the await page.setRequestInterception(True) and called the function above with
page.on('request', lambda req: asyncio.ensure_future(request_interception(req)))
in my main().
Without the req.__setattr__('_allowInterception', True) statement Pyppeteer would complain about the intercept not enabled for some requests but works fine for me with it.
Just in case someone interested in the system I'm running Pyppeteer:
Ubuntu 20.04
Python 3.7 (venv)
...
pyee==8.1.0
pyppeteer==0.2.5
python-dateutil==2.8.1
requests==2.25.1
urllib3==1.26.3
websockets==8.1
...
I also posted the solution at https://github.com/pyppeteer/pyppeteer/issues/198
Cheers
go to Chrome press F12, then go to "network" tab, you can see there all the http request that the website sends, yo're be able to see the details you mentioned.

Why does puppeteer page.goto() throw a timeout error?

The following code throws an error, why?
Navigation Timeout Exceeded: 60000ms exceeded
I'm using puppeteer version 1.19.0
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setCacheEnabled(false);
try {
const response = await page.goto("https://www.gatsbyjs.com", {
waitUntil: "networkidle0",
timeout: 60000
});
console.log("Status code:", response.status());
} catch (error) {
console.log(error.message);
}
await browser.close();
})();
Some other URLs work fine, so I wonder if there is anything special with this particular URL?
If you change the waitUntil : "networkidle2" . There is no time out.
networkidle2 - consider navigation to be finished when there are no
more than 2 network connections for at least 500 ms.
As pointed out in Erez's answer . 'serviceworker' may be holding the connection.You can check it by going to chrome://serviceworker-internals/ . Or Devtools -> Application Tab - Service Wokers
Serive Worker: chrome://serviceworker-internals/
Scope: https://www.gatsbyjs.com/
Registration ID: 295
Navigation preload enabled: false
Navigation preload header length: 4
Active worker:
Installation Status: ACTIVATED
Running Status: RUNNING
Fetch handler existence: EXISTS
Script: https://www.gatsbyjs.com/sw.js
Version ID: 10330
Renderer process ID: 11892
Renderer thread ID: 18124
DevTools agent route ID: 8
From Network : installingWorker ServiceWorker {scriptURL: "https://www.gatsbyjs.com/sw.js", state: "installing", onerror: null, onstatechange: null}
References :
Navigation Timeout Exceeded when using networkidle0 and no insight into what timed out
Support ServiceWorkers #2634
Removing waitUntil: "networkidle0" works so I'm assuming the site is still holding a connection to the server.
I couldn't figure out which connection it is (maybe the service worker?) using the developers tools (accessible in non headless mode by running await puppeteer.launch({ headless: false }))

This is a general expressjs running on node.js inside a docker container and on the cloud question

I have built two docker images. One with nginx that serves my angular web app and another with node.js that serves a basic express app. I have tried to access the express app from my browser in two different tabs at the same time.
In one tab the angular dev server (ng serve) serves up the web page. In the other tab the docker nginx container serves up the web page.
While accessing the node.js express app at the same time from both tabs the data starts to mix and mingle and the results returned to both tabs are a mix mash of the two requests (one from each browser tab)...
I'll try and make this more simple by showing my express app code here...but to answer this question you may not even need to know what the code is at all...so maybe check the question as stated below the code first.
'use strict';
/***********************************
GOOGLE GMAIL AND OAUTH SETUP
***********************************/
const fs = require('fs');
const {google} = require('googleapis');
const gmail = google.gmail('v1');
const clientSecretJson = JSON.parse(fs.readFileSync('./client_secret.json'));
const oauth2Client = new google.auth.OAuth2(
clientSecretJson.web.client_id,
clientSecretJson.web.client_secret,
'https://us-central1-labelorganizer.cloudfunctions.net/oauth2callback'
);
/***********************************
EXPRESS WITH CORS SETUP
***********************************/
const PORT = 8000;
const HOST = '0.0.0.0';
const express = require('express');
const cors = require('cors');
const cookieParser = require('cookie-parser');
const bodyParser = require('body-parser');
const whiteList = [
'http://localhost:4200',
'http://localhost:80',
'http://localhost',
];
const googleApi = express();
googleApi.use(
cors({
origin: whiteList
}),
cookieParser(),
bodyParser()
);
function getPageOfThreads(pageToken, userId, labelIds) {
return new Promise((resolve, reject) => {
gmail.users.threads.list(
{
'auth': oauth2Client,
'userId': userId,
'labelIds': labelIds,
'pageToken': pageToken
},
(error, response) => {
if (error) {
console.error(error);
reject(error);
}
resolve(response.data);
}
)
});
}
async function getPages(nextPageToken, userId, labelIds, result) {
while (nextPageToken) {
let pageOfThreads = await getPageOfThreads(nextPageToken, userId, labelIds);
console.log(pageOfThreads.nextPageToken);
pageOfThreads.threads.forEach((thread) => {
result = result.concat(thread.id);
})
nextPageToken = pageOfThreads.nextPageToken;
}
return result;
}
googleApi.post('/threads', (req, res) => {
console.log(req.body);
let threadIds = [];
oauth2Client.credentials = req.body.token;
let getAllThreadIds = new Promise((resolve, reject) => {
gmail.users.threads.list(
{ 'auth': oauth2Client, 'userId': 'me', 'maxResults': 500 },
(err, response) => {
if (err) {
console.error(err)
reject(err);
}
if (response.data.threads) {
response.data.threads.forEach((thread) => {
threadIds = threadIds.concat(thread.id);
});
}
if (response.data.nextPageToken) {
getPages(response.data.nextPageToken, 'me', ['INBOX'], threadIds).then(result => {
resolve(result);
}).catch((err) => {
console.error(err);
reject(err);
});
} else {
resolve(threadIds);
}
}
);
});
getAllThreadIds
.then((result) => {
res.send({ threadIds: result });
})
.catch((error) => {
res.status(500).send({ error: 'Request failed with error: ' + error })
});
});
googleApi.get('/', (req, res) => res.send('Hello World!'))
googleApi.listen(PORT, HOST);
console.log(`Running on http://${HOST}:${PORT}`);
The angular app makes a simple request to the express app and waits for the reply...which it properly receives...but when I try to make two requests at the exact same time data starts to get mixed together and results are given back to each browser tab from different accounts...
...and the question is... When running containers in the cloud is this kind of thing an issue? Does one need to spin up a new container for each client that wants to actively connect to the express service so that their data doesn't get mixed?
...or is this an issue I am seeing because the express app is being accessed from locally inside my machine? If two machines with two different ip address tried to access this express server at the same time would this sort of data mixing still be an issue or would each get back it's own set of results?
Is this why people use CaaS instead of IaaS solutions?
FYI: this is demo code and the data will not be actually going back to the consumer directly...plans are to have it placed into a database and then re-extracted from the database to download all of the metadata headers for each email.
-Thank you for your time
I can only clear up a small part of this question:
When running containers in the cloud is this kind of thing an issue?
No. Docker is not causing any of the quirky behaviour that you are describing.
Does one need to spin up a new container for each client?
A docker container generally can serve as much users as the application inside of it can. So as long as your application can handle a lot of users (and it should), you don't have to start the same application in multiple containers. That said, when you expect a very large number of customers, there exist docker tools like Docker Compose, Docker Swarm and a lot of alternatives that will enable you to scale up later. For now, you don't need to worry about this at all.
I think I may have found out the issue with my code...and this is actually very important if you are using the node.js googleapis client library...
It is entirely necessary to create a new oauth2Client for each request that comes in
const oauth2Client = new google.auth.OAuth2(
clientSecretJson.web.client_id,
clientSecretJson.web.client_secret,
'https://us-central1-labelorganizer.cloudfunctions.net/oauth2callback'
);
Problem:
When this oauth2Client is shared it is shared by each and every person that connects at the same time...So it is necessary to create a new one each and every time a user connects to my /threads endpoint so that they do not share the same memory space (i.e. access_token etc.) while the processing is done.
Setting the client secret etc. and creating the oauth2Client just once at the top and then simply resetting the token for each request leads to the conflicts mentioned above.
Solution:
For now simply moving the creation of this oauth2Client into each and every request that comes in makes this work properly.
Each client that connects to the service NEEDS to have their own newly created oauth2Client instance or these types of conflicts will occur...
...it's kind of a no brainer but I still find it odd that there is nothing about this in the docs? and their own examples (https://github.com/googleapis/google-api-nodejs-client) seem to show only one instance being created for the whole of their app...but those examples are snippets so...

Resources