I'm currently scraping a public webpage in the event that it goes down, and this site has some files where opening them in Chromium will usually download the file to your downloads folder automatically. For example, accessing https://www.7-zip.org/a/7z2201-x64.exe downloads a file instead of showing you the binary.
My code is really complicated, but what the main part of the code is, is this:
const page = await browser.newPage();
page.on("response", async response => {
// saves the file to a place I want it, but doesn't cancel the chrome-based download.
buffer = Buffer.from(new Uint8Array(await page.evaluate(function(x:string) {
return fetch(x).then(r=>r.arrayBuffer())
}, response.url())));
fs.writeFileSync('path', buffer);
return void 0;
});
await page.goto('https://www.7-zip.org/a/7z2201-x64.exe', { waitUntil: "load", timeout: 120000 });
I can't just assume the mime type either, the page could go to any URL from an html file to a zip file, so is it possible to disable downloads or rewire it to /dev/null? I've looked into response intercepting and it doesn't seem to be a thing based on this.
After reading a bit about /dev/null and seeing this answer, I figured out that I can do this:
const page = await browser.newPage();
await (await page.target().createCDPSession()).send("Page.setDownloadBehavior", {
behavior: "deny",
downloadPath: "/dev/null"
});
Setting the download path to /dev/null is redundant, but if you don't want to fully deny the download behavior for a tab but also don't want it going to your downloads folder, /dev/null will essentially delete whatever it receives on the spot.
I set the download behavior before navigating to a page, and this also focusing on behavior related to Chromium-based browsers, not just mime types.
Related
Disclaimer: I'm not a Node-pro. I've read so many tickets and sites today, trying to solve my issues. But when one problem is solved, another occured and vice versa.
Currently, Puppeteer is used in the following way:
const browser = await puppeteer.launch({
headless: true,
ignoreHTTPSErrors: true
});
const page = await browser.newPage();
const response = await page.goto(targetUrl, {waitUntil: 'load'});
const cdp = await page.target().createCDPSession();
const cookies = (await cdp.send('Network.getAllCookies')).cookies;
const localStorage = await page.evaluate(() => Object.assign({}, window.localStorage));
const sessionStorage = await page.evaluate(() => Object.assign({}, window.sessionStorage));
This works for most pages, but when trying to grab https://cioudways.com for example, I get Execution context was destroyed, most likely because of a navigation.
Replacing {waitUntil: 'load'} with {waitUntil: 'networkidle2'} it will only fail randomly then. But when trying to grab https://github.com using networkidle2, the whole process will timeout, resulting in Navigation timeout of 30000 ms exceeded (but with load instead of networkidle2 it works).
How can I solve this to get a stable script, that is able to work with nearly every URL?
The answer to the error in the case of the first URL is in the error message literally: Execution context was destroyed, most likely because of a navigation. When you open https://cioudways.com it will immediately replace location.href to https://www.cloudways.com/en/?id={foo}&data1={bar}&data2=in (note: it is not a regular HTTP 301 redirection, both are HTTP 200. HTTP 30x are handled by puppeteer) so your page is immediately destroyed before you'd have the chance evaluate it.
For this specific URL awaiting a new load event - right after page.goto() - would solve your issue:
await page.goto('https://cioudways.com', { waitUntil: 'load' })
await page.waitForNavigation({ waitUntil: 'load' })
Of course, this would break the script for any other target URL (as it is unusual to have additional navigation on site launch). So you can't apply it as a general solution.
You could use (for this specific site) the redirected page https://www.cloudways.com/ to avoid this issue.
The second case has a different cause. The https://github.com page seems to have an issue with its resources. If I log all network calls:
await page.setRequestInterception(true)
page.on('request', request => {
console.log(request.url())
request.continue()
})
await page.goto('https://github.com', { waitUntil: 'networkidle2' })
It always stops at https://github.githubassets.com/images/modules/site/home/globe/flag.obj. I have no answer for this, the login page is full of canvas-es and animations that may affect networkidle2 (that means there are no more than 2 network calls in progress). Can be caused by a bug on Github's side. Maybe it is worth its own question.
Suggestion
As your problem lies in the unreliability of page loads I suggest using { waitUntil: 'load' } (as this is the default you can omit this argument completely) and I'd pause the page (page.waitForTimeout()) for a short while to give time for localStorage etc. to be filled in case of Angular/React apps too. This is only a workaround, pausing script execution for a huge amount of URLs is not a good thing, for slower pages maybe the hardcoded pause won't be enough while for others it will be unnecessarily long wait.
await page.goto(targetUrl)
await page.waitForTimeout(4000)
I have a download GET endpoint in my express app. For now it simply reads a file from the file system and streams it after setting some headers.
When i open the endpoint in Chrome, I can see that this is treated as a "document", while in Firefox it is being treated as type png.
I can't seem to understand why it is being treated differently.
Chrome: title bar - "download"
Firefox: title bar - "image name"
In Chrome, this also leads to no caching of the image if I refresh the address bar.
In Firefox it is being cached just fine.
This is my express code:
app.get("/download", function(req, res) {
let file = `${__dirname}/graph-colors.png`;
var mimetype = "image/png";
res.set("Content-Type", mimetype);
res.set("Cache-Control", "public, max-age=1000");
res.set("Content-Disposition", "inline");
res.set("Vary", "Origin");
var filestream = fs.createReadStream(file);
filestream.pipe(res);
});
Also attaching images for Browser network tabs.
This are all to do with the behaviors of Chrome, you can test on another site like Example.png on Wikipedia.
Chrome always treats the "thing" you opened in the address bar as document, ignoring what it really is. You can even test loading a css and it will read document.
For title, it reads download because your path is /download, you cannot change it according to this SO thread.
For caching, Chrome apparently ignores the cache when you are reloading, anything, page or image. You can try using the Wiki example.png, you will get 304 instead of "(from cache)". (304 means the request is sent, and the server has implemented ETag, if-none-match or similar technique)
I tried several solution around nothing really works or they are simply outdated.
Here my webdriver profile
let firefox = require('selenium-webdriver/firefox');
let profile = new firefox.Profile();
profile.setPreference("pdfjs.disabled", true);
profile.setPreference("browser.download.dir", 'C:\\MYDIR');
profile.setPreference("browser.helperApps.neverAsk.saveToDisk", "application/pdf","application/x-pdf", "application/acrobat", "applications/vnd.pdf", "text/pdf", "text/x-pdf", "application/vnd.cups-pdf");
I simply want to download a file and set the destination path. It looks like browser.download.dir is ignored.
That's the way I download the file:
function _getDoc(i){
driver.sleep(1000)
.then(function(){
driver.get(`http://mysite/pdf_showcase/${i}`);
driver.wait(until.titleIs('here my pdf'), 5000);
})
}
for(let i=1;i<5;i++){
_getDoc(i);
}
The page contains an iframe with a pdf. I can gathers the src attribute of it, but with the iframe and pdfjs.disabled=true simply visits the page driver.get() causes the download (so I'm ok with it).
The only problem is the download dir is ignored and the file is saved in the default download firefox dir.
Side question: if I wrap _getDoc() in a for loop for that parameter i how can I be sure I won't flood the server? If I use the same driver instance (just like everyone usually does) the requests are sequentials?
I'm making an extension in which the user is setting some configs and, based on his selection, a png file needs to be downloaded (from a remote url) in the background and saved as a file somewhere to make it accessible later. It must be saved as a file and also injected in the page by it's final path.
Reading the [fileSystem][1] api of chrome, all methods are sending a prompt to the user in order to download a file. Is there any way I can save files to a location without prompting the user for download?
In an effort to close this years old question, I'm quoting and expanding on wOxxOm's comments and posting as an answer.
The key is to download the raw data as a blob type using fetch. You can then save it via an anchor tag (a link) using the download attribute. You can simulate the "click" on the link using JavaScript if you need to keep it in the background.
Here's an example (more or less copied another SO post).
const data = ''; // Your binary image data
const a = document.createElement("a");
document.body.appendChild(a);
a.style = "display: none";
const blob = new Blob([data], {type: "octet/stream"}),
const url = window.URL.createObjectURL(blob);
a.href = url;
a.download = fileName; // Optional
a.click();
window.URL.revokeObjectURL(url);
This is not without its faults. For instance, it works best with small file sizes (exact size varies by browser, typically ~2MB max). The download attribute requires HTML5 support.
I got a quick question for you guys, I'm new to making Chrome Extensions and the idea I have for one I'm not sure if I can do this with an extension or not. I've been looking through the API but haven't run across something that might help. So my idea for my extension is that whoever downloads the extension will be able to set a pin code they will click the icon and it basically will lock down the browser so if someone else came to the browser they would only be able to access that one page and what it would lead to, they wouldn't be able to us the url bar or have access to the tabs unless permitted.Then the owner can press a hot key and it will ask them for there pin and will unlock the browser if need be.Or even put it in the presentation mode but not able to get out of it without a password? Is this something a chrome extension could do or am I going at this the wrong way? I noticed there are some options in the Chrome://about settings where you can compact the url bar and also make the tabs on the side bar. Any help or direction for this would be great, thanks!
You can create an options page where the extension settings are saved, and then create an option called eg DisableBrowser.
In file background.js, we monitor the onBeforeRequest event, and then check the value of variable DisableBrowser if it has true value, set the value of cancel parameter onBeforeRequest event, being equal to true when cancel is equal value to true, the request is canceled.
In short, just cancel and set equal to true and everything is rejected, ie, the browser will not open urls while the extension is installed and enabled.
Update:
The sample code below is the content of background.js file, showing how to allow only certain urls that are allowed in a list is executed successfully, and consequently all other urls will be denied and fails when opened.
// callback
var onBeforeRequestCallback = function( details ) {
// List of Urls Allowed
// You can create an array or use localStorage through options.html page,
// to save the urls allowed,
// then check and if an allowed URL, the request is not canceled, or in other words, it is permitted,
// in case of failure it is canceled and is not permitted.
if ( details.url === 'https://www.google.com/' || details.url === 'http://www.bing.com/' ) {
return {
cancel : false
};
} else {
return {
cancel : true
};
}
};
// filter
var onBeforeRequestFilter = {
urls : [
"http://*/*",
"https://*/*"
]
};
// opt_extraInfoSpec
var onBeforeRequestInfo = [
"blocking",
"requestBody"
];
// Monitors onBeforeRequest event
chrome.webRequest.onBeforeRequest.addListener( onBeforeRequestCallback, onBeforeRequestFilter, onBeforeRequestInfo );
Help Links:
options
background
onBeforeRequest
localStorage