Often I meet PhantomJS to run headless browser , headless browser.
What's the difference between normal (user invoked or regular Selenium webdriver code invoked) and a headless browser?
Besides all, I'd like to make things clear on these browser features:
headers
local storage
cookie
A headless browser is by definition a web browser without a graphical user interface(GUI).
Normally, interaction with a website are done with mouse and keyboard using a browser with a GUI, While most headless browser provides an API to manipulate the page/DOM, download resources etc. So instead of, for example, actually clicking an element with the mouse, a headless browser allows you to click an element by code.
Example of interacting with a page using PhantomJS:
page.evaluate(function() {
//Fill in form on page
document.getElementById('Name').value = 'John Doe';
document.getElementById('Email').value = 'john.doe#john.doe';
//Submit
$('#SubmitButton').click();
});
Headers, Local storage and Cookies work the same way in most headless browsers as they do in regular browsers with a GUI if they are implemented. PhantomJS and HtmlUnit has support for all of these features.
In PhantomJS, you may also add your own cookies. For example, you may copy a cookie from chrome and programmatically add it to the phantomjs browser at runtime. It will automatically be added to page requests for a specific domain.
Adding a cookie to a page before loading it
var webPage = require('webpage');
var page = webPage.create();
phantom.addCookie({
'name' : 'Valid-Cookie-Name', /* required property */
'value' : 'Valid-Cookie-Value', /* required property */
'domain' : 'localhost',
'path' : '/', /* required property */
'httponly' : false,
'secure' : false,
'expires' : (new Date()).getTime() + (1000 * 60 * 60) /* <-- expires in 1 hour */
});
page.open('localhost', function (status) {
//Cookie automatically added to request headers for localhost
...
});
For some runnable examples using PhantomJS, see The phantomjs examples page
Related
I am tring to delete history in headless=false browser with node.js puppeteer through below code but non of method work.
await page.goto('chrome://settings/clearBrowserData');
await page.keyboard.down('Enter');
Second code
await page.keyboard.down('ControlLeft');
await page.keyboard.down('ShiftLeft');
await page.keyboard.down('Delete');
await page.keyboard.down('Enter');
i tried to used .evaluateHandle() and .click() function too but non of them work. if anyone know how to clear history with puppeteer please answer me.
It is not possible to navigate to the browser settings pages (chrome://...) like that.
You have three options:
Use an incognito window (called context in puppeteer)
Use a command from the Chrome DevTools Protocol to clear history.
Restart the browser
Option 1: Use an incognito window
To clear the history (including cookies and any data), you can use an "incognito" window called BrowserContext in puppeteer.
You create a context by calling browser.createIncognitoBrowserContext(). Quote from the docs:
Creates a new incognito browser context. This won't share cookies/cache with other browser contexts.
Example
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
// Execute your code
await page.goto('...');
// ...
await context.close(); // clear history
This example will create a new incognito browser window and open a page inside. From there on you can use the page handle like you normally would.
To clear any cookies or history inside, simply close the context via context.close().
Option 2: Use the Chrome DevTools Protocol to clear history
If you cannot rely on using context (as they are not supported when using extensions), you can use the Chrome DevTools Protocol to clear the history of the browser. It has functions which are not implemented in puppeteer to reset cookies and cache. You can directly use functions from the Chrome DevTools Protocol by using a CDPSession.
Example
const client = await page.target().createCDPSession();
await client.send('Network.clearBrowserCookies');
await client.send('Network.clearBrowserCache');
This will instruct the browser to clear the cookies and cache by directly calling Network.clearBrowserCookies and Network.clearBrowserCache.
Option 3: Restart the browser
If both approaches are not feasible, you can alway restart the browser, by closing the old instance and creating a new one. This will clear any stored data.
I am new to node.js
I want to try to write node.js client for my web site testing
(stuff like login, filling forms, etc...)
Which module should i use for that?
Since I want to test user login following other user functionality
it should be able to keep session like browser
Also any site where it has example of using that module?
Thanks
As Amenadiel has said in the comments, you might want to use something like Phantom.js for testing websites.
But if you're new to node.js maybe try with something light, like Zombie.js.
An example from their home page:
var Browser = require("zombie");
var assert = require("assert");
// Load the page from localhost
browser = new Browser()
browser.visit("http://localhost:3000/", function () {
// Fill email, password and submit form
browser.
fill("email", "zombie#underworld.dead").
fill("password", "eat-the-living").
pressButton("Sign Me Up!", function() {
// Form submitted, new page loaded.
assert.ok(browser.success);
assert.equal(browser.text("title"), "Welcome To Brains Depot");
})
});
Later on, when you get the hang of it, maybe switch to Phantom (which has webkit beneath, so it's not emulating the Dom).
I want to perform following actions at the server side:
1) Scrape a webpage
2) Simulate a click on that page and then navigate to the new page.
3) Scrape the new page
4) Simulate some button clicks on the new page
5) Sending the data back to the client via json or something
I am thinking of using it with Node.js.
But am confused as to which module should i use
a) Zombie
b) Node.io
c) Phantomjs
d) JSDOM
e) Anything else
I have installed node,io but am not able to run it via command prompt.
PS: I am working in windows 2008 server
Zombie.js and Node.io run on JSDOM, hence your options are either going with JSDOM (or any equivalent wrapper), a headless browser (PhantomJS, SlimerJS) or Cheerio.
JSDOM is fairly slow because it has to recreate DOM and CSSOM in Node.js.
PhantomJS/SlimerJS are proper headless browsers, thus performances are ok and those are also very reliable.
Cheerio is a lightweight alternative to JSDOM. It doesn't recreate the entire page in Node.js (it just downloads and parses the DOM - no javascript is executed). Therefore you can't really click on buttons/links, but it's very fast to scrape webpages.
Given your requirements, I'd probably go with something like a headless browser. In particular, I'd choose CasperJS because it has a nice and expressive API, it's fast and reliable (it doesn't need to reinvent the wheel on how to parse and render the dom or css like JSDOM does) and it's very easy to interact with elements such as buttons and links.
Your workflow in CasperJS should look more or less like this:
casper.start();
casper
.then(function(){
console.log("Start:");
})
.thenOpen("https://www.domain.com/page1")
.then(function(){
// scrape something
this.echo(this.getHTML('h1#foobar'));
})
.thenClick("#button1")
.then(function(){
// scrape something else
this.echo(this.getHTML('h2#foobar'));
})
.thenClick("#button2")
thenOpen("http://myserver.com", {
method: "post",
data: {
my: 'data',
}
}, function() {
this.echo("data sent back to the server")
});
casper.run();
Short answer (in 2019): Use puppeteer
If you need a full (headless) browser, use puppeteer instead of PhantomJS as it offers an up-to-date Chromium browser with a rich API to automate any browser crawling and scraping tasks. If you only want to parse a HTML document (without executing JavaScript inside the page) you should check out jsdom and cheerio.
Explanation
Tools like jsdom (or cheerio) allow it to extract information from a HTML document by parsing it. This is fast and works well as long as the website does not contain JavaScript. It will be very hard or even impossible to extract information from a website built on JavaScript. jsdom, for example, is able to execute scripts, but runs them inside a sandbox in your Node.js environment, which can be very dangerous and possibly crash your application. To quote the docs:
However, this is also highly dangerous when dealing with untrusted content.
Therefore, to reliably crawl more complex websites, you need an actual browser. For years, the most popular solution for this task was PhantomJS. But in 2018, the development of PhantomJS was offically suspended. Thankfully, since April 2017 the Google Chrome team makes it possible to run the Chrome browser headlessly (announcement).
This makes it possible to crawl websites using an up-to-date browser with full JavaScript support.
To control the browser, the library puppeteer, which is also maintained by Google developers, offers a rich API for use within the Node.js environment.
Code sample
The lines below, show a simple example. It uses Promises and the async/await syntax to execute a number of tasks. First, the browser is started (puppeteer.launch) and a URL is opened page.goto.
After that, a functions like page.evaluate and page.click are used to extract information and execute actions on the page. Finally, the browser is closed (browser.close).
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// example: get innerHTML of an element
const someContent = await page.$eval('#selector', el => el.innerHTML);
// Use Promise.all to wait for two actions (navigation and click)
await Promise.all([
page.waitForNavigation(), // wait for navigation to happen
page.click('a.some-link'), // click link to cause navigation
]);
// another example, this time using the evaluate function to return innerText of body
const moreContent = await page.evaluate(() => document.body.innerText);
// click another button
await page.click('#button');
// close brower when we are done
await browser.close();
})();
The modules you listed do the following:
Phantomjs/Zombie - simulate browser (headless - nothing is actually displayed). Can be used for scraping static or dynamic. Or testing of your html pages.
Node.io/jsdom - webscraping : extracting data from page (static).
Looking at your requirements, you could use phantom or zombie.
currently i am using angularjs and nodejs
i have some angular templates in my server, when the angular requesting the template, the template will be rendered in browser and it will be cached by the browser, if suppose i am changing the layout of the template and updating the template which is in server, now when the user is redirected to the same template the browser renders the template from the browser cache which is old one.
how to overcome this problem?
NOTE : i don't want to clear my whole browser cache since it will affect my overall website performance.
Practical
On some page you are not caching, you can append ?ver= to the url to "break" the cache. Whenever you change the URL it will cause the browser to reload the resource. Add ?ver= to cached resources and change it when you want a reload.
Interesting but less practical
Creating a single page app, I solved this sort of issue using AppCache. With application cache you can ask the page to reload a resource. From the linked page:
// Check if a new cache is available on page load.
window.addEventListener('load', function(e) {
window.applicationCache.addEventListener('updateready', function(e) {
if (window.applicationCache.status == window.applicationCache.UPDATEREADY) {
// Browser downloaded a new app cache.
// Swap it in and reload the page to get the new hotness.
window.applicationCache.swapCache();
if (confirm('A new version of this site is available. Load it?')) {
window.location.reload();
}
} else {
// Manifest didn't changed. Nothing new to server.
}
}, false);
}, false);
Note, this only works on new browsers like Chrome, Firefox, Safari, Opera and IE10. I wanted to suggest a new approach.
I am trying to create a simple javascript based extension for Google Chrome that takes data from one specific iframe and sends it as part of a POST request to a webpage.
The web page that sends the data submitted by POST request, to my email address.
I tried running the extension, it looks to be running fine, but I am not getting any email.
The servlet which receives form data is very simple, I dont think there is any error in it.
What I want is some way to check if the javascript based extension works or not.
The javascript code is given below-
var mypostrequest=new ajaxRequest()
mypostrequest.onreadystatechange=function(){
if (mypostrequest.readyState==4){
if (mypostrequest.status==200 || window.location.href.indexOf("http")==-1){
document.getElementById("result").innerHTML=mypostrequest.responseText
}
else{
alert("An error has occured making the request")
}
}
}
var namevalue=encodeURIComponent("Arvind")
var descvalue=encodeURIComponent(window.frames['test_iframe'].document.body.innerHTML)
var emailvalue=encodeURIComponent("arvindikchari#yahoo.com")
var parameters="name="+namevalue+"&description="+descvalue &email="+emailvalue
mypostrequest.open("POST", "http://taurusarticlesubmitter.appspot.com/sampleform", true)
mypostrequest.setRequestHeader("Content-type", "application/x-www-form-urlencoded")
mypostrequest.send(parameters)
UPDATE
I made changes so that the content in js file is invoked by background page, but even now the extension is not working.
I put the following code in background.html:
<script>
// Called when the user clicks on the browser action.
chrome.browserAction.onClicked.addListener(function(tab) {
chrome.tabs.executeScript( null, {file: "content.js"});
});
chrome.browserAction.setBadgeBackgroundColor({color:[0, 200, 0, 100]});
</script>
Looking at your code looks like you are trying to send cross domain ajax request from a content script. This is not allowed, you can do that only from background pages and after corresponding domains are declared in the manifest. More info here.
To check if your extension works, you can open dev tools and check if there any errors in the console. Open "Network" tab and see if request was sent to your URL. Place console.log in various places in your code for debugging, or use full featured built in javascript debugger for step-by-step debugging.