Handling popups while scraping with PhantomJS or other libraries

Handling popups while scraping with PhantomJS or other libraries - node.js

There are lots of libraries out there to scrape information from web pages. Some of them which I had a look are:
http://phantomjs.org/
http://webdriver.io/
http://casperjs.org/
http://www.nightmarejs.org/
http://codecept.io/
https://data-miner.io
http://chaijs.com/
Surprisingly, none of them provide a way to scrape a popup window. Even if they do, I couldn't figure out how it's done.
The scenario is something like this:
-Visit a url (example.com)
-Fill login form
-Click login button
...and now, webpage opens a popup (an actual browser window) which I need to scrape.
Any suggestions or workarounds for popups?

webdriver.io offers change frame features, so in case your pop up has frame tags or iframe tags you can switch to it and test it. You can read more here
http://webdriver.io/api/protocol/frame.html#Usage

Related

Selenium , Python and Chrome Webdriver problem

I am learning Python and attempting to build a program that will scrape specific data from a website, store it and then manipulate it.
Currently I run my application, it opens a new chrome browser window and loads the page correctly. The problem is it should begin to start scrolling down and loading the remaining elements on the page.
I know the code works because if I manually click somewhere on the page that doesn't normally illicit a response (white space/empty areas) the browser somehow comes into "focus" and begins to iterate through the loop that scrolls down the page (by sending keys) prints the data I am after. I also noticed if I click another similar "dead space" area that contains the header, it doesn't have the same effect. I am unsure if this is something specific to Chrome, iFrames or something of that nature but I am completely stumped and would greatly appreciate any help.
Any thoughts on why I need to manually click on the new chrome window for it to work would be great.
Update: Still having the same issue, even tried with Safari and the same problem seems to exist.

Fixed this with:
element = driver.find_element_by_css_selector("div[id^='app-container']")
action = ActionChains(driver)
action.click(on_element = element)
action.perform()

How to click a button and scrape text from a website using python scrapy

I have used python scrapy to extract data from a website. Now i am able to scrape most of the details of a site using scrapy. But my main problem is that iam not able to extract all the reviews of products from the site. I am only able to extract the top 4 reviews which they display on the page and for getting other reviews i have to go to a pop up window which has all the reviews. I looked for 'href' for the popup window but im not able to find it. This is the link that i tried to scrape. The reviews and ratings are at the bottom of the page: https://www.coursera.org/learn/big-data-introduction
Can any one help me by explaining how to extract the reviews from this popup window. Another think to note is that there is infinite scrolling for the pop up.
Thanks in advance.

Scrapy, unlike tools like Selenium and PhantomJS, does not drive a full web browser in the background. You cannot just click a button.
You need to understand what the button does (e.g. does it simply submit a form? Does it do something with JavaScript? Etc.) and reproduce the functionality in your own code.
For example, you might need to read the content of a script element, apply regular expressions to it to pull a URL from a string literal, then make a new HTTP request to that URL, the pell the data you want from the new DOM.
... and then repeat for the next “page” of the infinite scroll.

Using jquery for parsing causes image network traffic in Chrome extension?

I'm writing an extension that scrapes web pages using jquery. After a while I start getting net errors saying resources not available and errors in the console loading images in the pages I'm scraping. I thought it might be $.get() loading it as html somehow, but it still happens when I use a raw XMLHttpRequest and it appears even when I call $(text) with static text.
Looking in the application tab of my background page I can see that there are images, even though they don't exist in the html. For example run this in the console of any extension background page:
$('<div>Hello, world!<img src="https://www.gravatar.com/avatar/fdc806d0a8834e57b2d9309849dea8cd"/></div>')
And you can see the image was loaded on the Application tab in dev tools, though it isn't in the html of the page when inspected and but it's visible on the network tab:
I assume that jquery is creating dom elements to use the browser's capabilities for finding elements, and that chrome is happily pre-fetching that image even though the element isn't on the page and the page will never be visible anyway, but it is causing me errors besides the extra network traffic.
I've tried disabling 'precache' in chrome://flags but that didn't work. For now I'm replacing <img with <noimg which seems to work but is not ideal:
$(text.replace(/<img /g, '<noimg '))
Is there a way to keep this from happening? Is there another library besides jQuery (like cheerio in node) that wouldn't actually create dom objects?

Use the built-in DOMParser to parse the HTML into a detached document, then use jQuery on that document object:
var doc = new DOMParser().parseFromString(yourHTMLstring, 'text/html');
$('.some.selector', doc).attr('foo', 'bar');
In case there may be relative links in the HTML, add a base element explicitly:
$(doc.head).append('<base href="' + realFullURL + '">')

In Chrome extensions, why use a background page with HTML?

I understand that the background page of a Chrome extension is never displayed. It makes sense to me that a background page should contain only scripts. In what situations would HTML markup ever be needed?
At https://developer.chrome.com/extensions/background_pages there is an example with an HTML background page, but I haven't been able to get it to work (perhaps because I am not sure what it should be doing).
Are there any examples of simple Chrome extensions which demonstrate how HTML markup can be useful in a background page?

Historical reasons
The background page is, technically, a whole separate document - except it's not rendered in an actual tab.
For simplicity's sake, perhaps, extensions started with requiring a full HTML page for the background page through the background_page manifest property. That was the only form.
But, as evidenced by your question, most of the time it's not clear what the page can actually be used for except for holding scripts. That made the entire thing being just a piece of boilerplate.
That's why when Chrome introduced "manifest_version": 2 in 2012 as a big facelift to extensions, they added an alternative format, background.scripts array. This will offload the boilerplate to Chrome, which will then create a background page document for you, succinctly called _generated_background_page.html.
Today, this is a preferred method, though background.page is still available.
Practical reasons
With all the above said, you still sometimes want to have actual elements in your background page's document.
<script> for dynamically adding scripts to the background page (as long as they conform to extension CSP).
Among other things, since you can't include external scripts through background.scripts array, you need to create a <script> element for those you whitelist for the purpose.
<canvas> for preparing image data for use elsewhere, for example in Browser Action icons.
<audio> for producing sounds.
<textarea> for (old-school) working with clipboard (don't actually do this).
<iframe> for embedding an external page into the background page, which can sometimes help extracting dynamic data.
..possibly more.
It's debatable which boilerplate is "better": creating the elements in advance as a document, or using document.createElement and its friends as needed.
In any case, a background page is always a page, whether provided by you or autogenerated by Chrome. You can use all the DOM functions you want.

My two cents:
Take Google Mail Checker as an example, it declares a canvas in background.html
<canvas id="canvas" width="19" height="19">
Then it could manipulate the canvas in background.js and call chrome.browserAction.setIcon({imageData: canvasContext.getImageData(...)}) to change the browser action icon.
I know we could dynamically create canvas via background.js, however when doing something involving DOM element, using html directly seems easier.

How can one debug the Chrome extension "options" page using the new OptionsV2 method?

https://developer.chrome.com/extensions/optionsV2 tells me that I should be using options_ui in my manifest, rather than options_page, and recommends I start upgrading immediately.
However, I can't find any way to actually debug the script run by my options page when I use options_ui—the Options popup is in an tag, and the developer tools don't show me the source, or even the HTML content.
For now, I just comment out options_ui and let options_page take effect when I need to debug. I'm guessing that setting "options_ui": {"open_in_tab": true,...} would have the same effect, but it would be really nice to figure out how to actually debug the script when it's running the new way.

Auspex,
Teepeemm's comment is correct.
Other way, you can launch your options page from other tab using its full URL
like,
chrome-extension://{your extension id here}/{your options page path here, from the extension root}
e.g. say my extension id aaabbbcccdddeeefffggg, and say, my options page is located (from extension root) at app/html/options.html; then i can load up below URL in a new tab ---
chrome-extension://aaabbbcccdddeeefffggg/app/html/options.html
Now here, in this tab; you can do your regular debugging around HTML and javascript.
I hope this suffices your debugging requirement for 'new options UI' for chrome.

Teepeemm's comment is correct.
It's as simple as right-clicking inside the options page modal and selecting "Inspect element" - it will open the correct Dev Tools.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string