I'm coding a parser for Google Ads and I need to separate top and bottom sections.
When I get a page with Node.JS request module I get all ads with ('.ads-ad') selector. The same I get via a browser.
But in a browser I can see parent DIV element with id='taw' (for top ads) and parent DIV element with id='bottomads' (for bottom ads). But I don't see these DIV elements with request.
Is there existing simple way to get Google page like a browser?
Thank you in advance!
Try JSDom or Casper JS. These are headless browsers which render the pages and run any javascript on the page, this ensures the the final content is the same as in your browser.
Related
I'm using artoo.js for web scraping however For some reason the scraped image url's change when working with cheerio in node . i.e the original image url is :
"https://images-na.ssl-images-amazon.com/images/M/MV5BNWU4NmY3MTMtMTBmMi00NjFjLTkwMmItYWZhZWUwNDg5M2ExXkEyXkFqcGdeQXVyNDUyOTg3Njg#._V1_SX300.jpg"
However after scraping the Url turns to this url:
"http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg"
If I scrape it while in chrome browser console using Artoo.js bookmark. The Url stays same as original.
Why is it changing when i use it in node?.Any Suggestions
UPDATE: Update: I think I found the issue but not the solution. It seems the scraper method runs before the correct images have loaded on page. the changed URL is just the placeholder image. How can I wait till the entire page loads.
It may be caused by some JS code. If you are using request+cheerio to scrap the page. When you make the request in node the JS code does nothing (it's not interpreted). So you are probably getting the original url before any lib or piece of code changes it. Try to look at the source code of the page in the browser Crtl+u. If it's "http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg" then you will know some piece of code is doing something to change it.
Edit
If you absolutly need to run the JS to obtain the URL. You sould use phantomjs. It's a headless browser. The imaes will load. You can use it directly from nodejs or if you want a simpler way go with casperjs. I assume you're not used to scraping complicated web apps. If it's the case would go with casperjs. It's easy and it does the job. It's not as fast as using request + cheerio but it works. And you can put your code to run on a server.
I'm writing an application using Node.js.
One of the functions I want to create is to navigate to a specific URL.
for example:
navigate inside a website
taking some informations of the page like (contacts,etc..)
I'm able to take a URL but i don't know how to navigate inside it?
some advices?
If your goal is to make some sort of crawler, you should have a look at request and cheerio which is a sort of jquery for node, and HTML parser.
I have a fundamental question and I am searching for that for a long but I still don't know the exact response for that.
I am working with browsers and web applications. I am wondering how and based on what a web browser decide to send a particular request to the web server.
For example when you enter http://www.google.com inside the address bar of your web browser. the Browser will send a bunch of request to the web server for rendering the web page properly.
Now, my question is that how the web browser decide which request it needs to send to the web server.
does it related to some tags like 'link' or 'script' inside the body of the responses.
does the browser parse the javascript functions to see if it should send a request based on those functions?
Lets take an example to explain this one.
Consider you want to search for something and you hit http://www.google.com on your browser. These are the events that unfold to fetch you the page that will let you type in your query.
First, the networking stack on your machine will try to figure out which actual internet address matches www.google.com. This is called a DNS lookup. Once it receives a response for this lookup in form of an IP address, it can make a connection to the actual server that is serving google.com.
The machine makes a socket connection and uses the HTTP protocol to communicate with the server. It queries for the resource at / (which is the root) of the address you are trying to reach. This is called a GET request. The request is normally described like so: GET /
Google will respond with an HTML page. normally "index.html", which gets downloaded by the browser.
Once the HTML is downloaded, all linked resources, such as images to render the HTML as well as javascript referenced by the HTML page gets downloaded.
The downloaded HTML page is parsed and an in-memory tree is created called the "DOM Tree". This tree contains the elements of the HTML page in a hierarchy. Once the DOM is created, you can see the page being rendered on the browser.
During this parsing, the browser discovers more resources to be downloaded, such as images, stylesheets, javascript files. The HTML page references these resources via different tags such as <img> for images, <script> for javascript.
All detected resources are downloaded. Browsers download many of these resources in parallel, but apply them (javascript and stylesheets) sequentially in the order they where found on the page.
Stylesheets are parsed, and the styles are applied to the DOM of the HTML page. Sometimes, if stylesheets take longer to download, you can see the "raw" HTML page being rendered before the styles are applied. This happens sometimes over a slow connection.
Once the HTML page and related javascript files have been downloaded, the browser calls the "onload" callback function of javascript. Most Javascript heavy applications are started at this time.
Once onload is called, Javascript takes over and can attach handlers for different elements on the web page. Once the handlers have all been installed, interacting with the webpage could call one or more javascript functions that are listening for these events.
Javascript can also manipulate the DOM (the elements on the page), which results in UI updates (what the user sees) and therefore can be used to build a complete app on a single page.
Here is some more reading on the process: http://friendlybit.com/css/rendering-a-web-page-step-by-step/
The best way to examine this interaction is to use Developer tools on Chrome/FireFox or IE and view the network activity when you visit a web page.
There is a webpage with live text data in a span tag that updates without the page refreshing. Is it possible to use cheerio or maybe another node.js module to get the page info and keep it open so node.js also sees the updates?
I would like to not keep re-requesting. As A human with the webpage open in the browser i do not need to refresh so logically the same should be doable in node.js
True?
You can use phantomjs
It's like a real browser but without window.
You can handle all browser event, so you can know when an element is added to page.
I find that when I am doing web development there are a few browser plugins that are very useful to me.
For Firefox I am using:
Firebug - Great for inspecting the HTML elements and working with CSS.
YSlow for Firebug - Developed by Yahoo! and gives timing and tips about page resources.
Live HTTP headers - Lets you inspect the headers that are sent to your browser.
For IE I am using:
Fiddler - "a Web Debugging Proxy which logs all HTTP(S) traffic between your computer and the Internet"
I am always looking for other great tools to use. So what is everyone else using?
In addition to what you have:
Web Developer toolbar adds alot of extra functionality (cookie, form, image inspection, viewing generated DOM, etc).
HTML Validator - great for a quick check to make sure your pages are valid. Also good when there are display errors, you can quickly see if it's from improperly generated HTML.
ColorZilla - I use this alot to pull exact colors from a page to the clipboard.
Fireshot -- takes screenshots and annotates them convieniently, helpful.
Extended Statusbar modifies the status bar to show speed, percentage, time, and loaded size (useful for seeing how many images are being loaded, page weight, etc)
ShowIP Displays the IP address of the current page in the status bar
external IP Displays your external IP address in the statusbar
On a side note, I also find it useful to run these extensions in FirefoxPortable, so that I've got a browser setup specifically for development work with the relevant extensions installed, and to avoid slowing down or destabilizing my primary browser (eg. Firebug used to crash my browser all the time when accessing Gmail).
URL Params (Firefox extension) to view the POST and GET parameters of a webpage. Useful for checking your forms.
HttpFox
The one that prevents you from accessing StackOverflow is pretty useful.
All of these are Firefox plugins.
Firebug for Javascript and CSS debugging. Firebug allows for example to examine DOM tree while javascript modifies it. Firebug is my main tool.
Live HTTP Headers for looking at what data actually is inside request and responses.
Web Developer toolbar contains smaller utilities. For example it can validate html and CSS.
Dust Me Selectors finds which pieces of CSS are unused.
IE Developer Toolbar
Venkman debugger for Firefox
Firecookie and console 2
How about twitterfox to help twitter with developer colleagues and friends.
MeasureIt
For getting exact size of items rendered on a page in FF.
Firebug - Also let's me see the JS requests being sent from one page to another and which data is being sent.
- I can see the data inside the JS variables
- Replaces Error Console. It also outputs in the statue if it has found an error, so I can inspect it.
- Good for seeing the structure of the html when developing AJAX application.