I have been trying to scrape 10 websites for a website we are building with links to the original sites, on node.js using cheerio, problem we are getting is that some of the sites have changed which now uses ajax calls to bring their data, my question is how can we get that information, for instance trigger a button click first and then get the DOM.
secondly: same dom structure is not getting me all data, it is retrieving information for one page, but not getting the the elements on another page with identical DOM structure. any help would be appreciated.
Thanks and regards.
Edit 1: Relevant code
$('#ProductContent').filter(function(){
var price = undefined;
var ukulele = false;
var model = $(this).find('.ProductSubtitle').text().replace(/\n\s*/g,"");
if(model.indexOf(/m/i) != 0){
var description = $(this).find('.RomanceCopy').text().replace(/\n\s*|\r/g,"");
.
.code removed for brevity and the variables present here are populated
.
//this children is populated only for one page.
children = $(this).find('.SpecsColumn .SpecsTable table tbody').children('tr');
console.log('children: '+children.length)
console.log(guitar_url);
children.each(function(){
var key = $(this).children('td').first().text();
var value = $(this).children('td').last().text();
specs[key] = value;
console.log(specs);
});
Edit 2: Cherios Initialization
request(guitar_url,function(error,response,html){
if(!error){
var $ = cheerio.load(html);
$("#content #right-content").filter(function(){..children and other variables are populated inside here....})
}
})
To summarise all the comments you received:
Cheerio is minimalistic DOM reader inspired by jQuery. Its design is focused about reading data, and is not a browser emulator, where you could click a button.
Alternative is to use headless browsers like PhantomJS or CasperJS.
Those two are outside of Node.js scope, and you may have hard times transmitting the data back and forth from Node.js to headless browser.
If it is important for you to keep inside of Node.js environment, then you can use JSDOM.
All of them are more complicated to use than Cheerio, but if you want to manipulate the DOM, execute JavaScript on the DOM, etc... Then this is your best bet.
Removing the 'tbody' tags solved the problem, once they were removed it started to fetch the data normally for all three sites.
Related
I have been trying to load some PDFS that will show inside a Bootstrap accordion. The problem is that they load in a lot of different ways depending on the browser. I've been trying iframe and object html tags with different results and i have a huge flow in Safari where the accordion functionality breaks completely when i embed a PDF inside a panel.
So i guess my question is: Is there any sort of standard regarding crossbrowsing in order to make embeded PDF'S work in Chrome, Safari, IE11 and Firefox ?
Since i need this to work on mobile the situation is even worst. Some advice will be really appreciated.
Create a canvas element with the class "panel-body" and give it an id of your choosing. Then add the following code to your document ready event.
PDFJS.getDocument('YOURPDF.pdf').then(function(pdf) {
pdf.getPage(1).then(function(page) {
var scale = 1;
var viewport = page.getViewport(scale);
var canvas = document.getElementById('pdfOne'); // The id of your canvas
var context = canvas.getContext('2d');
canvas.height = viewport.height;
canvas.width = viewport.width;
var renderContext = {
canvasContext: context,
viewport: viewport
};
page.render(renderContext);
});
});
That will get the first page rendered. You'll need to create buttons to let the user navigate through the document but it should be easy enough to get that working based on what I've provided and the samples.
I'm new to node.js. My experience has been in Java and VBA. I'm trying to scrape a website for a friend and all is going well until I can't get what I’m after.
<div class="gwt-Label ADC2X2-c-q ADC2X2-b-nb ADC2X2-b-Zb">Phone: +4576 102900</div>
That tag just has a text. no attr or anything. Yet I cannot scrape it using cheerio.
if(!err && resp.statusCode == 200){
var $ = cheerio.load(body);
var number = $('//tried everything here!').text();
console.log(number);
this function I also played around with
$('.ADC2X2').filter(function(i){
console.log("Sdfs");
console.log (i);
any suggestions would be greatly appreciated.
thanks all!
I take answer from cheerio documentation.
$(".gwt-Label").text();
if that's not working, maybe you have many frame in page.
Another possibility is page is renderer at client side, like angular pages, so element your search is not in server html, but only created after page load.
If that's true, you will to use a full browser like phantomjs and not only a dom traverser tool like cheerio.
I am trying to build out a harness for a page so that we can write tests against it. What I would like to be able to do is use a CSS selector to find the given element or elements instead of manually modifying the SearchProperties or FilterProperties.For a web test the CSS Selector seems far more intuitive then the SearchProperties do. Is there some mechanism for doing this that I am simply not seeing?
Try this...
https://github.com/rpearsondev/CodedUI.jQueryExtensions/
It adds extension methods to the BrowserWindow object...
var example1 = browser.JQuerySelect<HtmlHyperlink>('a.class1');
var example2 = browser.JQuerySelect<HtmlListItem>('li.class2');
However, I will let you know I'm having issues with it complaining about casting errors regularly.
Try browserWindow.executeJavascript if you return a control you found via css/xpath it returns the relevant uiControl object
const string javascript = "document.querySelector('{0}');";
var bw = BrowserWindow.Launch(new Uri("http://rawstack.azurewebsites.net"));
string selector = "[ng-model='filterOptions.filterText']";
var control = bw.ExecuteScript(string.Format(javascript,selector));
HtmlEdit filter= control as HtmlEdit;
filter.Text = "Alien";
As sjdirect noted, the jQuery extensions are probably the way to go if you want to use those type of selectors.
However, it seems that you may be interested in some abstraction that doesn't require directly setting search / filter properties on the UITestControl objects.
There are good abstractions that do not use the same selectors as jQuery, but provide a readable, consistent approach for finding elements in the page and interacting with them.
I would recommend also looking into Code First and CodedUI Fluent (I wrote the fluent extensions) or even CodedUI Enhanced (CUITe).
These provide query support for that looks like (from CUITe):
// Launch the web browser and navigate to the homepage
BrowserWindowUnderTest browserWindow = BrowserWindowUnderTest.Launch("https://website.com");
// Enter the first name
browserWindow.Find<HtmlEdit>(By.Id("FirstName")).Text = "John";
// Enter the last name
browserWindow.Find<HtmlPassword>(By.Id("LastName")).Text ="Doe";
// Click the Save button
browserWindow.Find<HtmlInputButton>(By.Id("Save")).Click();
Recently, i integrate node and phantomjs by phantomjs-node. I opened page that has iframe element, i can get the hyperlink element of iframe, but failed when i execute click on it.
Do you have a way? Anyone can help me?
example:
page.open(url);
...
page.evaluate(function(res){
var childDoc = $(window.frames["iframe"].document),
submit = childDoc.find("[id='btnSave']"),
cf = submit.text();//succeed return text
submit.click()//failed
return cf;
},function(res){
console.log("result="+res);//result=submit
spage.render("test.png");//no submit the form
ph.exit();
});
You can't execute stuff in an iframe. You can only read from it. You even created a new document from the iframe, which will only contain the textual representation of the iframe, but it is in no way linked to the original iframe.
You would need to use page.switchToFrame to switch to the frame to execute stuff on the frame without copying it first.
It looks like switchToFrame is not implemented in phantomjs-node. You could try node-phantom.
If the iframe is on the same domain you can try the following from here:
submit = $("iframe").contents().find("[id='btnSave']")
cf = submit.text();
submit.click()
If the iframe is not from the same domain, you will need to create the page with web security turned off:
phantom.create('--web-security=false', function(page){...});
So this might be a convoluted question, but here goes:
I'm creating a simple, locally hosted web scraper with node.js. It's working perfectly fine when I manually define the URL to be scraped in the source file, and I'm now trying to prompt the user for a URL of their choice. I then append the URL they've entered to an empty div, and ideally, would be able to use cheerio to grab the content of that div.
Unfortunately, I have no idea how to parse the data that is being created on the same page that the script is running on. Any insight would be much, much appreciated!
var cheerio = require("cheerio");
response.write('<div id="newsStory"></div>');
response.write("<script type='text/javascript'>var userPrompt = prompt('input a url');");
response.write("if(userPrompt) {document.getElementById('newsStory').innerHTML = userPrompt;}");
response.write("</script>");
var $ = cheerio.load();
var url = $('div#newsStory').text(); //does not work!
var url = "http://www.cnn.com/2013/09/23/us/south-carolina-powerball-winner/"; //manually inputting a url works!
The problem you're having is you're mixing the browser-side DOM with the document Cheerio has server-side. The div newsStory is client-side, so you have to find some way to send its contents to the server.
Since you're familiar with Cheerio syntax, you could use jQuery on the client side, where the text() method acts the same, and you could use $.post() to send the URL to the server.