node-spider wait for lazy-load content - node.js

I am using NodeJS' node-spider to crawl website content, but it came with a problem is it won't wait for lazyload content to be loaded before grab the page content, especially like Angular kind of page which implement lazyload a lot.
I'm new to spider/crawler technology, not sure what is the configuration need to be done or any better library to solve this issue.
My code is exactly same as the code sample from the library: https://www.npmjs.com/package/node-spider

Related

How do I pull an image from a webpage in node js?

I'm currently trying to pull an image url from a website, and have that image url put in my code if that makes any sense.
Essentially, the request to the page goes through, and I need my code to grab an image on the page (the url of the image on the page), so it can send to a discord webhook.
What's the best way of doing this in node js? If this isn't enough information, please feel free to let me know and I will try my best to expand on this! Thanks.
Well, since you have only put your question and no code/situation, I'll just elaborate on how you can get it in general.
Page is Dynamic
What I mean by that is like, your page has content which you want to fetch and that content is loaded by JavaScript. Then you could try using Headless libraries like Puppeteer and Nightmare. Note that, all of these packages are kinda heavy, for example, Puppeteer installs Chromium (Not the Element, the Browser! If you don't know about it, read this) and Nightmare works with Electron (Again not the chemistry one, it is an NPM Package)
You can leverage the built-in functions in there to get the element which you want. However, you will need to do a lot of Inspect Element stuff to get the exact element you want!
Page is Static
What I mean by Static is that all of the stuff you need is in the default HTML yourself! So you won't need headless browsers for that. That's some heavy dependencies less for you.
So what you will want to do is, fetch the site using packages like Node-Fetch and Axios. I know it is possible to do it with the core Node.js Module called http but it is waaaaaay too much of a hassle to use and is not really that suggested.
On your request, you will get the Raw HTML of the site (looks kinda trash ngl). So now what you will need to do with this Raw HTML is to PARSE it to get your Image URL! You can use cheerio and JSDom to load your HTML and then parse the document to get the Image URL you want. Both are pretty awesome ngl.

How to crawl websites that have front end js framework

We are currently trying swiftype and wanted to see how to Crawl our website that has javascript frameworks becauase there are async calls.
I created a engine and was able to run a crawl based my sitemap, but instead of reading the actual content, it is reading my Angular js code.
For eg:
if have an angular code something like
<div ng-class='grey title'> {{ctrl.title}}</div>
and if this data gets binded on page load, instead of reading the title, it reads the actual code as {{ctrl.title}}
so when i search, the page returns something like
"This article is about {{ctrl.title}} . We take you through.... "
Any idea on how to make it compatible with js frameworks?
You can use a "headless" browser through i.e. Playwright.dev. "Headless" means it doesn't have a GUI. Since it's actually a browser it'll interpret the page correctly. It can be started from a JavaScript that runs server-side. Check out Web Scraping : Handling AJAX website part I and the code on GitHub: introWebScraping.

Image urls change while scraping in Node (works in browser console)

I'm using artoo.js for web scraping however For some reason the scraped image url's change when working with cheerio in node . i.e the original image url is :
"https://images-na.ssl-images-amazon.com/images/M/MV5BNWU4NmY3MTMtMTBmMi00NjFjLTkwMmItYWZhZWUwNDg5M2ExXkEyXkFqcGdeQXVyNDUyOTg3Njg#._V1_SX300.jpg"
However after scraping the Url turns to this url:
"http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg"
If I scrape it while in chrome browser console using Artoo.js bookmark. The Url stays same as original.
Why is it changing when i use it in node?.Any Suggestions
UPDATE: Update: I think I found the issue but not the solution. It seems the scraper method runs before the correct images have loaded on page. the changed URL is just the placeholder image. How can I wait till the entire page loads.
It may be caused by some JS code. If you are using request+cheerio to scrap the page. When you make the request in node the JS code does nothing (it's not interpreted). So you are probably getting the original url before any lib or piece of code changes it. Try to look at the source code of the page in the browser Crtl+u. If it's "http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg" then you will know some piece of code is doing something to change it.
Edit
If you absolutly need to run the JS to obtain the URL. You sould use phantomjs. It's a headless browser. The imaes will load. You can use it directly from nodejs or if you want a simpler way go with casperjs. I assume you're not used to scraping complicated web apps. If it's the case would go with casperjs. It's easy and it does the job. It's not as fast as using request + cheerio but it works. And you can put your code to run on a server.

React SPA - Show loading screen until multiple components are rendered

Apologies for the relatively conceptual question. I'm attempting to emulate the full-page Heroku loading screen that appears when the page first loads and disappears after the navbar, images, libraries, and other components have mounted:
I know the underlying stack of Heroku is Ember, but is there a good practice/way for accomplishing a similar effect using node.js, webpack, react, react-router, and express. It's not isomorphic and I would prefer all rendering to take place on the client.
Thank you for your help and let me know if there's any other information that would be helpful in understanding what I'm trying to do.
I haven't tried this but I imagine you could simply provide the HTML for your 'full page loading screen' as the initial HTML that comes with your page, and then simply replace it with your ReactDOM.render call which will only get processed after everything has been processed.
More specifically, if you put all of your JS script tags in the footer, then the rest of your document will load and render while the scripts are being loaded, giving you your 'loading screen' effect. It should work to simply put the loading graphic within a div and then render over it with React.

Automatically saving web pages requiring login/HTTPS

I'm trying to automate some datascraping from a website. However, because the user has to go through a login screen a wget cronjob won't work, and because I need to make an HTTPS request, a simple Perl script won't work either. I've tried looking at the "DejaClick" addon for Firefox to simply replay a series of browser events (logging into the website, navigating to where the interesting data is, downloading the page, etc.), but the addon's developers for some reason didn't include saving pages as a feature.
Is there any quick way of accomplishing what I'm trying to do here?
A while back I used mechanize wwwsearch.sourceforge.net/mechanize and found it very helpful. It supports urllib2 so it should also work with HTTPS requests as I read now. So my comment above could hopefully prove wrong.
You can record your action with IRobotSoft web scraper. See demo here: http://irobotsoft.com/help/
Then use saveFile(filename, TargetPage) function to save the target page.

Resources