Image urls change while scraping in Node (works in browser console) - node.js

I'm using artoo.js for web scraping however For some reason the scraped image url's change when working with cheerio in node . i.e the original image url is :
"https://images-na.ssl-images-amazon.com/images/M/MV5BNWU4NmY3MTMtMTBmMi00NjFjLTkwMmItYWZhZWUwNDg5M2ExXkEyXkFqcGdeQXVyNDUyOTg3Njg#._V1_SX300.jpg"
However after scraping the Url turns to this url:
"http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg"
If I scrape it while in chrome browser console using Artoo.js bookmark. The Url stays same as original.
Why is it changing when i use it in node?.Any Suggestions
UPDATE: Update: I think I found the issue but not the solution. It seems the scraper method runs before the correct images have loaded on page. the changed URL is just the placeholder image. How can I wait till the entire page loads.

It may be caused by some JS code. If you are using request+cheerio to scrap the page. When you make the request in node the JS code does nothing (it's not interpreted). So you are probably getting the original url before any lib or piece of code changes it. Try to look at the source code of the page in the browser Crtl+u. If it's "http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg" then you will know some piece of code is doing something to change it.
Edit
If you absolutly need to run the JS to obtain the URL. You sould use phantomjs. It's a headless browser. The imaes will load. You can use it directly from nodejs or if you want a simpler way go with casperjs. I assume you're not used to scraping complicated web apps. If it's the case would go with casperjs. It's easy and it does the job. It's not as fast as using request + cheerio but it works. And you can put your code to run on a server.

Related

How do I pull an image from a webpage in node js?

I'm currently trying to pull an image url from a website, and have that image url put in my code if that makes any sense.
Essentially, the request to the page goes through, and I need my code to grab an image on the page (the url of the image on the page), so it can send to a discord webhook.
What's the best way of doing this in node js? If this isn't enough information, please feel free to let me know and I will try my best to expand on this! Thanks.
Well, since you have only put your question and no code/situation, I'll just elaborate on how you can get it in general.
Page is Dynamic
What I mean by that is like, your page has content which you want to fetch and that content is loaded by JavaScript. Then you could try using Headless libraries like Puppeteer and Nightmare. Note that, all of these packages are kinda heavy, for example, Puppeteer installs Chromium (Not the Element, the Browser! If you don't know about it, read this) and Nightmare works with Electron (Again not the chemistry one, it is an NPM Package)
You can leverage the built-in functions in there to get the element which you want. However, you will need to do a lot of Inspect Element stuff to get the exact element you want!
Page is Static
What I mean by Static is that all of the stuff you need is in the default HTML yourself! So you won't need headless browsers for that. That's some heavy dependencies less for you.
So what you will want to do is, fetch the site using packages like Node-Fetch and Axios. I know it is possible to do it with the core Node.js Module called http but it is waaaaaay too much of a hassle to use and is not really that suggested.
On your request, you will get the Raw HTML of the site (looks kinda trash ngl). So now what you will need to do with this Raw HTML is to PARSE it to get your Image URL! You can use cheerio and JSDom to load your HTML and then parse the document to get the Image URL you want. Both are pretty awesome ngl.

How to crawl websites that have front end js framework

We are currently trying swiftype and wanted to see how to Crawl our website that has javascript frameworks becauase there are async calls.
I created a engine and was able to run a crawl based my sitemap, but instead of reading the actual content, it is reading my Angular js code.
For eg:
if have an angular code something like
<div ng-class='grey title'> {{ctrl.title}}</div>
and if this data gets binded on page load, instead of reading the title, it reads the actual code as {{ctrl.title}}
so when i search, the page returns something like
"This article is about {{ctrl.title}} . We take you through.... "
Any idea on how to make it compatible with js frameworks?
You can use a "headless" browser through i.e. Playwright.dev. "Headless" means it doesn't have a GUI. Since it's actually a browser it'll interpret the page correctly. It can be started from a JavaScript that runs server-side. Check out Web Scraping : Handling AJAX website part I and the code on GitHub: introWebScraping.

Responsive design for working website

I'm beginner web-developer, front-end only. Sometimes i need to make existing working websites to be responsive. I use browser extension(Styler for Chrome), that show me window where i can insert my styles, which will be applied for a page. But it looks little difficult(need to write code in my text-editor, copy this to extension form, than again, again and again...). Is there a way, to integrate my local stylesheet, to existing website, make changes only in my editor and reload page automatically, like with local page? I've found something on LiveReload website -
http://feedback.livereload.com/knowledgebase/articles/86220-preview-css-changes-against-a-live-site-then-uplo, but I can't use their app, cause i'm on windows(LiveReload is still in beta on it). If anybody use similar, can you please explain how to get it to work? Thanks.
In process of expiriencing I found simple solution for this:
Need to install Chrome extension(CSS Inject).
Run web-server on your machine which will host your css file for injecting(CSS Inject works only with HTTP) and insert it to CSS Inject, in my situation it looks like - http:// adapt/css/style.css
Need livereload server. I'am using node.js and this package https://www.npmjs.com/package/livereload for this.
Create file in your web-site root(for example server.js)
Paste this code in server.js:
var livereload = require('livereload'),
server = livereload.createServer();
server.watch(__dirname + "/css");
console.log('waiting for changes');
Go to your live website and activate CSS Inject.
run node ./server.js
That's it. You can now modify your styles localy and see changes on real website.
If anybody knows better solution(using API from this package https://www.npmjs.com/package/livereload#api-options, specifically overrideURL option) or have better expirience with node.js and understang how to implemet it, please post your solution here, I will be grateful.

node.js cheerio (scraping) live updating page

There is a webpage with live text data in a span tag that updates without the page refreshing. Is it possible to use cheerio or maybe another node.js module to get the page info and keep it open so node.js also sees the updates?
I would like to not keep re-requesting. As A human with the webpage open in the browser i do not need to refresh so logically the same should be doable in node.js
True?
You can use phantomjs
It's like a real browser but without window.
You can handle all browser event, so you can know when an element is added to page.

In spidermoney,how to get page url

In spidermonkey,I want to get page URL in one string creating function.How to realize it?I don't know where is the spidermonkey forum site.Can someone tell me?Thanks!
The standalone Spidermonkey is just a Javascript engine, it does not expose web related features (HTML, DOM). If you are running a Javascript script inside Firefox, you can get the URL through the standard DOM objectwindow .

Resources