node.js cheerio (scraping) live updating page

node.js cheerio (scraping) live updating page - node.js

There is a webpage with live text data in a span tag that updates without the page refreshing. Is it possible to use cheerio or maybe another node.js module to get the page info and keep it open so node.js also sees the updates?
I would like to not keep re-requesting. As A human with the webpage open in the browser i do not need to refresh so logically the same should be doable in node.js
True?

You can use phantomjs
It's like a real browser but without window.
You can handle all browser event, so you can know when an element is added to page.

Related

How do I pull an image from a webpage in node js?

I'm currently trying to pull an image url from a website, and have that image url put in my code if that makes any sense.
Essentially, the request to the page goes through, and I need my code to grab an image on the page (the url of the image on the page), so it can send to a discord webhook.
What's the best way of doing this in node js? If this isn't enough information, please feel free to let me know and I will try my best to expand on this! Thanks.

Well, since you have only put your question and no code/situation, I'll just elaborate on how you can get it in general.
Page is Dynamic
What I mean by that is like, your page has content which you want to fetch and that content is loaded by JavaScript. Then you could try using Headless libraries like Puppeteer and Nightmare. Note that, all of these packages are kinda heavy, for example, Puppeteer installs Chromium (Not the Element, the Browser! If you don't know about it, read this) and Nightmare works with Electron (Again not the chemistry one, it is an NPM Package)
You can leverage the built-in functions in there to get the element which you want. However, you will need to do a lot of Inspect Element stuff to get the exact element you want!
Page is Static
What I mean by Static is that all of the stuff you need is in the default HTML yourself! So you won't need headless browsers for that. That's some heavy dependencies less for you.
So what you will want to do is, fetch the site using packages like Node-Fetch and Axios. I know it is possible to do it with the core Node.js Module called http but it is waaaaaay too much of a hassle to use and is not really that suggested.
On your request, you will get the Raw HTML of the site (looks kinda trash ngl). So now what you will need to do with this Raw HTML is to PARSE it to get your Image URL! You can use cheerio and JSDom to load your HTML and then parse the document to get the Image URL you want. Both are pretty awesome ngl.

How to crawl websites that have front end js framework

We are currently trying swiftype and wanted to see how to Crawl our website that has javascript frameworks becauase there are async calls.
I created a engine and was able to run a crawl based my sitemap, but instead of reading the actual content, it is reading my Angular js code.
For eg:
if have an angular code something like
<div ng-class='grey title'> {{ctrl.title}}</div>
and if this data gets binded on page load, instead of reading the title, it reads the actual code as {{ctrl.title}}
so when i search, the page returns something like
"This article is about {{ctrl.title}} . We take you through.... "
Any idea on how to make it compatible with js frameworks?

You can use a "headless" browser through i.e. Playwright.dev. "Headless" means it doesn't have a GUI. Since it's actually a browser it'll interpret the page correctly. It can be started from a JavaScript that runs server-side. Check out Web Scraping : Handling AJAX website part I and the code on GitHub: introWebScraping.

Image urls change while scraping in Node (works in browser console)

I'm using artoo.js for web scraping however For some reason the scraped image url's change when working with cheerio in node . i.e the original image url is :
"https://images-na.ssl-images-amazon.com/images/M/MV5BNWU4NmY3MTMtMTBmMi00NjFjLTkwMmItYWZhZWUwNDg5M2ExXkEyXkFqcGdeQXVyNDUyOTg3Njg#._V1_SX300.jpg"
However after scraping the Url turns to this url:
"http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg"
If I scrape it while in chrome browser console using Artoo.js bookmark. The Url stays same as original.
Why is it changing when i use it in node?.Any Suggestions
UPDATE: Update: I think I found the issue but not the solution. It seems the scraper method runs before the correct images have loaded on page. the changed URL is just the placeholder image. How can I wait till the entire page loads.

It may be caused by some JS code. If you are using request+cheerio to scrap the page. When you make the request in node the JS code does nothing (it's not interpreted). So you are probably getting the original url before any lib or piece of code changes it. Try to look at the source code of the page in the browser Crtl+u. If it's "http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg" then you will know some piece of code is doing something to change it.
Edit
If you absolutly need to run the JS to obtain the URL. You sould use phantomjs. It's a headless browser. The imaes will load. You can use it directly from nodejs or if you want a simpler way go with casperjs. I assume you're not used to scraping complicated web apps. If it's the case would go with casperjs. It's easy and it does the job. It's not as fast as using request + cheerio but it works. And you can put your code to run on a server.

how to perform a post through chrome extention?

How can I perform a post through the chrome extention, lets say I want to send the current tab page title to a webpage

You can do POST XHRs from chrome extensions to any URL, as long as you have host permissions defined in your manifest. See these docs.

In a chrome extension the best way to try and do what i think you want is via a content script see documentation a word of warning however pinging your server with a POST request every time someone with your extension installed opens a web page is going to be extremely heavy going on your servers especially if you have a lot of installs. A possible solution is to use the content script to keep tally of the sites a user visits and save this data in a HTML5 database (wich chrome supports) then using background.html sending the data at given intervals in bulk with an AJAX request, this will significantly cut down the number of times your server is pinged.

How can I pass a message from outside URL to my Chrome Extension?

I know there's a way for extensions and pages to communicate locally, but I need to send a message from an outside URL, have my Chrome Extension listen for it.
I have tried easyXDM in the background page, but it seems to stop listening after awhile, as if Google "turns off" the Javascript in the background page after awhile.

I think you may try some walk around and build a site with some specific data structure, and then implement a content script which will look for this specific that specific data structure, and when i finds one it can fetch the data you want to be passed to your extension.

Yes, you need a content script that communicates with the page using DOM Events.. Instructions on how to do that are here:
http://code.google.com/chrome/extensions/content_scripts.html#host-page-communication

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

node.js cheerio (scraping) live updating page - node.js

You can use phantomjs It's like a real browser but without window. You can handle all browser event, so you can know when an element is added to page.

Related

How do I pull an image from a webpage in node js?

How to crawl websites that have front end js framework

Image urls change while scraping in Node (works in browser console)

how to perform a post through chrome extention?

How can I pass a message from outside URL to my Chrome Extension?

Categories

Resources