How to crawl websites that have front end js framework

How to crawl websites that have front end js framework - search

We are currently trying swiftype and wanted to see how to Crawl our website that has javascript frameworks becauase there are async calls.
I created a engine and was able to run a crawl based my sitemap, but instead of reading the actual content, it is reading my Angular js code.
For eg:
if have an angular code something like
<div ng-class='grey title'> {{ctrl.title}}</div>
and if this data gets binded on page load, instead of reading the title, it reads the actual code as {{ctrl.title}}
so when i search, the page returns something like
"This article is about {{ctrl.title}} . We take you through.... "
Any idea on how to make it compatible with js frameworks?

You can use a "headless" browser through i.e. Playwright.dev. "Headless" means it doesn't have a GUI. Since it's actually a browser it'll interpret the page correctly. It can be started from a JavaScript that runs server-side. Check out Web Scraping : Handling AJAX website part I and the code on GitHub: introWebScraping.

Related

How do I pull an image from a webpage in node js?

I'm currently trying to pull an image url from a website, and have that image url put in my code if that makes any sense.
Essentially, the request to the page goes through, and I need my code to grab an image on the page (the url of the image on the page), so it can send to a discord webhook.
What's the best way of doing this in node js? If this isn't enough information, please feel free to let me know and I will try my best to expand on this! Thanks.

Well, since you have only put your question and no code/situation, I'll just elaborate on how you can get it in general.
Page is Dynamic
What I mean by that is like, your page has content which you want to fetch and that content is loaded by JavaScript. Then you could try using Headless libraries like Puppeteer and Nightmare. Note that, all of these packages are kinda heavy, for example, Puppeteer installs Chromium (Not the Element, the Browser! If you don't know about it, read this) and Nightmare works with Electron (Again not the chemistry one, it is an NPM Package)
You can leverage the built-in functions in there to get the element which you want. However, you will need to do a lot of Inspect Element stuff to get the exact element you want!
Page is Static
What I mean by Static is that all of the stuff you need is in the default HTML yourself! So you won't need headless browsers for that. That's some heavy dependencies less for you.
So what you will want to do is, fetch the site using packages like Node-Fetch and Axios. I know it is possible to do it with the core Node.js Module called http but it is waaaaaay too much of a hassle to use and is not really that suggested.
On your request, you will get the Raw HTML of the site (looks kinda trash ngl). So now what you will need to do with this Raw HTML is to PARSE it to get your Image URL! You can use cheerio and JSDom to load your HTML and then parse the document to get the Image URL you want. Both are pretty awesome ngl.

What is the benefit of serving react app(front end) from express app(back end)

I have developed and hosted few react applications but I'm still confused about the word server side rendering. So, I'm curious to know what is the benefit of serving my react applications from an express server.
Thank you as I look forward to your response.

Please consider clarifying your question, because I don't know also if you use next.js or nuxt.js.
Angular, React, Vue.js are JavaScript frontend frameworks/libraries, which means they are using JavaScript language.
The websites on the Internet contain HTML, CSS, JavaScript Vanilla code.
The application you developed using React.js needs to use React.js library source code, and then you need the React.js code to be compiled to JavaScript Vanilla code because the browser at the end needs a JavaScript native code that it can read.
When I need to see your hosted website, I need to write the domain name to see it, so the HTTP request at the end goes to the server which is hosting your react app, thus if your react application contains just a static content, for example, a landing HTML page, then there is no need for server-side rendering from Express(Node.js), Ruby, PHP, Java,...
What I mean by static content is content that doesn’t change in response to different users, all of them will get the same content.
Notice you can host a static website in Github and you still don't need any server-side rendering...
Let's have a small application for a better explanation:
If you developed a Portfolio that contains a description of yourself, images of your projects, skills, then here there is no need for a server-side rendering.
But if you developed a system that lets a user who has permission to create a short link from a full URL, then you need a backend server(like Java, Ruby, C#, PHP,...) to host the logic code in order to generate a tiny URL from the full URL, and then save it in a Database, that way any user can click the generated tiny URL then this request goes to your backend server which needs to redirect the user with correct full URL, an application like this cannot be done using React.js alone, you need a server to handle the logic.
Returning to your answering your question: "So, I'm curious to know what is the benefit of serving my react applications from an express server."
If you have static content you can avoid using Express, but if you think your application needs some backend logic in the future, then Express or any other backend framework will help you in that.
*Notice when you have a static website, and you tried to edit the content of it, the users which already visited your website, their browser might cached your website content unless (they disabled this option in their browser), so if your website is cached in users' browsers they might not get the updated content unless you changed the static website file name for example by adding ?092130123 to file name in order to let the users' browser download the updated data

Image urls change while scraping in Node (works in browser console)

I'm using artoo.js for web scraping however For some reason the scraped image url's change when working with cheerio in node . i.e the original image url is :
"https://images-na.ssl-images-amazon.com/images/M/MV5BNWU4NmY3MTMtMTBmMi00NjFjLTkwMmItYWZhZWUwNDg5M2ExXkEyXkFqcGdeQXVyNDUyOTg3Njg#._V1_SX300.jpg"
However after scraping the Url turns to this url:
"http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg"
If I scrape it while in chrome browser console using Artoo.js bookmark. The Url stays same as original.
Why is it changing when i use it in node?.Any Suggestions
UPDATE: Update: I think I found the issue but not the solution. It seems the scraper method runs before the correct images have loaded on page. the changed URL is just the placeholder image. How can I wait till the entire page loads.

It may be caused by some JS code. If you are using request+cheerio to scrap the page. When you make the request in node the JS code does nothing (it's not interpreted). So you are probably getting the original url before any lib or piece of code changes it. Try to look at the source code of the page in the browser Crtl+u. If it's "http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg" then you will know some piece of code is doing something to change it.
Edit
If you absolutly need to run the JS to obtain the URL. You sould use phantomjs. It's a headless browser. The imaes will load. You can use it directly from nodejs or if you want a simpler way go with casperjs. I assume you're not used to scraping complicated web apps. If it's the case would go with casperjs. It's easy and it does the job. It's not as fast as using request + cheerio but it works. And you can put your code to run on a server.

node.js cheerio (scraping) live updating page

There is a webpage with live text data in a span tag that updates without the page refreshing. Is it possible to use cheerio or maybe another node.js module to get the page info and keep it open so node.js also sees the updates?
I would like to not keep re-requesting. As A human with the webpage open in the browser i do not need to refresh so logically the same should be doable in node.js
True?

You can use phantomjs
It's like a real browser but without window.
You can handle all browser event, so you can know when an element is added to page.

In spidermoney,how to get page url

In spidermonkey,I want to get page URL in one string creating function.How to realize it?I don't know where is the spidermonkey forum site.Can someone tell me?Thanks!

The standalone Spidermonkey is just a Javascript engine, it does not expose web related features (HTML, DOM). If you are running a Javascript script inside Firefox, you can get the URL through the standard DOM objectwindow .

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to crawl websites that have front end js framework - search

Related

How do I pull an image from a webpage in node js?

What is the benefit of serving react app(front end) from express app(back end)

Image urls change while scraping in Node (works in browser console)

node.js cheerio (scraping) live updating page

In spidermoney,how to get page url

Categories

Resources