I'm writing an application using Node.js.
One of the functions I want to create is to navigate to a specific URL.
for example:
navigate inside a website
taking some informations of the page like (contacts,etc..)
I'm able to take a URL but i don't know how to navigate inside it?
some advices?
If your goal is to make some sort of crawler, you should have a look at request and cheerio which is a sort of jquery for node, and HTML parser.
Related
I'm currently trying to pull an image url from a website, and have that image url put in my code if that makes any sense.
Essentially, the request to the page goes through, and I need my code to grab an image on the page (the url of the image on the page), so it can send to a discord webhook.
What's the best way of doing this in node js? If this isn't enough information, please feel free to let me know and I will try my best to expand on this! Thanks.
Well, since you have only put your question and no code/situation, I'll just elaborate on how you can get it in general.
Page is Dynamic
What I mean by that is like, your page has content which you want to fetch and that content is loaded by JavaScript. Then you could try using Headless libraries like Puppeteer and Nightmare. Note that, all of these packages are kinda heavy, for example, Puppeteer installs Chromium (Not the Element, the Browser! If you don't know about it, read this) and Nightmare works with Electron (Again not the chemistry one, it is an NPM Package)
You can leverage the built-in functions in there to get the element which you want. However, you will need to do a lot of Inspect Element stuff to get the exact element you want!
Page is Static
What I mean by Static is that all of the stuff you need is in the default HTML yourself! So you won't need headless browsers for that. That's some heavy dependencies less for you.
So what you will want to do is, fetch the site using packages like Node-Fetch and Axios. I know it is possible to do it with the core Node.js Module called http but it is waaaaaay too much of a hassle to use and is not really that suggested.
On your request, you will get the Raw HTML of the site (looks kinda trash ngl). So now what you will need to do with this Raw HTML is to PARSE it to get your Image URL! You can use cheerio and JSDom to load your HTML and then parse the document to get the Image URL you want. Both are pretty awesome ngl.
Website renders products after filters are applied but source code remains same. I want to scrape products which are present on the screen. I am using cheerio and axios in node.js but cheerio gets only the source code body which is not what I require.
I think easiest way to do it would be to take a look into network tab. My suspicion is that website is dynamicly loading this data fro some sort of API, so you actually need to scrape incoming data from that API and not website itself.
We are currently trying swiftype and wanted to see how to Crawl our website that has javascript frameworks becauase there are async calls.
I created a engine and was able to run a crawl based my sitemap, but instead of reading the actual content, it is reading my Angular js code.
For eg:
if have an angular code something like
<div ng-class='grey title'> {{ctrl.title}}</div>
and if this data gets binded on page load, instead of reading the title, it reads the actual code as {{ctrl.title}}
so when i search, the page returns something like
"This article is about {{ctrl.title}} . We take you through.... "
Any idea on how to make it compatible with js frameworks?
You can use a "headless" browser through i.e. Playwright.dev. "Headless" means it doesn't have a GUI. Since it's actually a browser it'll interpret the page correctly. It can be started from a JavaScript that runs server-side. Check out Web Scraping : Handling AJAX website part I and the code on GitHub: introWebScraping.
I'm using artoo.js for web scraping however For some reason the scraped image url's change when working with cheerio in node . i.e the original image url is :
"https://images-na.ssl-images-amazon.com/images/M/MV5BNWU4NmY3MTMtMTBmMi00NjFjLTkwMmItYWZhZWUwNDg5M2ExXkEyXkFqcGdeQXVyNDUyOTg3Njg#._V1_SX300.jpg"
However after scraping the Url turns to this url:
"http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg"
If I scrape it while in chrome browser console using Artoo.js bookmark. The Url stays same as original.
Why is it changing when i use it in node?.Any Suggestions
UPDATE: Update: I think I found the issue but not the solution. It seems the scraper method runs before the correct images have loaded on page. the changed URL is just the placeholder image. How can I wait till the entire page loads.
It may be caused by some JS code. If you are using request+cheerio to scrap the page. When you make the request in node the JS code does nothing (it's not interpreted). So you are probably getting the original url before any lib or piece of code changes it. Try to look at the source code of the page in the browser Crtl+u. If it's "http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png#._V1_SX300.jpg" then you will know some piece of code is doing something to change it.
Edit
If you absolutly need to run the JS to obtain the URL. You sould use phantomjs. It's a headless browser. The imaes will load. You can use it directly from nodejs or if you want a simpler way go with casperjs. I assume you're not used to scraping complicated web apps. If it's the case would go with casperjs. It's easy and it does the job. It's not as fast as using request + cheerio but it works. And you can put your code to run on a server.
I know there's a way for extensions and pages to communicate locally, but I need to send a message from an outside URL, have my Chrome Extension listen for it.
I have tried easyXDM in the background page, but it seems to stop listening after awhile, as if Google "turns off" the Javascript in the background page after awhile.
I think you may try some walk around and build a site with some specific data structure, and then implement a content script which will look for this specific that specific data structure, and when i finds one it can fetch the data you want to be passed to your extension.
Yes, you need a content script that communicates with the page using DOM Events.. Instructions on how to do that are here:
http://code.google.com/chrome/extensions/content_scripts.html#host-page-communication