Nodejs web scraping from Ecommerce platforms' search results - node.js

I'm learning to scrape search results from e-commerce platforms(Ebay, etc...) with nodeJS.
The problem I'm facing is that there are sponsored products in these platforms;
such that these sponsored items will appear together with other non-sponsored but keyword-relevant items in the search result page.
When I use Postman to check the API responsible for the search results,
it appears that only the non-sponsored but relevant items can be retrieved from that API called.
As a result, simply calling API is not workable in this case as I also wanna scrape those sponsored items as well.
I would like to ask:
using nodeJS, how to scrape both sponsored and non-sponsored items appeared in the search result page?
I'm thinking about using certain packages such as jsdom or puppeteer, may I ask if my thought is on the right track? Thx a lots!

I think you should send GET requests with axios or etc. and then parse the whole webpage using puppeteer or REGEX , you are on the right path

Related

How Extract data from html with xpath

I'm trying to extract product prices from Google Shopping with Google Spreadsheet:
=IMPORTXML("https://www.google.com.br/?source=pshome-c-0-3&sa=X&ved=0ahUKEwjfguD5xaHKAhXMiZAKHWjuBi8Q7j8IEA#tbm=shop&q=Samsung+Galaxy+S6&spd=0";"/div[#class='product-results']/div[#class='psli'][2]/div[#class='pslicont']/div[#class='pslmain']/div[#class='pslline'][1]/div[#class='_tyb shop__secondary']/span[#class='price']/b")
My xpath query is:
/div[#class='product-results']/div[#class='psli'][2]/div[#class='pslicont']/div[#class='pslmain']/div[#class='pslline'][1]/div[#class='_tyb shop__secondary']/span[#class='price']/b
But i don't have results.
What's Wrong?
HTML from Google Shopping
Because google in not returning HTML in actual. See the source code of the page
view-source:https://www.google.com.br/?source=pshome-c-0-3&sa=X&ved=0ahUKEwjfguD5xaHKAhXMiZAKHWjuBi8Q7j8IEA#tbm=shop&q=Samsung+Galaxy+S6&spd=0
Try using User Agent while getting the HTML from Google. This was a problem I faced a few days ago and got around by mimicking the user agent to be Chrome Browser.
You can find the different ways to mimic User-Agent on Google itself(no pun intended).

Is there a way to get all pagination links at time in facebook page/feed api

Is there way to get all pagination links at time in facebook page/feed api
I want to get all facebook page/feed pagination links at once.. is it possible?
I don't want to wait until current fetch is done and then I get next page link.
I have tried with date range way, which is the one solution i found so far.
page_id/feed?since=2014-01-01&until=2014-02-02&limit=100.
is there any better way to get all pagination links at once without missing any post.
My intention is to fetch these links in asynchronous way.

Get/Show google search results in my app

I am facing a problem while developing an app, where I need to display search engine results directly on my app page without directing to www.google.com.
This is how it looks, in the search box I'll enter the RSS feed site name, and now I want to get the google search result on my app page so that I can easily extract RSS feed website and perform the operation I was intended to do.
I am intending only to get RSS feeds from the site just by typing sitename.
Thank you!
Answer.
Almost working..,
Thank you #Chandan,#Suzi
Check under 2. A Better Approach
I didn't try it out practically and am not sure whether its deprecated by this time or not.

Is it possible to scrape any given URL with NodeJS?

est I'll preface this by saying this is something that is new to me and is purely a learning exercise, so please excuse any naivety.
I've been looking through some articles on scraping and it seems that NodeJS, ExpressJS, Request and Cheerio would be my preferred method as a Front-End guy who is comfortable with JS/jQuery.
All the articles I've read so far focus on scraping data from a specific website in the absence of an API, whereas what I am looking to achieve to start with is a tool which takes any given URL and returns a true/false for a list of which common libraries are being used and which social networks are linked.
For example, a user enters a URL and the results return a "This website uses jQuery, MooTools, BackboneJS, AngularJS, etc" and "This website is linked with Facebook, Twitter, etc". Somewhat similar to Tregia: http://www.tregia.com/process?q=http://smashingmagazine.com.
Is my chosen setup (above) appropriate or limited to only scraping specific pages due to CSS selectors?
You should be able to scrape all pages and then find their tags and read which tools they're using (although keep in mind they may have renamed them [ex angularjs3.1.0.js - > foobar.js] to keep people from knowing their stack). You should also be able to get the specific text within the rest of the tags that you feel relevant as well.
You should try and pay attention to every page's robots.txt as well.
Edit: You probably won't be able to scrape "members"/"login only" areas of sites though.

How do search engines recognize search boxes on websites?

I've noticed that a lot of the time when i search something on Google, Google automatically uses the search function of relevant websites and return the result of the website search as if it was just another URL.
How do i let Google and other search engines know what is the search box on my own website and does Open Search has anything to do with it?
do you maybe mean the site search function via the google chrome omnibar?
to get there you just need to have a
form with method type GET
input type text element
submit button
on the root page of your domain
if users go directly to your root page and search something there, google learns of this form and adds it to the search engines accessible via the omnibar (the google chrome address bar).
did you mean this?
Google doesn't use anyones search forms - it just finds a link to search results, you need to
Use GET for your search parameters to make this possible
Create links to common/useful search results pages
Make sure google finds those links
Google makes it look like just another URL because that is exactly what it is.
Most of the time though Google will do a better job than your search engine so actually doing this could lower the quality of results from your site...
I don't think it does. It's impossible to spider sites in real time.
It's just a SEO technique some sites use to improve their ranking by spamming Google with fake results. They feed the Google bot with an endless stream of links to bogus pages:
http://en.wikipedia.org/wiki/Spamdexing

Resources