How to crawl javascript (vuejs, reactjs) web site with nodejs

How to crawl javascript (vuejs, reactjs) web site with nodejs - node.js

I was going to crawl vue js frontend web site when I try to crawl that it doesn't load the content to cheerio.. what i was getting , a blank web page. my code as follows
getSiteContentAsJs = (url) => {
return new Promise((resolve, reject) => {
let j = request.jar();
request.get({url: url, jar: j}, function(err, response, body) {
if(err)
return resolve({body: null, jar: j, error: err});
return resolve({body: body, jar: j, error: null});
});
})
}
I got my content as follows
const { body, jar, error} = await getSiteContentAsJs(url);
//I passed body to cheerio to get the js object out of the web content
const $ = cheerio.load(body);
but there is nothing rendered. but a blank web page. no content in it.

I found that cheerio doesn't run javascript. since this web site based on vue front end I needed a virtual browser which actually run js and render me the output
so instead of using request I used phantom to render js web pages
const phantom = require('phantom');
const cheerio = require('cheerio');
loadJsSite = async (url) => {
const instance = await phantom.create();
const page = await instance.createPage();
await page.on('onResourceRequested', function(requestData) {
console.info('Requesting', requestData.url);
});
const status = await page.open(url);
const content = await page.property('content');
// console.log(content);
// let $ = cheerio.load(content);
await instance.exit();
return {$: cheerio.load(content), content: content};
}
now I can get the rendered page like below
const {$, content} = await loadJsSite(url);
// I can query like this
// get the body
$('body').html();

Related

Problem with picking HTML element with cheerio.js [duplicate]

I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.
I use the cheerio in node.js and My code is below.
var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
request(url, function (err, res, html) {
var $ = cheerio.load(html);
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
});
This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty.
The content has not been appended yet.
How can I get these elements using node.js? How can I scrape pages with dynamic content?

Here you go;
var phantom = require('phantom');
phantom.create(function (ph) {
ph.createPage(function (page) {
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
page.open(url, function() {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
page.evaluate(function() {
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
}, function(){
ph.exit()
});
});
});
});
});

Check out GoogleChrome/puppeteer
Headless Chrome Node API
It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com (assuming .npm-expansions remains)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.npmjs.com/');
const textContent = await page.evaluate(() => {
return document.querySelector('.npm-expansions').textContent
});
console.log(textContent); /* No Problem Mate */
browser.close();
})();
evaluate will allow for the inspection of the dynamic element as this will run scripts on the page.

Use the new npm module x-ray, with a pluggable web driver x-ray-phantom.
Examples in the pages above, but here's how to do dynamic scraping:
var phantom = require('x-ray-phantom');
var Xray = require('x-ray');
var x = Xray()
.driver(phantom());
x('http://google.com', 'title')(function(err, str) {
if (err) return done(err);
assert.equal('Google', str);
done();
})

Answering this as a canonical, an alternative to Puppeteer for scraping dynamic sites which is also well-supported as of 2023 is Playwright. Here's a simple example:
const playwright = require("playwright"); // ^1.28.1
let browser;
(async () => {
browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
const text = await page.locator('h1:text("Example")').textContent();
console.log(text); // => Example Domain
})()
.catch(err => console.error(err))
.finally(() => browser?.close());

Easiest and reliable solution is to use puppeteer. As mentioned in https://pusher.com/tutorials/web-scraper-node which is suitable for both static + dynamic scraping.
Only change the timeout in Browser.js, TimeoutSettings.js, Launcher.js 300000 to 3000000

How to get response from nodejs after posting data from reactjs to nodejs?

This is my react code (works after entering url):
const SendToServer = (url) => {
axios.post('http://localhost:4000/app/urls', url)
.then(response => console.log(response))
}
and this is the nodejs code:
router.post('/urls',(request, response)=>{
console.log(request.body.url);
console.log(request.body.depth);
var Crawler = require("simplecrawler");
var crawler = Crawler(request.body.url)
.on("fetchcomplete", function (queueItem) {
//console.log("Fetched a resource!",queueItem)
});
crawler.maxDepth = request.body.depth;
crawler.maxConcurrency = 3;
crawler.start();
How can I send the "queueItem" as a response to the reactjs?

Just use the response object in your route middleware.
var crawler = Crawler(request.body.url)
.on("fetchcomplete", function (queueItem) {
response.status(200).send(queueItem);
});

Send Multiple HTTP Request with Axios

I am trying to crawl this website to get university names using Node JS Axios. I notice that the website uses Paginated API so to crawl all the university name I have to send multiple requests.
const url = 'https://www.usnews.com/best-colleges/search?_sort=rank&_sortDirection=asc&study=Engineering&_mode=table&_page=1;
const url = 'https://www.usnews.com/best-colleges/search?_sort=rank&_sortDirection=asc&study=Engineering&_mode=table&_page=2;
const url = 'https://www.usnews.com/best-colleges/search?_sort=rank&_sortDirection=asc&study=Engineering&_mode=table&_page=3;
...
const url = 'https://www.usnews.com/best-colleges/search?_sort=rank&_sortDirection=asc&study=Engineering&_mode=table&_page=55;
I have written code to crawl only one page. I do not know how to crawl more than 1 page.
Here is my code
const axios = require('axios');
const cheerio = require('cheerio');
var request = require('request');
fs = require('fs');
_sort=rank&_sortDirection=asc&study=Engineering";
// table view
page= 1;
const url = 'https://www.usnews.com/best-colleges/search?_sort=rank&_sortDirection=asc&study=Engineering&_mode=table&_page=' +page;
fetchData(url).then((res) => {
const html = res.data;
const $ = cheerio.load(html);
const unilist = $('.TableTabular__TableContainer-febmbj-0.guaRKP > tbody > tr >td ');
unilist.each(function() {
let title = $(this).find('div').attr("name");
if (typeof(title) == 'string') {
console.log(title);
fs.appendFileSync('universityRanking.txt', title+'\n', function (err) {
if (err) return console.log(err);
});
}
});
})
async function fetchData(url){
console.log("Crawling data...")
// make http call to url
let response = await axios(url).catch((err) => console.log(err));
if(response.status !== 200){
console.log("Error occurred while fetching data");
return;
}
return response;
}
I would like help on how to make 55 Axios requests? I checked that the page has 55 pages. I need to append all the university name from each page to a text file.

The axios.all() method can help your use case.
axios.all([]) // Pass the array of axios requests for all the 55 pages here
.then({
// Multiple requests complete
});

Can I force SSR for a Nuxt page?

In a Nuxt app I need to render a page with a lot of data displayed on a google map, obtained from a 100MB .jsonl file. I'm using fs.createReadStream inside asyncData() to parse the data and feed it to the Vue component. Since fs is a server-side only module, this means my app errors when it attempts to render that page client-side.
I would like it so this specific page will exclusively be rendered with SSR so I can use fs in the Vue component.
I thought of using a custom Express middleware to process the data, but this still results in downloading dozens of MB to the client, which is unacceptable. You can see how I request it with Axios in my example.
async asyncData( {$axios} ) {
const fs = require('fs');
if (process.server) {
console.log("Server");
async function readData() {
const DelimiterStream = require('delimiter-stream');
const StringDecoder = require('string_decoder').StringDecoder;
const decoder = new StringDecoder('utf8');
let linestream = new DelimiterStream();
let input = fs.createReadStream('/Users/Chibi/WebstormProjects/OPI/OPIExamen/static/stream.jsonl');
return new Promise((resolve, reject) => {
console.log("LETS GO");
let data = [];
linestream.on('data', (chunk) => {
let parsed = JSON.parse(chunk);
if (parsed.coordinates)
data.push({
coordinates: parsed.coordinates.coordinates,
country: parsed.place && parsed.place.country_code
});
});
linestream.on('end', () => {
return resolve(data);
});
input.pipe(linestream);
});
}
const items = await readData();
return {items};
} else {
console.log("CLIENT");
const items = this.$axios.$get('http://localhost:3000/api/stream');
return {items };
}
}
Even when it renders correctly, NUXT will show me an error overlay complaining about the issue.

Scrape part of page that is not html

I want to scrape this site.
I'm using Node.js and Phantom.js with Phantom.
This is my code:
var phantom = require('phantom');
var loadInProgress = false;
var url = 'http://apps.who.int/flumart/Default?ReportNo=12';
(async function() {
const instance = await phantom.create();
const page = await instance.createPage();
await page.on('onResourceRequested', function(requestData) {
console.info('Requesting', requestData.url);
});
await page.on('onConsoleMessage', function(msg) {
console.info(msg);
});
await page.on('onLoadStarted', function() {
loadInProgress = true;
console.log('Load started...');
});
await page.on('onLoadFinished', function() {
loadInProgress = false;
console.log('Load end');
});
const status = await page.open(url);
await console.log('STATUS:', status);
const content = await page.property('content');
await console.log('CONTENT:', content);
// submit
await page.evaluate(function() {
document.getElementById('lblFilteBy').value = 'Country, area or territory'; //'WHO region';
document.getElementById('lblSelectBy').value = 'Italy'; //'European Region of WHO';
document.getElementById('lbl_YearFrom').value = '1995';
document.getElementById('lbl_WeekFrom').value = '1';
document.getElementById('lbl_YearTo').value = '2018';
document.getElementById('ctl_list_WeekTo').value = '53';
//console.log('SUBMIT:', document.getElementById('ctl_ViewReport'));
document.getElementById('ctl_ViewReport').submit();
});
var result = await page.evaluate(function() {
return document.querySelectorAll('html')[0].outerHTML; // Problem here
});
await console.log('RESULT:', result);
await instance.exit();
}());
I don't understand what this part (in red) of page is:
It's not HTML, how do I scrape the displayed data?
Thanks!
EDIT 1
If I go to 'Network' tab of Chrome dev tools:

You can catch the ajax request, check :
outlined in blue, it's the XHR request that you need to call yourself in your phantom script, and the ajax result outlined in red. In the header tab, you will see the form data sent via POST to the page.

This is going to be hard. Take a look at this: Node.js web browser with JavaScript execution
Basically, you need a lib that simulates a browser with js execution and use that to render the report, then you can parse it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to crawl javascript (vuejs, reactjs) web site with nodejs - node.js

Related

Problem with picking HTML element with cheerio.js [duplicate]

How to get response from nodejs after posting data from reactjs to nodejs?

Send Multiple HTTP Request with Axios

Can I force SSR for a Nuxt page?

Scrape part of page that is not html

Categories

Resources