scraping data from a website node js - node.js

i am new to scraping data from a website, i would like to scrape the level number from: https://fortnitetracker.com/profile/pc/Twitch.BadGuyBen, i have tried using cheerio and request for this task and im not sure if im using the right selector maybe some tips on what i should do. this is my code:
var request = require('request');
var cheerio = require('cheerio');
var options = {
url: `https://fortnitetracker.com/profile/pc/Twitch.BadGuyBen`,
method: 'GET'
}
request(options, function (error, response, body) {
var $ = cheerio.load(body);
var level = "";
var xp = "";
$('.top-stats').filter(function(){
var data = $(this);
level = data.children().first().find('.value').text();
console.log(level);
})
});
again i am not sure if i have even selected the right class much appreciated.
EDIT:
also '.top-stats' is present further on
website open in chrome dev tools
other .top-stats class

You can't use request to get the body since the stats are displayed using javascript. You will have to use something like puppeteer to request the page and execute the javascript and then scrape the stats.

Related

XML scraping using nodeJs

I have a very huge xml file that I got by exporting all the data from tally, I am trying to use web scraping to get elements out of my code using cheerio, but I am having trouble with the formatting or something similar. Reading it with fs.readFileSync() works fine and the console.log shows complete xml file but when I write the file using the fs.writeFileSync it makes it look like this:
And my web scraping code outputs empty file:
const cheerio = require('cheerio');
const fs = require ('fs');
var xml = fs.readFileSync('Master.xml','utf8');
const htmlC = cheerio.load(xml);
var list = [];
list = htmlC('ENVELOPE').find('BODY>TALLYMESSAGE>STOCKITEM>LANGUAGENAME.LIST>NAME.LIST>NAME').each(function (index, element) {
list.push(htmlC(element).attr('data-prefix'));
})
console.log(list)
fs.writeFileSync("data.html",list,()=>{})
You might try checking to make sure that Cheerio isn't decoding all the HTML entities. Change:
const htmlC = cheerio.load(xml);
to:
const htmlC = cheerio.load(xml, { decodeEntities: false });

Problem with scraping OP.GG website with node.js and cheerio

I'm a beginner with node.js and cheerio and a little help would be awesome :D
I try to scrape the pubg.op.gg website to have two simple elements to show them in the console.
Here is my code:
var url = "https://pubg.op.gg/user/K1uu"
var request = require('request');
var cheerio = require('cheerio');
var cheerioAdv = require('cheerio-advanced-selectors');
request(url, function(err, resp, body) {
var $ = cheerio.load(body);
var playerName = $('.player-summary__name');
var playerNameText = playerName.text();
console.log(playerNameText);
var playerRank = $('.ranked-stats__rating-point');
var playerRankText = playerRank.text();
console.log(playerRankText);
})
I try to have something like this : "Kyuu - 1503"
No problem for the Kyuu value for playernickname but impossible to have the 1503 however the name of the div is correct !
Where is my problem ?
Thanks guys !!
Hey and welcome to StackOverflow!
That website uses AJAX to fetch the ratings, so when the HTML is loaded the ratings are not available and the ranked-stats__rating-point class does not exist yet. If you check it with the browser's developer tools, you can see that it requests 3 additional URLs for the 3 different rating point (the only difference is the queue_size URL param).
https://pubg.op.gg/api/users/59fdce2bdf1b210001a9324d/ranked-stats?season=pc-2018-01&queue_size=1&mode=tpp
https://pubg.op.gg/api/users/59fdce2bdf1b210001a9324d/ranked-stats?season=pc-2018-01&queue_size=2&mode=tpp
https://pubg.op.gg/api/users/59fdce2bdf1b210001a9324d/ranked-stats?season=pc-2018-01&queue_size=4&mode=tpp
You should be able to request the first rating like this:
var url = "https://pubg.op.gg/api/users/59fdce2bdf1b210001a9324d/ranked-stats?season=pc-2018-01&queue_size=1&mode=tpp";
var request = require('request');
request(url, function(err, resp, body) {
var jsonData = JSON.parse(body);
var score = jsonData['stats']['rating'];
console.log(score); // outputs "1520"
} );
However the username is not available from these endpoints, so you need to find another API endpoint for that if you want to fetch these for arbitrary usernames.
Hi korsosa and thanks for your answer !
Yes, there is multiple elements with ranked-stats__rating-point for the name.
There is the result of your code :
var playerRankText = playerRank[1].text();
TypeError: Cannot read property 'text' of undefined

Parsing Get HTTP Request for parameters and using in http response

I am trying to learn node.js. Here is the basic hello World example where I expect a http request like
http://localhost:3000?fname=ABC&lname=XYZ
And return response to print on the browser
Hello, ABC XYZ!
This is working fine. But if you see the response.end function I have something like query.lname || "Anonymous". I was expecting that in case the lname is not specified in the URL then the response contains 'Anonymous' in place of last name. But this doesn't happen and I get
Hello, ABC undefined!
The code is as follows. Kindly help me understand this. Thanks for the help.
var http = require('http');
var url = require('url');
var querystring = require('querystring');
http.createServer(function(request, response) {
var query = querystring.parse(url.parse(request.url).query || "");
response.writeHead(200,{’content-type’:"text/plain"});
response.end("Hello, "+(query.fname+" "+ query.lname || "Anonymous")+"!\n");
}).listen(3000);

Retrieving HTML from CouchBase into Node.js / Express 4 leaves it unrendered

I'm having a small issue with rendering HTML, stored in CouchBase, fetched by Node.js
In CouchBase I have several small HTML-snippets. They contain text, tags such as <br /> and html entities such as <. They are of course stored as an escaped string in JSON.
So far, so good. However when I pull it out and display on the page, it is rendered "as-is", without being interpreted as HTML.
For example:
[ some content ...]
<p>Lorem is > ipsum<br />And another line</p>
[rest of content ...]
From the controller in Express 4:
var express = require('express');
var router = express.Router();
var couchbase = require('couchbase');
var cluster = new couchbase.Cluster('couchbase://myserver');
var bucket = cluster.openBucket('someBucket', 'somePassword');
var Entities = require('html-entities').XmlEntities;
entities = new Entities();
var utf8 = require('utf8');
/* GET home page. */
router.get('/', function(req, res) {
bucket.get('my:thingie:44', function(err, result) {
if(err) throw err
console.log(result);
var html = utf8.decode(entities.decode(result.value.thingie.html));
// var html = utf8.encode(result.value.thingie.html);
// var html = utf8.decode(result.value.thingie.html);
res.render('index', { title: 'PageTitle', content: html });
});
});
It is then passed to the template (using hogan.js) for rendering.
When looking into this I found that it might have something to do with the encoding of the <'s and <'s that prevent it from being parsed. You can see my converting attempts in the code, where none of the options gave the desired result, i.e. rendering the contents as HTML.
When using utf8.decode(), no difference.
Using utf8.encode(), no difference.
Using entities.decode() it convert < into < as predicted, but it's not rendered even if <div;&gt becomes <div>.
Any ideas?
I found the solution over here: Partials with Node.js + Express + Hogan.js
When putting HTML in a Hogan template, you have to use {{{var}}} instead of {{var}}.
And thus it renders beautifully, as intended :)
Wasn't encoding issues at all ;)

Web Scrape Meteor Pages

I'm trying to write an application that scrapes a meteor webpage. This is rather difficult as meteor webpages render initially entirely as Javascript. Is there some way perhaps to render the page with some sort of scraper?
Probably going to do it with node, if that helps.
Thanks
You could use phantomjs to render the webpage. This is an example, specifically designed for meteor webpages, (from spiderable) to capture their HTML:
var fs = require('fs');
var child_process = require('child_process');
console.log('Loading a web page');
var page = require('webpage').create();
page.open("http://localhost:3000", function(status) {
});
var i = 0;
setInterval(function() {
var ready = page.evaluate(function () {
if (typeof Meteor !== 'undefined'
&& typeof(Meteor.status) !== 'undefined'
&& Meteor.status().connected) {
Deps.flush();
return DDP._allSubscriptionsReady();
}
return false;
});
console.log("Ready", ready);
if (ready) {
var out = page.content;
console.log(out);
phantom.exit();
}
}, 100);
It is this way but you could wrap the output and capture it using require('child_process').exec and stdin.
You can run the code with phantomjs script.js and it would give you back the HTML of a meteor page.
If they have the spiderable package enabled, then you can pretend to be a web crawler to get the server to render the page.
If you don't control the server or it isn't enabled, you will probably have to use Selenium - but the crawling will be CPU intensive and slow.

Resources