Problem with scraping OP.GG website with node.js and cheerio - node.js

I'm a beginner with node.js and cheerio and a little help would be awesome :D
I try to scrape the pubg.op.gg website to have two simple elements to show them in the console.
Here is my code:
var url = "https://pubg.op.gg/user/K1uu"
var request = require('request');
var cheerio = require('cheerio');
var cheerioAdv = require('cheerio-advanced-selectors');
request(url, function(err, resp, body) {
var $ = cheerio.load(body);
var playerName = $('.player-summary__name');
var playerNameText = playerName.text();
console.log(playerNameText);
var playerRank = $('.ranked-stats__rating-point');
var playerRankText = playerRank.text();
console.log(playerRankText);
})
I try to have something like this : "Kyuu - 1503"
No problem for the Kyuu value for playernickname but impossible to have the 1503 however the name of the div is correct !
Where is my problem ?
Thanks guys !!

Hey and welcome to StackOverflow!
That website uses AJAX to fetch the ratings, so when the HTML is loaded the ratings are not available and the ranked-stats__rating-point class does not exist yet. If you check it with the browser's developer tools, you can see that it requests 3 additional URLs for the 3 different rating point (the only difference is the queue_size URL param).
https://pubg.op.gg/api/users/59fdce2bdf1b210001a9324d/ranked-stats?season=pc-2018-01&queue_size=1&mode=tpp
https://pubg.op.gg/api/users/59fdce2bdf1b210001a9324d/ranked-stats?season=pc-2018-01&queue_size=2&mode=tpp
https://pubg.op.gg/api/users/59fdce2bdf1b210001a9324d/ranked-stats?season=pc-2018-01&queue_size=4&mode=tpp
You should be able to request the first rating like this:
var url = "https://pubg.op.gg/api/users/59fdce2bdf1b210001a9324d/ranked-stats?season=pc-2018-01&queue_size=1&mode=tpp";
var request = require('request');
request(url, function(err, resp, body) {
var jsonData = JSON.parse(body);
var score = jsonData['stats']['rating'];
console.log(score); // outputs "1520"
} );
However the username is not available from these endpoints, so you need to find another API endpoint for that if you want to fetch these for arbitrary usernames.

Hi korsosa and thanks for your answer !
Yes, there is multiple elements with ranked-stats__rating-point for the name.
There is the result of your code :
var playerRankText = playerRank[1].text();
TypeError: Cannot read property 'text' of undefined

Related

XML scraping using nodeJs

I have a very huge xml file that I got by exporting all the data from tally, I am trying to use web scraping to get elements out of my code using cheerio, but I am having trouble with the formatting or something similar. Reading it with fs.readFileSync() works fine and the console.log shows complete xml file but when I write the file using the fs.writeFileSync it makes it look like this:
And my web scraping code outputs empty file:
const cheerio = require('cheerio');
const fs = require ('fs');
var xml = fs.readFileSync('Master.xml','utf8');
const htmlC = cheerio.load(xml);
var list = [];
list = htmlC('ENVELOPE').find('BODY>TALLYMESSAGE>STOCKITEM>LANGUAGENAME.LIST>NAME.LIST>NAME').each(function (index, element) {
list.push(htmlC(element).attr('data-prefix'));
})
console.log(list)
fs.writeFileSync("data.html",list,()=>{})
You might try checking to make sure that Cheerio isn't decoding all the HTML entities. Change:
const htmlC = cheerio.load(xml);
to:
const htmlC = cheerio.load(xml, { decodeEntities: false });

scraping data from a website node js

i am new to scraping data from a website, i would like to scrape the level number from: https://fortnitetracker.com/profile/pc/Twitch.BadGuyBen, i have tried using cheerio and request for this task and im not sure if im using the right selector maybe some tips on what i should do. this is my code:
var request = require('request');
var cheerio = require('cheerio');
var options = {
url: `https://fortnitetracker.com/profile/pc/Twitch.BadGuyBen`,
method: 'GET'
}
request(options, function (error, response, body) {
var $ = cheerio.load(body);
var level = "";
var xp = "";
$('.top-stats').filter(function(){
var data = $(this);
level = data.children().first().find('.value').text();
console.log(level);
})
});
again i am not sure if i have even selected the right class much appreciated.
EDIT:
also '.top-stats' is present further on
website open in chrome dev tools
other .top-stats class
You can't use request to get the body since the stats are displayed using javascript. You will have to use something like puppeteer to request the page and execute the javascript and then scrape the stats.

url.searchParams returns undefined in node.js

In the following node.js example:
var url = require('url');
var urlString='/status?name=ryan'
var parseObj= url.parse(urlString);
console.log(urlString);
var params = parseObj.searchParams;
console.log(JSON.stringify(params));
the property searchParams is undefined. I would expect searchParams to contain the parameters of the search query.
As you see in https://nodejs.org/dist/latest-v8.x/docs/api/url.html#url_class_urlsearchparams
searchParams is a proxy to an URL object. You must obtain a new URL complete object (with domain and protocol) and then you can use searchParams:
var url = require('url');
var urlString='https://this.com/status?name=ryan'
var parseObj= new url.URL(urlString);
console.log(urlString);
var params = parseObj.searchParams;
console.log(params);
Other way is using the query attribute (you must pass true as second parameter to url.parse):
var urlString='/status?name=ryan'
var parseObj= url.parse(urlString, true);
console.log(parseObj);
var params = parseObj.query;
console.log(params);
It is recommended to use: var parsedUrl = new URL(request.url, 'https://your-host'); instead of url.parse
url.parse shouldn't be used in new applications. It is deprecated and could cause some security issues: as stated here

Parsing Get HTTP Request for parameters and using in http response

I am trying to learn node.js. Here is the basic hello World example where I expect a http request like
http://localhost:3000?fname=ABC&lname=XYZ
And return response to print on the browser
Hello, ABC XYZ!
This is working fine. But if you see the response.end function I have something like query.lname || "Anonymous". I was expecting that in case the lname is not specified in the URL then the response contains 'Anonymous' in place of last name. But this doesn't happen and I get
Hello, ABC undefined!
The code is as follows. Kindly help me understand this. Thanks for the help.
var http = require('http');
var url = require('url');
var querystring = require('querystring');
http.createServer(function(request, response) {
var query = querystring.parse(url.parse(request.url).query || "");
response.writeHead(200,{’content-type’:"text/plain"});
response.end("Hello, "+(query.fname+" "+ query.lname || "Anonymous")+"!\n");
}).listen(3000);

Retrieving HTML from CouchBase into Node.js / Express 4 leaves it unrendered

I'm having a small issue with rendering HTML, stored in CouchBase, fetched by Node.js
In CouchBase I have several small HTML-snippets. They contain text, tags such as <br /> and html entities such as <. They are of course stored as an escaped string in JSON.
So far, so good. However when I pull it out and display on the page, it is rendered "as-is", without being interpreted as HTML.
For example:
[ some content ...]
<p>Lorem is > ipsum<br />And another line</p>
[rest of content ...]
From the controller in Express 4:
var express = require('express');
var router = express.Router();
var couchbase = require('couchbase');
var cluster = new couchbase.Cluster('couchbase://myserver');
var bucket = cluster.openBucket('someBucket', 'somePassword');
var Entities = require('html-entities').XmlEntities;
entities = new Entities();
var utf8 = require('utf8');
/* GET home page. */
router.get('/', function(req, res) {
bucket.get('my:thingie:44', function(err, result) {
if(err) throw err
console.log(result);
var html = utf8.decode(entities.decode(result.value.thingie.html));
// var html = utf8.encode(result.value.thingie.html);
// var html = utf8.decode(result.value.thingie.html);
res.render('index', { title: 'PageTitle', content: html });
});
});
It is then passed to the template (using hogan.js) for rendering.
When looking into this I found that it might have something to do with the encoding of the <'s and <'s that prevent it from being parsed. You can see my converting attempts in the code, where none of the options gave the desired result, i.e. rendering the contents as HTML.
When using utf8.decode(), no difference.
Using utf8.encode(), no difference.
Using entities.decode() it convert < into < as predicted, but it's not rendered even if <div;&gt becomes <div>.
Any ideas?
I found the solution over here: Partials with Node.js + Express + Hogan.js
When putting HTML in a Hogan template, you have to use {{{var}}} instead of {{var}}.
And thus it renders beautifully, as intended :)
Wasn't encoding issues at all ;)

Resources