Scraping Nodejs

Scraping Nodejs - node.js

I want to scrape page "https://www.ukr.net/ua/news/sport.html" with Nodejs.
I`m trying to make basic get request with 'request' npm module, here is example:
const inspect = require('eyespect').inspector();
const request = require('request');
const url = 'https://www.ukr.net/news/dat/sport/2/';
const options = {
method: 'get',
json: true,
url: url
};
request(options, (err, res, body) => {
if (err) {
inspect(err, 'error posting json');
return
}
const headers = res.headers;
const statusCode = res.statusCode;
inspect(headers, 'headers');
inspect(statusCode, 'statusCode');
inspect(body, 'body');
});
But in response body I only get
body: '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN">\n<html>\n<head>\n<META HTTP-EQUIV="expires"
CONTENT="Wed, 26 Feb 1997 08:21:57 GMT">\n<META HTTP-EQUIV=Refresh
CONTENT="10">\n<meta HTTP-EQUIV="Content-type" CONTENT="text/html;
charset=utf-8">\n<title>www.ukr.net</title>\n</head>\n<body>\n
Идет загрузка, подождите .....\n</body>\n</html>'
If I make get request from Postman, I get exactly what I need:
Please help me guys.

You might have been blocked by bot protection - this can be checked with curl.
curl -vL https://www.ukr.net/news/dat/sport/2/
curl seem to get the result and if curl is working then there is probably something missing in the request from node, a solution could be to mimic a browser of your choice.
For example - Here is an example of Chrome-like request taken from developer-tools:
deriving the following options for the request:
const options = {
method: 'get',
json: true,
url: url,
gzip: true,
headers: {
"Host": "www.ukr.net",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, sdch, br",
"Accept-Language": "en-US,en;q=0.8"
}
};

If you have experience in jquery, there a library to access of the HTML, for example.
Markup example we'll be using:
<ul id="fruits">
<li class="apple">Apple</li>
<li class="orange">Orange</li>
<li class="pear">Pear</li>
</ul>
First you need to load in the HTML. This step in jQuery is implicit, since jQuery operates on the one, baked-in DOM. With Cheerio, we need to pass in the HTML document.
var cheerio = require('cheerio');
$ = cheerio.load('<ul id="fruits">...</ul>');
Selectors
$('ul .pear').attr('class')
probably you can make something like this.
request(options, (err, res, body) => {
var $ = cheerio.load(html);
})
https://github.com/cheeriojs/cheerio

Related

Axios POST data does not send in correct format to Express Server

Hi I'm running an express server that has this .post routed on / and using Formidable and express.json() as middleware.
Express Server
const formidable = require('express-formidable');
app.use(express.json());
app.use(formidable());
app.post('/test', function(req, res){
console.log(req.fields);
})
Using AJAX (No Issues)
When I send a POST request using AJAX like so:
$.ajax({
url:'http://localhost:3000/test',
type: "POST",
crossDomain: true,
dataType: "json",
data: {
"file" : "background.js"
},
success: async function (response) {
}
})
The server outputs:
{ file: 'background.js' }
The Problem
However, when I send the same POST request using AXIOS
var fUrl = 'http://localhost:3000/test';
var fHeader = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'
};
var req = await axios({
method: "POST",
url: fUrl,
withCredentials: true,
data: {"file" : 'background.js'},
headers: fHeader
});
The server ouputs in the wrong format:
{ '{"file":"background.js"}': '' }
I suspect that the issue may be because of the content-type header, however when i change it to application/json, the request doesn't complete/timeout and awaits for an apparently infinite amount of time.

app.use(express.json());
app.use(formidable());
never use both at the same time.
Also that is not the way to send a file, but that would be another Q&A

Fetch working in HTML page but not in JS one [duplicate]

This question already has answers here:
CORS error even after setting Access-Control-Allow-Origin or other Access-Control-Allow-* headers on client side
(2 answers)
Closed 1 year ago.
Very simple HTML pages fetches me the JSON I want with a 200 status code:
<body>
<button id="login" name="login">login</button>
<script type="text/javascript">
document.getElementById('login').onclick = function () {
fetch('https://api.contonso.com/search/company',
{ method: 'GET' }).then(_ => console.log(_.status));
};
</script>
</body>
Which I launch that way :
// index.js
const fs = require('fs');
var http = require('http');
fs.readFile('./index.html', function (err, html) {
if (err) {
throw err;
}
http.createServer(function (request, response) {
response.writeHeader(200, { "Content-Type": "text/html" });
response.write(html);
response.end();
}).listen(8000);
});
So clicking in the HTML pages, fetch works with console.log printing 200 (OK).
Now consider I'm modifying code in index.js with the following one :
// index.js (modified)
const fetch = require('node-fetch');
fetch('https://api.contonso.com/search/company',
{ method: 'GET' }).then(_ => console.log(_.status));
Here console.log prints me 403 (Forbidden)
Could you please explain me what I'm doing wrong? Why it is working in HTML pages and not in JS one?
I'm developing a bot that does not use any frontend, I only need JS files.
Thanks,
In JS only I added the following headers (seen in the browser missing in JS), still the same error
headers: {
'Origin': 'https://localhost:8000',
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}

The request from the browser is sending some headers like the user agent (that the library is by default setting to node-fetch)
You have to inspect your request response for the reason of the 403, and check the API documentation for further instructions.

request nodejs gets unreadable data

I'm trying to scrape the html using library request on node.js. The response code is 200 and the data I get is unreadable. Here my code:
var request = require("request");
const options = {
uri: 'https://www.wikipedia.org',
encoding: 'utf-8',
headers: {
"Accept": "text/html,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"charset": "utf-8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/78.0.3904.108 Chrome/78.0.3904.108 Safari/537.36"
}
};
request(options, function(error, response, body) {
console.log(body);
});
As you can see, I sent the request for html and utf-8 but got a large string like f��j���+���x��,�G�Y�l
My node version is v8.10.0 and the request version is 2.88.0.
Is something wrong with the code or I'am missing something??
Any hint to overtake this problem would be appreciate.

Updated Answer:
In response to your latest post:
The reason it is not working for Amazon is because the response is gzipped.. In order to decompress the gzip response, you simply need to add gzip: true to the options object you are using. This will work for both Amazon and Wikipedia:
const request = require('request');
const options = {
uri: "https://www.amazon.com",
gzip: true
}
request(options, function(error, response, body) {
if (error) throw error;
console.log(body);
});
Lastly, if you are wanting to scrape webpages like this, it is probably best to use a web scraping framework, like Puppeteer, since it is built for web scraping.
See here for Puppeteer GitHub.
Original Answer:
Since you are just grabbing the HTML from the main page, you do not have to specify charset, encoding, or Accept-Encoding..
const request = require('request');
const options = {
uri: 'https://www.wikipedia.org',
//encoding: 'utf-8',
headers: {
"Accept": "text/html,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
//"charset": "utf-8",
//"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/78.0.3904.108 Chrome/78.0.3904.108 Safari/537.36"
}
};
request(options, function (error, response, body) {
if (error) throw error
console.log(body);
});
To take it a bit further... in this scenario, you don't need to specify headers at all...
const request = require('request');
request('https://www.wikipedia.org', function (error, response, body) {
if (error) throw error
console.log(body);
});

Thanks you the reply, when I used that to the Wikipedia page works properly, but when I use it to scrape another website like the amazon, got the same bad result
const request = require('request');
request('https://www.amazon.com', function (error, response, body) {
if (error) throw error
console.log(body);
});

How to do a get request and get the same results that a browser would with Nodejs?

I'm trying to do a get request for image search, and I'm not getting the same result that I am in my browser. Is there a way to get the same result using node.js?
Here's the code I'm using:
var keyword = "Photographie"
keyword = keyword.replace(/[^a-zA-Z0-9éàèùâêîôûçëïü]/g, "+")
var httpOptions = { hostname: 'yandex.com',
path: '/images/search?text=' + keyword, //path does not accept spaces or dashes
headers: { 'Content-Type': 'application/x-www-form-urlencoded', 'user-agent': 'Mozilla/5.0'}}
console.log(httpOptions.hostname + httpOptions.path +postTitle)
https.get(httpOptions, (httpResponse) => {
console.log(`STATUS: ${httpResponse.statusCode}`);
httpResponse.setEncoding('utf8');
httpResponse.on('data', (htmlBody) => {
console.log(`BODY: ${htmlBody}`);
});
});

By switching to the request-promise library and using the proper capitalization of the User-Agent header name and an actual user agent string from the Chrome browser, this code works for me:
const rp = require('request-promise');
let keyword = "Photographie"
let options = { url: 'http://yandex.com/images/search?text=' + keyword,
headers: {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
};
rp(options).then(response => {
console.log(response);
}).catch(err => {
console.log(err);
});
When I try to run your actual code, I get a 302 redirect and a cookie set. I'm guessing that they are expecting you to follow the redirect and retain the cookie. But, you can apparently just switch to the above code and it appears to work for me. I don't know exactly what makes my code work, but it could be that is has a more recognizable user agent.

What is Postman "Interception Mode" equivalent to in Node.js?

I used to send requests using Postman Interceptor. This is how I handled the headers and body of the request:
You can try it by yourself. You can see that once you turn on "interception mode", you get a different response than "without" it.
Now, I want to send the same request, but by using 'HTTPS module in Node.js.
I followed the following pattern:
var https = require('https');
var querystring = require('querystring');
var post_data = querystring.stringify({
hid_last: "SMITH",
hid_first: "JOHN",
__RequestVerificationToken: "EiO369xBXRY9sHV/x26RNwlMzWjM9sR/mNlO9p9tor0PcY0j3dRItKH8XeljXmTfFWT0vQ1DYBzlGpLtnBBqEcOB51E9lh6wrEQbtMLUNOXpKKR3RzFqGc9inDP+OBIyD7s9fh9aMAypCHFCNFatUkx666nf7NOMHHKfiJKhfxc=",
hid_max_rows: 20,
hid_page: 1,
hid_SearchType: 'PARTYNAME'
});
// An object of options to indicate where to post to
var post_options = {
host: 'a836-acris.nyc.gov',
path: '/DS/DocumentSearch/PartyNameResult',
method: 'POST',
headers: {
'Content-Type': 'application/x-www-form-urlencoded',
'Origin': 'https://a836-acris.nyc.gov',
'Referer': "https://a836-acris.nyc.gov/DS/DocumentSearch/PartyName",
'Upgrade-Insecure-Requests': 1,
'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
'Content-Length': Buffer.byteLength(post_data),
'Cookie': '_ga=GA1.2.1526584332.1483281720; WT_FPC=id=2fb6833e-6ae6-4529-b84a-4a1c61f24978:lv=1483256520738:ss=1483256520738',
}
};
// Set up the request
var post_req = https.request(post_options, function(res) {
res.setEncoding('utf8');
res.on('data', function (chunk) {
console.log('Response: ' + chunk);
});
});
// post the data
post_req.write(post_data);
post_req.end();
The only thing missing is the "interceptor" issue. When I use this code now, I get the same response I used to get without using 'interceptor' mode in Postman.
My question is how to "convert" the "interceptor mode" in Postman to 'HTTPS module in node.js?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scraping Nodejs - node.js

Related

Axios POST data does not send in correct format to Express Server

Fetch working in HTML page but not in JS one [duplicate]

request nodejs gets unreadable data

How to do a get request and get the same results that a browser would with Nodejs?

What is Postman "Interception Mode" equivalent to in Node.js?

Categories

Resources