I'm trying to scrape the html using library request on node.js. The response code is 200 and the data I get is unreadable. Here my code:
var request = require("request");
const options = {
uri: 'https://www.wikipedia.org',
encoding: 'utf-8',
headers: {
"Accept": "text/html,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"charset": "utf-8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/78.0.3904.108 Chrome/78.0.3904.108 Safari/537.36"
}
};
request(options, function(error, response, body) {
console.log(body);
});
As you can see, I sent the request for html and utf-8 but got a large string like f��j���+���x��,�G�Y�l
My node version is v8.10.0 and the request version is 2.88.0.
Is something wrong with the code or I'am missing something??
Any hint to overtake this problem would be appreciate.
Updated Answer:
In response to your latest post:
The reason it is not working for Amazon is because the response is gzipped.. In order to decompress the gzip response, you simply need to add gzip: true to the options object you are using. This will work for both Amazon and Wikipedia:
const request = require('request');
const options = {
uri: "https://www.amazon.com",
gzip: true
}
request(options, function(error, response, body) {
if (error) throw error;
console.log(body);
});
Lastly, if you are wanting to scrape webpages like this, it is probably best to use a web scraping framework, like Puppeteer, since it is built for web scraping.
See here for Puppeteer GitHub.
Original Answer:
Since you are just grabbing the HTML from the main page, you do not have to specify charset, encoding, or Accept-Encoding..
const request = require('request');
const options = {
uri: 'https://www.wikipedia.org',
//encoding: 'utf-8',
headers: {
"Accept": "text/html,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
//"charset": "utf-8",
//"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/78.0.3904.108 Chrome/78.0.3904.108 Safari/537.36"
}
};
request(options, function (error, response, body) {
if (error) throw error
console.log(body);
});
To take it a bit further... in this scenario, you don't need to specify headers at all...
const request = require('request');
request('https://www.wikipedia.org', function (error, response, body) {
if (error) throw error
console.log(body);
});
Thanks you the reply, when I used that to the Wikipedia page works properly, but when I use it to scrape another website like the amazon, got the same bad result
const request = require('request');
request('https://www.amazon.com', function (error, response, body) {
if (error) throw error
console.log(body);
});
Related
Good evening everyone. I've been playing with HTTPs request for a while and I've been reading the axios documentation. I've been trying to send a request to a specific API (Zalando) but I got struck with 403 forbidden. Modules like puppeteer and chromium works fine because they do sit in a web browser and so they don't get easily detected by Akamai. Here below is the code I'm using. I've been doing this ONLY for educational purpose. Feel free to comment. Have a nice day all.
const axios = require('axios')
const params = [{
id: "e7f9dfd05f6b992d05ec8d79803ce6a6bcfb0a10972d4d9731c6b94f6ec75033",
variables: {
addToCartInput: {
productId: "PU111A0P1-C110035000",
clientMutationId: "addToCartMutation"
}
}
}]
const instance = axios.create({
withCredentials: true,
baseURL: 'https://www.zalando.it'
})
instance.post('/api/graphql/add-to-cart/', params, {
headers: {
'accept': '*/*',
'content-Type': 'application/json',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
},
})
.then(function (response) {
console.log(response);
})
.catch(function (error) {
console.log(error);
});
}
}
Response: [AxiosError: Request failed with status code 403]
This question already has answers here:
CORS error even after setting Access-Control-Allow-Origin or other Access-Control-Allow-* headers on client side
(2 answers)
Closed 1 year ago.
Very simple HTML pages fetches me the JSON I want with a 200 status code:
<body>
<button id="login" name="login">login</button>
<script type="text/javascript">
document.getElementById('login').onclick = function () {
fetch('https://api.contonso.com/search/company',
{ method: 'GET' }).then(_ => console.log(_.status));
};
</script>
</body>
Which I launch that way :
// index.js
const fs = require('fs');
var http = require('http');
fs.readFile('./index.html', function (err, html) {
if (err) {
throw err;
}
http.createServer(function (request, response) {
response.writeHeader(200, { "Content-Type": "text/html" });
response.write(html);
response.end();
}).listen(8000);
});
So clicking in the HTML pages, fetch works with console.log printing 200 (OK).
Now consider I'm modifying code in index.js with the following one :
// index.js (modified)
const fetch = require('node-fetch');
fetch('https://api.contonso.com/search/company',
{ method: 'GET' }).then(_ => console.log(_.status));
Here console.log prints me 403 (Forbidden)
Could you please explain me what I'm doing wrong? Why it is working in HTML pages and not in JS one?
I'm developing a bot that does not use any frontend, I only need JS files.
Thanks,
In JS only I added the following headers (seen in the browser missing in JS), still the same error
headers: {
'Origin': 'https://localhost:8000',
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
The request from the browser is sending some headers like the user agent (that the library is by default setting to node-fetch)
You have to inspect your request response for the reason of the 403, and check the API documentation for further instructions.
Im tring to make a request with nodejs, but Im getting with response those caracters ��!̨=��}�oZdW���Z������ξ���q��~. When I do the same request in my browser with Jquery it works well and return the correct json. How can I do it with nodejs?
Im using requestjs lib to send the post request.
My code:
var options = { method: 'POST',
url: 'myapi',
headers:
{
connection: 'keep-alive',
referer: 'myapirefere',
'cache-control': 'no-cache',
'content-type': 'application/x-www-form-urlencoded',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
origin: 'myapiorigin',
pragma: 'no-cache' },
body: 'code=123' };
request(options, function (error, response, body) {
if (error) throw new Error(error);
console.log('body', body)
});
I just added gzip: true option and it works!
I want to scrape page "https://www.ukr.net/ua/news/sport.html" with Nodejs.
I`m trying to make basic get request with 'request' npm module, here is example:
const inspect = require('eyespect').inspector();
const request = require('request');
const url = 'https://www.ukr.net/news/dat/sport/2/';
const options = {
method: 'get',
json: true,
url: url
};
request(options, (err, res, body) => {
if (err) {
inspect(err, 'error posting json');
return
}
const headers = res.headers;
const statusCode = res.statusCode;
inspect(headers, 'headers');
inspect(statusCode, 'statusCode');
inspect(body, 'body');
});
But in response body I only get
body: '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN">\n<html>\n<head>\n<META HTTP-EQUIV="expires"
CONTENT="Wed, 26 Feb 1997 08:21:57 GMT">\n<META HTTP-EQUIV=Refresh
CONTENT="10">\n<meta HTTP-EQUIV="Content-type" CONTENT="text/html;
charset=utf-8">\n<title>www.ukr.net</title>\n</head>\n<body>\n
Идет загрузка, подождите .....\n</body>\n</html>'
If I make get request from Postman, I get exactly what I need:
Please help me guys.
You might have been blocked by bot protection - this can be checked with curl.
curl -vL https://www.ukr.net/news/dat/sport/2/
curl seem to get the result and if curl is working then there is probably something missing in the request from node, a solution could be to mimic a browser of your choice.
For example - Here is an example of Chrome-like request taken from developer-tools:
deriving the following options for the request:
const options = {
method: 'get',
json: true,
url: url,
gzip: true,
headers: {
"Host": "www.ukr.net",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, sdch, br",
"Accept-Language": "en-US,en;q=0.8"
}
};
If you have experience in jquery, there a library to access of the HTML, for example.
Markup example we'll be using:
<ul id="fruits">
<li class="apple">Apple</li>
<li class="orange">Orange</li>
<li class="pear">Pear</li>
</ul>
First you need to load in the HTML. This step in jQuery is implicit, since jQuery operates on the one, baked-in DOM. With Cheerio, we need to pass in the HTML document.
var cheerio = require('cheerio');
$ = cheerio.load('<ul id="fruits">...</ul>');
Selectors
$('ul .pear').attr('class')
probably you can make something like this.
request(options, (err, res, body) => {
var $ = cheerio.load(html);
})
https://github.com/cheeriojs/cheerio
I'm using NodeJS "request" module to access this particular page
http://www.actapress.com/PaperInfo.aspx?PaperID=28602
via
r = request(i, (err, resp, body) ->
if err
console.log err
else
console.log body
)
The content of "body" is different compared to when I actually access the URL via the browser. Are there some extra settings that I need to configure for request module?
try to set User-Agent header:
request({
uri: 'http://www.actapress.com/PaperInfo.aspx?PaperID=28602',
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36'
}
}, function(err, res, body) {
console.log(body);
});
You can simply use JSON.parse.
body = JSON.parse(body);