I got desperate about one problem and I need some help...
I'm using node.js to crawl a list of websites, some of them gives me this error, for example:
http://www.fz-juelich.de/portal/DE/Home/home_node.html, Parse Error, HPE_INVALID_HEADER_TOKEN
request.get({
url: uri,
timeout: timeout,
headers: {
referer: domain
}
}, (error, response, body) => {
if (error)
console.log(error);
console.log(body);
});
though, curl -i --raw http://www.fz-juelich.de/portal/DE/Home/home_node.html
works just perfect
HTTP/1.1 404 Not Found
Server: Apache-Coyote/1.1
Cache-Control: no-cache
JSESSIONID=E594677A6CCA13BE0338E1D00A729C34; Path=/cae:
Content-Type: text/html;charset=utf-8
Content-Language: de
Set-Cookie: JSESSIONID=E594677A6CCA13BE0338E1D00A729C34; Path=/
Content-Length: 19677
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
Also I'm able to see this website in my chrome browser
Any ideas in which side should I dig to get rid of this errors?
I use quotes in properties and that resolve for me :
request.post(url,{
headers: {
'Authorization': 'Basic onEnAGrosEncodedBase64',
'Content-Type': 'application/x-www-form-urlencoded'
},
form: {
'grant_type': 'client_credentials'
}
})
I hope that can help someone ;)
I the end of this journey I no longer use node.js for crawling and parsing
Go lang crawler fits much better here, more flixibility in http library and easier to write really concurrent stuff
Related
I've been using the NPM Request package for years now and despite the fact that it's deprecated, it has always worked perfectly for my needs and I haven't ever had any issues up until today...
I am trying to do a regular GET request with a user:pass authenticated proxy and some additional headers - the same thing I've done a thousand times in the past, however this time the url is HTTP instead of HTTPS.
For whatever reason because the link is HTTP it is messing with my proxy authentication and for whatever reason is returning a response code of 407 and the response body shows an error stating Cache Access Denied (ERR_CACHE_ACCESS_DENIED). This is where I'm confused as to what to do because I know for a fact that there shouldn't be anything wrong with my proxy authentication since its the same thing I've done for years and years.
Request Code:
const request = require('request').defaults({
timeout: 30000,
gzip: true,
forever: true
});
cookieJar = request.jar();
proxyUrl = "http://proxyUsername:proxyPassword#proxyDomain:proxyPort";
request({
method: "GET",
url: "http://mylink.com",
proxy: proxyUrl,
jar: cookieJar,
headers: {
"Proxy-Authorization": new Buffer('proxyUsername:proxyPassword').toString('base64'),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36",
},
followAllRedirects: true
}, (err, resp, body) => {
if (err || resp.statusCode != 200) {
if (err) {
console.log(err);
} else {
console.log(resp.statusCode);
console.log(resp.body);
}
return;
}
console.log(resp.body);
});
Snippet of Response Body (Status Code 407):
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head>
<meta type="copyright" content="Copyright (C) 1996-2019 The Squid Software Foundation and contributors">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>ERROR: Cache Access Denied</title>
<style type="text/css"><!--
body
:lang(fa) { direction: rtl; font-size: 100%; font-family: Tahoma, Roya, sans-serif; float: right; }
:lang(he) { direction: rtl; }
--></style>
</head><body id=ERR_CACHE_ACCESS_DENIED>
<div id="titles">
<h1>ERROR</h1>
<h2>Cache Access Denied.</h2>
</div>
<hr>
...
<p>Sorry, you are not currently allowed to request http://mylink.com from this cache until you have authenticated yourself.</p>
...
Other things to note:
I have also connected to the exact same proxy on my browser and gone to the same site and it works perfectly so it's not an issue with the proxy itself.
If I remove the proxy from the request it works perfectly, so maybe I have configured the request wrong for HTTP is all I can think of
Now as I said this exact code works flawlessly with any HTTPS link so that's where I'm stumped. Any help would be appreciated!
Changed the Proxy-Authorization header to Proxy-Authenticate and that seemed to work perfectly. I have no clue why it requires that for HTTP but there you go, not much documentation on the matter...
I need to load page from this url: https://medalist-a805c.web.app/HeardSuggest_Medalist.html which actually loads a page from a express server. I'm able to use cors, the express middleware. But the problem is, even though i need to load this url: https://www.alwaysheard.com/embed/Medalist/brand/$2a$10$Z9ib22YZyRWWZz/Vlzju3u9eZCZM.a8oUYorNPsKGzLPKu6vM696K?uname=embed&orgin=https%3A%2F%2Fwww.alwaysheard.com eventually.
The weird problem i'm facing is it always returns the homepage rather than the url i am trying to get which is: https://www.alwaysheard.com/embed/Medalist/brand/$2a$10$Z9ib22YZyRWWZz/Vlzju3u9eZCZM.a8oUYorNPsKGzLPKu6vM696K?uname=embed&orgin=https%3A%2F%2Fwww.alwaysheard.com
Any help will be highly appreciated. Many Thanks.
This is for a expressjs server.
This is my frotend calling code:
$.ajax({
url: _url,
data: { uname: "embed", orgin: _origin },
// dataType: 'json',
crossDomain: true,
type: "GET",
mode: 'no-cors',
headers: {
'mode': 'no-cors',
// "X-TOKEN": 'xxxxx',
// 'x-Trigger': 'CORS',
// 'Access-Control-Allow-Headers': 'x-requested-with',
// "Access-Control-Allow-Origin": "*",
// 'Authorization': 'Bearer fadsfasf asdfasdf'
},
success: function (data) {
console.log("data: ", data);
$("body").html(data);
}
});
And in backend i'm using app.use(cors());
Your URL is first returning a 302 redirect to the home page and then the $.ajax() system (actually the XMLHttpRequest system underneath it) is following that redirect automatically.
You can see a discussion of that topic here:
How to prevent ajax requests to follow redirects using jQuery
In newer browsers, you can use the fetch() interface instead and it contains a redirect option that allows you to tell it not to follow redirects.
FYI, it's unclear what you're really trying to accomplish here. There is no "page" to retrieve from this URL. That returns a 302 status and a few headers, no page.
In case this isn't clear to you, your long url returns this response:
HTTP/1.1 302 Found
Date: Fri, 18 Oct 2019 17:04:05 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 23
Connection: keep-alive
Server: nginx/1.14.1
Access-Control-Allow-Origin: *
Set-Cookie: _csrf=j6JfsEb_JbmSJlvHDoyhgatS; Path=/
Set-Cookie: connect.sid=s%3A8C1PVNBcm5395dwmPJ4Xbl7BNASxPH4W.WFZ6uz3NklxaObnWvZaZXI%2BoH5VdM9dnOi2VCTw3J%2BI; Path=/; Expires=Mon, 15 Oct 2029 17:04:05 GMT; HttpOnly
Surrogate-Control: no-store
Cache-Control: no-store, no-cache, must-revalidate, proxy-revalidate
Pragma: no-cache
Expires: 0
Location: /
Vary: Accept
This content was retrieved with curl -l -v yourlongurl.
I am trying to make a call to the Emotion Api via JavaScript with in a PhoneGap app. I encoded the image into base64 and verified that the data can be decoded by one of the online tools. this is the code that i found on the web to use.
var apiKey = "e371fd4333ccad2"; //(you can get a free key on site this is modified for here)
//apiUrl: The base URL for the API. Find out what this is for other APIs via the API documentation
var apiUrl = "https://api.projectoxford.ai/emotion/v1.0/recognize";
"file" is the base64 string.
function CallAPI(file, apiUrl, apiKey)
{
// console.log("file=> " +file);
$.ajax({
url: apiUrl,
beforeSend: function (xhrObj) {
xhrObj.setRequestHeader("Content-Type", "application/octet-stream");
xhrObj.setRequestHeader("Ocp-Apim-Subscription-Key", apiKey);
},
type: "POST",
data: file,
processData: false
})
.done(function (response) {
console.log("in call api a");
ProcessResult(response);
})
.fail(function (error) {
console.log(error.getAllResponseHeaders())
});
}
function ProcessResult(response)
{
console.log("in processrult");
var data = JSON.stringify(response);
console.log(data);
}
I got back this:
Expires: -1
Access-Control-Allow-Origin: *
Pragma: no-cache
Cache-Control: no-cache
Content-Length: 60
Date: Fri, 01 Apr 2016 13:34:32 GMT
Content-Type: application/json; charset=utf-8
X-Powered-By: ASP.NET
So i tried their console test page.
https://dev.projectoxford.ai/docs/services/5639d931ca73072154c1ce89/operations/563b31ea778daf121cc3a5fa/console
I can put in an image like your "example.com/man.jpg" and it works great. but if i take the same image and have it encoded as a base 64 image all i get is "Bad Body" i have tried it both as content type "application/octet-stream" and "application/json" and get the same error. sample of the encoded looks like..and http request
POST https://api.projectoxford.ai/emotion/v1.0/recognize HTTP/1.1
Content-Type: application/octet-stream
Host: api.projectoxford.ai
Ocp-Apim-Subscription-Key: ••••••••••••••••••••••••••••••••
Content-Length: 129420
...
i get back:
Pragma: no-cache
Cache-Control: no-cache
Date: Fri, 01 Apr 2016 16:23:09 GMT
X-Powered-By: ASP.NET
Content-Length: 60
Content-Type: application/json; charset=utf-8
Expires: -1
{
"error": {
"code": "BadBody",
"message": "Invalid face image."
}
}
I am now not sure if you can send an image like this or not from Javascript. Can anyone tell me if my javascript is correct or if they can send an encoded base64 string image to the site.
thanks for your help
tim
This API does not accept data URIs for images. What you'll need to do is convert it to a binary blob. Though this answer is for a different Project Oxford API, you can apply the same technique.
I am trying to import a json array into arangodb using the http api from various node modules like needle, http, request. Each time i get the following error or similar:
{ error: true,
errorMessage: 'expecting a JSON array in the request',
code: 400,
errorNum: 400 }
The code is below (similar for most modules listed above with minor variations). Various scenarios (single document import, etc.) all seem to point to the post body not being correctly recognized for some reason.
var needle = require('needle');
var data = [{
"lastname": "ln",
"firstname": "fn",
},
{
"lastname": "ln2",
"firstname": "fn2"
}];
var options = { 'Content-Type': 'application/json; charset=utf-8' };
needle.request('POST', 'http://ip:8529/_db/mydb/_api/import?type=array&collection=accounts&createCollection=false', data, options, function(err, resp) {
console.log(resp.body);
});
While i am able to upload the documents using curl and browser dev tools, I have not been able to get it working in node.js. What am i doing wrong? This is driving me crazy. Any help would be appreciated. Thank you very much.
You can use ngrep (or wireshark) to quickly find out whats wrong:
ngrep -Wbyline port 8529 -d lo
T 127.0.0.1:53440 -> 127.0.0.1:8529 [AP]
POST /_db/mydb/_api/import?type=array&collection=accounts& createCollection=true HTTP/1.1.
Accept: */*.
Connection: close.
User-Agent: Needle/0.9.2 (Node.js v1.8.1; linux x64).
Content-Type: application/x-www-form-urlencoded.
Content-Length: 51.
Host: 127.0.0.1:8529.
.
##
T 127.0.0.1:53440 -> 127.0.0.1:8529 [AP]
lastname=ln&firstname=fn&lastname=ln2&firstname=fn2
The body to be sent to ArangoDB has to be json (as you try to achieve by setting the content type).
Making needle to actualy post json works this way: (see https://github.com/tomas/needle#request-options )
var options = {
Content-Type: 'application/json; charset=utf-8',
json: true
};
which produces the proper reply:
{ error: false,
created: 2,
errors: 0,
empty: 0,
updated: 0,
ignored: 0 }
I am using nodejs and the REST API to interact with bigquery. I am using the google-oauth-jwt module for JWT signing.
I granted a service account write permission. So far I can list projects, list datasets, create a table and delete a table. But when it comes to upload a file via multipart POST, I ran into two problems:
gzipped json file doesn't work, I get an error saying "end boundary missing"
when I use uncompressed json file, I get a 401 unauthorized error
I don't think this is related to my machine's time being out of sync since other REST api calls worked as expected.
var url = 'https://www.googleapis.com/upload/bigquery/v2/projects/' + projectId + '/jobs';
var request = googleOauthJWT.requestWithJWT();
var jobResource = {
jobReference: {
projectId: projectId,
jobId: jobId
},
configuration: {
load: {
sourceFormat: 'NEWLINE_DELIMITED_JSON',
destinationTable: {
projectId: projectId,
datasetId: datasetId,
tableId: tableId
},
createDisposition: '',
writeDisposition: ''
}
}
};
request(
{
url: url,
method: 'POST',
jwt: jwtParams,
headers: {
'Content-Type': 'multipart/related'
},
qs: {
uploadType: 'multipart'
},
multipart: [
{
'Content-Type':'application/json; charset=UTF-8',
body: JSON.stringify(jobResource)
},
{
'Content-Type':'application/octet-stream',
body: fileBuffer.toString()
}
]
},
function(err, response, body) {
console.log(JSON.parse(body).selfLink);
}
);
Can anyone shine some light on this?
P.S. the documentation on bigquery REST api is not up to date on many things, wish the google guys can keep it updated
Update 1:
Here is the full HTTP request:
POST /upload/bigquery/v2/projects/239525534299/jobs?uploadType=multipart HTTP/1.1
content-type: multipart/related; boundary=71e00bd1-1c17-4892-8784-2facc6998699
authorization: Bearer ya29.AHES6ZRYyfSUpQz7xt-xwEgUfelmCvwi0RL3ztHDwC4vnBI
host: www.googleapis.com
content-length: 876
Connection: keep-alive
--71e00bd1-1c17-4892-8784-2facc6998699
Content-Type: application/json
{"jobReference":{"projectId":"239525534299","jobId":"test-upload-2013-08-07_2300"},"configuration":{"load":{"sourceFormat":"NEWLINE_DELIMITED_JSON","destinationTable":{"projectId":"239525534299","datasetId":"performance","tableId":"test_table"},"createDisposition":"CREATE_NEVER","writeDisposition":"WRITE_APPEND"}}}
--71e00bd1-1c17-4892-8784-2facc6998699
Content-Type: application/octet-stream
{"practiceId":2,"fanCount":5,"mvp":"Hello"}
{"practiceId":3,"fanCount":33,"mvp":"Hello"}
{"practiceId":4,"fanCount":71,"mvp":"Hello"}
{"practiceId":5,"fanCount":93,"mvp":"Hello"}
{"practiceId":6,"fanCount":92,"mvp":"Hello"}
{"practiceId":7,"fanCount":74,"mvp":"Hello"}
{"practiceId":8,"fanCount":100,"mvp":"Hello"}
{"practiceId":9,"fanCount":27,"mvp":"Hello"}
--71e00bd1-1c17-4892-8784-2facc6998699--
You are most likely sending duplicate content-type headers to the Google API.
I don't have the capability to effortlessly make a request to Google BigQuery to test, but I'd start with removing the headers property of your options object to request().
Remove this:
headers: {
'Content-Type': 'multipart/related'
},
The Node.js request module automatically detects that you have passed in a multipart array, and it adds the appropriate content-type header. If you provide your own content-type header, you most likely end up with a "duplicate" one, which does not contain the multipart boundary.
If you modify your code slightly to print out the actual headers sent:
var req = request({...}, function(..) {...});
console.log(req.headers);
You should see something like this for your original code above (I'm using the Node REPL):
> req.headers
{ 'Content-Type': 'multipart/related',
'content-type': 'multipart/related; boundary=af5ed508-5655-48e4-b43c-ae5be91b5ae9',
'content-length': 271 }
And the following if you remove the explicit headers option:
> req.headers
{ 'content-type': 'multipart/related; boundary=49d2371f-1baf-4526-b140-0d4d3f80bb75',
'content-length': 271 }
Some servers don't deal well with multiple headers having the same name. Hopefully this solves the end boundary missing error from the API!
I figured this out myself. This is one of those silly mistakes that would have you stuck for the whole day and at the end when you found the solution you would really knock on your own head.
I got the 401 by typing the selfLink URL in the browser. Of course it's not authorized.