I am just starting with web scraping and using axios.
I am trying to take it one step at a time. I am trying to scrape a web page. There is a url for login page like
webpage/login
If I inspect the page the login form which has 2 input fields name and password,
<input type="text" name="login">
<input type="password" name="password">
Once I enter the username and password in the browser, I get redirected to a page which contains the data that I need.
axios({
method: 'post',
url: 'https://mywebsite/login/',
data: {
login: 'Dave232',
password: 'pass23456'
}
})
.then(response=>{
console.log(response.data)
console.log(response.headers);
})
.catch(error=>{
console.log(error)
})
However, when I run my node app, I get the login page html back instead of the next page.
console.log(response.header)
returns
{ date: 'Wed, 18 Sep 2019 03:17:35 GMT',
server: 'Apache-Coyote/1.1',
'content-type': 'text/html;charset=UTF-8',
'content-length': '24979',
'set-cookie':
[ 'cfid=04d1704f-88f6-49b6-bcd4-4a7467b8e4ab;Path=/;Expires=Thu, 16-Sep-2049 11:09:05 GMT;HTTPOnly',
'cftoken=0;Path=/;Expires=Thu, 16-Sep-2049 11:09:05 GMT;HTTPOnly',
'JSESSIONID=6874842A05E89B0D8B1D33BEBAD537AF.NE1ITCPRHEWS06; Path=/website/; Secure; HttpOnly',
'LOGGED_IN=;Path=/;Expires=Wed, 18-Sep-2019 03:17:35 GMT',
'CF_CLIENT_WEBSITE=%7B%27queueLastTab%27%3A%27workorder%27%2C%27last_contract_category_sk%27%3A%27%27%7D;Path=/;Expires=Thu, 19-Sep-2019 03:17:35 GMT',
'CF_CLIENT_WEBSITE_LV=1568776655537;Path=/;Expires=Thu, 19-Sep-2019 03:17:35 GMT',
'CF_CLIENT_WEBSITE_TC=1568776655537;Path=/;Expires=Thu, 19-Sep-2019 03:17:35 GMT',
'CF_CLIENT_WEBSITE_HC=2;Path=/;Expires=Thu, 19-Sep-2019 03:17:35 GMT' ],
connection: 'close' }
Please help me.
Edit 1:
Just realized that the forms action preperty looks something like this:
<form action="index.cfm?fuseaction=security.login_check" method="post">
I have never seen an action property like this. But I tried the url:
url/index.cfm?fuseaction=security.login_check
which did not work either
You need to handle redirection in your node app. You can access redirect URL using.
response.request.res.responseUrl
https://github.com/axios/axios/issues/799
Related
I am currently having trouble with mocking a particular request mapbox-gl is making. When the map is loaded from mapbox pbf-files are being requested and i have not been able to mock this.
My guess is that the core issue is that there seems to be an open bug with cypress issue-16420.
I tried alot of different intercept variants. I tried all kinds of response headers. I gziped, compressed, brd the file that I serve via fixture. I tried different encodings for the fixture. Nothing worked. One of the interceptors looks basically like this
cy.intercept({
method: 'GET',
url: '**/fonts/v1/mapbox/DIN%20Offc%20Pro%20Italic,Arial%20Unicode%20MS%20Regular/0-255.pbf?*',
}, {
fixture: 'fonts/italic.arial.0-255.pbf,binary',
statusCode: 204,
headers: {
'Connection': 'keep-alive',
'Keep-Alive': 'timeout=5',
'Transfer-Encoding': 'chunked',
'access-control-allow-origin': '*',
'access-control-expose-headers': 'Link',
'age': '11631145',
'cache-control': 'max-age=31536000',
'content-encoding': 'compress',
'content-type': 'application/x-protobuf',
'date': 'Sat, 19 Feb 2022 20:46:43 GMT',
'etag': 'W/"b040-+eCb/OHkPqToOcONTDlvpCrjmvs"',
'via': '1.1 4dd80d99fd5d0f6baaaf5179cd921f72.cloudfront.net (CloudFront)',
'x-amz-cf-id': '4uY9rjBgR_R12nkfHFrBMLEpNuWygW9DkmODlMEzwJHABTGCGg8pww==',
'x-amz-cf-pop': 'FRA56-P7',
'x-cache': 'Hit from cloudfront',
'x-origin': 'Mbx-Fonts'
}
}).as('get.0-255.pbf').as('getItalicArial0-255');
Now even if this is a bug there has to be some kind of workaround to serve the file in a cypress test without having an active internet connection. It would be great not having to rely on the network on tests. So all kinds of workarounds and dirty tricks are welcome in making this intercept work.
I discovered that axios is returning a string instead of a valid json.
headers:Object {date: "Tue, 02 Jun 2020 08:44:06 GMT", server: "Apache", connection: "close", …}
connection:"close"
content-type:"text/html; charset=UTF-8"
date:"Tue, 02 Jun 2020 08:44:06 GMT"
server:"Apache"
transfer-encoding:"chunked"
how do I change the content-type to application/json in NestJs application?
I tried this but didnt not work
const meterInfo = await this.httpService.get(url, { headers: { "Content-Type": "application/json" } }).toPromise();
Here is the invalid json returned.
"{"status":"00","message":"OK","access_token":"2347682423567","customer":{"name":"John Doe","address":"Mr. John Doe 34 Tokai, leaflet. 7999.","util":"Demo Utility","minimumAmount":"13897"},"response_hash":"c43c9d74480f340f55156f6r5c56487v8w"}"
Instead of sending a Content-Type header, you should send an Accept header with the same MIME type. This tells the server what you are expecting to receive, and if Content Negotiation is set up properly, it will allow you to get a JSON back instead of that weird string.
this.httpService.get(
url,
{
headers: {
'Accept': 'application/json',
},
},
).toPromise();
If that doesn't work, you'll need to provide your own serializer to take the string from that wonky format to JSON, or get in touch with the server admins and see if they can provide you better documentation about how to consume their API.
I have got this unit test code which mocks an HTTP request:
nock(/localhost/)
.post(/validate-user$/, credentials)
.reply(401);
This works fine. However, I don't really care about the domain. In fact, the domain will change. This will break my unit tests.
Can I specify a wildcard domain for Nock?
I've tried nock(/.*/) but it does not work.
The Nock recorder outputs:
<<<<<<-- cut here -->>>>>>
nock('https://localhost:41109', {"encodedQueryParams":true})
.post('/mock-validate-user', {"username":"alexxx","password":"alexxx100"})
.reply(200, {"username":"alexxx","email":"alexxx#foo.com","admin":true},
[ 'Content-Type',
'application/json; charset=utf-8',
'Content-Length',
'59',
'ETag',
'W/"3b-JaaJntKHf2/nzwhkqdNoFA"',
'Date',
'Thu, 02 Mar 2017 16:11:06 GMT',
'Connection',
'close' ]);
<<<<<<-- cut here -->>>>>>
according to https://developers.google.com/admin-sdk/email-audit/#creating_a_mailbox_for_export I am trying to request the email audit export of an user in G Suite this way:
def requestAuditExport(account):
credentials = getCredentials()
http = credentials.authorize(httplib2.Http())
url = 'https://apps-apis.google.com/a/feeds/compliance/audit/mail/export/helpling.com/'+account
status, response = http.request(url, 'POST', headers={'Content-Type': 'application/atom+xml'})
print(status)
print(response)
And I get the following result:
{'content-length': '22', 'expires': 'Tue, 13 Dec 2016 14:19:37 GMT', 'date': 'Tue, 13 Dec 2016 14:19:37 GMT', 'x-frame-options': 'SAMEORIGIN', 'transfer-encoding': 'chunked', 'x-xss-protection': '1; mode=block', 'content-type': 'text/html; charset=UTF-8', 'x-content-type-options': 'nosniff', '-content-encoding': 'gzip', 'server': 'GSE', 'status': '400', 'cache-control': 'private, max-age=0', 'alt-svc': 'quic=":443"; ma=2592000; v="35,34"'}
b'Premature end of file.'
I cannot see where the problem is, can someone please give me a hint?
Thanks in advance!
Kay
Fix it by going intp the Admin Console, Manage API client access page under Security and add the Client ID, scope needed for the Directory API. For more information, check this document.
Okay, found out what was wrong and fixed it myself. Finally it looks like this:
http = getCredentials().authorize(httplib2.Http())
url = 'https://apps-apis.google.com/a/feeds/compliance/audit/mail/export/helpling.com/'+account
headers = {'Content-Type': 'application/atom+xml'}
xml_data = """<atom:entry xmlns:atom='http://www.w3.org/2005/Atom' xmlns:apps='http://schemas.google.com/apps/2006'> \
<apps:property name='includeDeleted' value='true'/> \
</atom:entry>"""
status, response = http.request(url, 'POST', headers=headers, body=xml_data)
Not sure if it was about the body or the header. It works now and I hope it will help others.
Thanks anyway.
Not sure if this is really a stackoverflow-ey question, but here goes:
I'm trying to web-scrape a page that requires you to be logged in to view the extra data. I already have an account, and am trying to replicate the POST of the login page to log me in and get a cookie to use for the rest of the pages.
If I chrome debug the POST, the request headers are:
POST /authenticate/login?ReturnUrl=%2Fauthorize%3Fresponse_type%3Dcode%26client_id%3Dc82cb4c9-7cfa-4483-938b-2d3c61efabea%26redirect_uri%3Dhttps%253A%252F%252Fthenuel.com%252Fsignin-nuel%26scope%3Didentity%2520offline%26state%3Db9lB42PxZ6AZU-cmP-zOOLjaPHsif9z5yI7mVfxlLtiv00R_4O-FDtsh-GmFMYvZa7-mw6WdJMGWd1owC2SABiQOKJXdGvPC7XahPStVpsdoVypeGl3Rk-oDSNU7V4700LRV2D9URjtmIpCfAwE1WjLeUXJOZZ2GjQzF_UdBz9BtdlHR7hdR4iUMmebLKpZU2Y-vGuocOhT1D3G_gxcJ0aE9jw_PhPuFe3IGnpno86XKEsQlcviK5aYj2vyhXXTs HTTP/1.1
Host: login.thenuel.com
Connection: keep-alive
Content-Length: 177
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Origin: https://login.thenuel.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Referer: https://login.thenuel.com/authenticate/login?ReturnUrl=%2Fauthorize%3Fresponse_type%3Dcode%26client_id%3Dc82cb4c9-7cfa-4483-938b-2d3c61efabea%26redirect_uri%3Dhttps%253A%252F%252Fthenuel.com%252Fsignin-nuel%26scope%3Didentity%2520offline%26state%3Db9lB42PxZ6AZU-cmP-zOOLjaPHsif9z5yI7mVfxlLtiv00R_4O-FDtsh-GmFMYvZa7-mw6WdJMGWd1owC2SABiQOKJXdGvPC7XahPStVpsdoVypeGl3Rk-oDSNU7V4700LRV2D9URjtmIpCfAwE1WjLeUXJOZZ2GjQzF_UdBz9BtdlHR7hdR4iUMmebLKpZU2Y-vGuocOhT1D3G_gxcJ0aE9jw_PhPuFe3IGnpno86XKEsQlcviK5aYj2vyhXXTs
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8
Cookie: ARRAffinity=f02c8a40711ffa249ac8dcf17e82c47021b4939e86cdd8caa8a1729b4a81838d; _ga=GA1.2.592576149.1458220828; _gat=1; __RequestVerificationToken=-t6-ReUs7KPoo2ioYs4h3OQ-2VLJYE5IRq5GcEGBG5YKGe84VXNdJO4taMK4CCV_HXbFJI_ZflWZqALWjrA29pLZcjWakodi19rtT0sJwiQ1; ARRAffinity=e310baf6f2079f1b7c40c521ea7e13fd41184f9683f30fea9f5312b081e077ba
and I also need to pass in some form data:
__RequestVerificationToken=tTi9aRzgeb0zA1z3QMZ1iWbGuC4ajR9Ke2VctCLnUlaTKFg1m-70WSOEsZf3PLEUgRdr4n1rEPVvwmfvN6RwrGMKvqnQvjP_gWsAAxAHPY1&UserName=email%40email.com&Password=asdas
I know this is a bit complicated, but I am completely stuck, so i'll post my code and try to run through it as cleanly as I can:
'I am using Superagent-cache and Cheerio'
var url = "https://login.thenuel.com/authenticate/login?ReturnUrl=%2Fauthorize%3Fresponse_type%3Dcode%26client_id%3Dc82cb4c9-7cfa-4483-938b-2d3c61efabea%26redirect_uri%3Dhttps%253A%252F%252Fthenuel.com%252Fsignin-nuel%26scope%3Didentity%2520offline%26state%3DiNj9r1juVrmUK5DyLfXnjq6bM6Fci5E1seI-faOadJQsfBKC9PQJJA-wve3TrusBfhrcjNk8C932FDA_vgQIyrlg36K6ucoC3HZkAO-Yn-mRXmaVqZcdKPRvgwYr55UkeETK4ZsjyuOXNixzk0Z3AslC2ZVN2dqiqoPfpoYtz_n-xgtJlvN5WwRt6cEAvzSwhHkFX4UPUF_1OalC8J4aYO-FHfUjTp8Bv4xBe7w0j0exmjcsMIjpmnp4qbN3qz7u";
request
.get(url)
.end(function(err, res) {
if (err) {
console.log(err);
} else {
// This is for getting the response headers from the GET page
console.log("~~~GET~~~~~~~~~~~~~~~~~~~~~~~~")
$ = cheerio.load(res.headers);
console.log(res.headers);
var secondRequestVToken = res.headers['set-cookie'][0]
console.log("~~~Second RequestVerificationToken: " + secondRequestVToken);
var firstARRAffinity = "ARRAffinity=f02c8a40711ffa249ac8dcf17e82c47021b4939e86cdd8caa8a1729b4a81838d; _ga=GA1.2.592576149.1458220828; _gat=1;"
var secondARRAffinity = res.headers['set-cookie'][1]
console.log("~~~Second ARRAfinnity: " + secondARRAffinity);
// This is the getting of the post Request data from the login page
console.log("~~~POST~~~~~~~~~~~~~~~~~~~~~~~~");
$ = cheerio.load(res.text)
var postUrl = "https://login.thenuel.com" + $('body > div > div > div.content-pane.login > form').attr("action");
console.log("~~~URL POST: " + postUrl);
var firstRequestVToken = $('body > div > div > div.content-pane.login > form > input[type="hidden"]:nth-child(1)').attr("value");
console.log("~~~First POST RequestVerificationToken: " + firstRequestVToken);
var formData= {
UserName:'email#email.com',
Password:'Password',
__RequestVerificationToken:firstRequestVToken
}
console.log(formData);
console.log("~~~COOKIE STUFF~~~")
secondRequestVToken = secondRequestVToken.slice(0, -7);
console.log(secondRequestVToken);
secondARRAffinity = secondARRAffinity.slice(0, -31)
console.log(secondARRAffinity);
var finalCookieString = firstARRAffinity + " "
+ secondRequestVToken + " " + secondARRAffinity;
console.log("~~~Final Cookie String: " + finalCookieString);
request
.post(postUrl)
.send(formData)
.set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
.set("Accept-Encoding", "gzip, deflate")
.set("Accept-Language", "en-US,en;q=0.8")
.set("Cache-Control", "max-age=0")
.set("Connection", "keep-alive")
.set("Content-Length", 177)
.set("Content-Type", "application/x-www-form-urlencoded")
.set("Cookie", finalCookieString)
.set("Host", "login.thenuel.com")
.set("Origin", "https://login.thenuel.com")
.set("Referer", postUrl)
.set("Upgrade-Insecure-Requests", 1)
.set("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36")
.end(function(err, res) {
// Do respone here
if (err) {
console.log("~~~Hello failure world!");
console.log(err);
} else {
console.log("Hello success world!");
console.log(res);
}
})
So I am getting a GET of the login page to grab the 'RequestVerificationToken' and the second ARRAfinnity string - both to add to the cookie string I send to the POST.
I am then getting the POST Url from the Login Form Action, and the 'RequestVerificationToken' that gets put in the form data to send along with my username and password.
I then remove some of the guff from the tokens and stuff to make it the same as the request my browser sends off to that POST. I then make the POST call.
Here is my log with the console.logs:
Server Running...
~~~GET~~~~~~~~~~~~~~~~~~~~~~~~
{ 'cache-control': 'private',
'content-length': '1777',
'content-type': 'text/html; charset=utf-8',
'content-encoding': 'gzip',
vary: 'Accept-Encoding',
server: 'Microsoft-IIS/8.0',
'set-cookie':
[ '__RequestVerificationToken=SL9MYCWkPdY3dI66vBq7BKt4wxfzmNQCO6IEg8EteTdCIe-BCiKbBNCIbWtb3jD9ZbNSR
ZmUIlVxzICnKGX5PpPOsQvp5me7NJoc4BHu1Ew1; path=/',
'ARRAffinity=e310baf6f2079f1b7c40c521ea7e13fd41184f9683f30fea9f5312b081e077ba;Path=/;Domain=login
.thenuel.com' ],
'x-aspnetmvc-version': '5.2',
'x-aspnet-version': '4.0.30319',
date: 'Wed, 30 Mar 2016 13:48:55 GMT',
connection: 'close',
prev: null,
next: null,
root:
{ type: 'root',
name: 'root',
attribs: {},
children: [ [Circular] ],
next: null,
prev: null,
parent: null },
parent: null }
~~~Second RequestVerificationToken: __RequestVerificationToken=SL9MYCWkPdY3dI66vBq7BKt4wxfzmNQCO6IEg8E
teTdCIe-BCiKbBNCIbWtb3jD9ZbNSRZmUIlVxzICnKGX5PpPOsQvp5me7NJoc4BHu1Ew1; path=/
~~~Second ARRAfinnity: ARRAffinity=e310baf6f2079f1b7c40c521ea7e13fd41184f9683f30fea9f5312b081e077ba;Pa
th=/;Domain=login.thenuel.com
~~~POST~~~~~~~~~~~~~~~~~~~~~~~~
~~~URL POST: https://login.thenuel.com/authenticate/login?ReturnUrl=%2Fauthorize%3Fresponse_type%3Dcod
e%26client_id%3Dc82cb4c9-7cfa-4483-938b-2d3c61efabea%26redirect_uri%3Dhttps%253A%252F%252Fthenuel.com%
252Fsignin-nuel%26scope%3Didentity%2520offline%26state%3DiNj9r1juVrmUK5DyLfXnjq6bM6Fci5E1seI-faOadJQsf
BKC9PQJJA-wve3TrusBfhrcjNk8C932FDA_vgQIyrlg36K6ucoC3HZkAO-Yn-mRXmaVqZcdKPRvgwYr55UkeETK4ZsjyuOXNixzk0Z
3AslC2ZVN2dqiqoPfpoYtz_n-xgtJlvN5WwRt6cEAvzSwhHkFX4UPUF_1OalC8J4aYO-FHfUjTp8Bv4xBe7w0j0exmjcsMIjpmnp4q
bN3qz7u
~~~First POST RequestVerificationToken: bqsa4HwniG2PeJVTBKWjg2ux0S4zUJ-Y1U4C_YF93za33dPnNortSOTeHZyyWW
dT_WECqgr44IbJ_FjUSfu9_N3ITdjnJuJhuMvBp_dYQ4c1
{ UserName: 'email#email.com',
Password: 'Password',
__RequestVerificationToken: 'bqsa4HwniG2PeJVTBKWjg2ux0S4zUJ-Y1U4C_YF93za33dPnNortSOTeHZyyWWdT_WECqgr
44IbJ_FjUSfu9_N3ITdjnJuJhuMvBp_dYQ4c1' }
~~~COOKIE STUFF~~~
__RequestVerificationToken=SL9MYCWkPdY3dI66vBq7BKt4wxfzmNQCO6IEg8EteTdCIe-BCiKbBNCIbWtb3jD9ZbNSRZmUIlV
xzICnKGX5PpPOsQvp5me7NJoc4BHu1Ew1;
ARRAffinity=e310baf6f2079f1b7c40c521ea7e13fd41184f9683f30fea9f5312b081e077ba;
~~~Final Cookie String: ARRAffinity=f02c8a40711ffa249ac8dcf17e82c47021b4939e86cdd8caa8a1729b4a81838d;
_ga=GA1.2.592576149.1458220828; _gat=1; __RequestVerificationToken=SL9MYCWkPdY3dI66vBq7BKt4wxfzmNQCO6I
Eg8EteTdCIe-BCiKbBNCIbWtb3jD9ZbNSRZmUIlVxzICnKGX5PpPOsQvp5me7NJoc4BHu1Ew1; ARRAffinity=e310baf6f2079f1
b7c40c521ea7e13fd41184f9683f30fea9f5312b081e077ba;
Hello success world!
{ body: {},
text: '<!doctype html>\r\n<html>\r\n<head>\r\n <link rel="stylesheet" type="text/css" href="//nu
el-ui-playgroundofwonders.azurewebsites.net/Content/nuel-reset.css" async />\r\n <link rel="stylesh
eet" type="text/css" href="//nuel-ui-playgroundofwonders.azurewebsites.net/Content/nuel-base.css" asyn
c />\r\n <title>Sorry :( - The NUEL</title>\r\n</head>\r\n<body>\r\n <div class="stretch" style=
"width: 23rem; padding: 1.3rem 1rem; background-color: #ffffff; box-shadow: #e0e0e0 0px 0px 1px; margi
n: 3rem auto 0px;">\r\n <h1 class="non-content">Something\'s up!</h1>\r\n <p style="line
-height: 1.3rem;">An error occured during your last request, if you were logging in - try clearing you
r cookies and then retrying. You can <a href="https://support.google.com/chrome/answer/95647?hl=en-GB"
target="_blank">find out how to do that here</a>.</p>\r\n <p style="line-height: 1.3rem;">If t
hat doesn\'t work then just drop us an email at issues#thenuel.com and we\'ll get back to you ASAP.</p
>\r\n <a class="button minimal primary small" href="https://thenuel.com/">Go to thenuel.com</a>
\r\n </div>\r\n</body>\r\n</html>',
headers:
{ 'content-length': '765',
'content-type': 'text/html',
'content-encoding': 'gzip',
'last-modified': 'Wed, 28 Oct 2015 23:22:44 GMT',
'accept-ranges': 'bytes',
etag: '"1c315c8cd711d11:0"',
vary: 'Accept-Encoding',
server: 'Microsoft-IIS/8.0',
'set-cookie': [ 'ARRAffinity=e310baf6f2079f1b7c40c521ea7e13fd41184f9683f30fea9f5312b081e077ba;Pat
h=/;Domain=login.thenuel.com' ],
date: 'Wed, 30 Mar 2016 13:48:57 GMT' },
statusCode: 200,
status: 200,
ok: true }
I would've hoped that it would have returned a cookie for being logged
The Verification tokens change each time. The correct verification token is being sent to the form and the correct one is being sent to the cookie string.
I know this is super complicated, but I'm completely stumped.
The website is login.thenuel.com.
Any help is appreciated, but I think this might just be a keep on trying sort of thing! And if there's any other info I need to provide just ask - and it might be easier to just try and do it yourself and see if you can do it!
Thanks