HTTP headers format using python's requests - python-3.x

I use python requests to capture a website's http headers. For example, this is a response header:
{'Connection': 'keep-alive',
'Access-Control-Allow-Origin': '*', 'cache-control': 'max-age=600',
'Content-Type': 'text/html; charset=utf-8', 'Expires': 'Fri, 19 Apr
2019 03:16:28 GMT', 'Via': '1.1 varnish, 1.1 varnish', 'X-ESI': 'on',
'Verso': 'false', 'Accept-Ranges': 'none', 'Date': 'Fri, 19 Apr 2019
03:11:12 GMT', 'Age': '283', 'Set-Cookie':
'CN_xid=08f66bff-4001-4173-b4e2-71ac31bb58d7; Expires=Wed, 16 Oct 2019
03:11:12 GMT; path=/;, xid1=1; Expires=Fri, 19 Apr 2019 03:11:27 GMT;
path=/;, verso_bucket=281; Expires=Sat, 18 Apr 2020 03:11:12 GMT;
path=/;', 'X-Served-By': 'cache-iad2133-IAD, cache-gru17122-GRU',
'X-Cache': 'HIT, MISS', 'X-Cache-Hits': '1, 0', 'X-Timer':
'S1555643472.999490,VS0,VE302', 'Content-Security-Policy':
"default-src https: data: 'unsafe-inline' 'unsafe-eval'; child-src
https: data: blob:; connect-src https: data: blob:; font-src https:
data:; img-src https: data: blob:; media-src https: data: blob:;
object-src https:; script-src https: data: blob: 'unsafe-inline'
'unsafe-eval'; style-src https: 'unsafe-inline';
block-all-mixed-content; upgrade-insecure-requests; report-uri
https://l.com/csp/gq",
'X-Fastly-Device-Detect': 'desktop', 'Strict-Transport-Security':
'max-age=7776000; preload', 'Vary': 'Accept-Encoding, Verso,
Accept-Encoding', 'content-encoding': 'gzip', 'transfer-encoding':
'chunked'}
I noted that from several examples I tested, the headers I receive from requests are formatted as 'key':'value' (plz note the single colons surrounding the key and the value). However, when I check the headers from the Firefox-> Web developer -> Inspector, and choose to view the header in raw format, I do not see commas:
HTTP/2.0 200 OK date: Thu, 09 May 2019 18:49:07 GMT expires: -1
cache-control: private, max-age=0 content-type: text/html;
charset=UTF-8 strict-transport-security: max-age=31536000
content-encoding: br server: gws content-length: 55844
x-xss-protection: 0 x-frame-options: SAMEORIGIN set-cookie:
1P_JAR=2019-05-09-18; expires=Sat, 08-Jun-2019 18:49:07 GMT; path=/;
domain=.google.com alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"
X-Firefox-Spdy: h2
I need to know: Does python's requests module always adds single colons? This important from me as I need to include/exclude them in my regex that is used to analyze the headers.

The issue I think you are running into is the request coming back as a dict instead of a value as firefox inspector is giving you. When you do this you could be getting mixed results if one of the value pairs has a numeric or boolean value so when doing your regex you may want to use a Try/Except if you can remove the exterior apostrophes or just use the value given.

It's not the requests module that's adding the colons. Request represents headers as a dict, but you seem to be treating them as a string. When Python converts dicts to strings, they get the colons, the commas, the quotation marks.
The right fix for your program is probably to treat the dictionary as a dictionary, not convert it into a string. But if you really want the headers in string form, you should consider using different tool, such as curl.

Related

Curl HEAD request returns a last-modified header but Node https.request(..., {method: HEAD},...) does not

Apparently there's something I don't know about HEAD requests.
Here's the URL: 'https://theweekinchess.com/assets/files/pgn/eurbli22.pgn', which I'll refer to as <URL> below.
If I curl this, I see a last-modified entry in the headers:
curl --head <URL>
HTTP/2 200
last-modified: Sun, 18 Dec 2022 18:07:16 GMT
accept-ranges: bytes
content-length: 1888745
host-header: c2hhcmVkLmJsdWVob3N0LmNvbQ==
content-type: application/x-chess-pgn
date: Wed, 11 Jan 2023 23:09:14 GMT
server: Apache
But if I make a HEAD request in Node using https, That information is missing:
https.request(<URL>, { method: 'HEAD' }, res => {
console.log([<URL>, res.headers])}).end()
This returns:
[
<URL>
{
date: 'Wed, 11 Jan 2023 23:16:15 GMT',
server: 'Apache',
p3p: 'CP="NOI NID ADMa OUR IND UNI COM NAV"',
'cache-control': 'private, must-revalidate',
'set-cookie': [
'evof3sqa=4b412b5913b38669fc928a0cca9870e4; path=/; secure; HttpOnly'
],
upgrade: 'h2,h2c',
connection: 'Upgrade, Keep-Alive',
'host-header': 'c2hhcmVkLmJsdWVob3N0LmNvbQ==',
'keep-alive': 'timeout=5, max=75',
'content-type': 'text/html; charset=UTF-8'
}
]
I tried axios instead of https:
const response = await axios.head('https://theweekinchess.com/assets/files/pgn/eurbli22.pgn');
console.log({response: response.headers})
And that works (incl. the proper MIME type):
date: 'Thu, 12 Jan 2023 20:00:48 GMT',
server: 'Apache',
upgrade: 'h2,h2c',
connection: 'Upgrade, Keep-Alive',
'last-modified': 'Sun, 18 Dec 2022 18:07:16 GMT',
'accept-ranges': 'bytes',
'content-length': '1888745',
'host-header': 'c2hhcmVkLmJsdWVob3N0LmNvbQ==',
'keep-alive': 'timeout=5, max=75',
'content-type': 'application/x-chess-pgn'
I also tried waiting for re2.on('end', console.log(res.headers)), but same output as before.
I'm going to close this issue and post it instead as a 'bug' on Node's site. I'm sure there's something that needs to be changed in how I'm executing the HEAD request.

Finding the final URL that the request will be redirected to in nodeJS?

I have the following URL:
https://forecast.weather.gov/zipcity.php?inputstring=95014
I would like to figure out which URL it will redirect to. In this example it is:
https://forecast.weather.gov/MapClick.php?CityName=Cupertino&state=CA&site=MTR&lat=37.3042&lon=-122.095
I tried multiple solutions such as:
res.headers.location
and
res.headers.get('location')
But none of them seem to work. I know that the URL redirects because, I could successfully redirect in curl and google-chrome. Here is the code that I am running:
https.get('https://forecast.weather.gov/zipcity.php?inputstring=95014', res => console.log(res.headers.location))
When I was running in curl, I ran the following:
curl -Ls -o /dev/null -w %{url_effective} http://forecast.weather.gov/zipcity.php?inputstring=95014
And got the desired output of:
http://forecast.weather.gov/MapClick.php?CityName=Cupertino&state=CA&site=MTR&lat=37.3042&lon=-122.095
When I run curl:
$ curl -I https://forecast.weather.gov/zipcity.php?inputstring=95014
HTTP/2 302
server: Apache
x-nids-serverid: www1.mo
location: https://forecast.weather.gov/MapClick.php?CityName=Cupertino&state=CA&site=MTR&lat=37.3042&lon=-122.095
x-ua-compatible: IE=Edge
access-control-allow-origin: *
content-type: text/html; charset=UTF-8
content-length: 0
cache-control: max-age=722
expires: Mon, 26 Apr 2021 19:28:03 GMT
date: Mon, 26 Apr 2021 19:16:01 GMT
strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
As you can see, there are multiple headers present here that are not present inside nodejs's results:
{
server: 'AkamaiGHost',
'mime-version': '1.0',
'content-type': 'text/html',
'content-length': '288',
expires: 'Mon, 26 Apr 2021 19:21:50 GMT',
date: 'Mon, 26 Apr 2021 19:21:50 GMT',
connection: 'close',
'strict-transport-security': 'max-age=31536000 ; includeSubDomains ; preload'
}
I would like to stick to the http module.

After I getting a set-cookie in response, is not saved and transmitted in requests

Basically after an auth, I setting a cookie, but apparently after page refresh on the cookie that was set by cloudflare is saved
And the cookie that I transmitted with set-cookie is not used in after set-cookie requests
# Response headers
HTTP/2.0 200 OK
date: Thu, 18 Jul 2019 10:03:25 GMT
content-type: application/json; charset=utf-8
content-length: 29
set-cookie: __cfduid=d578c7a5e4378dc1b1946964a08ebc4ec1563444205; expires=Fri, 17-Jul-20 10:03:25 GMT; path=/; domain=.doc.io; HttpOnly; Secure
set-cookie: __doc=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJlIjoiNzQ2NTczNzQ0MDY2NjE3Mzc0NmQ2MTY5NmMyZTZkNzgiLCJwIjoiNTUzMjQ2NzM2NDQ3NTY2YjU4MzEzODMzNTU0NjUwNTU2ZjRiMzkzMTY3NDUzNDY5NDc3MzM3MzgzOTU5MzczMDUxNjk0ZjQxNjQ0OTM5Nzg0YjZiNzU1Njc3Nzk0NDc0NjE3NDMxNTE0NzcwMzE0YjQxNmY1MjU5MzM3YTZhNDU2NDJiNmU0ZTc0NGE3NTMyNTQ1ODc2NjI1YTczNDc1MTQ1Njc0MjVhNGQ0MTNkM2QiLCJkIjoiMzEzNTM2MzMzNDM0MzQzMjMwMzUzNTM2MzMiLCJpYXQiOjE1NjM0NDQyMDV9.go1jDpc2rBe5FjK2sKX4ybW4PhCPFq1xT1WIX-mSI84; Domain=.doc.io; Path=/; Expires=Thu, 18 Jul 2019 16:03:25 GMT; HttpOnly; Secure
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 4f83a02a3dc36455-FRA
X-Firefox-Spdy: h2
reply
.code(200)
.header('Access-Control-Allow-Origin', '*')
.header('Content-Type', 'application/json; charset=utf-8')
.setCookie('__doc', token, {
domain: '.doc.io',
path: '/',
secure: true,
httpOnly: true,
expires: new Date(new Date().setHours(new Date().getHours() + 6))})
.send({ 'success': 'Sign In success' })
All my websites are https
First I do POST request for an auth on /auth, and you could see response in response headers above and after I do GET on (trying to load page) from /page and get cookies, but with reply.log.info(request.cookies) I see only cookies from cloudflare. Surely I tried to refresh and go to address in different table, there just no any cookies, but from cloudflare.
# Request headers
Host: test.doc.io
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
DNT: 1
Connection: keep-alive
Cookie: __cfduid=d578c7a5e4378dc1b1946964a08ebc4ec1563444205
Upgrade-Insecure-Requests: 1
TE: Trailers
After I tested it with an actual page with js code for a XHR request it works fine. However cookie was only displayed, but not actually saved in the storage when I was sending request directly in Firefox's inspector. Lost a day to figure out that created in inspector cookies seems preventing from installing

electron-updater unable to parse latest.yml in artifacts of gitlab private repo

I am trying to use Electron Updater with a GitLab Private Repository.
Main Electron File (partial):
autoUpdater.requestHeaders = { 'PRIVATE-TOKEN': process.env.VUE_APP_GITLABSECRET }
autoUpdater.autoDownload = true
autoUpdater.setFeedURL({
provider: 'generic',
url: 'https://gitlab.com/SmellydogCoding/mchd-electronic-field-guide/-/jobs/artifacts/master/raw/dist_electron?job=build'
})
autoUpdater.on('checking-for-update', function () {
console.log('Checking for update...')
})
When I start the app I get this error message:
Error: Error: Cannot parse update info from latest.yml in the latest release artifacts (https://gitlab.com/SmellydogCoding/mchd-electronic-field-guide/-/jobs/artifacts/master/raw/dist_electron/latest.yml?job=build): YAMLException: end of the stream or a document separator is expected at line 3, column 17:
<head prefix="og: http://ogp.me/ns#">
What is happening is that the server is responding with a string of HTML, which is the Gitlab login page.
If I curl
--header 'PRIVATE-TOKEN': 'mygitlabprivatetoken' https://gitlab.com/SmellydogCoding/mchd-electronic-field-guide/-/jobs/artifacts/master/raw/dist_electron/latest.yml?job=build
The server returns:
Header
HTTP/1.1 302 Found
Server: nginx
Date: Tue, 19 Mar 2019 17:57:21 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 98
Cache-Control: no-cache
Location: https://gitlab.com/users/sign_in
Set-Cookie: _gitlab_session=da00cbc69f2d50ea4192f4e3002f84a9; path=/; secure; HttpOnly
X-Request-Id: dGkxtbboHy7
X-Runtime: 0.049129
Strict-Transport-Security: max-age=31536000
Content-Security-Policy: object-src 'none'; worker-src https://assets.gitlab-static.net https://gl-canary.freetls.fastly.net https://gitlab.com blob:; script-src 'self' 'unsafe-inline' 'unsafe-eval' https://assets.gitlab-static.net https://gl-canary.freetls.fastly.net https://www.google.com/recaptcha/ https://www.recaptcha.net/ https://www.gstatic.com/recaptcha/ https://apis.google.com; style-src 'self' 'unsafe-inline' https://assets.gitlab-static.net https://gl-canary.freetls.fastly.net; img-src * data: blob:; frame-src 'self' https://www.google.com/recaptcha/ https://www.recaptcha.net/ https://content.googleapis.com https://content-compute.googleapis.com https://content-cloudbilling.googleapis.com https://content-cloudresourcemanager.googleapis.com https://*.codesandbox.io; frame-ancestors 'self'; connect-src 'self' https://assets.gitlab-static.net https://gl-canary.freetls.fastly.net wss://gitlab.com https://sentry.gitlab.net https://customers.gitlab.com https://snowplow.trx.gitlab.net
Body
<html><body>You are being redirected.</body></html>
It seems like i'm not authenticating properly. I'm really not sure what i'm doing incorrectly.

How do I download an mp3 file with Python3

I am trying to download some playlists off soundcloud and found a site that does this for you. Of course if the playlist is long, then it's super tedious to click each link to download. So I saved the HTML of the page and have parsed out the links. The idea is to use urllib or requests to download the files.
Here's my code:
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
track_url = 'https://scdownloader.io/download?track=zandex-hazerback-erox-stroke-bth-release&token=be1bc7997695495f756312886f566110'
track_name = 'BANG_THE_HOUSE___zandex-hazerback-erox-stroke-bth-release.mp3'
output_file = '/Users/ms/Desktop/playlist/{}'.format(track_name)
urllib.request.urlretrieve(track_url, output_file)
When I run the above code, it does save the file, but it arrives as a 1 byte file only.
I've tried other permutations using requests but basically either it doesn't work, downloads and saves a zero byte file, or does work to download and save a 1 byte file... just can't get the whole thing!
Also note, I have to send headers b/c otherwise I get a 403 error.
Any help is greatly appreciated!
Thank you!
EDIT:
Per the comments below, here's what the urlretrieve http response is:
Date: Fri, 15 Mar 2019 23:52:44 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: __cfduid=dcc5f95391fac83973cc77648c0e8c0391552693964; expires=Sat, 14-Mar-20 23:52:44 GMT; path=/; domain=.scdownloader.io; HttpOnly; Secure
X-Powered-By: PHP/5.6.36
Set-Cookie: PHPSESSID=fsnrrrtpnrav3vq5u2t9vfvrp7; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding,User-Agent
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 4b82671d38067790-LAX

Resources