Loading a GRIB from the web in Python without saving the file locally - python-3.x

I would like to download a GRIB file from the web:
Opt1: https://noaa-gfs-bdp-pds.s3.amazonaws.com/gfs.20210801/12/atmos/gfs.t12z.pgrb2.1p00.f000
Opt2: https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_1200_000.grb2
(FYI, data is identical)
A similar question is asked here:
How can a GRIB file be opened with pygrib without first downloading the file?
However, ultimately a final solution isn't given as the data source has multiple responses (Causing issues). When I tried to recreate the code:
url = 'https://noaa-gfs-bdp-pds.s3.amazonaws.com/gfs.20210801/12/atmos/gfs.t12z.pgrb2.1p00.f000'
urllib.request.install_opener(opener)
req = urllib.request.Request(url,headers={'User-Agent': 'Mozilla/5.0'})
data = urllib.request.urlopen(req, timeout = 300)
import pygrib
grib = pygrib.fromstring(data.read())
My error is different:
ECCODES ERROR : grib_handle_new_from_message: No final 7777 in message!
Traceback (most recent call last):
File "grib_data_test.py", line 139, in <module>
grib = pygrib.fromstring(data.read())
File "src\pygrib\_pygrib.pyx", line 627, in pygrib._pygrib.fromstring
File "src\pygrib\_pygrib.pyx", line 1390, in pygrib._pygrib.gribmessage._set_projparams
File "src\pygrib\_pygrib.pyx", line 1127, in pygrib._pygrib.gribmessage.__getitem__
RuntimeError: b'Key/value not found'
I have also tried to work directly with the osgeo.gdal library (preferable for my project as it is already in use in the project). Documentation: https://gdal.org/user/virtual_file_systems.html#vsimem-in-memory-files:
Attempt1: url = "/vsicurl/https://noaa-gfs-bdp-pds.s3.amazonaws.com/gfs.20210801/12/atmos/gfs.t12z.pgrb2.0p50.f000"
Attempt2: url = "/vsis3/https://noaa-gfs-bdp-pds.s3.amazonaws.com/gfs.20210801/12/atmos/gfs.t12z.pgrb2.0p50.f000"
Attempt3: url = "/vsicurl/https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2"
grib = gdal.Open(url)
Errors:
Attempt1: ERROR 11: CURL error: schannel: next InitializeSecurityContext failed: Unknown error (0x80092012) - The revocation function was unable to check revocation for the certificate.
Attempt2: ERROR 11: HTTP response code: 0
Attempt3: ERROR 4: /vsicurl/https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2 is a grib file, but no raster dataset was successfully identified.
For Attempt1 and 2: I feel like both attempts should work here - normally you do not need any specific AWS Credentials / connection to access the publically available s3 bucket data (As it is accessed using urllib above).
For Attempt3, there is a similar issue here:
https://gis.stackexchange.com/questions/395867/opening-a-grib-from-the-web-with-gdal-in-python-using-vsicurl-throws-error-on-m
However I do not experience the same issue, my Output:
gdalinfo /vsicurl/https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2 --debug on --config CPL_CURL_VERBOSE YES
HTTP: libcurl/7.78.0 Schannel zlib/1.2.11 libssh2/1.9.0
HTTP: GDAL was built against curl 7.77.0, but is running against 7.78.0.
* Couldn't find host www.ncei.noaa.gov in the (nil) file; using defaults
* Trying 205.167.25.177:443...
* Connected to www.ncei.noaa.gov (205.167.25.177) port 443 (#0)
* schannel: disabled automatic use of client certificate
> HEAD /data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2 HTTP/1.1
Host: www.ncei.noaa.gov
Accept: */*
* schannel: server closed the connection
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Mon, 20 Sep 2021 14:38:59 GMT
< Server: Apache
< Strict-Transport-Security: max-age=31536000
< Last-Modified: Sun, 01 Aug 2021 04:46:49 GMT
< ETag: "287c05a-5c87824012f22"
< Accept-Ranges: bytes
< Content-Length: 42451034
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Headers: X-Requested-With, Content-Type
< Connection: close
<
* Closing connection 0
* schannel: shutting down SSL/TLS connection with www.ncei.noaa.gov port 443
VSICURL: GetFileSize(https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2)=42451034 response_code=200
VSICURL: Downloading 0-16383 (https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2)...
* Couldn't find host www.ncei.noaa.gov in the (nil) file; using defaults
* Hostname www.ncei.noaa.gov was found in DNS cache
* Trying 205.167.25.177:443...
* Connected to www.ncei.noaa.gov (205.167.25.177) port 443 (#1)
* schannel: disabled automatic use of client certificate
> GET /data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2 HTTP/1.1
Host: www.ncei.noaa.gov
Accept: */*
Range: bytes=0-16383
* schannel: failed to decrypt data, need more data
* Mark bundle as not supporting multiuse
< HTTP/1.1 206 Partial Content
< Date: Mon, 20 Sep 2021 14:39:00 GMT
< Server: Apache
< Strict-Transport-Security: max-age=31536000
< Last-Modified: Sun, 01 Aug 2021 04:46:49 GMT
< ETag: "287c05a-5c87824012f22"
< Accept-Ranges: bytes
< Content-Length: 16384
< Content-Range: bytes 0-16383/42451034
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Headers: X-Requested-With, Content-Type
< Connection: close
<
* schannel: failed to decrypt data, need more data
* schannel: failed to decrypt data, need more data
* schannel: failed to decrypt data, need more data
* Closing connection 1
* schannel: shutting down SSL/TLS connection with www.ncei.noaa.gov port 443
VSICURL: Got response_code=206
VSICURL: Downloading 81920-98303 (https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2)...
* Couldn't find host www.ncei.noaa.gov in the (nil) file; using defaults
* Hostname www.ncei.noaa.gov was found in DNS cache
* Trying 205.167.25.177:443...
* Connected to www.ncei.noaa.gov (205.167.25.177) port 443 (#2)
* schannel: disabled automatic use of client certificate
* schannel: failed to receive handshake, SSL/TLS connection failed
* Closing connection 2
VSICURL: DownloadRegion(https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2): response_code=0, msg=schannel: failed to receive handshake, SSL/TLS connection failed
VSICURL: Got response_code=0
GRIB: ERROR: Ran out of file in Section 7
ERROR: Problems Jumping past section 7
ERROR 4: /vsicurl/https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2 is a grib file, but no raster dataset was successfully identified.
gdalinfo failed - unable to open '/vsicurl/https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2'.
Basically. I can currently download the files using urllib, then:
file_out = open(<file.path>, 'w')
file_out.write(data.read())
grib = gdal.Open(file_out)
does what I require. But there is a desire to not save the files locally during the temporary processing moment due to the system we are working within.
Python Versions:
gdal 3.3.1 py38hacca965_1 defaults
pygrib 2.1.4 py38hceae430_0 defaults
python 3.8.10 h7840368_1_cpython defaults
Cheers.

This worked for me. I was using the GEFS data hosted on AWS instead though. I believe there is GFS on AWS also. No account should be needed, so it would just be a matter of changing the bucket and s3_object names to point to GFS data instead
import pygrib
import boto3
from botocore import UNSIGNED
from botocore.config import Config
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
bucket_name = 'noaa-gefs-pds'
s3_object = 'gefs.20220425/00/atmos/pgrb2ap5/gec00.t00z.pgrb2a.0p50.f000'
obj = s3.get_object(Bucket=bucket_name, Key=s3_object)['Body'].read()
grbs = pygrib.fromstring(obj)
print(type(grbs))

Related

No MESSAGE-ID and get imap_tools work for imap.mail.yahoo.com

The question is twofold, about getting MESSAGE-ID, and using imap_tools. For an email client ("handmade") in Python I need to lessen the data amount read from the server (presently it takes 2 min to read the whole mbox folder of ~170 msg for yahoo), I believe that having MESSAGE-ID will help me.
imap_tools has IDLE command which is essential to keep the yahoo server connection alive and other features which I believe will simplify the code.
To learn about MESSAGE-ID I started with the following code (file fetch_ssl.py):
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import imaplib
import email
import os
import ssl
import conf
# Why UID==1 has no MESSAGE-ID ?
if __name__ == '__main__':
args = conf.parser.parse_args()
host, port, env_var = conf.config[args.host]
if 0 < args.verbose:
print(host, port, env_var)
with imaplib.IMAP4_SSL(host, port,
ssl_context=ssl.create_default_context()) as mbox:
user, pass_ = os.getenv('USER_NAME_EMAIL'), os.getenv(env_var)
mbox.login(user, pass_)
mbox.select()
typ, data = mbox.search(None, 'ALL')
for num in data[0].split():
typ, data = mbox.fetch(num, '(RFC822)')
msg = email.message_from_bytes(data[0][1])
print(f'num={int(num)}, MESSAGE-ID={msg["MESSAGE-ID"]}')
ans = input('Continue[Y/n]? ')
if ans.upper() in ('', 'Y'):
continue
else:
break
Where conf.py is:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import argparse
HOST = 'imap.mail.yahoo.com'
PORT = 993
config = {'gmail': ('imap.gmail.com', PORT, 'GMAIL_APP_PWD'),
'yahoo': ('imap.mail.yahoo.com', PORT, 'YAHOO_APP_PWD')}
parser = argparse.ArgumentParser(description="""\
Fetch MESSAGE-ID from imap server""")
parser.add_argument('host', choices=config)
parser.add_argument('-verbose', '-v', action='count', default=0)
fetch_ssl.py outputs:
$ python fetch_ssl.py yahoo
num=1, MESSAGE-ID=None
Continue[Y/n]?
num=2, MESSAGE-ID=<83895140.288751#communications.yahoo.com>
Continue[Y/n]? n
I'd like to understand why the message with UID == 1 has no MESSAGE-ID? Does that happen from time to time (I mean there are messages with no MESSAGE-ID)? How to handle these cases? I haven't found such cases for gmail.
Then I attempted to do similar with imap_tools (Version: 0.56.0), (file fetch_tools.py):
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
import ssl
from imap_tools import MailBoxTls
import conf
# https://github.com/ikvk/imap_tools/blob/master/examples/tls.py
# advices
# ctx.load_cert_chain(certfile="./one.crt", keyfile="./one.key")
if __name__ == '__main__':
args = conf.parser.parse_args()
host, port, env_var = conf.config[args.host]
if 0 < args.verbose:
print(host, port, env_var)
user, pass_ = os.getenv('USER_NAME_EMAIL'), os.getenv(env_var)
ctx = ssl.create_default_context(ssl.Purpose.CLIENT_AUTH)
ctx.options &= ~ssl.OP_NO_SSLv3
# imaplib.abort: socket error: EOF
with MailBoxTls(host=host, port=port, ssl_context=ctx) as mbox:
mbox.login(user, pass_, 'INBOX')
for msg in mbox.fetch():
print(msg.subject, msg.date_str)
Command
$python fetch_tools.py yahoo
outputs:
Traceback (most recent call last):
File "/home/vlz/Documents/python-scripts/programming_python/Internet/Email/ymail/imap_tools_lab/fetch_tools.py", line 20, in <module>
with MailBoxTls(host=host, port=port, ssl_context=ctx) as mbox:
File "/home/vlz/Documents/.venv39/lib/python3.9/site-packages/imap_tools/mailbox.py", line 322, in __init__
super().__init__()
File "/home/vlz/Documents/.venv39/lib/python3.9/site-packages/imap_tools/mailbox.py", line 35, in __init__
self.client = self._get_mailbox_client()
File "/home/vlz/Documents/.venv39/lib/python3.9/site-packages/imap_tools/mailbox.py", line 328, in _get_mailbox_client
client = imaplib.IMAP4(self._host, self._port, self._timeout) # noqa
File "/usr/lib/python3.9/imaplib.py", line 205, in __init__
self._connect()
File "/usr/lib/python3.9/imaplib.py", line 247, in _connect
self.welcome = self._get_response()
File "/usr/lib/python3.9/imaplib.py", line 1075, in _get_response
resp = self._get_line()
File "/usr/lib/python3.9/imaplib.py", line 1185, in _get_line
raise self.abort('socket error: EOF')
imaplib.abort: socket error: EOF
Command
$ python fetch_tools.py gmail
Produces identical results. What are my mistakes?
Using Python 3.9.2, Debian GNU/Linux 11 (bullseye), imap_tools
(Version: 0.56.0)
EDIT
Headers from the message with no MESSAGE-ID
X-Apparently-To: vladimir.zolotykh#yahoo.com; Sun, 25 Oct 2015 20:54:21 +0000
Return-Path: <mail#product.communications.yahoo.com>
Received-SPF: fail (domain of product.communications.yahoo.com does not designate 216.39.62.96 as permitted sender)
...
X-Originating-IP: [216.39.62.96]
Authentication-Results: mta1029.mail.bf1.yahoo.com from=product.communications.yahoo.com; domainkeys=neutral (no sig); from=product.communications.yahoo.com; dkim=pass (ok)
Received: from 127.0.0.1 (EHLO n3-vm4.bullet.mail.gq1.yahoo.com) (216.39.62.96)
by mta1029.mail.bf1.yahoo.com with SMTPS; Sun, 25 Oct 2015 20:54:21 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=product.communications.yahoo.com; s=201402-std-mrk-prd; t=1445806460; bh=5PTgF8Jghm92xeMD5mSHp6A3eRVV70PWo1oQ15K7Tfk=; h=Date:From:Reply-To:To:Subject:From:Subject; b=D7ItgOiuLbiexJGHvORgbpRi22X+sYso6gwZKDXVca79DxMMy2R1dUtZTIg7tcft1lovVJUDw/7fC51orDltRidlfnpayeY8lT+94DRlSBwopuxgOqqR9oTTjTBZ0oEvdxUcXl/q54N2GxuBFvmg8UO0OZoCnFPpUVYo9x4arMjt/0TOW1Q5d/yjdmO7iwiued/rliP/Bsq0TaZYcb0oCAT7Q50tb1fB7wcXLYNSC1OCQ1l1LajbUqmU1LWWNse36mUUTBieO2sZT0ERFrHaCTaTNQSXKQG2AxYF7Dd/8i0Iq3xqdcS0bDpjmWE25uoKvCdtXtUbylsuQSChuLFMTw==
Received: from [216.39.60.185] by n3.bullet.mail.gq1.yahoo.com with NNFMP; 25 Oct 2015 20:54:20 -0000
Received: from [98.137.101.84] by t1.bullet.mail.gq1.yahoo.com with NNFMP; 25 Oct 2015 20:54:20 -0000
Date: 25 Oct 2015 20:54:20 +0000
Received: from [127.0.0.1] by nu-repl01.direct.gq1.yahoo.com with NNFMP; 25 Oct 2015 20:54:20 -0000
X-yahoo-newman-expires: 1445810060
From: "Yahoo Mail" <mail#product.communications.yahoo.com>
Reply-To: replies#communications.yahoo.com
To: <ME>#yahoo.com
Subject: Welcome to Yahoo! Vladimir
X-Yahoo-Newman-Property: ydirect
Content-Type: text/html
Content-Length: 25180
I skipped only X-YMailISG.
EDIT II
Of 167 messages 21 have no MESSAGE-ID header.
fetch_ssl.py takes 4m12.342s, and fetch_tools.py -- 3m41.965s
It looks simply like your email without a Message-ID legitimately does not have one; it appears the welcome email Yahoo sent you actually lacks it. Since it's a system generated email, that's not that unexpected. You'd just have to skip over it.
The second problem is that you need to use imap_tools.MailBox.
Looking at the documentation and source at the repo it appears that the relevant classes to use are:
MailBox - for a normal encrypted connection. This is what most email servers use these days, aka IMAPS (imap with SSL/TLS)
MailBoxTls - For a STARTTLS connection: this creates a plaintext connection then upgrades it later by using a STARTTLS command in the protocol. The internet has mostly gone to the "always encrypted" rather than "upgrade" paradigm, so this is not the class to use.
MailBoxUnencrypted - Standard IMAP without SSL/TLS. You should not use this on the public internet.
The naming is a bit confusing. MailBox corresponds to imaplib.IMAP4_SSL; MailBoxTls corresponds to imaplib.IMAP4, then using startls() on the resulting connection; and MailboxUnencrypted corresponds to imaplib.IMAP4 with no security applied. I imagine it's this way so the most common one (Mailbox) is a safe default.

How to upload a 10 Gb file using SAS token

I'm trying to upload a large file (over 10Gb) to Azure Blob Storage using SAS tokens.
I generate the tokens like this
val storageConnectionString = s"DefaultEndpointsProtocol=https;AccountName=${accountName};AccountKey=${accountKey}"
val storageAccount = CloudStorageAccount.parse(storageConnectionString)
val client = storageAccount.createCloudBlobClient()
val container = client.getContainerReference(CONTAINER_NAME)
val blockBlob = container.getBlockBlobReference(path)
val policy = new SharedAccessAccountPolicy()
policy.setPermissionsFromString("racwdlup")
val date = new Date().getTime();
val expiryDate = new Date(date + 8640000).getTime()
policy.setSharedAccessStartTime(new Date(date))
policy.setSharedAccessExpiryTime(new Date(expiryDate))
policy.setResourceTypeFromString("sco")
policy.setServiceFromString("bfqt")
val token = storageAccount.generateSharedAccessSignature(policy)
Then I tried the Put Blob API and hit the following error
$ curl -X PUT -H 'Content-Type: multipart/form-data' -H 'x-ms-date: 2020-09-04' -H 'x-ms-blob-type: BlockBlob' -F file=#10gb.csv https://ACCOUNT.blob.core.windows.net/CONTAINER/10gb.csv\?ss\=bfqt\&sig\=.... -v
< HTTP/1.1 413 The request body is too large and exceeds the maximum permissible limit.
< Content-Length: 290
< Content-Type: application/xml
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: f08a1473-301e-006a-4423-837a27000000
< x-ms-version: 2019-02-02
< x-ms-error-code: RequestBodyTooLarge
< Date: Sat, 05 Sep 2020 01:24:35 GMT
* HTTP error before end of send, stop sending
<
<?xml version="1.0" encoding="utf-8"?><Error><Code>RequestBodyTooLarge</Code><Message>The request body is too large and exceeds the maximum permissible limit.
RequestId:f08a1473-301e-006a-4423-837a27000000
* Closing connection 0
* TLSv1.2 (OUT), TLS alert, close notify (256):
Time:2020-09-05T01:24:35.7712576Z</Message><MaxLimit>268435456</MaxLimit></Error>%
After that tried uploading it using PageBlob (I saw in the documentation something like size can be up to 8 TiB)
$ curl -X PUT -H 'Content-Type: multipart/form-data' -H 'x-ms-date: 2020-09-04' -H 'x-ms-blob-type: PageBlob' -H 'x-ms-blob-content-length: 1099511627776' -F file=#10gb.csv https://ACCOUNT.blob.core.windows.net/CONTAINER/10gb.csv\?ss\=bfqt\&sig\=... -v
< HTTP/1.1 400 The value for one of the HTTP headers is not in the correct format.
< Content-Length: 331
< Content-Type: application/xml
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: b00d5c32-101e-0052-3125-83dee7000000
< x-ms-version: 2019-02-02
< x-ms-error-code: InvalidHeaderValue
< Date: Sat, 05 Sep 2020 01:42:24 GMT
* HTTP error before end of send, stop sending
<
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidHeaderValue</Code><Message>The value for one of the HTTP headers is not in the correct format.
RequestId:b00d5c32-101e-0052-3125-83dee7000000
* Closing connection 0
* TLSv1.2 (OUT), TLS alert, close notify (256):
Time:2020-09-05T01:42:24.5137237Z</Message><HeaderName>Content-Length</HeaderName><HeaderValue>10114368132</HeaderValue></Error>%
Not sure what is the proper way to go about uploading such large file?
Check the different blob types here: https://learn.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs
The page blob actually limits the maximum size to 8TB but it's optimal for for random read and write operation.
On the other hand:
Block blobs are optimized for uploading large amounts of data efficiently. Block blobs are comprised of blocks, each of which is identified by a block ID. A block blob can include up to 50,000 blocks.
So block blobs is the way to go as it supports sizes of up to
190.7 TB (preview mode)
Now you need to use the put block https://learn.microsoft.com/en-us/rest/api/storageservices/put-block to upload the blocks that will form your blob.
To copy large files to a blob you can use azcopy:
Authenticate first:
azcopy login
Then copy the file:
azcopy copy 'C:\myDirectory\myTextFile.txt' 'https://mystorageaccount.blob.core.windows.net/mycontainer/myTextFile.txt'
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs?toc=/azure/storage/blobs/toc.json

is maxTotalHeaderLength working as expected?

Warp has a settingsMaxTotalHeaderLength field which by default is 50*1024 : https://hackage.haskell.org/package/warp-3.3.10/docs/src/Network.Wai.Handler.Warp.Settings.html#defaultSettings
I suppose this means 50KB? But, when I try to send a header with ~33KB, server throws bad request:
curl -v -w '%{size_request} %{size_upload}' -H #temp.log localhost:8080/v1/version
Result:
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /v1/version HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.58.0
> Accept: */*
> myheader2: <big header snipped>
>
* HTTP 1.0, assume close after body
< HTTP/1.0 400 Bad Request
< Date: Wed, 22 Jul 2020 13:15:19 GMT
< Server: Warp/3.3.10
< Content-Type: text/plain; charset=utf-8
<
* Closing connection 0
Bad Request33098 0
(note that the request size is 33098)
Same thing works with 32.5KB header file.
My real problem is actually that I need to set settingsMaxTotalHeaderLength = 160000 to send a header of size ~55KB. Not sure if this is a bug or I am misreading something?
Congratulations, it looks like you found a bug in warp. In the definition of push, there's some double-counting going on. Near the top, bsLen is calculated as the complete length of the header so far, but further down in the push' Nothing case that adds newline-free chunks, the length is updated as:
len' = len + bsLen
when bsLen already accounts for len. There are similar problems in the other push' cases, where start and end are offsets into the complete header and so shouldn't be added to len.

How can I view raw content with HTTP request?

I cannot seem to make the script print out JUST the content viewed by the page
I would like this to be using sockets module. No other libraries like requests or urllib
I cannot really try much. So I instantly committed a sin and came here first ^^'
My code:
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("pastebin.com", 80))
sock.sendall(b"GET /raw/yWmuKZyb HTTP/1.1\r\nHost: pastebin.com\r\n\r\n")
r = sock.recv(4096).decode("utf-8")
print(r)
sock.close()
I want the printed result to be:
test
test1
test2
test3
but what I get is
HTTP/1.1 200 OK
Date: Tue, 09 Apr 2019 14:20:45 GMT
Content-Type: text/plain; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: __cfduid=xxx; expires=Wed, 08-Apr-20 14:20:45 GMT; path=/; domain=.pastebin.com; HttpOnly
Cache-Control: no-cache, must-revalidate
Pragma: no-cache
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Vary: Accept-Encoding
X-XSS-Protection: 1; mode=block
CF-Cache-Status: MISS
Server: cloudflare
CF-RAY: 4c4d1f9f685ece41-LHR
19
test
test1
test2
test3
Just extract out the content after \r\r\n\n by using string.split and print it
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("pastebin.com", 80))
sock.sendall(b"GET /raw/yWmuKZyb HTTP/1.1\r\nHost: pastebin.com\r\n\r\n")
r = sock.recv(4096).decode("utf-8")
#Extract the content after splitting the string on \r\n\r\n
content_list = r.split('\r\n\r\n')[1].split('\r\n')
content = '\r\n'.join(content_list)
print(content)
#19
#test
#test1
#test2
#test3
sock.close()
You are doing a HTTP/1.1 request and therefore the web server might reply with a response body in chunked transfer encoding. In this mode each chunk is prefixed by the size in hexadecimal. You either need to implement this mode or you could simply do a HTTP/1.0 request in which case the server will not use chunked transfer encoding since this was only introduced with HTTP/1.1.
Anyway, if you don't want to use any existing libraries but do your own HTTP it is expected that you actually understand HTTP. Understanding means that you have read the relevant standards, because that's what standards are for. For HTTP/1.1 this is originally RFC 2616 which was later slightly reworked into RFC 7230-7235. Once you started reading these standards you likely appreciate that there are existing libraries which deal with these protocols, since these are far from trivial.

Cloudfront to use different Content-Type for compressed and uncompressed files

I am serving a website generated by a static site generator via S3 and Cloudfront. The files are being uploaded to S3 with correct Content-Types. The DNS points to Cloudfront which uses the S3 bucket as its origin. Cloudfront takes care about encryption and compression. I told Cloudfront to compress objects automatically. That worked fine until I decided to change some of the used images from PNG to SVG.
Whenever a file is requested as uncompressed it is delivered as is with the set Content-Type (image/svg+xml) and the site is rendered correctly. However, if the file is requested as compressed it is delivered with the default Content-Type (application/octet-stream) and the image is missing in the rendering. If I then right-click on the image and choose to open the image in a new tab, it will be shown correctly (without the rest of the page).
The result is the same independent of the used browser. In Firefox I know how to set it to force requesting compressed or uncompressed pages. I also tried curl to check the headers. These are the results:
λ curl --compressed -v -o /dev/null http://dev.example.com/img/logo-6998bdf68c.svg
* STATE: INIT => CONNECT handle 0x20049798; line 1090 (connection #-5000)
* Added connection 0. The cache now contains 1 members
* Trying 52.222.157.200...
* STATE: CONNECT => WAITCONNECT handle 0x20049798; line 1143 (connection #0)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to dev.example.com (52.222.157.200) port 80 (#0)
* STATE: WAITCONNECT => SENDPROTOCONNECT handle 0x20049798; line 1240 (connection #0)
* STATE: SENDPROTOCONNECT => DO handle 0x20049798; line 1258 (connection #0)
> GET /img/logo-6998bdf68c.svg HTTP/1.1
> Host: dev.example.com
> User-Agent: curl/7.44.0
> Accept: */*
> Accept-Encoding: deflate, gzip
>
* STATE: DO => DO_DONE handle 0x20049798; line 1337 (connection #0)
* STATE: DO_DONE => WAITPERFORM handle 0x20049798; line 1464 (connection #0)
* STATE: WAITPERFORM => PERFORM handle 0x20049798; line 1474 (connection #0)
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Content-Type: application/octet-stream
< Content-Length: 7468
< Connection: keep-alive
< Date: Wed, 01 Mar 2017 13:31:33 GMT
< x-amz-meta-cb-modifiedtime: Wed, 01 Mar 2017 13:28:26 GMT
< Last-Modified: Wed, 01 Mar 2017 13:30:24 GMT
< ETag: "6998bdf68c8812d193dd799c644abfb6"
* Server AmazonS3 is not blacklisted
< Server: AmazonS3
< X-Cache: RefreshHit from cloudfront
< Via: 1.1 36c13eeffcddf77ad33d7874b28e6168.cloudfront.net (CloudFront)
< X-Amz-Cf-Id: jT86EeNn2vFYAU2Jagj_aDx6qQUBXFqiDhlcdfxLKrj5bCdAKBIbXQ==
<
{ [7468 bytes data]
* STATE: PERFORM => DONE handle 0x20049798; line 1632 (connection #0)
* Curl_done
100 7468 100 7468 0 0 44526 0 --:--:-- --:--:-- --:--:-- 48493
* Connection #0 to host dev.example.com left intact
* Expire cleared
and for uncompressed it looks better:
λ curl -v -o /dev/null http://dev.example.com/img/logo-6998bdf68c.svg
* STATE: INIT => CONNECT handle 0x20049798; line 1090 (connection #-5000)
* Added connection 0. The cache now contains 1 members
* Trying 52.222.157.203...
* STATE: CONNECT => WAITCONNECT handle 0x20049798; line 1143 (connection #0)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to dev.example.com (52.222.157.203) port 80 (#0)
* STATE: WAITCONNECT => SENDPROTOCONNECT handle 0x20049798; line 1240 (connection #0)
* STATE: SENDPROTOCONNECT => DO handle 0x20049798; line 1258 (connection #0)
> GET /img/logo-6998bdf68c.svg HTTP/1.1
> Host: dev.example.com
> User-Agent: curl/7.44.0
> Accept: */*
>
* STATE: DO => DO_DONE handle 0x20049798; line 1337 (connection #0)
* STATE: DO_DONE => WAITPERFORM handle 0x20049798; line 1464 (connection #0)
* STATE: WAITPERFORM => PERFORM handle 0x20049798; line 1474 (connection #0)
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Content-Type: image/svg+xml
< Content-Length: 7468
< Connection: keep-alive
< Date: Wed, 01 Mar 2017 20:56:11 GMT
< x-amz-meta-cb-modifiedtime: Wed, 01 Mar 2017 20:39:17 GMT
< Last-Modified: Wed, 01 Mar 2017 20:41:13 GMT
< ETag: "6998bdf68c8812d193dd799c644abfb6"
* Server AmazonS3 is not blacklisted
< Server: AmazonS3
< Vary: Accept-Encoding
< X-Cache: RefreshHit from cloudfront
< Via: 1.1 ac27d939fa02703c4b28926f53f95083.cloudfront.net (CloudFront)
< X-Amz-Cf-Id: AlodMvGOKIoNb8zm5OuS7x_8TquQXzAAXg05efSMdIKgrPhwEPv4kA==
<
{ [2422 bytes data]
* STATE: PERFORM => DONE handle 0x20049798; line 1632 (connection #0)
* Curl_done
100 7468 100 7468 0 0 27667 0 --:--:-- --:--:-- --:--:-- 33639
* Connection #0 to host dev.example.com left intact
I don't want to switch off the compression for performance reasons. And it looks that this only happens for SVG file types. All other types have the correct, ie. the same Content-Type. I already tried to invalidate the cache and even switched it off completely by setting the cache time to 0 seconds. I cannot upload a compressed version when uploading to S3 because the upload process is automated and cannot easily be changed for a single file.
I hope that I did something wrong because that would be easiest to be fixed. But I have no clue what could be wrong with the setting. I already used Google to find someone having a similar issue, but it looks like it's only me. Anyone, who has an idea?
You're misdiagnosing the problem. CloudFront doesn't change the Content-Type.
CloudFront, however, does cache different versions of the same object, based on variations in the request.
If you notice, your Last-Modified times on these objects are different. You originally had the content-type set wrong in S3. You subsequently fixed that, but CloudFront doesn't realize the metadata has changed, since the ETag didn't change, so you're getting erroneous RefreshHit responses. It's serving the older version on requests that advertise gzip encoding support. If the actual payload of the object had changed, CloudFront would have likely already updated its cache.
Do an invalidation to purge the cache and within a few minutes, this issue should go away.
I was able to solve it by forcing the MIME type to "image/svg+xml" instead of "binary/octet-stream" which was selected after syncing the files with python boto3.
When you right click on an svg in a S3 bucket you can check the mimetype by viewing the metadata:
I'm not sure if this was caused by the python sync, or by some weirdness in S3/Cloudfront. I have to add that just a cache invalidation didn't work after that. I had to re-upload my files with the correct mimetype to get cloudfront access to the svg working ok.

Resources