How to upload a 10 Gb file using SAS token - azure

I'm trying to upload a large file (over 10Gb) to Azure Blob Storage using SAS tokens.
I generate the tokens like this
val storageConnectionString = s"DefaultEndpointsProtocol=https;AccountName=${accountName};AccountKey=${accountKey}"
val storageAccount = CloudStorageAccount.parse(storageConnectionString)
val client = storageAccount.createCloudBlobClient()
val container = client.getContainerReference(CONTAINER_NAME)
val blockBlob = container.getBlockBlobReference(path)
val policy = new SharedAccessAccountPolicy()
policy.setPermissionsFromString("racwdlup")
val date = new Date().getTime();
val expiryDate = new Date(date + 8640000).getTime()
policy.setSharedAccessStartTime(new Date(date))
policy.setSharedAccessExpiryTime(new Date(expiryDate))
policy.setResourceTypeFromString("sco")
policy.setServiceFromString("bfqt")
val token = storageAccount.generateSharedAccessSignature(policy)
Then I tried the Put Blob API and hit the following error
$ curl -X PUT -H 'Content-Type: multipart/form-data' -H 'x-ms-date: 2020-09-04' -H 'x-ms-blob-type: BlockBlob' -F file=#10gb.csv https://ACCOUNT.blob.core.windows.net/CONTAINER/10gb.csv\?ss\=bfqt\&sig\=.... -v
< HTTP/1.1 413 The request body is too large and exceeds the maximum permissible limit.
< Content-Length: 290
< Content-Type: application/xml
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: f08a1473-301e-006a-4423-837a27000000
< x-ms-version: 2019-02-02
< x-ms-error-code: RequestBodyTooLarge
< Date: Sat, 05 Sep 2020 01:24:35 GMT
* HTTP error before end of send, stop sending
<
<?xml version="1.0" encoding="utf-8"?><Error><Code>RequestBodyTooLarge</Code><Message>The request body is too large and exceeds the maximum permissible limit.
RequestId:f08a1473-301e-006a-4423-837a27000000
* Closing connection 0
* TLSv1.2 (OUT), TLS alert, close notify (256):
Time:2020-09-05T01:24:35.7712576Z</Message><MaxLimit>268435456</MaxLimit></Error>%
After that tried uploading it using PageBlob (I saw in the documentation something like size can be up to 8 TiB)
$ curl -X PUT -H 'Content-Type: multipart/form-data' -H 'x-ms-date: 2020-09-04' -H 'x-ms-blob-type: PageBlob' -H 'x-ms-blob-content-length: 1099511627776' -F file=#10gb.csv https://ACCOUNT.blob.core.windows.net/CONTAINER/10gb.csv\?ss\=bfqt\&sig\=... -v
< HTTP/1.1 400 The value for one of the HTTP headers is not in the correct format.
< Content-Length: 331
< Content-Type: application/xml
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: b00d5c32-101e-0052-3125-83dee7000000
< x-ms-version: 2019-02-02
< x-ms-error-code: InvalidHeaderValue
< Date: Sat, 05 Sep 2020 01:42:24 GMT
* HTTP error before end of send, stop sending
<
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidHeaderValue</Code><Message>The value for one of the HTTP headers is not in the correct format.
RequestId:b00d5c32-101e-0052-3125-83dee7000000
* Closing connection 0
* TLSv1.2 (OUT), TLS alert, close notify (256):
Time:2020-09-05T01:42:24.5137237Z</Message><HeaderName>Content-Length</HeaderName><HeaderValue>10114368132</HeaderValue></Error>%
Not sure what is the proper way to go about uploading such large file?

Check the different blob types here: https://learn.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs
The page blob actually limits the maximum size to 8TB but it's optimal for for random read and write operation.
On the other hand:
Block blobs are optimized for uploading large amounts of data efficiently. Block blobs are comprised of blocks, each of which is identified by a block ID. A block blob can include up to 50,000 blocks.
So block blobs is the way to go as it supports sizes of up to
190.7 TB (preview mode)
Now you need to use the put block https://learn.microsoft.com/en-us/rest/api/storageservices/put-block to upload the blocks that will form your blob.

To copy large files to a blob you can use azcopy:
Authenticate first:
azcopy login
Then copy the file:
azcopy copy 'C:\myDirectory\myTextFile.txt' 'https://mystorageaccount.blob.core.windows.net/mycontainer/myTextFile.txt'
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs?toc=/azure/storage/blobs/toc.json

Related

Loading a GRIB from the web in Python without saving the file locally

I would like to download a GRIB file from the web:
Opt1: https://noaa-gfs-bdp-pds.s3.amazonaws.com/gfs.20210801/12/atmos/gfs.t12z.pgrb2.1p00.f000
Opt2: https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_1200_000.grb2
(FYI, data is identical)
A similar question is asked here:
How can a GRIB file be opened with pygrib without first downloading the file?
However, ultimately a final solution isn't given as the data source has multiple responses (Causing issues). When I tried to recreate the code:
url = 'https://noaa-gfs-bdp-pds.s3.amazonaws.com/gfs.20210801/12/atmos/gfs.t12z.pgrb2.1p00.f000'
urllib.request.install_opener(opener)
req = urllib.request.Request(url,headers={'User-Agent': 'Mozilla/5.0'})
data = urllib.request.urlopen(req, timeout = 300)
import pygrib
grib = pygrib.fromstring(data.read())
My error is different:
ECCODES ERROR : grib_handle_new_from_message: No final 7777 in message!
Traceback (most recent call last):
File "grib_data_test.py", line 139, in <module>
grib = pygrib.fromstring(data.read())
File "src\pygrib\_pygrib.pyx", line 627, in pygrib._pygrib.fromstring
File "src\pygrib\_pygrib.pyx", line 1390, in pygrib._pygrib.gribmessage._set_projparams
File "src\pygrib\_pygrib.pyx", line 1127, in pygrib._pygrib.gribmessage.__getitem__
RuntimeError: b'Key/value not found'
I have also tried to work directly with the osgeo.gdal library (preferable for my project as it is already in use in the project). Documentation: https://gdal.org/user/virtual_file_systems.html#vsimem-in-memory-files:
Attempt1: url = "/vsicurl/https://noaa-gfs-bdp-pds.s3.amazonaws.com/gfs.20210801/12/atmos/gfs.t12z.pgrb2.0p50.f000"
Attempt2: url = "/vsis3/https://noaa-gfs-bdp-pds.s3.amazonaws.com/gfs.20210801/12/atmos/gfs.t12z.pgrb2.0p50.f000"
Attempt3: url = "/vsicurl/https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2"
grib = gdal.Open(url)
Errors:
Attempt1: ERROR 11: CURL error: schannel: next InitializeSecurityContext failed: Unknown error (0x80092012) - The revocation function was unable to check revocation for the certificate.
Attempt2: ERROR 11: HTTP response code: 0
Attempt3: ERROR 4: /vsicurl/https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2 is a grib file, but no raster dataset was successfully identified.
For Attempt1 and 2: I feel like both attempts should work here - normally you do not need any specific AWS Credentials / connection to access the publically available s3 bucket data (As it is accessed using urllib above).
For Attempt3, there is a similar issue here:
https://gis.stackexchange.com/questions/395867/opening-a-grib-from-the-web-with-gdal-in-python-using-vsicurl-throws-error-on-m
However I do not experience the same issue, my Output:
gdalinfo /vsicurl/https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2 --debug on --config CPL_CURL_VERBOSE YES
HTTP: libcurl/7.78.0 Schannel zlib/1.2.11 libssh2/1.9.0
HTTP: GDAL was built against curl 7.77.0, but is running against 7.78.0.
* Couldn't find host www.ncei.noaa.gov in the (nil) file; using defaults
* Trying 205.167.25.177:443...
* Connected to www.ncei.noaa.gov (205.167.25.177) port 443 (#0)
* schannel: disabled automatic use of client certificate
> HEAD /data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2 HTTP/1.1
Host: www.ncei.noaa.gov
Accept: */*
* schannel: server closed the connection
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Mon, 20 Sep 2021 14:38:59 GMT
< Server: Apache
< Strict-Transport-Security: max-age=31536000
< Last-Modified: Sun, 01 Aug 2021 04:46:49 GMT
< ETag: "287c05a-5c87824012f22"
< Accept-Ranges: bytes
< Content-Length: 42451034
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Headers: X-Requested-With, Content-Type
< Connection: close
<
* Closing connection 0
* schannel: shutting down SSL/TLS connection with www.ncei.noaa.gov port 443
VSICURL: GetFileSize(https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2)=42451034 response_code=200
VSICURL: Downloading 0-16383 (https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2)...
* Couldn't find host www.ncei.noaa.gov in the (nil) file; using defaults
* Hostname www.ncei.noaa.gov was found in DNS cache
* Trying 205.167.25.177:443...
* Connected to www.ncei.noaa.gov (205.167.25.177) port 443 (#1)
* schannel: disabled automatic use of client certificate
> GET /data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2 HTTP/1.1
Host: www.ncei.noaa.gov
Accept: */*
Range: bytes=0-16383
* schannel: failed to decrypt data, need more data
* Mark bundle as not supporting multiuse
< HTTP/1.1 206 Partial Content
< Date: Mon, 20 Sep 2021 14:39:00 GMT
< Server: Apache
< Strict-Transport-Security: max-age=31536000
< Last-Modified: Sun, 01 Aug 2021 04:46:49 GMT
< ETag: "287c05a-5c87824012f22"
< Accept-Ranges: bytes
< Content-Length: 16384
< Content-Range: bytes 0-16383/42451034
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Headers: X-Requested-With, Content-Type
< Connection: close
<
* schannel: failed to decrypt data, need more data
* schannel: failed to decrypt data, need more data
* schannel: failed to decrypt data, need more data
* Closing connection 1
* schannel: shutting down SSL/TLS connection with www.ncei.noaa.gov port 443
VSICURL: Got response_code=206
VSICURL: Downloading 81920-98303 (https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2)...
* Couldn't find host www.ncei.noaa.gov in the (nil) file; using defaults
* Hostname www.ncei.noaa.gov was found in DNS cache
* Trying 205.167.25.177:443...
* Connected to www.ncei.noaa.gov (205.167.25.177) port 443 (#2)
* schannel: disabled automatic use of client certificate
* schannel: failed to receive handshake, SSL/TLS connection failed
* Closing connection 2
VSICURL: DownloadRegion(https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2): response_code=0, msg=schannel: failed to receive handshake, SSL/TLS connection failed
VSICURL: Got response_code=0
GRIB: ERROR: Ran out of file in Section 7
ERROR: Problems Jumping past section 7
ERROR 4: /vsicurl/https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2 is a grib file, but no raster dataset was successfully identified.
gdalinfo failed - unable to open '/vsicurl/https://www.ncei.noaa.gov/data/global-forecast-system/access/grid-003-1.0-degree/analysis/202108/20210801/gfs_3_20210801_0000_000.grb2'.
Basically. I can currently download the files using urllib, then:
file_out = open(<file.path>, 'w')
file_out.write(data.read())
grib = gdal.Open(file_out)
does what I require. But there is a desire to not save the files locally during the temporary processing moment due to the system we are working within.
Python Versions:
gdal 3.3.1 py38hacca965_1 defaults
pygrib 2.1.4 py38hceae430_0 defaults
python 3.8.10 h7840368_1_cpython defaults
Cheers.
This worked for me. I was using the GEFS data hosted on AWS instead though. I believe there is GFS on AWS also. No account should be needed, so it would just be a matter of changing the bucket and s3_object names to point to GFS data instead
import pygrib
import boto3
from botocore import UNSIGNED
from botocore.config import Config
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
bucket_name = 'noaa-gefs-pds'
s3_object = 'gefs.20220425/00/atmos/pgrb2ap5/gec00.t00z.pgrb2a.0p50.f000'
obj = s3.get_object(Bucket=bucket_name, Key=s3_object)['Body'].read()
grbs = pygrib.fromstring(obj)
print(type(grbs))

is maxTotalHeaderLength working as expected?

Warp has a settingsMaxTotalHeaderLength field which by default is 50*1024 : https://hackage.haskell.org/package/warp-3.3.10/docs/src/Network.Wai.Handler.Warp.Settings.html#defaultSettings
I suppose this means 50KB? But, when I try to send a header with ~33KB, server throws bad request:
curl -v -w '%{size_request} %{size_upload}' -H #temp.log localhost:8080/v1/version
Result:
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /v1/version HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.58.0
> Accept: */*
> myheader2: <big header snipped>
>
* HTTP 1.0, assume close after body
< HTTP/1.0 400 Bad Request
< Date: Wed, 22 Jul 2020 13:15:19 GMT
< Server: Warp/3.3.10
< Content-Type: text/plain; charset=utf-8
<
* Closing connection 0
Bad Request33098 0
(note that the request size is 33098)
Same thing works with 32.5KB header file.
My real problem is actually that I need to set settingsMaxTotalHeaderLength = 160000 to send a header of size ~55KB. Not sure if this is a bug or I am misreading something?
Congratulations, it looks like you found a bug in warp. In the definition of push, there's some double-counting going on. Near the top, bsLen is calculated as the complete length of the header so far, but further down in the push' Nothing case that adds newline-free chunks, the length is updated as:
len' = len + bsLen
when bsLen already accounts for len. There are similar problems in the other push' cases, where start and end are offsets into the complete header and so shouldn't be added to len.

How can I view raw content with HTTP request?

I cannot seem to make the script print out JUST the content viewed by the page
I would like this to be using sockets module. No other libraries like requests or urllib
I cannot really try much. So I instantly committed a sin and came here first ^^'
My code:
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("pastebin.com", 80))
sock.sendall(b"GET /raw/yWmuKZyb HTTP/1.1\r\nHost: pastebin.com\r\n\r\n")
r = sock.recv(4096).decode("utf-8")
print(r)
sock.close()
I want the printed result to be:
test
test1
test2
test3
but what I get is
HTTP/1.1 200 OK
Date: Tue, 09 Apr 2019 14:20:45 GMT
Content-Type: text/plain; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: __cfduid=xxx; expires=Wed, 08-Apr-20 14:20:45 GMT; path=/; domain=.pastebin.com; HttpOnly
Cache-Control: no-cache, must-revalidate
Pragma: no-cache
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Vary: Accept-Encoding
X-XSS-Protection: 1; mode=block
CF-Cache-Status: MISS
Server: cloudflare
CF-RAY: 4c4d1f9f685ece41-LHR
19
test
test1
test2
test3
Just extract out the content after \r\r\n\n by using string.split and print it
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("pastebin.com", 80))
sock.sendall(b"GET /raw/yWmuKZyb HTTP/1.1\r\nHost: pastebin.com\r\n\r\n")
r = sock.recv(4096).decode("utf-8")
#Extract the content after splitting the string on \r\n\r\n
content_list = r.split('\r\n\r\n')[1].split('\r\n')
content = '\r\n'.join(content_list)
print(content)
#19
#test
#test1
#test2
#test3
sock.close()
You are doing a HTTP/1.1 request and therefore the web server might reply with a response body in chunked transfer encoding. In this mode each chunk is prefixed by the size in hexadecimal. You either need to implement this mode or you could simply do a HTTP/1.0 request in which case the server will not use chunked transfer encoding since this was only introduced with HTTP/1.1.
Anyway, if you don't want to use any existing libraries but do your own HTTP it is expected that you actually understand HTTP. Understanding means that you have read the relevant standards, because that's what standards are for. For HTTP/1.1 this is originally RFC 2616 which was later slightly reworked into RFC 7230-7235. Once you started reading these standards you likely appreciate that there are existing libraries which deal with these protocols, since these are far from trivial.

Cloudfront to use different Content-Type for compressed and uncompressed files

I am serving a website generated by a static site generator via S3 and Cloudfront. The files are being uploaded to S3 with correct Content-Types. The DNS points to Cloudfront which uses the S3 bucket as its origin. Cloudfront takes care about encryption and compression. I told Cloudfront to compress objects automatically. That worked fine until I decided to change some of the used images from PNG to SVG.
Whenever a file is requested as uncompressed it is delivered as is with the set Content-Type (image/svg+xml) and the site is rendered correctly. However, if the file is requested as compressed it is delivered with the default Content-Type (application/octet-stream) and the image is missing in the rendering. If I then right-click on the image and choose to open the image in a new tab, it will be shown correctly (without the rest of the page).
The result is the same independent of the used browser. In Firefox I know how to set it to force requesting compressed or uncompressed pages. I also tried curl to check the headers. These are the results:
λ curl --compressed -v -o /dev/null http://dev.example.com/img/logo-6998bdf68c.svg
* STATE: INIT => CONNECT handle 0x20049798; line 1090 (connection #-5000)
* Added connection 0. The cache now contains 1 members
* Trying 52.222.157.200...
* STATE: CONNECT => WAITCONNECT handle 0x20049798; line 1143 (connection #0)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to dev.example.com (52.222.157.200) port 80 (#0)
* STATE: WAITCONNECT => SENDPROTOCONNECT handle 0x20049798; line 1240 (connection #0)
* STATE: SENDPROTOCONNECT => DO handle 0x20049798; line 1258 (connection #0)
> GET /img/logo-6998bdf68c.svg HTTP/1.1
> Host: dev.example.com
> User-Agent: curl/7.44.0
> Accept: */*
> Accept-Encoding: deflate, gzip
>
* STATE: DO => DO_DONE handle 0x20049798; line 1337 (connection #0)
* STATE: DO_DONE => WAITPERFORM handle 0x20049798; line 1464 (connection #0)
* STATE: WAITPERFORM => PERFORM handle 0x20049798; line 1474 (connection #0)
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Content-Type: application/octet-stream
< Content-Length: 7468
< Connection: keep-alive
< Date: Wed, 01 Mar 2017 13:31:33 GMT
< x-amz-meta-cb-modifiedtime: Wed, 01 Mar 2017 13:28:26 GMT
< Last-Modified: Wed, 01 Mar 2017 13:30:24 GMT
< ETag: "6998bdf68c8812d193dd799c644abfb6"
* Server AmazonS3 is not blacklisted
< Server: AmazonS3
< X-Cache: RefreshHit from cloudfront
< Via: 1.1 36c13eeffcddf77ad33d7874b28e6168.cloudfront.net (CloudFront)
< X-Amz-Cf-Id: jT86EeNn2vFYAU2Jagj_aDx6qQUBXFqiDhlcdfxLKrj5bCdAKBIbXQ==
<
{ [7468 bytes data]
* STATE: PERFORM => DONE handle 0x20049798; line 1632 (connection #0)
* Curl_done
100 7468 100 7468 0 0 44526 0 --:--:-- --:--:-- --:--:-- 48493
* Connection #0 to host dev.example.com left intact
* Expire cleared
and for uncompressed it looks better:
λ curl -v -o /dev/null http://dev.example.com/img/logo-6998bdf68c.svg
* STATE: INIT => CONNECT handle 0x20049798; line 1090 (connection #-5000)
* Added connection 0. The cache now contains 1 members
* Trying 52.222.157.203...
* STATE: CONNECT => WAITCONNECT handle 0x20049798; line 1143 (connection #0)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to dev.example.com (52.222.157.203) port 80 (#0)
* STATE: WAITCONNECT => SENDPROTOCONNECT handle 0x20049798; line 1240 (connection #0)
* STATE: SENDPROTOCONNECT => DO handle 0x20049798; line 1258 (connection #0)
> GET /img/logo-6998bdf68c.svg HTTP/1.1
> Host: dev.example.com
> User-Agent: curl/7.44.0
> Accept: */*
>
* STATE: DO => DO_DONE handle 0x20049798; line 1337 (connection #0)
* STATE: DO_DONE => WAITPERFORM handle 0x20049798; line 1464 (connection #0)
* STATE: WAITPERFORM => PERFORM handle 0x20049798; line 1474 (connection #0)
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Content-Type: image/svg+xml
< Content-Length: 7468
< Connection: keep-alive
< Date: Wed, 01 Mar 2017 20:56:11 GMT
< x-amz-meta-cb-modifiedtime: Wed, 01 Mar 2017 20:39:17 GMT
< Last-Modified: Wed, 01 Mar 2017 20:41:13 GMT
< ETag: "6998bdf68c8812d193dd799c644abfb6"
* Server AmazonS3 is not blacklisted
< Server: AmazonS3
< Vary: Accept-Encoding
< X-Cache: RefreshHit from cloudfront
< Via: 1.1 ac27d939fa02703c4b28926f53f95083.cloudfront.net (CloudFront)
< X-Amz-Cf-Id: AlodMvGOKIoNb8zm5OuS7x_8TquQXzAAXg05efSMdIKgrPhwEPv4kA==
<
{ [2422 bytes data]
* STATE: PERFORM => DONE handle 0x20049798; line 1632 (connection #0)
* Curl_done
100 7468 100 7468 0 0 27667 0 --:--:-- --:--:-- --:--:-- 33639
* Connection #0 to host dev.example.com left intact
I don't want to switch off the compression for performance reasons. And it looks that this only happens for SVG file types. All other types have the correct, ie. the same Content-Type. I already tried to invalidate the cache and even switched it off completely by setting the cache time to 0 seconds. I cannot upload a compressed version when uploading to S3 because the upload process is automated and cannot easily be changed for a single file.
I hope that I did something wrong because that would be easiest to be fixed. But I have no clue what could be wrong with the setting. I already used Google to find someone having a similar issue, but it looks like it's only me. Anyone, who has an idea?
You're misdiagnosing the problem. CloudFront doesn't change the Content-Type.
CloudFront, however, does cache different versions of the same object, based on variations in the request.
If you notice, your Last-Modified times on these objects are different. You originally had the content-type set wrong in S3. You subsequently fixed that, but CloudFront doesn't realize the metadata has changed, since the ETag didn't change, so you're getting erroneous RefreshHit responses. It's serving the older version on requests that advertise gzip encoding support. If the actual payload of the object had changed, CloudFront would have likely already updated its cache.
Do an invalidation to purge the cache and within a few minutes, this issue should go away.
I was able to solve it by forcing the MIME type to "image/svg+xml" instead of "binary/octet-stream" which was selected after syncing the files with python boto3.
When you right click on an svg in a S3 bucket you can check the mimetype by viewing the metadata:
I'm not sure if this was caused by the python sync, or by some weirdness in S3/Cloudfront. I have to add that just a cache invalidation didn't work after that. I had to re-upload my files with the correct mimetype to get cloudfront access to the svg working ok.

mod_sec trigger on CSR rule _23

I'm using mod_security with the latest core rules.
It triggers on all my pages whenever I use a querystring.. ie.
www.mypage.com/index.php?querystring=1
I get a warning that it exceeds maximum allowed number of arguements, however the base config defines max_numb_args to = 255 which of course it doesn't exceed.
Any ideas why?
Base conf:
SecRuleEngine On
SecAuditEngine RelevantOnly
SecAuditLog /var/log/apache2/modsec_audit.log
SecDebugLog /var/log/apache2/modsec_debug_log
SecDebugLogLevel 3
SecDefaultAction "phase:2,pass,log,status:500"
SecRule REMOTE_ADDR "^127.0.0.1$" nolog,allow
SecRequestBodyAccess On
SecResponseBodyAccess On
SecResponseBodyMimeType (null) text/html text/plain text/xml
SecResponseBodyLimit 2621440
SecServerSignature Apache
SecUploadDir /tmp
SecUploadKeepFiles Off
SecAuditLogParts ABIFHZ
SecArgumentSeparator "&"
SecCookieFormat 0
SecRequestBodyInMemoryLimit 131072
SecDataDir /tmp
SecTmpDir /tmp
SecAuditLogStorageDir /var/log/apache2/audit
SecResponseBodyLimitAction ProcessPartial
SecAction "phase:1,t:none,nolog,pass,setvar:tx.max_num_args=255"
Rule that triggers:
# Maximum number of arguments in request limited
SecRule &TX:MAX_NUM_ARGS "#eq 1" "chain,phase:2,t:none,pass,nolog,auditlog,msg:'Maximum number of arguments in request reached',id:'960335',severity:'4',rev:'2.0.7'"
SecRule &ARGS "#gt %{tx.max_num_args}" "t:none,setvar:'tx.msg=%{rule.msg}',setvar:tx.anomaly_score=+%{tx.notice_anomaly_score},setvar:tx.policy_score=+%{tx.notice_anomaly_score},setvar:tx.%{rule.id}-POLICY/SIZE_LIMIT-%{matched_var_name}=%{matched_var}"
And the log ouput:
--ad5dc005-C--
queryString=2
--ad5dc005-F--
HTTP/1.1 200 OK
X-Powered-By: PHP/5.3
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: SESSION=ak19oq36gpi94rco2qbi6j2k20; path=/
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 1272
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: text/html; charset=utf-8
--ad5dc005-H--
Message: Operator GT matched 0 at ARGS. [file "/etc/apache2/conf/modsecurity_crs/base_rules/modsecurity_crs_23_request_limits.conf"] [line "30"] [id "960335"] [rev "2.0.7"] [msg "Maximum number of arguments in request reached"] [severity "WARNING"]
Message: Operator GE matched 0 at TX:anomaly_score. [file "/etc/apache2/conf/modsecurity_crs/base_rules/modsecurity_crs_49_inbound_blocking.conf"] [line "18"] [msg "Inbound Anomaly Score Exceeded (Total Score: 5, SQLi=, XSS=): Maximum number of arguments in request reached"]
Message: Warning. Operator GE matched 0 at TX:inbound_anomaly_score. [file "/etc/apache2/conf/modsecurity_crs/base_rules/modsecurity_crs_60_correlation.conf"] [line "35"] [msg "Inbound Anomaly Score Exceeded (Total Inbound Score: 5, SQLi=, XSS=): Maximum number of arguments in request reached"]
Apache-Handler: application/x-httpd-php
Stopwatch: 1279667800315092 76979 (1546* 7522 72931)
Producer: ModSeurity for Apache/2.5.11 (http://www.modsecurity.org/); core ruleset/2.0.7.
Server: Apache
I was using the lib from Ubuntu.. which had the .11 version. I uninstalled it, compiled from source .12 version and now it's alive, kicking and screaming!
Latest CSR rules needs the .12 version. Cheers.

Resources