Unable to programmatically download Kaggle dataset "nih-chest-xrays/data" specific file Version 3 - python-3.x

I need to download Kaggle's NIH Chest X-rays Dataset programmatically, specifically the Data_Entry_2017.csv.
I want to download it if doesn't exist on system, and redownload it if it's updation date exceeds last download date.
I know I can do this manually, but I'll appreciate if there was a way to do it programmatically.
I have done:
Direct download:
kaggle.api.dataset_download_file(dataset='nih-chest-xrays/data', file_name='Data_Entry_2017.csv', path='data/')
This gives:
File "C:\Program Files\Python311\Lib\site-packages\kaggle\rest.py", line 241, in request
raise ApiException(http_resp=r)
kaggle.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Sun, 18 Dec
2022 03:35:46 GMT', 'Access-Control-Allow-Credentials': 'true', 'Set-Cookie':
'ka_sessionid=bdaa2b71c677d6e48bf93fc71b85a9c6; max-age=2626560; path=/, GCLB=CIeroe2g5sixeA;
path=/; HttpOnly', 'Transfer-Encoding': 'chunked', 'Vary': 'Accept-Encoding', 'Turbolinks-
Location': 'https://www.kaggle.com/api/v1/datasets/download/nih-chest-
xrays/data/Data_Entry_2017.csv', 'X-Kaggle-MillisecondsElapsed': '526', 'X-Kaggle-RequestId':
'03383f8cdac3aa4b6ab83d5dadbc507c', 'X-Kaggle-ApiVersion': '1.5.12', 'X-Frame-Options':
'SAMEORIGIN', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains; preload',
'Content-Security-Policy': "object-src 'none'; script-src 'nonce-gn6A4mfrbofbZRyE9gREnA=='
'report-sample' 'unsafe-inline' 'unsafe-eval' 'strict-dynamic' https: http:; frame-src 'self'
https://www.kaggleusercontent.com https://www.youtube.com/embed/ https://polygraph-cool.github.io
https://www.google.com/recaptcha/ https://form.jotform.com https://submit.jotform.us
https://submit.jotformpro.com https://submit.jotform.com https://www.docdroid.com
https://www.docdroid.net https://kaggle-static.storage.googleapis.com https://kaggle-static-
staging.storage.googleapis.com https://kkb-dev.jupyter-proxy.kaggle.net https://kkb-
staging.jupyter-proxy.kaggle.net https://kkb-production.jupyter-proxy.kaggle.net https://kkb-
dev.firebaseapp.com https://kkb-staging.firebaseapp.com
https://kkb-production.firebaseapp.com https://kaggle-metastore-test.firebaseapp.com
https://kaggle-metastore.firebaseapp.com https://apis.google.com https://content-
sheets.googleapis.com/ https://accounts.google.com/ https://storage.googleapis.com
https://docs.google.com https://drive.google.com https://calendar.google.com/;
base-uri 'none'; report-uri https://csp.withgoogle.com/csp/kaggle/20201130;", 'X-Content-Type-
Options': 'nosniff', 'Referrer-Policy': 'strict-origin-when-cross-origin', 'Via': '1.1 google',
'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: b'{"code":404,"message":"Not found"}'
Same for any other files in this dataset:
kaggle.api.dataset_download_file(dataset='nih-chest-xrays/data', file_name='test_list.txt', path='data/')
Which again gives
kaggle.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Sun, 18 Dec
2022 03:37:20 GMT', 'Access-Control-Allow-Credentials': 'true', 'Set-Cookie':
'ka_sessionid=b59676e67aa4a36e92657a7fa50e95f3; max-age=2626560; path=/, GCLB=CKOV0bPZ-p6ZSA;
path=/; HttpOnly', 'Transfer-Encoding': 'chunked', 'Vary': 'Accept-Encoding', 'Turbolinks-
Location': 'https://www.kaggle.com/api/v1/datasets/download/nih-chest-xrays/data/test_list.txt',
'X-Kaggle-MillisecondsElapsed': '520', 'X-Kaggle-RequestId': '79fde70e368d73ba7c767a007bacc086',
'X-Kaggle-ApiVersion': '1.5.12', 'X-Frame-Options': 'SAMEORIGIN', 'Strict-Transport-Security':
'max-age=63072000; includeSubDomains; preload', 'Content-Security-Policy': "object-src 'none';
script-src 'nonce-U88j/npGgHfKpSCwewojZg==' 'report-sample' 'unsafe-inline' 'unsafe-eval'
'strict-dynamic' https: http:; frame-src 'self' https://www.kaggleusercontent.com
https://www.youtube.com/embed/ https://polygraph-cool.github.io https://www.google.com/recaptcha/
https://form.jotform.com https://submit.jotform.us https://submit.jotformpro.com
https://submit.jotform.com https://www.docdroid.com https://www.docdroid.net https://kaggle-
static.storage.googleapis.com https://kaggle-static-staging.storage.googleapis.com https://kkb-
dev.jupyter-proxy.kaggle.net https://kkb-staging.jupyter-proxy.kaggle.net https://kkb-
production.jupyter-proxy.kaggle.net https://kkb-dev.firebaseapp.com https://kkb-
staging.firebaseapp.com https://kkb-production.firebaseapp.com https://kaggle-metastore-
test.firebaseapp.com https://kaggle-metastore.firebaseapp.com https://apis.google.com
https://content-sheets.googleapis.com/ https://accounts.google.com/
https://storage.googleapis.com https://docs.google.com https://drive.google.com
https://calendar.google.com/; base-uri 'none'; report-uri
https://csp.withgoogle.com/csp/kaggle/20201130;", 'X-Content-Type-Options': 'nosniff', 'Referrer-
Policy': 'strict-origin-when-cross-origin', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443";
ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: b'{"code":404,"message":"Not found"}'
The strange thing is that I can search this dataset:
dataset_list = kaggle.api.dataset_list(search='nih')
print(dataset_list)
Which gives:
[nih-chest-xrays/data, nih-chest-xrays/sample, allen-institute-for-ai/CORD-19-research-challenge,
kmader/nih-deeplesion-subset, nickuzmenkov/nih-chest-xrays-tfrecords, iarunava/cell-images-for-
detecting-malaria, tunguz/covid19-genomes, nlm-nih/nlm-rxnorm, akhileshdkapse/nih-image-600x600-
data, tunguz/nih-awarded-grant-text, danielmadmon/nih-xray-dataset-tfrec-with-labels,
miracle9to9/files1, nlm-nih/rxnorm-drug-name-conventions, kmader/rsna-bone-age,
redwankarimsony/chestxray8-dataframe, akhileshdkapse/nih-dataframe, luigisaetta/cxr-tfrec256-
may2020, kokyew93/nihdata, ammarali32/startingpointschestx, dannellyz/cancer-incidence-totals-
and-rates-per-us-county]
It's literally the first one on the list.
And checking its metadata:
dataset = vars(dataset_list[0])
print(dataset)
Which gives:
{'subtitleNullable': 'Over 112,000 Chest X-ray images from more than 30,000 unique patients',
'creatorNameNullable': 'Timo Bozsolik', 'creatorUrlNullable': 'timoboz', 'totalBytesNullable':
45096150231, 'urlNullable': 'https://www.kaggle.com/datasets/nih-chest-xrays/data',
'licenseNameNullable': 'CC0: Public Domain', 'descriptionNullable': None, 'ownerNameNullable':
'National Institutes of Health Chest X-Ray Dataset', 'ownerRefNullable': 'nih-chest-xrays',
'titleNullable': 'NIH Chest X-rays', 'currentVersionNumberNullable': 3,
'usabilityRatingNullable': 0.7352941, 'id': 5839, 'ref': 'nih-chest-xrays/data', 'subtitle':
'Over 112,000 Chest X-ray images from more than 30,000 unique patients', 'hasSubtitle': True,
'creatorName': 'Timo Bozsolik', 'hasCreatorName': True, 'creatorUrl': 'timoboz', 'hasCreatorUrl':
True, 'totalBytes': 45096150231, 'hasTotalBytes': True, 'url':
'https://www.kaggle.com/datasets/nih-chest-xrays/data', 'hasUrl': True, 'lastUpdated':
datetime.datetime(2018, 2, 21, 20, 52, 23), 'downloadCount': 65214, 'isPrivate': False,
'isFeatured': False, 'licenseName': 'CC0: Public Domain', 'hasLicenseName': True, 'description':
'', 'hasDescription': False, 'ownerName': 'National Institutes of Health Chest X-Ray Dataset',
'hasOwnerName': True, 'ownerRef': 'nih-chest-xrays', 'hasOwnerRef': True, 'kernelCount': 323,
'title': 'NIH Chest X-rays', 'hasTitle': True, 'topicCount': 0, 'viewCount': 454490, 'voteCount':
967, 'currentVersionNumber': 3, 'hasCurrentVersionNumber': True, 'usabilityRating': 0.7352941,
'hasUsabilityRating': True, 'tags': [biology, health, medicine, computer science, software,
health conditions], 'files': [], 'versions': [], 'size': '42GB'}
I can even list the files for that dataset, and it's visible:
print(kaggle.api.dataset_list_files('nih-chest-xrays/data').files)
Which gives:
[BBox_List_2017.csv, LOG_CHESTXRAY.pdf, ARXIV_V5_CHESTXRAY.pdf, README_CHESTXRAY.pdf,
train_val_list.txt, Data_Entry_2017.csv, FAQ_CHESTXRAY.pdf, test_list.txt]
A problem I noticed was that kaggle.api.dataset_download_file was only fetching from Version 1, where these files don't exist. This was further confirmed when I successfully fetched an image file from Version 1:
res = kaggle.api.dataset_download_file(dataset='nih-chest-xrays/data', file_name='images_001/images/00000001_000.png', path='data/')
print(res)
Which downloads the image and prints True.
My data is Version 3. Is there any way to configure kaggle to use Version 3?
Note that I can download the whole 45GB dataset, but I only require the single file Data_Entry_2017.csv.
Even more strangely, it's also throwing an error when checking its status:
print(kaggle.api.dataset_status(dataset='nih-chest-xrays/data'))
Which gives:
Traceback (most recent call last):
File "C:\Users\jaide\OneDrive\Documents\MScCS2022\Data-Mining-I\chest-datasets\visualise.py",
line 117, in <module>
full_df = load_nih_data()
^^^^^^^^^^^^^^^
File "C:\Users\jaide\OneDrive\Documents\MScCS2022\Data-Mining-I\chest-datasets\visualise.py",
line 89, in load_nih_data
print(kaggle.api.dataset_status(dataset='nih-chest-xrays/data'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\api\kaggle_api_extended.py", line
1135, in dataset_status
self.datasets_status_with_http_info(owner_slug=owner_slug,
File "C:\Program Files\Python311\Lib\site-packages\kaggle\api\kaggle_api.py", line 1910, in
datasets_status_with_http_info
return self.api_client.call_api(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\api_client.py", line 329, in call_api
return self.__call_api(resource_path, method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\api_client.py", line 161, in
__call_api
response_data = self.request(
^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\api_client.py", line 351, in request
return self.rest_client.GET(url,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\rest.py", line 247, in GET
return self.request("GET", url,
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\rest.py", line 241, in request
raise ApiException(http_resp=r)
kaggle.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Sun, 18 Dec
2022 04:14:32 GMT', 'Access-Control-Allow-Credentials': 'true', 'Set-Cookie':
'ka_sessionid=1ab6e66561c7d9ae43ead26df25beea3; max-age=2626560; path=/, GCLB=CNS4vKLYyPuCpgE;
path=/; HttpOnly', 'Transfer-Encoding': 'chunked', 'Vary':
'Accept-Encoding', 'Turbolinks-Location': 'https://www.kaggle.com/api/v1/datasets/status/nih-
chest-xrays/data', 'X-Kaggle-MillisecondsElapsed': '46', 'X-Kaggle-RequestId':
'66473cc55f62c39d2deb0aa4cd2953a7', 'X-Kaggle-ApiVersion': '1.5.12', 'X-Frame-Options':
'SAMEORIGIN', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains; preload',
'Content-Security-Policy': "object-src 'none'; script-src 'nonce-5AJVxKPJRg4Gsi/iT3D5Aw=='
'report-sample' 'unsafe-inline' 'unsafe-eval' 'strict-dynamic' https: http:; frame-src 'self'
https://www.kaggleusercontent.com https://www.youtube.com/embed/ https://polygraph-cool.github.io
https://www.google.com/recaptcha/ https://form.jotform.com https://submit.jotform.us
https://submit.jotformpro.com https://submit.jotform.com https://www.docdroid.com
https://www.docdroid.net https://kaggle-static.storage.googleapis.com https://kaggle-static-
staging.storage.googleapis.com https://kkb-dev.jupyter-proxy.kaggle.net https://kkb-
staging.jupyter-proxy.kaggle.net https://kkb-production.jupyter-proxy.kaggle.net https://kkb-
dev.firebaseapp.com https://kkb-staging.firebaseapp.com https://kkb-production.firebaseapp.com
https://kaggle-metastore-test.firebaseapp.com https://kaggle-metastore.firebaseapp.com
https://apis.google.com https://content-sheets.googleapis.com/ https://accounts.google.com/
https://storage.googleapis.com https://docs.google.com https://drive.google.com
https://calendar.google.com/; base-uri 'none'; report-uri
https://csp.withgoogle.com/csp/kaggle/20201130;", 'X-Content-Type-Options': 'nosniff', 'Referrer-
Policy': 'strict-origin-when-cross-origin', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443";
ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"code":404,"message":"Not found"}
CLI download
Check if dataset exists:
C:\Users\jaide\OneDrive\Documents\MScCS2022\Data-Mining-I\chest-datasets>kaggle datasets list -s "nih-chest-xrays/data"
ref title size lastUpdated downloadCount voteCount usabilityRating
------------------------------------------------------------- ----------------------------------------------- ----- ------------------- ------------- --------- ---------------
nickuzmenkov/nih-chest-xrays-tfrecords NIH Chest X-rays TFRecords 11GB 2021-03-09 04:49:23 2121 119 0.9411765
nih-chest-xrays/data NIH Chest X-rays 42GB 2018-02-21 20:52:23 65221 967 0.7352941
nih-chest-xrays/sample Random Sample of NIH Chest X-ray Dataset 4GB 2017-11-23 02:58:24 14930 241 0.7647059
ammarali32/startingpointschestx StartingPoints-ChestX 667MB 2021-02-27 17:35:19 668 68 0.6875
redwankarimsony/chestxray8-dataframe ChestX-ray8_DataFrame 51MB 2020-07-12 02:28:33 416 10 0.64705884
amanullahasraf/covid19-pneumonia-normal-chest-xray-pa-dataset COVID19_Pneumonia_Normal_Chest_Xray_PA_Dataset 2GB 2020-07-13 05:54:22 1482 14 0.8125
harshsoni/nih-chest-xray-tfrecords NIH Chest X-ray TFRecords 42GB 2020-09-26 18:48:04 52 6 0.5294118
amanullahasraf/covid19-pneumonia-normal-chest-xraypa-dataset COVID19_Pneumonia_Normal_Chest_Xray(PA)_Dataset 1GB 2020-06-29 08:28:01 161 3 0.5882353
roderikmogot/nih-chest-x-ray-models nih chest x ray models 2GB 2022-08-17 01:32:33 66 4 0.47058824
ericspod/project-monai-2020-bootcamp-challenge-dataset Project MONAI 2020 Bootcamp Challenge Dataset 481MB 2021-01-25 01:11:42 22 2 0.6875
sunghyunjun/nih-chest-xrays-600-jpg-dataset NIH Chest X-rays 600 JPG Dataset 7GB 2021-03-15 07:53:12 35 3 0.64705884
zhuangjw/chest-xray-cleaned chest_xray_cleaned 2GB 2019-11-30 19:05:24 13 1 0.375
kambarakun/nih-chest-xrays-trained-models NIH Chest X-rays: Trained Models 3GB 2020-04-30 11:52:56 10 2 0.3125
jessevent/all-kaggle-datasets Complete Kaggle Datasets Collection 390KB 2018-01-16 12:32:58 2256 118 0.8235294
vzaguskin/nih-chest-xrays-tfrecords-756 NIH Chest X-rays tfrecords 756 11GB 2021-02-25 14:25:12 3 1 0.29411766
Yes, it is the second one. Now does it have my required file?
C:\Users\jaide\OneDrive\Documents\MScCS2022\Data-Mining-I\chest-datasets>kaggle datasets files "nih-chest-xrays/data"
name size creationDate
---------------------- ----- -------------------
ARXIV_V5_CHESTXRAY.pdf 9MB 2018-02-21 20:52:23
FAQ_CHESTXRAY.pdf 71KB 2018-02-21 20:52:23
Data_Entry_2017.csv 7MB 2018-02-21 20:52:23
BBox_List_2017.csv 90KB 2018-02-21 20:52:23
README_CHESTXRAY.pdf 827KB 2018-02-21 20:52:23
LOG_CHESTXRAY.pdf 4KB 2018-02-21 20:52:23
train_val_list.txt 1MB 2018-02-21 20:52:23
test_list.txt 425KB 2018-02-21 20:52:23
Yes it does. Now download it.
C:\Users\jaide\OneDrive\Documents\MScCS2022\Data-Mining-I\chest-datasets>kaggle datasets download -f "Data_Entry_2017.csv" -p "data/" "nih-chest-xrays/data"
404 - Not Found
Nope, error 404.
How do I download Kaggle's NIH Chest X-rays Dataset programmatically, specifically the Data_Entry_2017.csv?
Any help is appreciated!
I have checked:
I checked out Kaggle Python API, but it doesn't appear to have any way to configure Version.
I checked out Kaggle API on github, but it also doesn't appear to have any ways to configure Version.
Related Questions Read:
Kaggle Dataset Download - is irrelevant.
Download a Kaggle Dataset - is downloading competition files.
How to download data set into Colab? Stuck with a problem, it says "401 - Unauthorized"? - They had api key issues
How to load just one chosen file of a way too large Kaggle dataset from Kaggle into Colab - Is related to kaggle and jupyter notebooks, I am currently just using a python script.
Trouble turning comorbidity data into a table using Python and Pandas - Are again using Kernels (Kaggle Notebooks).
How to download kaggle dataset? - Downloads the whole dataset.

Related

Cypress: intercept a network request with compression-type gzip to simulate mapbox

I am currently having trouble with mocking a particular request mapbox-gl is making. When the map is loaded from mapbox pbf-files are being requested and i have not been able to mock this.
My guess is that the core issue is that there seems to be an open bug with cypress issue-16420.
I tried alot of different intercept variants. I tried all kinds of response headers. I gziped, compressed, brd the file that I serve via fixture. I tried different encodings for the fixture. Nothing worked. One of the interceptors looks basically like this
cy.intercept({
method: 'GET',
url: '**/fonts/v1/mapbox/DIN%20Offc%20Pro%20Italic,Arial%20Unicode%20MS%20Regular/0-255.pbf?*',
}, {
fixture: 'fonts/italic.arial.0-255.pbf,binary',
statusCode: 204,
headers: {
'Connection': 'keep-alive',
'Keep-Alive': 'timeout=5',
'Transfer-Encoding': 'chunked',
'access-control-allow-origin': '*',
'access-control-expose-headers': 'Link',
'age': '11631145',
'cache-control': 'max-age=31536000',
'content-encoding': 'compress',
'content-type': 'application/x-protobuf',
'date': 'Sat, 19 Feb 2022 20:46:43 GMT',
'etag': 'W/"b040-+eCb/OHkPqToOcONTDlvpCrjmvs"',
'via': '1.1 4dd80d99fd5d0f6baaaf5179cd921f72.cloudfront.net (CloudFront)',
'x-amz-cf-id': '4uY9rjBgR_R12nkfHFrBMLEpNuWygW9DkmODlMEzwJHABTGCGg8pww==',
'x-amz-cf-pop': 'FRA56-P7',
'x-cache': 'Hit from cloudfront',
'x-origin': 'Mbx-Fonts'
}
}).as('get.0-255.pbf').as('getItalicArial0-255');
Now even if this is a bug there has to be some kind of workaround to serve the file in a cypress test without having an active internet connection. It would be great not having to rely on the network on tests. So all kinds of workarounds and dirty tricks are welcome in making this intercept work.

Stuck with xml download using python. How to handle that?

I need a hint from you about an issue I'm handling. Using requests to do some webscraping in python, the URL gives me a file to download, but when I get the content from the request, I get the following result:
b'"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9InllcyI/Pg0KPERhZG9zRWNvbm9taWNvRmluYW5jZWlyb3MgeG1sbnM6eHNpPSJodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYS1pbnN0YW5jZSI+DQoJPERhZG9zR2VyYWlzPg0KCQk8Tm9tZUZ1bmRvPkZJSSBCVEdQIExPR0lTVElDQTwvTm9tZUZ1bmRvPg0KCQk8Q05QSkZ1bmRvPjExODM5NTkzMDAwMTA5PC9DTlBKRnVuZG8+DQoJCTxOb21lQWRtaW5pc3RyYWRvcj5CVEcgUGFjdHVhbCBTZXJ2acOnb3MgRmluYW5jZWlyb3MgUy5BLiBEVFZNPC9Ob21lQWRtaW5pc3RyYWRvcj4NCgkJPENOUEpBZG1pbmlzdHJhZG9yPjU5MjgxMjUzMDAwMTIzPC9DTlBKQWRtaW5pc3RyYWRvcj4NCgkJPFJlc3BvbnNhdmVsSW5mb3JtYWNhbz5MdWNhcyBNYXNzb2xhPC9SZXNwb25zYXZlbEluZm9ybWFjYW8+DQoJCTxUZWxlZm9uZUNvbnRhdG8+KDExKSAzMzgzLTI1MTM8L1RlbGVmb25lQ29udGF0bz4NCgkJPENvZElTSU5Db3RhPkJSQlRMR0NURjAwMDwvQ29kSVNJTkNvdGE+DQoJCTxDb2ROZWdvY2lhY2FvQ290YT5CVExHMTE8L0NvZE5lZ29jaWFjYW9Db3RhPg0KCTwvRGFkb3NHZXJhaXM+DQoJPEluZm9ybWVSZW5kaW1lbnRvcz4NCgkJPFJlbmRpbWVudG8+DQoJCQk8RGF0YUFwcm92YWNhbz4yMDIxLTEyLTE1PC9EYXRhQXByb3ZhY2FvPg0KCQkJPERhdGFCYXNlPjIwMjEtMTItMTU8L0RhdGFCYXNlPg0KCQkJPERhdGFQYWdhbWVudG8+MjAyMS0xMi0yMzwvRGF0YVBhZ2FtZW50bz4NCgkJCTxWYWxvclByb3ZlbnRvQ290YT4wLjcyPC9WYWxvclByb3ZlbnRvQ290YT4NCgkJCTxQZXJpb2RvUmVmZXJlbmNpYT5Ob3ZlbWJybzwvUGVyaW9kb1JlZmVyZW5jaWE+DQoJCQk8QW5vPjIwMjE8L0Fubz4NCgkJCTxSZW5kaW1lbnRvSXNlbnRvSVI+dHJ1ZTwvUmVuZGltZW50b0lzZW50b0lSPg0KCQk8L1JlbmRpbWVudG8+DQoJCTxBbW9ydGl6YWNhbyB0aXBvPSIiLz4NCgk8L0luZm9ybWVSZW5kaW1lbnRvcz4NCjwvRGFkb3NFY29ub21pY29GaW5hbmNlaXJvcz4="'
and these headers:
{'Date': 'Thu, 13 Jan 2022 13:25:03 GMT', 'Set-Cookie': 'dtCookie=v_4_srv_27_sn_A24AD4C76E5194F3DB0056C40CBABEF7_perc_100000_ol_0_mul_1_app-3A97e61c3a8a7c6a0b_1_rcs-3Acss_0; Path=/; Domain=.bmfbovespa.com.br, JSESSIONID=LWB+pcQEPreUbb+BtwZ9pyOm.sfnNODE01; Path=/fnet; Secure; HttpOnly, TS01871345=011d592ce1f641d52fa6af8d3b5a924eddc7997db2f6611d8d70aeab610f5e34ea2706a45b6f2c35f2b500d01fc681c74e5caa356c; Path=/; HTTPOnly, TS01e3f871=011d592ce1f641d52fa6af8d3b5a924eddc7997db2f6611d8d70aeab610f5e34ea2706a45b6f2c35f2b500d01fc681c74e5caa356c; path=/; domain=.bmfbovespa.com.br; HTTPonly, TS01d1c2dd=011d592ce1f641d52fa6af8d3b5a924eddc7997db2f6611d8d70aeab610f5e34ea2706a45b6f2c35f2b500d01fc681c74e5caa356c; path=/fnet; HTTPonly', 'X-OneAgent-JS-Injection': 'true', 'X-Frame-Options': 'SAMEORIGIN', 'Cache-Control': 'no-cache, no-store, must-revalidate', 'Pragma': 'no-cache', 'Expires': '0', 'Content-Disposition': 'attachment; filename="08706065000169-ACE28022020V01-000083505.xml"', 'Server-Timing': 'dtRpid;desc="258920448"', 'Connection': 'close', 'Content-Type': 'text/xml', 'X-XSS-Protection': '1; mode=block', 'Transfer-Encoding': 'chunked'}
But it works perfectly and download the .xml file when I point the browser to https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=247031 URL address, for example, with the following data
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DadosEconomicoFinanceiros xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<DadosGerais>
<NomeFundo>FII BTGP LOGISTICA</NomeFundo>
<CNPJFundo>11839593000109</CNPJFundo>
<NomeAdministrador>BTG Pactual Serviços Financeiros S.A. DTVM</NomeAdministrador>
<CNPJAdministrador>59281253000123</CNPJAdministrador>
<ResponsavelInformacao>Lucas Massola</ResponsavelInformacao>
<TelefoneContato>(11) 3383-2513</TelefoneContato>
<CodISINCota>BRBTLGCTF000</CodISINCota>
<CodNegociacaoCota>BTLG11</CodNegociacaoCota>
</DadosGerais>
<InformeRendimentos>
<Rendimento>
<DataAprovacao>2021-12-15</DataAprovacao>
<DataBase>2021-12-15</DataBase>
<DataPagamento>2021-12-23</DataPagamento>
<ValorProventoCota>0.72</ValorProventoCota>
<PeriodoReferencia>Novembro</PeriodoReferencia>
<Ano>2021</Ano>
<RendimentoIsentoIR>true</RendimentoIsentoIR>
</Rendimento>
<Amortizacao tipo=""/>
</InformeRendimentos>
</DadosEconomicoFinanceiros>
It seems to me that the data is cryptographed, but I have no idea how to get the xml data to use the data inside it. Can you help me?
Thank you very much.
EDIT:
The example code I've used is quite simple:
Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> url='fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=247031'
>>> xhtml = requests.get(url,verify=False, headers={'User-Agent':'Mozzila/5.0'})
Then xhtml.content command shows the string. (There is a HTTPS warning due to the verify=False that i will handle after)
I have tried a solution using urllib.request, but got the same result
Data seems to be base64 encoded. Try to decode it:
import requests
import base64
url = 'http://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=247031'
response = requests.get(url,verify=False, headers={'User-Agent':'Mozzila/5.0'})
decoded = base64.b64decode(response.content)
print(decoded)

HTTP headers format using python's requests

I use python requests to capture a website's http headers. For example, this is a response header:
{'Connection': 'keep-alive',
'Access-Control-Allow-Origin': '*', 'cache-control': 'max-age=600',
'Content-Type': 'text/html; charset=utf-8', 'Expires': 'Fri, 19 Apr
2019 03:16:28 GMT', 'Via': '1.1 varnish, 1.1 varnish', 'X-ESI': 'on',
'Verso': 'false', 'Accept-Ranges': 'none', 'Date': 'Fri, 19 Apr 2019
03:11:12 GMT', 'Age': '283', 'Set-Cookie':
'CN_xid=08f66bff-4001-4173-b4e2-71ac31bb58d7; Expires=Wed, 16 Oct 2019
03:11:12 GMT; path=/;, xid1=1; Expires=Fri, 19 Apr 2019 03:11:27 GMT;
path=/;, verso_bucket=281; Expires=Sat, 18 Apr 2020 03:11:12 GMT;
path=/;', 'X-Served-By': 'cache-iad2133-IAD, cache-gru17122-GRU',
'X-Cache': 'HIT, MISS', 'X-Cache-Hits': '1, 0', 'X-Timer':
'S1555643472.999490,VS0,VE302', 'Content-Security-Policy':
"default-src https: data: 'unsafe-inline' 'unsafe-eval'; child-src
https: data: blob:; connect-src https: data: blob:; font-src https:
data:; img-src https: data: blob:; media-src https: data: blob:;
object-src https:; script-src https: data: blob: 'unsafe-inline'
'unsafe-eval'; style-src https: 'unsafe-inline';
block-all-mixed-content; upgrade-insecure-requests; report-uri
https://l.com/csp/gq",
'X-Fastly-Device-Detect': 'desktop', 'Strict-Transport-Security':
'max-age=7776000; preload', 'Vary': 'Accept-Encoding, Verso,
Accept-Encoding', 'content-encoding': 'gzip', 'transfer-encoding':
'chunked'}
I noted that from several examples I tested, the headers I receive from requests are formatted as 'key':'value' (plz note the single colons surrounding the key and the value). However, when I check the headers from the Firefox-> Web developer -> Inspector, and choose to view the header in raw format, I do not see commas:
HTTP/2.0 200 OK date: Thu, 09 May 2019 18:49:07 GMT expires: -1
cache-control: private, max-age=0 content-type: text/html;
charset=UTF-8 strict-transport-security: max-age=31536000
content-encoding: br server: gws content-length: 55844
x-xss-protection: 0 x-frame-options: SAMEORIGIN set-cookie:
1P_JAR=2019-05-09-18; expires=Sat, 08-Jun-2019 18:49:07 GMT; path=/;
domain=.google.com alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"
X-Firefox-Spdy: h2
I need to know: Does python's requests module always adds single colons? This important from me as I need to include/exclude them in my regex that is used to analyze the headers.
The issue I think you are running into is the request coming back as a dict instead of a value as firefox inspector is giving you. When you do this you could be getting mixed results if one of the value pairs has a numeric or boolean value so when doing your regex you may want to use a Try/Except if you can remove the exterior apostrophes or just use the value given.
It's not the requests module that's adding the colons. Request represents headers as a dict, but you seem to be treating them as a string. When Python converts dicts to strings, they get the colons, the commas, the quotation marks.
The right fix for your program is probably to treat the dictionary as a dictionary, not convert it into a string. But if you really want the headers in string form, you should consider using different tool, such as curl.

Request email audit export fails with status 400 and "Premature end of file."

according to https://developers.google.com/admin-sdk/email-audit/#creating_a_mailbox_for_export I am trying to request the email audit export of an user in G Suite this way:
def requestAuditExport(account):
credentials = getCredentials()
http = credentials.authorize(httplib2.Http())
url = 'https://apps-apis.google.com/a/feeds/compliance/audit/mail/export/helpling.com/'+account
status, response = http.request(url, 'POST', headers={'Content-Type': 'application/atom+xml'})
print(status)
print(response)
And I get the following result:
{'content-length': '22', 'expires': 'Tue, 13 Dec 2016 14:19:37 GMT', 'date': 'Tue, 13 Dec 2016 14:19:37 GMT', 'x-frame-options': 'SAMEORIGIN', 'transfer-encoding': 'chunked', 'x-xss-protection': '1; mode=block', 'content-type': 'text/html; charset=UTF-8', 'x-content-type-options': 'nosniff', '-content-encoding': 'gzip', 'server': 'GSE', 'status': '400', 'cache-control': 'private, max-age=0', 'alt-svc': 'quic=":443"; ma=2592000; v="35,34"'}
b'Premature end of file.'
I cannot see where the problem is, can someone please give me a hint?
Thanks in advance!
Kay
Fix it by going intp the Admin Console, Manage API client access page under Security and add the Client ID, scope needed for the Directory API. For more information, check this document.
Okay, found out what was wrong and fixed it myself. Finally it looks like this:
http = getCredentials().authorize(httplib2.Http())
url = 'https://apps-apis.google.com/a/feeds/compliance/audit/mail/export/helpling.com/'+account
headers = {'Content-Type': 'application/atom+xml'}
xml_data = """<atom:entry xmlns:atom='http://www.w3.org/2005/Atom' xmlns:apps='http://schemas.google.com/apps/2006'> \
<apps:property name='includeDeleted' value='true'/> \
</atom:entry>"""
status, response = http.request(url, 'POST', headers=headers, body=xml_data)
Not sure if it was about the body or the header. It works now and I hope it will help others.
Thanks anyway.

Python 3.5.2 Iterating a get request

Hoping someone can tell me whether this script is functioning the way I intended it to, and if not explain what I am doing wrong.
The RESTful API I am using has a parameter pageSize ranging from 10-50. I used pageSize=50. There was another parameter that I did not use called pageNumber
So, I thought this would be the right way to make the get request:
# Python 3.5.2
import requests
r = requests.get(url, stream=True)
with open("file.txt",'w', newline='', encoding='utf-8') as fd:
text_out = r.text
fd.write(text_out)
UPDATE
I think I understand a bit better. I read the documentation in more detail, but I am still missing how to get the entire data set from the API. Here is some more information:
verbs = requests.options(r.url)
print(verbs.headers)
{'Server': 'ninx', 'Date': 'Sat, 24 Dec 2016 22:50:13 GMT',
'Allow': 'OPTIONS,HEAD,GET', 'Content-Length': '0', 'Connection': 'keep-alive'}
print(r.headers)
{'Transfer-Encoding': 'chunked', 'Vary': 'Accept-Encoding',
'X-Entity-Count': '50', 'Connection': 'keep-alive',
'Content-Encoding': 'gzip', 'Date': 'Sat, 24 Dec 2016 23:59:07 GMT',
'Server': 'ninx', 'Content-Type': 'application/json; charset=UTF-8'}
Should I create a session and use the previously unused pageNumber parameter to create a new url until the 'X-Entity-Count' is zero? Or, is there a better way?
I found a discussion that helped clear this matter up for me...this updated question should probably be deleted...
API pagination best practices

Resources