Related
I need to download Kaggle's NIH Chest X-rays Dataset programmatically, specifically the Data_Entry_2017.csv.
I want to download it if doesn't exist on system, and redownload it if it's updation date exceeds last download date.
I know I can do this manually, but I'll appreciate if there was a way to do it programmatically.
I have done:
Direct download:
kaggle.api.dataset_download_file(dataset='nih-chest-xrays/data', file_name='Data_Entry_2017.csv', path='data/')
This gives:
File "C:\Program Files\Python311\Lib\site-packages\kaggle\rest.py", line 241, in request
raise ApiException(http_resp=r)
kaggle.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Sun, 18 Dec
2022 03:35:46 GMT', 'Access-Control-Allow-Credentials': 'true', 'Set-Cookie':
'ka_sessionid=bdaa2b71c677d6e48bf93fc71b85a9c6; max-age=2626560; path=/, GCLB=CIeroe2g5sixeA;
path=/; HttpOnly', 'Transfer-Encoding': 'chunked', 'Vary': 'Accept-Encoding', 'Turbolinks-
Location': 'https://www.kaggle.com/api/v1/datasets/download/nih-chest-
xrays/data/Data_Entry_2017.csv', 'X-Kaggle-MillisecondsElapsed': '526', 'X-Kaggle-RequestId':
'03383f8cdac3aa4b6ab83d5dadbc507c', 'X-Kaggle-ApiVersion': '1.5.12', 'X-Frame-Options':
'SAMEORIGIN', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains; preload',
'Content-Security-Policy': "object-src 'none'; script-src 'nonce-gn6A4mfrbofbZRyE9gREnA=='
'report-sample' 'unsafe-inline' 'unsafe-eval' 'strict-dynamic' https: http:; frame-src 'self'
https://www.kaggleusercontent.com https://www.youtube.com/embed/ https://polygraph-cool.github.io
https://www.google.com/recaptcha/ https://form.jotform.com https://submit.jotform.us
https://submit.jotformpro.com https://submit.jotform.com https://www.docdroid.com
https://www.docdroid.net https://kaggle-static.storage.googleapis.com https://kaggle-static-
staging.storage.googleapis.com https://kkb-dev.jupyter-proxy.kaggle.net https://kkb-
staging.jupyter-proxy.kaggle.net https://kkb-production.jupyter-proxy.kaggle.net https://kkb-
dev.firebaseapp.com https://kkb-staging.firebaseapp.com
https://kkb-production.firebaseapp.com https://kaggle-metastore-test.firebaseapp.com
https://kaggle-metastore.firebaseapp.com https://apis.google.com https://content-
sheets.googleapis.com/ https://accounts.google.com/ https://storage.googleapis.com
https://docs.google.com https://drive.google.com https://calendar.google.com/;
base-uri 'none'; report-uri https://csp.withgoogle.com/csp/kaggle/20201130;", 'X-Content-Type-
Options': 'nosniff', 'Referrer-Policy': 'strict-origin-when-cross-origin', 'Via': '1.1 google',
'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: b'{"code":404,"message":"Not found"}'
Same for any other files in this dataset:
kaggle.api.dataset_download_file(dataset='nih-chest-xrays/data', file_name='test_list.txt', path='data/')
Which again gives
kaggle.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Sun, 18 Dec
2022 03:37:20 GMT', 'Access-Control-Allow-Credentials': 'true', 'Set-Cookie':
'ka_sessionid=b59676e67aa4a36e92657a7fa50e95f3; max-age=2626560; path=/, GCLB=CKOV0bPZ-p6ZSA;
path=/; HttpOnly', 'Transfer-Encoding': 'chunked', 'Vary': 'Accept-Encoding', 'Turbolinks-
Location': 'https://www.kaggle.com/api/v1/datasets/download/nih-chest-xrays/data/test_list.txt',
'X-Kaggle-MillisecondsElapsed': '520', 'X-Kaggle-RequestId': '79fde70e368d73ba7c767a007bacc086',
'X-Kaggle-ApiVersion': '1.5.12', 'X-Frame-Options': 'SAMEORIGIN', 'Strict-Transport-Security':
'max-age=63072000; includeSubDomains; preload', 'Content-Security-Policy': "object-src 'none';
script-src 'nonce-U88j/npGgHfKpSCwewojZg==' 'report-sample' 'unsafe-inline' 'unsafe-eval'
'strict-dynamic' https: http:; frame-src 'self' https://www.kaggleusercontent.com
https://www.youtube.com/embed/ https://polygraph-cool.github.io https://www.google.com/recaptcha/
https://form.jotform.com https://submit.jotform.us https://submit.jotformpro.com
https://submit.jotform.com https://www.docdroid.com https://www.docdroid.net https://kaggle-
static.storage.googleapis.com https://kaggle-static-staging.storage.googleapis.com https://kkb-
dev.jupyter-proxy.kaggle.net https://kkb-staging.jupyter-proxy.kaggle.net https://kkb-
production.jupyter-proxy.kaggle.net https://kkb-dev.firebaseapp.com https://kkb-
staging.firebaseapp.com https://kkb-production.firebaseapp.com https://kaggle-metastore-
test.firebaseapp.com https://kaggle-metastore.firebaseapp.com https://apis.google.com
https://content-sheets.googleapis.com/ https://accounts.google.com/
https://storage.googleapis.com https://docs.google.com https://drive.google.com
https://calendar.google.com/; base-uri 'none'; report-uri
https://csp.withgoogle.com/csp/kaggle/20201130;", 'X-Content-Type-Options': 'nosniff', 'Referrer-
Policy': 'strict-origin-when-cross-origin', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443";
ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: b'{"code":404,"message":"Not found"}'
The strange thing is that I can search this dataset:
dataset_list = kaggle.api.dataset_list(search='nih')
print(dataset_list)
Which gives:
[nih-chest-xrays/data, nih-chest-xrays/sample, allen-institute-for-ai/CORD-19-research-challenge,
kmader/nih-deeplesion-subset, nickuzmenkov/nih-chest-xrays-tfrecords, iarunava/cell-images-for-
detecting-malaria, tunguz/covid19-genomes, nlm-nih/nlm-rxnorm, akhileshdkapse/nih-image-600x600-
data, tunguz/nih-awarded-grant-text, danielmadmon/nih-xray-dataset-tfrec-with-labels,
miracle9to9/files1, nlm-nih/rxnorm-drug-name-conventions, kmader/rsna-bone-age,
redwankarimsony/chestxray8-dataframe, akhileshdkapse/nih-dataframe, luigisaetta/cxr-tfrec256-
may2020, kokyew93/nihdata, ammarali32/startingpointschestx, dannellyz/cancer-incidence-totals-
and-rates-per-us-county]
It's literally the first one on the list.
And checking its metadata:
dataset = vars(dataset_list[0])
print(dataset)
Which gives:
{'subtitleNullable': 'Over 112,000 Chest X-ray images from more than 30,000 unique patients',
'creatorNameNullable': 'Timo Bozsolik', 'creatorUrlNullable': 'timoboz', 'totalBytesNullable':
45096150231, 'urlNullable': 'https://www.kaggle.com/datasets/nih-chest-xrays/data',
'licenseNameNullable': 'CC0: Public Domain', 'descriptionNullable': None, 'ownerNameNullable':
'National Institutes of Health Chest X-Ray Dataset', 'ownerRefNullable': 'nih-chest-xrays',
'titleNullable': 'NIH Chest X-rays', 'currentVersionNumberNullable': 3,
'usabilityRatingNullable': 0.7352941, 'id': 5839, 'ref': 'nih-chest-xrays/data', 'subtitle':
'Over 112,000 Chest X-ray images from more than 30,000 unique patients', 'hasSubtitle': True,
'creatorName': 'Timo Bozsolik', 'hasCreatorName': True, 'creatorUrl': 'timoboz', 'hasCreatorUrl':
True, 'totalBytes': 45096150231, 'hasTotalBytes': True, 'url':
'https://www.kaggle.com/datasets/nih-chest-xrays/data', 'hasUrl': True, 'lastUpdated':
datetime.datetime(2018, 2, 21, 20, 52, 23), 'downloadCount': 65214, 'isPrivate': False,
'isFeatured': False, 'licenseName': 'CC0: Public Domain', 'hasLicenseName': True, 'description':
'', 'hasDescription': False, 'ownerName': 'National Institutes of Health Chest X-Ray Dataset',
'hasOwnerName': True, 'ownerRef': 'nih-chest-xrays', 'hasOwnerRef': True, 'kernelCount': 323,
'title': 'NIH Chest X-rays', 'hasTitle': True, 'topicCount': 0, 'viewCount': 454490, 'voteCount':
967, 'currentVersionNumber': 3, 'hasCurrentVersionNumber': True, 'usabilityRating': 0.7352941,
'hasUsabilityRating': True, 'tags': [biology, health, medicine, computer science, software,
health conditions], 'files': [], 'versions': [], 'size': '42GB'}
I can even list the files for that dataset, and it's visible:
print(kaggle.api.dataset_list_files('nih-chest-xrays/data').files)
Which gives:
[BBox_List_2017.csv, LOG_CHESTXRAY.pdf, ARXIV_V5_CHESTXRAY.pdf, README_CHESTXRAY.pdf,
train_val_list.txt, Data_Entry_2017.csv, FAQ_CHESTXRAY.pdf, test_list.txt]
A problem I noticed was that kaggle.api.dataset_download_file was only fetching from Version 1, where these files don't exist. This was further confirmed when I successfully fetched an image file from Version 1:
res = kaggle.api.dataset_download_file(dataset='nih-chest-xrays/data', file_name='images_001/images/00000001_000.png', path='data/')
print(res)
Which downloads the image and prints True.
My data is Version 3. Is there any way to configure kaggle to use Version 3?
Note that I can download the whole 45GB dataset, but I only require the single file Data_Entry_2017.csv.
Even more strangely, it's also throwing an error when checking its status:
print(kaggle.api.dataset_status(dataset='nih-chest-xrays/data'))
Which gives:
Traceback (most recent call last):
File "C:\Users\jaide\OneDrive\Documents\MScCS2022\Data-Mining-I\chest-datasets\visualise.py",
line 117, in <module>
full_df = load_nih_data()
^^^^^^^^^^^^^^^
File "C:\Users\jaide\OneDrive\Documents\MScCS2022\Data-Mining-I\chest-datasets\visualise.py",
line 89, in load_nih_data
print(kaggle.api.dataset_status(dataset='nih-chest-xrays/data'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\api\kaggle_api_extended.py", line
1135, in dataset_status
self.datasets_status_with_http_info(owner_slug=owner_slug,
File "C:\Program Files\Python311\Lib\site-packages\kaggle\api\kaggle_api.py", line 1910, in
datasets_status_with_http_info
return self.api_client.call_api(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\api_client.py", line 329, in call_api
return self.__call_api(resource_path, method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\api_client.py", line 161, in
__call_api
response_data = self.request(
^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\api_client.py", line 351, in request
return self.rest_client.GET(url,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\rest.py", line 247, in GET
return self.request("GET", url,
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\site-packages\kaggle\rest.py", line 241, in request
raise ApiException(http_resp=r)
kaggle.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Sun, 18 Dec
2022 04:14:32 GMT', 'Access-Control-Allow-Credentials': 'true', 'Set-Cookie':
'ka_sessionid=1ab6e66561c7d9ae43ead26df25beea3; max-age=2626560; path=/, GCLB=CNS4vKLYyPuCpgE;
path=/; HttpOnly', 'Transfer-Encoding': 'chunked', 'Vary':
'Accept-Encoding', 'Turbolinks-Location': 'https://www.kaggle.com/api/v1/datasets/status/nih-
chest-xrays/data', 'X-Kaggle-MillisecondsElapsed': '46', 'X-Kaggle-RequestId':
'66473cc55f62c39d2deb0aa4cd2953a7', 'X-Kaggle-ApiVersion': '1.5.12', 'X-Frame-Options':
'SAMEORIGIN', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains; preload',
'Content-Security-Policy': "object-src 'none'; script-src 'nonce-5AJVxKPJRg4Gsi/iT3D5Aw=='
'report-sample' 'unsafe-inline' 'unsafe-eval' 'strict-dynamic' https: http:; frame-src 'self'
https://www.kaggleusercontent.com https://www.youtube.com/embed/ https://polygraph-cool.github.io
https://www.google.com/recaptcha/ https://form.jotform.com https://submit.jotform.us
https://submit.jotformpro.com https://submit.jotform.com https://www.docdroid.com
https://www.docdroid.net https://kaggle-static.storage.googleapis.com https://kaggle-static-
staging.storage.googleapis.com https://kkb-dev.jupyter-proxy.kaggle.net https://kkb-
staging.jupyter-proxy.kaggle.net https://kkb-production.jupyter-proxy.kaggle.net https://kkb-
dev.firebaseapp.com https://kkb-staging.firebaseapp.com https://kkb-production.firebaseapp.com
https://kaggle-metastore-test.firebaseapp.com https://kaggle-metastore.firebaseapp.com
https://apis.google.com https://content-sheets.googleapis.com/ https://accounts.google.com/
https://storage.googleapis.com https://docs.google.com https://drive.google.com
https://calendar.google.com/; base-uri 'none'; report-uri
https://csp.withgoogle.com/csp/kaggle/20201130;", 'X-Content-Type-Options': 'nosniff', 'Referrer-
Policy': 'strict-origin-when-cross-origin', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443";
ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"code":404,"message":"Not found"}
CLI download
Check if dataset exists:
C:\Users\jaide\OneDrive\Documents\MScCS2022\Data-Mining-I\chest-datasets>kaggle datasets list -s "nih-chest-xrays/data"
ref title size lastUpdated downloadCount voteCount usabilityRating
------------------------------------------------------------- ----------------------------------------------- ----- ------------------- ------------- --------- ---------------
nickuzmenkov/nih-chest-xrays-tfrecords NIH Chest X-rays TFRecords 11GB 2021-03-09 04:49:23 2121 119 0.9411765
nih-chest-xrays/data NIH Chest X-rays 42GB 2018-02-21 20:52:23 65221 967 0.7352941
nih-chest-xrays/sample Random Sample of NIH Chest X-ray Dataset 4GB 2017-11-23 02:58:24 14930 241 0.7647059
ammarali32/startingpointschestx StartingPoints-ChestX 667MB 2021-02-27 17:35:19 668 68 0.6875
redwankarimsony/chestxray8-dataframe ChestX-ray8_DataFrame 51MB 2020-07-12 02:28:33 416 10 0.64705884
amanullahasraf/covid19-pneumonia-normal-chest-xray-pa-dataset COVID19_Pneumonia_Normal_Chest_Xray_PA_Dataset 2GB 2020-07-13 05:54:22 1482 14 0.8125
harshsoni/nih-chest-xray-tfrecords NIH Chest X-ray TFRecords 42GB 2020-09-26 18:48:04 52 6 0.5294118
amanullahasraf/covid19-pneumonia-normal-chest-xraypa-dataset COVID19_Pneumonia_Normal_Chest_Xray(PA)_Dataset 1GB 2020-06-29 08:28:01 161 3 0.5882353
roderikmogot/nih-chest-x-ray-models nih chest x ray models 2GB 2022-08-17 01:32:33 66 4 0.47058824
ericspod/project-monai-2020-bootcamp-challenge-dataset Project MONAI 2020 Bootcamp Challenge Dataset 481MB 2021-01-25 01:11:42 22 2 0.6875
sunghyunjun/nih-chest-xrays-600-jpg-dataset NIH Chest X-rays 600 JPG Dataset 7GB 2021-03-15 07:53:12 35 3 0.64705884
zhuangjw/chest-xray-cleaned chest_xray_cleaned 2GB 2019-11-30 19:05:24 13 1 0.375
kambarakun/nih-chest-xrays-trained-models NIH Chest X-rays: Trained Models 3GB 2020-04-30 11:52:56 10 2 0.3125
jessevent/all-kaggle-datasets Complete Kaggle Datasets Collection 390KB 2018-01-16 12:32:58 2256 118 0.8235294
vzaguskin/nih-chest-xrays-tfrecords-756 NIH Chest X-rays tfrecords 756 11GB 2021-02-25 14:25:12 3 1 0.29411766
Yes, it is the second one. Now does it have my required file?
C:\Users\jaide\OneDrive\Documents\MScCS2022\Data-Mining-I\chest-datasets>kaggle datasets files "nih-chest-xrays/data"
name size creationDate
---------------------- ----- -------------------
ARXIV_V5_CHESTXRAY.pdf 9MB 2018-02-21 20:52:23
FAQ_CHESTXRAY.pdf 71KB 2018-02-21 20:52:23
Data_Entry_2017.csv 7MB 2018-02-21 20:52:23
BBox_List_2017.csv 90KB 2018-02-21 20:52:23
README_CHESTXRAY.pdf 827KB 2018-02-21 20:52:23
LOG_CHESTXRAY.pdf 4KB 2018-02-21 20:52:23
train_val_list.txt 1MB 2018-02-21 20:52:23
test_list.txt 425KB 2018-02-21 20:52:23
Yes it does. Now download it.
C:\Users\jaide\OneDrive\Documents\MScCS2022\Data-Mining-I\chest-datasets>kaggle datasets download -f "Data_Entry_2017.csv" -p "data/" "nih-chest-xrays/data"
404 - Not Found
Nope, error 404.
How do I download Kaggle's NIH Chest X-rays Dataset programmatically, specifically the Data_Entry_2017.csv?
Any help is appreciated!
I have checked:
I checked out Kaggle Python API, but it doesn't appear to have any way to configure Version.
I checked out Kaggle API on github, but it also doesn't appear to have any ways to configure Version.
Related Questions Read:
Kaggle Dataset Download - is irrelevant.
Download a Kaggle Dataset - is downloading competition files.
How to download data set into Colab? Stuck with a problem, it says "401 - Unauthorized"? - They had api key issues
How to load just one chosen file of a way too large Kaggle dataset from Kaggle into Colab - Is related to kaggle and jupyter notebooks, I am currently just using a python script.
Trouble turning comorbidity data into a table using Python and Pandas - Are again using Kernels (Kaggle Notebooks).
How to download kaggle dataset? - Downloads the whole dataset.
python code
# -*- coding: utf-8 -*-
import requests
import json
import logging
logging.basicConfig(level=logging.DEBUG)
def dump_resp(resp: requests.Response):
print(resp.json())
print('{}\n{}\r\n{}\r\n\r\n{}'.format(
'-----------START-----------',
resp.url,
'\r\n'.join('{}: {}'.format(k, v) for k, v in resp.headers.items()),
json.dumps(resp.json(), indent=2),
))
req = requests.Request(
url='https://echo-echo-eaqnidmlfz.cn-hongkong.fcapp.run/path-with- blank?key1=value1&key1=value2&key2=value2&key blank=value blank',
method='POST',
headers={},
data='anything',
)
with requests.session() as s:
prep = s.prepare_request(req)
resp = s.send(prep)
dump_resp(resp)
The previous code will eventually end up with ok, but it will cost at least 20 seconds in response parsing. And I don't know why. It seems like python requests package is buggy.
errors and warnings:
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): echo-echo-eaqnidmlfz.cn-hongkong.fcapp.run:443
DEBUG:urllib3.connectionpool:https://echo-echo-eaqnidmlfz.cn-hongkong.fcapp.run:443 "POST /path-with-%20blank?key1=value1&key1=value2&key2=value2&key%20blank=value%20blank HTTP/1.1" 200 None
WARNING:urllib3.connectionpool:Failed to parse headers (url=https://echo-echo-eaqnidmlfz.cn-hongkong.fcapp.run:443/path-with-%20blank?key1=value1&key1=value2&key2=value2&key%20blank=value%20blank): [MissingHeaderBodySeparatorDefect()], unparsed data: 'key blank: value blank\r\nDate: Mon, 30 May 2022 12:19:08 GMT\r\nContent-Length: 601\r\n\r\n'
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 469, in _make_request
assert_header_parsing(httplib_response.msg)
File "/usr/local/lib/python3.9/site-packages/urllib3/util/response.py", line 91, in assert_header_parsing
raise HeaderParsingError(defects=defects, unparsed_data=unparsed_data)
urllib3.exceptions.HeaderParsingError: [MissingHeaderBodySeparatorDefect()], unparsed data: 'key blank: value blank\r\nDate: Mon, 30 May 2022 12:19:08 GMT\r\nContent-Length: 601\r\n\r\n'
{'path': '/path-with- blank', 'queries': {'key blank': 'value blank', 'key1': ['value1', 'value2'], 'key2': 'value2'}, 'headers': {'accept': '*/*', 'content-length': '8', 'host': 'echo-echo-eaqnidmlfz.cn-hongkong.fcapp.run', 'user-agent': 'python-requests/2.27.1', 'x-forwarded-proto': 'https'}, 'method': 'POST', 'requestURI': '/path-with-%20blank?key1=value1&key1=value2&key2=value2&key%20blank=value%20blank', 'clientIP': '42.120.75.226', 'body': 'anything'}
-----------START-----------
https://echo-echo-eaqnidmlfz.cn-hongkong.fcapp.run/path-with-%20blank?key1=value1&key1=value2&key2=value2&key%20blank=value%20blank
Access-Control-Expose-Headers: Date,x-fc-request-id,x-fc-error-type,x-fc-code-checksum,x-fc-invocation-duration,x-fc-max-memory-usage,x-fc-log-result,x-fc-invocation-code-version
Content-Disposition: attachment
Content-Type: text/plain
Key1: value1, value2
Key2: value2
X-Fc-Code-Checksum: 5253856105889449316
X-Fc-Instance-Id: c-6294b4c2-e964f124955247518971
X-Fc-Invocation-Duration: 2
X-Fc-Invocation-Service-Version: LATEST
X-Fc-Max-Memory-Usage: 49.12
X-Fc-Request-Id: 8aa9adc4-c23c-4e2c-a198-0d8933563064
{
"path": "/path-with- blank",
"queries": {
"key blank": "value blank",
"key1": [
"value1",
"value2"
],
"key2": "value2"
},
"headers": {
"accept": "*/*",
"content-length": "8",
"host": "echo-echo-eaqnidmlfz.cn-hongkong.fcapp.run",
"user-agent": "python-requests/2.27.1",
"x-forwarded-proto": "https"
},
"method": "POST",
"requestURI": "/path-with-%20blank?key1=value1&key1=value2&key2=value2&key%20blank=value%20blank",
"clientIP": "42.120.75.226",
"body": "anything"
}
I have found out if there are blank spaces in query parameters, something will go wrong with response parsing, but i don't know what to do to fix or avoid it.
I am trying to scrape Myntra but I got errors. I did many changes in the code. I tried requests package as well as urllib but still getting error.
Sometimes I got timeout error or urllib.error.URLError:
urllib.error.URLError: <urlopen error Tunnel connection failed: 502 Proxy Error (no funds available)>
Here is my code.
import os, ssl, http, gzip
import urllib.request
from bs4 import BeautifulSoup
import re
from http.cookiejar import CookieJar
import json
import http
import requests
def myntraScraper(url):
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None)):
ssl._create_default_https_context = ssl._create_unverified_context
cj = CookieJar()
proxy = {
'https': '------',
'http': '-------'
}
# user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
try:
import urllib.request as urllib2
except ImportError:
import urllib2
urllib2.install_opener(
urllib2.build_opener(
urllib2.ProxyHandler(proxy),
urllib.request.HTTPCookieProcessor(cj)
)
)
request = urllib2.Request(url, headers={
'accept-encoding': 'gzip',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
})
page = urllib2.urlopen(request)
html = gzip.decompress(page.read()).decode('utf-8')
soup = BeautifulSoup(html, 'lxml')
print(soup)
myntraScraper("https://www.myntra.com/sports-shoes/puma/puma-men-blue-hybrid-fuego-running-shoes/11203218/buy")
Currently, I am using Smartproxy. But I tried the same thing with PacketStream and Luminati. Most of the time I got the proxy error.
Myntra stores all the product data in a variable in a script variable called pdpData.
The below script gets the whole json that contains all the data regarding the product.
import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
s = requests.Session()
res = s.get("https://www.myntra.com/sports-shoes/puma/puma-men-blue-hybrid-fuego-running-shoes/11203218/buy", headers=headers, verify=False)
soup = BeautifulSoup(res.text,"lxml")
script = None
for s in soup.find_all("script"):
if 'pdpData' in s.text:
script = s.get_text(strip=True)
break
print(json.loads(script[script.index('{'):]))
Output:
{'pdpData': {'id': 11203218, 'name': 'Puma Men Blue Hybrid Fuego Running Shoes', 'mrp': 6499, 'manufacturer': 'SSIPL RETAIL LIMITED, KUNDLI,75, SERSA ROAD, 131028 SONEPAT', 'countryOfOrigin': 'India', 'colours': None, 'baseColour': 'Blue', 'brand': {'uidx': '', 'name': 'Puma', 'image': '', 'bio': ''}, 'media': {'videos': [], 'albums': [{'name': 'default', 'images': [{'src': 'http://assets.myntassets.com/h_($height),q_($qualityPercentage),w_($width)/v1/assets/images/productimage/2019/12/20/0c15e03c-863b-4a4a-9bb7-709a733fd4821576816965952-1.jpg', 'secureSrc': 'https://assets.myntassets.com/h_($height),q_($qualityPercentage),w_($width)/v1/assets/images/productimage/2019/12/20/0c15e03c-863b-4a4a-9bb7-709a733fd4821576816965952-1.jpg', 'host': None, 'imageURL': 'http://assets.myntassets.com/assets/images/productimage/2019/12/20/0c15e03c-863b-4a4a-9bb7-709a733fd4821576816965952-1.jpg', 'annotation': []}, {'src': 'http://assets.myntassets.com/h_($height),q_($qualityPercentage),w_($width)/v1/assets/images/productimage/2019/12/20/69bfa4e0-1ac4-4adf-b84e-4815ff60e8831576816966007-2.jpg', 'secureSrc': 'https://assets.myntassets.com/h_($height),q_($qualityPercentage),w_($width)/v1/assets/images/productimage/2019/12/20/69bfa4e0-1ac4-4adf-b84e-4815ff60e8831576816966007-2.jpg', 'host': None, 'imageURL': 'http://assets.myntassets.com/assets/images/productimage/2019/12/20/69bfa4e0-1ac4-4adf-b84e-4815ff60e8831576816966007-2.jpg', 'annotation': []}, {'src': 'http://assets.myntassets.com/h_($height),q_($qualityPercentage),w_($width)/v1/assets/images/productimage/2019/12/20/d2fd0ca0-1643-43ae-a0fc-fb1309580e151576816966049-3.jpg', 'secureSrc': 'https://assets.myntassets.com/h_($height),q_($qualityPercentage),w_($width)/v1/assets/images/productimage/2019/12/20/d2fd0ca0-1643-43ae-a0fc-fb1309580e151576816966049-3.jpg', 'host': None, 'imageURL': 'http://assets.myntassets.com/assets/images/productimage/2019/12/20/d2fd0ca0-1643-43ae-a0fc-fb1309580e151576816966049-3.jpg', 'annotation': []}, {'src': 'http://assets.myntassets.com/h_($height),q_($qualityPercentage),w_($width)/v1/assets/images/productimage/2019/12/20/0edae428-b9c0-4755-9127-0961d872b78a1576816966095-4.jpg', 'secureSrc': 'https://assets.myntassets.com/h_($height),q_($qualityPercentage),w_($width)/v1/assets/images/productimage/2019/12/20/0edae428-b9c0-4755-9127-0961d872b78a1576816966095-4.jpg', 'host': None, 'imageURL': 'http://assets.myntassets.com/assets/images/productimage/2019/12/20/0edae428-b9c0-4755-9127-0961d872b78a1576816966095-4.jpg', 'annotation': []}, {'src': 'http://assets.myntassets.com/h_($height),q_($qualityPercentage),w_($width)/v1/assets/images/productimage/2019/12/20/c59c7677-2bbd-4dbe-9b02-7c321c29cb701576816966142-5.jpg', 'secureSrc': 'https://assets.myntassets.com/h_($height),q_($qualityPercentage),w_($width)/v1/assets/images/productimage/2019/12/20/c59c7677-2bbd-4dbe-9b02-7c321c29cb701576816966142-5.jpg', 'host': None, 'imageURL': 'http://assets.myntassets.com/assets/images/productimage/2019/12/20/c59c7677-2bbd-4dbe-9b02-7c321c29cb701576816966142-5.jpg', 'annotation': []}]}, {'name': 'animatedImage', 'images': []}]}, 'sbpEnabled': False, 'sizechart': {'sizeChartUrl': None, 'sizeRepresentationUrl': 'http://assets.myntassets.com/assets/images/sizechart/2016/12/12/11481538267795-footwear.png'}, 'sizeRecoLazy': {'actionType': 'lazy', 'action': '/product/11203218/size/recommendation', 'sizeProfileAction': '/user/size-profiles?gender=male&articleType=Sports%20Shoes'}, 'analytics': {'articleType': 'Sports Shoes', 'subCategory': 'Shoes', 'masterCategory': 'Footwear', 'gender': 'Men', 'brand': 'Puma', 'colourHexCode': None}, 'crossLinks': [{'title': 'More Sports Shoes by Puma', 'url': 'sports-shoes?f=Brand:Puma::Gender:men'}, {'title': 'More Blue Sports Shoes', 'url': 'sports-shoes?f=Color:Blue_0074D9::Gender:men'}, {'title': 'More Sports Shoes', 'url': 'sports-shoes?f=Gender:men'}], 'relatedStyles': None, 'disclaimerTitle': '', 'productDetails': [{'type': None, 'content': None, 'title': 'Product Details', 'description': "<b>FEATURES + BENEFITS</b><br>HYBRID: PUMA's combination of two of its best technologies: IGNITE foam and NRGY beads<br>IGNITE: PUMA's foam midsole and branded heel cage supports and stabilises by locking the heel onto the platform<br>NRGY: PUMA's foam midsole offers superior cushion from heel to toe so you can power through your run<br>Heel-to-toe drop: 12mm<br><br><b>Product Design Details</b><ul><li>A pair of blue & brown running sports shoes, has regular styling, lace-up detail</li><li>Low boot silhouette</li><li>Lightweight synthetic upper</li><li>Overlays to secure the heel</li><li>Classic tongue</li><li>Lace-up closure</li><li>Rubber outsole for traction and durability</li><li>PUMA Wordmark at the tongue</li><li>PUMA Cat Logo at heel</li><li>Warranty: 3 months</li><li>Warranty provided by brand/manufacturer</li></ul><br><b>PRODUCT STORY</b><br>Change the name of the game with the HYBRID Fuego running sneakers. This bold colour-blocked shoe pairs a HYBRID foam midsole and a grippy rubber outsole for the ultimate in comfort and stability while still maintaining a stylish edge."}, {'type': None, 'content': None, 'title': 'MATERIAL & CARE', 'description': 'Textile<br>Wipe with a clean, dry cloth to remove dust'}], 'preOrder': None, 'sizeChartDisclaimerText': '', 'tags': None, 'articleAttributes': {'Ankle Height': 'Regular', 'Arch Type': 'Medium', 'Cleats': 'No Cleats', 'Cushioning': 'Medium', 'Distance': 'Medium', 'Fastening': 'Lace-Ups', 'Material': 'Textile', 'Outsole Type': 'Marking', 'Pronation for Running Shoes': 'Neutral', 'Running Type': 'Road Running', 'Sole Material': 'Rubber', 'Sport': 'Running', 'Surface Type': 'Outdoor', 'Technology': 'NA', 'Warranty': '3 months'}, 'systemAttributes': [], 'ratings': None, 'urgency': [{'value': '0', 'type': 'PURCHASED', 'ptile': 0}, {'value': '0', 'type': 'CART', 'ptile': 0}, {'value': '0', 'type': 'WISHLIST', 'ptile': 0}, {'value': '0', 'type': 'PDP', 'ptile': 0}], 'catalogAttributes': {'catalogDate': '1576751286000', 'season': 'summer', 'year': '2020'}, 'productContentGroupEntries': [{'title': '', 'type': 'DETAILS', 'attributes': [{'attributeName': 'Product Details', 'attributeType': 'STRING', 'value': "<b>FEATURES + BENEFITS</b><br>HYBRID: PUMA's combination of two of its best technologies: IGNITE foam and NRGY beads<br>IGNITE: PUMA's foam midsole and branded heel cage supports and stabilises by locking the heel onto the platform<br>NRGY: PUMA's foam midsole offers superior cushion from heel to toe so you can power through your run<br>Heel-to-toe drop: 12mm<br><br><b>Product Design Details</b><ul><li>A pair of blue & brown running sports shoes, has regular styling, lace-up detail</li><li>Low boot silhouette</li><li>Lightweight synthetic upper</li><li>Overlays to secure the heel</li><li>Classic tongue</li><li>Lace-up closure</li><li>Rubber outsole for traction and durability</li><li>PUMA Wordmark at the tongue</li><li>PUMA Cat Logo at heel</li><li>Warranty: 3 months</li><li>Warranty provided by brand/manufacturer</li></ul><br><b>PRODUCT STORY</b><br>Change the name of the game with the HYBRID Fuego running sneakers. This bold colour-blocked shoe pairs a HYBRID foam midsole and a grippy rubber outsole for the ultimate in comfort and stability while still maintaining a stylish edge."}, {'attributeName': 'Material & Care', 'attributeType': 'STRING', 'value': 'Textile<br>Wipe with a clean, dry cloth to remove dust'}, {'attributeName': 'Style Note', 'attributeType': 'STRING', 'value': "You'll look and feel super stylish in these trendsetting sports shoes by Puma. Match this blue pair with track pants and a sleeveless sports T-shirt when heading out for a casual day with friends."}]}], 'shoppableLooks': None, 'descriptors': [{'title': 'description', 'description': "<b>FEATURES + BENEFITS</b><br>HYBRID: PUMA's combination of two of its best technologies: IGNITE foam and NRGY beads<br>IGNITE: PUMA's foam midsole and branded heel cage supports and stabilises by locking the heel onto the platform<br>NRGY: PUMA's foam midsole offers superior cushion from heel to toe so you can power through your run<br>Heel-to-toe drop: 12mm<br><br><b>Product Design Details</b><ul><li>A pair of blue & brown running sports shoes, has regular styling, lace-up detail</li><li>Low boot silhouette</li><li>Lightweight synthetic upper</li><li>Overlays to secure the heel</li><li>Classic tongue</li><li>Lace-up closure</li><li>Rubber outsole for traction and durability</li><li>PUMA Wordmark at the tongue</li><li>PUMA Cat Logo at heel</li><li>Warranty: 3 months</li><li>Warranty provided by brand/manufacturer</li></ul><br><b>PRODUCT STORY</b><br>Change the name of the game with the HYBRID Fuego running sneakers. This bold colour-blocked shoe pairs a HYBRID foam midsole and a grippy rubber outsole for the ultimate in comfort and stability while still maintaining a stylish edge."}, {'title': 'style_note', 'description': "You'll look and feel super stylish in these trendsetting sports shoes by Puma. Match this blue pair with track pants and a sleeveless sports T-shirt when heading out for a casual day with friends."}, {'title': 'materials_care_desc', 'description': 'Textile<br>Wipe with a clean, dry cloth to remove dust'}], 'flags': {'isExchangeable': True, 'isReturnable': True, 'openBoxPickupEnabled': True, 'tryAndBuyEnabled': True, 'isLarge': False, 'isHazmat': False, 'isFragile': False, 'isJewellery': False, 'outOfStock': False, 'codEnabled': True, 'globalStore': False, 'loyaltyPointsEnabled': False, 'emiEnabled': True, 'chatEnabled': False, 'measurementModeEnabled': False, 'sampleModeEnabled': False, 'disableBuyButton': False}, 'earlyBirdOffer': None, 'serviceability': {'launchDate': '', 'returnPeriod': 30, 'descriptors': ['Pay on delivery might be available', 'Easy 30 days returns and exchanges', 'Try & Buy might be available'], 'procurementTimeInDays': {'6206': 4}}, 'buyButtonSellerOrder': [{'skuId': 38724440, 'sellerPartnerId': 6206}, {'skuId': 38724442, 'sellerPartnerId': 6206}, {'skuId': 38724446, 'sellerPartnerId': 6206}, {'skuId': 38724450, 'sellerPartnerId': 6206}, {'skuId': 38724452, 'sellerPartnerId': 6206}, {'skuId': 38724444, 'sellerPartnerId': 6206}, {'skuId': 38724448, 'sellerPartnerId': 6206}], 'sellers': [{'sellerPartnerId': 6206, 'sellerName': 'Puma Sports India Pvt. Ltd.(NSCM)'}], 'sizes': [{'skuId': 38724440, 'styleId': 11203218, 'action': '/product/11203218/related/6?co=1', 'label': '6', 'available': True, 'sizeType': 'UK Size', 'originalStyle': True, 'measurements': [{'type': 'Body Measurement', 'name': 'To Fit Foot Length', 'value': '24.5', 'minValue': '24.5', 'maxValue': '24.5', 'unit': 'cm', 'displayText': '24.5cm'}], 'allSizesList': [{'scaleCode': 'uk_size', 'sizeValue': '6', 'size': 'UK Size', 'order': 1, 'prefix': 'UK'}, {'scaleCode': 'us_size', 'sizeValue': '7', 'size': 'US Size', 'order': 2, 'prefix': 'US'}, {'scaleCode': 'euro_size', 'sizeValue': '39', 'size': 'Euro Size', 'order': 3, 'prefix': 'EURO'}], 'sizeSellerData': [{'mrp': 6499, 'sellerPartnerId': 6206, 'availableCount': 32, 'sellableInventoryCount': 32, 'warehouses': ['106', '328'], 'supplyType': 'ON_HAND', 'discountId': '11203218:23363948', 'discountedPrice': 2924}]}, {'skuId': 38724442, 'styleId': 11203218, 'action': '/product/11203218/related/7?co=1', 'label': '7', 'available': True, 'sizeType': 'UK Size', 'originalStyle': True, 'measurements': [{'type': 'Body Measurement', 'name': 'To Fit Foot Length', 'value': '25.4', 'minValue': '25.4', 'maxValue': '25.4', 'unit': 'cm', 'displayText': '25.4cm'}], 'allSizesList': [{'scaleCode': 'uk_size', 'sizeValue': '7', 'size': 'UK Size', 'order': 1, 'prefix': 'UK'}, {'scaleCode': 'us_size', 'sizeValue': '8', 'size': 'US Size', 'order': 2, 'prefix': 'US'}, {'scaleCode': 'euro_size', 'sizeValue': '40.5', 'size': 'Euro Size', 'order': 3, 'prefix': 'EURO'}], 'sizeSellerData': [{'mrp': 6499, 'sellerPartnerId': 6206, 'availableCount': 86, 'sellableInventoryCount': 86, 'warehouses': ['106'], 'supplyType': 'ON_HAND', 'discountId': '11203218:23363948', 'discountedPrice': 2924}]}, {'skuId': 38724444, 'styleId': 11203218, 'action': '/product/11203218/related/8?co=1', 'label': '8', 'available': True, 'sizeType': 'UK Size', 'originalStyle': True, 'measurements': [{'type': 'Body Measurement', 'name': 'To Fit Foot Length', 'value': '26.2', 'minValue': '26.2', 'maxValue': '26.2', 'unit': 'cm', 'displayText': '26.2cm'}], 'allSizesList': [{'scaleCode': 'uk_size', 'sizeValue': '8', 'size': 'UK Size', 'order': 1, 'prefix': 'UK'}, {'scaleCode': 'us_size', 'sizeValue': '9', 'size': 'US Size', 'order': 2, 'prefix': 'US'}, {'scaleCode': 'euro_size', 'sizeValue': '42', 'size': 'Euro Size', 'order': 3, 'prefix': 'EURO'}], 'sizeSellerData': [{'mrp': 6499, 'sellerPartnerId': 6206, 'availableCount': 188, 'sellableInventoryCount': 188, 'warehouses': ['106'], 'supplyType': 'ON_HAND', 'discountId': '11203218:23363948', 'discountedPrice': 2924}]}, {'skuId': 38724446, 'styleId': 11203218, 'action': '/product/11203218/related/9?co=1', 'label': '9', 'available': True, 'sizeType': 'UK Size', 'originalStyle': True, 'measurements': [{'type': 'Body Measurement', 'name': 'To Fit Foot Length', 'value': '27.1', 'minValue': '27.1', 'maxValue': '27.1', 'unit': 'cm', 'displayText': '27.1cm'}], 'allSizesList': [{'scaleCode': 'uk_size', 'sizeValue': '9', 'size': 'UK Size', 'order': 1, 'prefix': 'UK'}, {'scaleCode': 'us_size', 'sizeValue': '10', 'size': 'US Size', 'order': 2, 'prefix': 'US'}, {'scaleCode': 'euro_size', 'sizeValue': '43', 'size': 'Euro Size', 'order': 3, 'prefix': 'EURO'}], 'sizeSellerData': [{'mrp': 6499, 'sellerPartnerId': 6206, 'availableCount': 163, 'sellableInventoryCount': 163, 'warehouses': ['106'], 'supplyType': 'ON_HAND', 'discountId': '11203218:23363948', 'discountedPrice': 2924}]}, {'skuId': 38724448, 'styleId': 11203218, 'action': '/product/11203218/related/10?co=1', 'label': '10', 'available': True, 'sizeType': 'UK Size', 'originalStyle': True, 'measurements': [{'type': 'Body Measurement', 'name': 'To Fit Foot Length', 'value': '27.9', 'minValue': '27.9', 'maxValue': '27.9', 'unit': 'cm', 'displayText': '27.9cm'}], 'allSizesList': [{'scaleCode': 'uk_size', 'sizeValue': '10', 'size': 'UK Size', 'order': 1, 'prefix': 'UK'}, {'scaleCode': 'us_size', 'sizeValue': '11', 'size': 'US Size', 'order': 2, 'prefix': 'US'}, {'scaleCode': 'euro_size', 'sizeValue': '44.5', 'size': 'Euro Size', 'order': 3, 'prefix': 'EURO'}], 'sizeSellerData': [{'mrp': 6499, 'sellerPartnerId': 6206, 'availableCount': 153, 'sellableInventoryCount': 153, 'warehouses': ['106'], 'supplyType': 'ON_HAND', 'discountId': '11203218:23363948', 'discountedPrice': 2924}]}, {'skuId': 38724450, 'styleId': 11203218, 'action': '/product/11203218/related/11?co=1', 'label': '11', 'available': True, 'sizeType': 'UK Size', 'originalStyle': True, 'measurements': [{'type': 'Body Measurement', 'name': 'To Fit Foot Length', 'value': '28.8', 'minValue': '28.8', 'maxValue': '28.8', 'unit': 'cm', 'displayText': '28.8cm'}], 'allSizesList': [{'scaleCode': 'uk_size', 'sizeValue': '11', 'size': 'UK Size', 'order': 1, 'prefix': 'UK'}, {'scaleCode': 'us_size', 'sizeValue': '12', 'size': 'US Size', 'order': 2, 'prefix': 'US'}, {'scaleCode': 'euro_size', 'sizeValue': '46', 'size': 'Euro Size', 'order': 3, 'prefix': 'EURO'}], 'sizeSellerData': [{'mrp': 6499, 'sellerPartnerId': 6206, 'availableCount': 43, 'sellableInventoryCount': 43, 'warehouses': ['106'], 'supplyType': 'ON_HAND', 'discountId': '11203218:23363948', 'discountedPrice': 2924}]}, {'skuId': 38724452, 'styleId': 11203218, 'action': '/product/11203218/related/12?co=1', 'label': '12', 'available': False, 'sizeType': 'UK Size', 'originalStyle': True, 'measurements': [{'type': 'Body Measurement', 'name': 'To Fit Foot Length', 'value': '29.6', 'minValue': '29.6', 'maxValue': '29.6', 'unit': 'cm', 'displayText': '29.6cm'}], 'allSizesList': [{'scaleCode': 'uk_size', 'sizeValue': '12', 'size': 'UK Size', 'order': 1, 'prefix': 'UK'}, {'scaleCode': 'us_size', 'sizeValue': '13', 'size': 'US Size', 'order': 2, 'prefix': 'US'}, {'scaleCode': 'euro_size', 'sizeValue': '47', 'size': 'Euro Size', 'order': 3, 'prefix': 'EURO'}], 'sizeSellerData': []}], 'discounts': [{'type': 1, 'freeItem': False, 'label': '(55% OFF)', 'discountText': '', 'timerStart': '0', 'timerEnd': '1597084200', 'discountPercent': 55, 'offer': '', 'discountId': '11203218:23363948', 'heading': None, 'description': None, 'link': None, 'freeItemImage': None}], 'offers': [{'type': 'EMI', 'title': 'EMI option available', 'description': '', 'action': '/faqs', 'image': None}], 'bundledSkus': None, 'richPdp': None, 'landingPageUrl': 'sports-shoes/puma/puma-men-blue-hybrid-fuego-running-shoes/11203218/buy'}, 'pageName': 'Pdp', 'atsa': ['Sport', 'Material', 'Fastening', 'Ankle Height', 'Outsole Type', 'Cleats', 'Pronation for Running Shoes', 'Arch Type', 'Cushioning', 'Running Type', 'Warranty', 'Distance', 'Number of Components', 'Surface Type', 'Technology']}
I need to get the data from v2?count=3 from the page https://support.hpe.com/hpesc/public/km/Security-Bulletin-Library#sort=relevancy&layout=table&numberOfResults=25&f:#kmdocsecuritybulletin=[4000003]&f:#kmdoclanguagecode=[cv1871440,cv1871463]&hpe=1
The data I need is shown in the image
class HPUXSpider(_BaseSpider):
name = 'hp_ux_spider'
def start_requests(self):
return [scrapy.FormRequest(
url='https://platform.cloud.coveo.com/rest/search/v2?count=3',
method='POST',
formdata={
'actionsHistory': r'[{"name":"Query","time":"\"2020-07-13T12:49:51.480Z\""},{"name":"Query","time":"\"2020-07-13T10:44:35.303Z\""},{"name":"Query","time":"\"2020-07-13T07:49:10.078Z\""},{"name":"Query","time":"\"2020-07-13T06:58:59.532Z\""},{"name":"Query","time":"\"2020-07-13T06:57:24.599Z\""},{"name":"Query","time":"\"2020-07-12T21:47:41.323Z\""},{"name":"Query","time":"\"2020-07-12T16:38:19.741Z\""},{"name":"Query","time":"\"2020-07-12T06:04:36.049Z\""},{"name":"Query","time":"\"2020-07-12T05:59:39.814Z\""},{"name":"Query","time":"\"2020-07-11T19:31:55.963Z\""},{"name":"Query","time":"\"2020-07-11T19:29:55.997Z\""},{"name":"Query","time":"\"2020-07-11T19:23:29.999Z\""},{"name":"Query","time":"\"2020-07-11T19:21:09.859Z\""},{"name":"Query","time":"\"2020-07-11T19:19:03.748Z\""},{"name":"Query","time":"\"2020-07-11T19:17:23.735Z\""},{"name":"Query","time":"\"2020-07-11T19:14:51.152Z\""},{"name":"Query","time":"\"2020-07-11T18:54:03.418Z\""},{"name":"Query","time":"\"2020-07-11T12:28:39.484Z\""},{"name":"Query","time":"\"2020-07-10T13:08:42.876Z\""},{"name":"Query","time":"\"2020-07-10T12:57:51.285Z\""}]',
'referrer': 'https://support.hpe.com/hpesc/public/km/Security-Bulletin-Library',
'visitorId': '33b0ede7-3274-486f-a31c-23ed3001ad91',
'isGuestUser': 'false',
'aq': '(#kmdoctypedetails==cv66000018) ((NOT #kmdoctype=cv60000001)) (#kmdocsecuritybulletin==4000003) (#kmdoclanguagecode==(cv1871440,cv1871463))',
'cq': '(#source=="cdp-km-document-pro-h4-v2")',
'searchHub': 'HPE-SecurityBulletins-Page',
'locale': 'ru',
'firstResult': '0',
'numberOfResults': '25',
'excerptLength': '500',
'enableDidYouMean': 'true',
'sortCriteria': 'relevancy',
'queryFunctions': '[]',
'rankingFunctions': '[]',
'groupBy': r'[{"field":"#kmdocsecuritybulletin","maximumNumberOfValues":20,"sortCriteria":"nosort","injectionDepth":1000,"completeFacetWithStandardValues":true,"allowedValues":["4000019","4000018","4000005","4000004","4000017","4000003","4000009","4000006","4000007","4000008","4000001","4000002","4000010","4000011","4000012","4000013","4000014","4000015","4000016"],"advancedQueryOverride":"(#kmdoctypedetails==cv66000018) ((NOT #kmdoctype=cv60000001)) (#kmdoclanguagecode==(cv1871440,cv1871463))","constantQueryOverride":"(#source==\"cdp-km-document-pro-h4-v2\")"},{"field":"#kmdoclanguagecode","maximumNumberOfValues":6,"sortCriteria":"Score","injectionDepth":1000,"completeFacetWithStandardValues":true,"allowedValues":["cv1871440","cv1871463"],"advancedQueryOverride":"(#kmdoctypedetails==cv66000018) ((NOT #kmdoctype=cv60000001)) (#kmdocsecuritybulletin==4000003)","constantQueryOverride":"(#source==\"cdp-km-document-pro-h4-v2\")"},{"field":"#kmdoctopissue","maximumNumberOfValues":6,"sortCriteria":"Score","injectionDepth":1000,"completeFacetWithStandardValues":true,"allowedValues":[],"advancedQueryOverride":"(#kmdoctypedetails==cv66000018) ((NOT #kmdoctype=cv60000001)) (#kmdocsecuritybulletin==4000003) (#kmdoclanguagecode==(cv1871440,cv1871463))","constantQueryOverride":"(#source==\"cdp-km-document-pro-h4-v2\") #kmdoctopissueexpirationdate>today"},{"field":"#kmdocdisclosurelevel","maximumNumberOfValues":6,"sortCriteria":"Score","injectionDepth":1000,"completeFacetWithStandardValues":true,"allowedValues":[]},{"field":"#hpescuniversaldate","completeFacetWithStandardValues":true,"maximumNumberOfValues":1,"sortCriteria":"nosort","generateAutomaticRanges":true,"advancedQueryOverride":"(#kmdoctypedetails==cv66000018) ((NOT #kmdoctype=cv60000001)) (#kmdocsecuritybulletin==4000003) (#kmdoclanguagecode==(cv1871440,cv1871463)) #uri","constantQueryOverride":"(#source==\"cdp-km-document-pro-h4-v2\") #hpescuniversaldate>1970/01/01#00:00:00"},{"field":"#hpescuniversaldate","completeFacetWithStandardValues":true,"maximumNumberOfValues":1,"sortCriteria":"nosort","generateAutomaticRanges":true,"constantQueryOverride":"(#source==\"cdp-km-document-pro-h4-v2\") #hpescuniversaldate>1970/01/01#00:00:00 #hpescuniversaldate>1970/01/01#00:00:00"},{"field":"#hpescuniversaldate","maximumNumberOfValues":5,"sortCriteria":"nosort","injectionDepth":1000,"completeFacetWithStandardValues":true,"rangeValues":[{"start":"1900-01-31T18:20:09.000Z","end":"2020-07-13T17:00:00.000Z","label":"All dates","endInclusive":false},{"start":"2020-07-05T17:00:00.000Z","end":"2020-07-13T17:00:00.000Z","label":"Last 7 days","endInclusive":false},{"start":"2020-06-12T17:00:00.000Z","end":"2020-07-13T17:00:00.000Z","label":"Last 30 days","endInclusive":false},{"start":"2020-05-13T17:00:00.000Z","end":"2020-07-13T17:00:00.000Z","label":"Last 60 days","endInclusive":false},{"start":"2020-04-13T17:00:00.000Z","end":"2020-07-12T17:00:00.000Z","label":"Last 90 days","endInclusive":false}]}]',
'facetOptions': '{}',
'categoryFacets': '[]',
'retrieveFirstSentences': 'true',
'timezone': 'Asia/Tomsk',
'enableQuerySyntax': 'false',
'enableDuplicateFiltering': 'false',
'enableCollaborativeRating': 'false',
'debug': 'false',
'context': '{"tracking_id":"HPESCXwxYkRD5BgcAAFnGlJ0AAAAY","active_features":"DCS,DHFWS,SA2,patchCoveoSearchToggle,sa2_product_focus_target_levels_toggle,toggleCsr,toggleSecBulletin","user_tracking_id":"XwRimRD5AcgAAFl2OMkAAAAW"}',
'allowQueriesWithoutKeywords': 'true',
},
callback=self.save_response,
cb_kwargs=dict(path_dir=DATA_DIR, file_name='1.json')
) ]
Log
2020-07-14 07:17:33 [scrapy.core.engine] INFO: Spider opened
2020-07-14 07:17:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-14 07:17:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-14 07:17:34 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36
2020-07-14 07:17:34 [scrapy.core.engine] DEBUG: Crawled (401) <POST https://platform.cloud.coveo.com/rest/search/v2?count=3> (referer: None)
2020-07-14 07:17:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <401 https://platform.cloud.coveo.com/rest/search/v2?count=3>: HTTP status code is not handled or not allowed
2020-07-14 07:17:34 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-14 07:17:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
What am I doing wrong?
Traceback1
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 192, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 196, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here> ---
File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 88, in crawl
start_requests = iter(self.spider.start_requests())
File "/code/hp_ux/splash/spiders/hp_ux_spider.py", line 50, in start_requests
cb_kwargs=dict(path_dir=DATA_DIR, file_name='1.json')
File "/usr/local/lib/python3.7/site-packages/scrapy/http/request/form.py", line 27, in __init__
super(FormRequest, self).__init__(*args, **kwargs)
builtins.TypeError: __init__() got an unexpected keyword argument 'params'
2020-07-14 11:32:04 [twisted] CRITICAL:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 88, in crawl
start_requests = iter(self.spider.start_requests())
File "/code/hp_ux/splash/spiders/hp_ux_spider.py", line 50, in start_requests
cb_kwargs=dict(path_dir=DATA_DIR, file_name='1.json')
File "/usr/local/lib/python3.7/site-packages/scrapy/http/request/form.py", line 27, in __init__
super(FormRequest, self).__init__(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'params'
you have to use headers with Authorization for this website:
def parse(self, response):
headers = {
'Connection': 'keep-alive',
'Authorization': 'Bearer eyJhbGciOiJIUzI1NiJ9.eyJwaXBlbGluZSI6ImNkcC1ocGVzYy1waXBlbGluZS1wcm8taDQtdjEyIiwidXNlckdyb3VwcyI6WyJMT0NBTF9QT1JUQUxfSFBQX1VTRVJTIiwiTE9DQUxfUE9SVEFMX0NPVU5UUllfVVMiLCJMT0NBTF9QT1JUQUxfTEFOR1VBR0VfRU4iLCJMT0NBTF9QT1JUQUxfQ09NUEFOWV9IUEUiLCJMT0NBTF9QT1JUQUxfR1VFU1RfVVNFUlMiXSwidjgiOnRydWUsIm9yZ2FuaXphdGlvbiI6Imhld2xldHRwYWNrYXJkcHJvZHVjdGlvbml3bWc5Yjl3IiwidXNlcklkcyI6W3sicHJvdmlkZXIiOiJFbWFpbCBTZWN1cml0eSBQcm92aWRlciIsIm5hbWUiOiJhbm9ueW1vdXNAY292ZW8uY29tIiwidHlwZSI6IlVzZXIifV0sInJvbGVzIjpbInF1ZXJ5RXhlY3V0b3IiXSwiZXhwIjoxNTk0ODEzODI0LCJpYXQiOjE1OTQ3Mjc0MjR9.O-SGmzsy2QdMClI9CfmN5MY9G1JBQmCe9m379zFpa4Y',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset="UTF-8"',
'Accept': '*/*',
'Origin': 'https://support.hpe.com',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://support.hpe.com/hpesc/public/km/Security-Bulletin-Library',
'Accept-Language': 'en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7,uk;q=0.6,en-GB;q=0.5',
}
data = {
'actionsHistory': '[{"name":"Query","time":"\\"2020-07-14T11:50:24.995Z\\""},{"name":"Query","time":"\\"2020-07-14T11:15:14.602Z\\""}]',
'referrer': '',
'visitorId': 'deabe929-cc0e-41eb-ab62-f62e40aca82a',
'isGuestUser': 'false',
'aq': '(#kmdoctypedetails==cv66000018) ((NOT #kmdoctype=cv60000001)) (#kmdocsecuritybulletin==4000003) (#kmdoclanguagecode==(cv1871440,cv1871463))',
'cq': '(#source=="cdp-km-document-pro-h4-v2")',
'searchHub': 'HPE-SecurityBulletins-Page',
'locale': 'en',
'firstResult': '25',
'numberOfResults': '25',
'excerptLength': '500',
'enableDidYouMean': 'true',
'sortCriteria': 'relevancy',
'queryFunctions': '[]',
'rankingFunctions': '[]',
'groupBy': '[{"field":"#kmdocsecuritybulletin","maximumNumberOfValues":20,"sortCriteria":"nosort","injectionDepth":1000,"completeFacetWithStandardValues":true,"allowedValues":["4000019","4000018","4000005","4000004","4000017","4000003","4000009","4000006","4000007","4000008","4000001","4000002","4000010","4000011","4000012","4000013","4000014","4000015","4000016"],"advancedQueryOverride":"(#kmdoctypedetails==cv66000018) ((NOT #kmdoctype=cv60000001)) (#kmdoclanguagecode==(cv1871440,cv1871463))","constantQueryOverride":"(#source==\\"cdp-km-document-pro-h4-v2\\")"},{"field":"#kmdoclanguagecode","maximumNumberOfValues":6,"sortCriteria":"Score","injectionDepth":1000,"completeFacetWithStandardValues":true,"allowedValues":["cv1871440","cv1871463"],"advancedQueryOverride":"(#kmdoctypedetails==cv66000018) ((NOT #kmdoctype=cv60000001)) (#kmdocsecuritybulletin==4000003)","constantQueryOverride":"(#source==\\"cdp-km-document-pro-h4-v2\\")"},{"field":"#kmdoctopissue","maximumNumberOfValues":6,"sortCriteria":"Score","injectionDepth":1000,"completeFacetWithStandardValues":true,"allowedValues":[],"advancedQueryOverride":"(#kmdoctypedetails==cv66000018) ((NOT #kmdoctype=cv60000001)) (#kmdocsecuritybulletin==4000003) (#kmdoclanguagecode==(cv1871440,cv1871463))","constantQueryOverride":"(#source==\\"cdp-km-document-pro-h4-v2\\") #kmdoctopissueexpirationdate>today"},{"field":"#kmdocdisclosurelevel","maximumNumberOfValues":6,"sortCriteria":"Score","injectionDepth":1000,"completeFacetWithStandardValues":true,"allowedValues":[]},{"field":"#hpescuniversaldate","maximumNumberOfValues":5,"sortCriteria":"nosort","injectionDepth":1000,"completeFacetWithStandardValues":true,"rangeValues":[{"start":"1900-01-31T21:57:56.000Z","end":"2020-07-14T21:00:00.000Z","label":"All dates","endInclusive":false},{"start":"2020-07-06T21:00:00.000Z","end":"2020-07-14T21:00:00.000Z","label":"Last 7 days","endInclusive":false},{"start":"2020-06-13T21:00:00.000Z","end":"2020-07-14T21:00:00.000Z","label":"Last 30 days","endInclusive":false},{"start":"2020-05-14T21:00:00.000Z","end":"2020-07-14T21:00:00.000Z","label":"Last 60 days","endInclusive":false},{"start":"2020-04-14T21:00:00.000Z","end":"2020-07-13T21:00:00.000Z","label":"Last 90 days","endInclusive":false}]},{"field":"#hpescuniversaldate","completeFacetWithStandardValues":true,"maximumNumberOfValues":1,"sortCriteria":"nosort","generateAutomaticRanges":true,"advancedQueryOverride":"(#kmdoctypedetails==cv66000018) ((NOT #kmdoctype=cv60000001)) (#kmdocsecuritybulletin==4000003) (#kmdoclanguagecode==(cv1871440,cv1871463)) #uri","constantQueryOverride":"(#source==\\"cdp-km-document-pro-h4-v2\\") #hpescuniversaldate>1970/01/01#00:00:00"},{"field":"#hpescuniversaldate","completeFacetWithStandardValues":true,"maximumNumberOfValues":1,"sortCriteria":"nosort","generateAutomaticRanges":true,"constantQueryOverride":"(#source==\\"cdp-km-document-pro-h4-v2\\") #hpescuniversaldate>1970/01/01#00:00:00 #hpescuniversaldate>1970/01/01#00:00:00"}]',
'facetOptions': '{}',
'categoryFacets': '[]',
'retrieveFirstSentences': 'true',
'timezone': 'Europe/Kiev',
'enableQuerySyntax': 'false',
'enableDuplicateFiltering': 'false',
'enableCollaborativeRating': 'false',
'debug': 'false',
'context': '{"tracking_id":"HPESCXw2cKBD5AcgAADvUM8IAAAAa","active_features":"DCS,DHFWS,SA2,patchCoveoSearchToggle,sa2_product_focus_target_levels_toggle,toggleCsr,toggleSecBulletin","user_tracking_id":"Xw2TthD5AcgAACecWi0AAAAZ"}',
'allowQueriesWithoutKeywords': 'true'
}
url = 'https://platform.cloud.coveo.com/rest/search/v2?count=3'
yield scrapy.FormRequest(
url=url,
formdata=data,
headers=headers,
callback=self.parse_result
)
def parse_result(self, response):
j_obj = json.loads(response.body_as_unicode())
print(j_obj)
I have a python script that is trying to get every EC2 instance ID in every AWS account that I own. I am using a custom library (nwmaws) that will list every account ID for me. I am using a function that generates an sts token and pulls each account id and interpolates the id to dynamically build the ARN so I can assume a role in each account and get the instance IDs. I am able to generate the sts tokens, but I am not getting the instance IDs in the response. Just an HTTP 200 status code. Below is my code and the response.
CODE:
import boto3
import nwmaws
client = boto3.client('ec2')
accounts = nwmaws.Accounts().list()
def get_sts_token(**kwargs):
role_arn = kwargs['RoleArn']
region_name = kwargs['RegionName']
sts = boto3.client(
'sts',
region_name=region_name,
)
token = sts.assume_role(
RoleArn=role_arn,
RoleSessionName='GetInstances',
DurationSeconds=900,
)
return token["Credentials"]
def get_all_instances():
for i in accounts:
account_list = i.account_id
role_arn = "arn:aws:iam::{}:role/ADFS-
GlobalAdmins".format(account_list)
get_sts_token(RoleArn=role_arn, RegionName="us-east-1")
response = client.describe_instances()
print(response)
get_all_instances()
RESPONSE:
{'Reservations': [], 'ResponseMetadata': {'RequestId': '5c1e8326-5a36-
4866-9cfd-bd83bff62d05', 'HTTPStatusCode': 200, 'HTTPHeaders':
{'content-type': 'text/xml;charset=UTF-8', 'transfer-encoding':
'chunked', 'vary': 'Accept-Encoding', 'date': 'Sun, 13 May 2018
21:23:25 GMT', 'server': 'AmazonEC2'}, 'RetryAttempts': 0}}
{'Reservations': [], 'ResponseMetadata': {'RequestId': '1e165d98-0b5c-
4172-8917-bf688afbad7c', 'HTTPStatusCode': 200, 'HTTPHeaders':
{'content-type': 'text/xml;charset=UTF-8', 'content-length': '230',
'date': 'Sun, 13 May 2018 21:23:25 GMT', 'server': 'AmazonEC2'},
'RetryAttempts': 0}}
{'Reservations': [], 'ResponseMetadata': {'RequestId': 'e18526d5-c7e9-
465f-a1fd-87e1d652e95c', 'HTTPStatusCode': 200, 'HTTPHeaders':
{'content-type': 'text/xml;charset=UTF-8', 'transfer-encoding':
'chunked', 'vary': 'Accept-Encoding', 'date': 'Sun, 13 May 2018
21:23:25 GMT', 'server': 'AmazonEC2'}, 'RetryAttempts': 0}} etc. etc...
DESIRED RESPONSE:
i-xxxxxx
i-xxxxxx
i-xxxxxx
i-xxxxxx
i-xxxxxx
etc etc
As #Michael - sqlbot mentioned, you are not using the token generated by the assume_role API call. Create your EC2 client object using the credentials obtained. Replace get_sts_token(RoleArn=role_arn, RegionName="us-east-1") line in your code with the following lines to retrieve the temporary credentials and use it to list the instances:
credentials = get_sts_token(RoleArn=role_arn, RegionName="us-east-1")
access_key = credentials['AccessKeyId']
secret_key = credentials['SecretAccessKey']
token = credentials['SessionToken']
session = boto3.session.Session(
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
aws_session_token=token
)
client = session.client('ec2', region_name='us-east-1')
response = client.describe_instances()
print(response)
This will return all the instances in us-east-1. If you need the list of instances in all regions, call describe_regions API and iterate through the list.
References:
Documentation about Session object can be found here.
The output of print(response) is correct.
However you can try this to get your desired output:
client = boto3.client('ec2')
instances = ec2.instance.filter(Filters=[{'Name': 'instance-state-name',
'Values' ": ['running']}])
for instance in instances:
print(instance.id, instance.instance_type)