requests post - and retry mechanism problem - python-3.x

I have following code in my webscraper:
postbody = {'Submit': {}, 'czas_kon2': '', 'czas_pocz2': '', 'num_pacz': '', 'typ': 'wsz'}
post = requests.post(spolka, data=postbody)
data = post.text
I am executing it over 400 webpages in a loop, to scrape data using multiprocessing (8 processes).
data is supposed to contain whole html page for further xml processing.
But out of 400 pages I get 2 that does not return meaningful content. I suspect it is because of heavy load I create. I tried time.sleep(1), time.sleep(10) but no luck here.
How could I ensure that data or post variable always contain whole page, like for 398 working ones?
I tried simple while loop for retry... but it is far from perfect (I was able to get 1 out of remaining 2 pages) afrer one extra attempt.
while len(data) < 1024:
postbody = {'Submit': {}, 'czas_kon2': '', 'czas_pocz2': '', 'num_pacz': '', 'typ': 'wsz'}
post = requests.post(spolka, data=postbody)
data = post.text

I think you should add a request headers.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}
postbody = {'Submit': {}, 'czas_kon2': '', 'czas_pocz2': '', 'num_pacz': '', 'typ': 'wsz'}
post = requests.post(spolka, data=postbody, headers=headers)
and more headers example:
headers = {
'Accept': 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Host': 'www.google.com',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'X-Requested-With': 'XMLHttpRequest',
'Cookies': '',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'
}

Related

HTTP Request using Python

For those who are reading my thread, I'd like to thank you in advance for your assistance in advance and would also like to ask for a bit of leniency when it comes to incorrect terminology, as I am still a 'Newbie'.
I've been trying to retrieve stock codes from KRX website, as I could not find any other resource to retrieve the information that I need. I tried to use requests library in python, but because the data I needed was loaded Asynchronously, which made the data inaccessible.
The problem is that in order to retrieve the information, I need to make two requests to an endpoint, one to retrieve code to be used as body for the second request, but when I made the second request, it returns empty list.
I managed to locate the API calls which retrieved the stock codes as shown below.
TwoRequests
To my knowledge, it requires two API calls, one to retrieve code, which works as access token for the second request in order to retrieve the Stock code that I am trying to retrieve.
I've managed to retrieve the code for the first request with the following codes
import requests
url = 'https://global.krx.co.kr/contents/COM/GenerateOTP.jspx'
headers = {
'Cookie': 'SCOUTER=x22rkf7ltsmr7l; __utma=88009422.986813715.1652669493.1652669493.1652669493.1; SCOUTER=z6pj0p85muce99; JSESSIONID=bOnAJtLWSpK1BiCuhWD0ldj1TqW5z6wEcn65oVgtyie841OlbdJs3fEHpUs1QtAV.bWRjX2RvbWFpbi9tZGNvd2FwMS1tZGNhcHAwMQ==; JSESSIONID=C2794518AD56B7119F0DA630B73B05AA.58tomcat2',
'Connection': 'keep-alive',
'accept': '*/*',
'accept-enconding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,ko;q=0.8',
'host': 'global.krx.co.kr',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
}
params = {
'bld': 'COM/stock_isu_info',
'name': 'finderBld',
'_': '1668677450106',
}
# make get request to the url and keep the connection open
response = requests.get(url, headers=headers, params=params, stream=True)
# response = requests.get(url, params=params, headers=headers)
relay_data = response.text
but upon sending a request to the second endpoint with the code as payload, it returns empty list, but I was expecting the response value for the second request as the following:
PayloadNeeded
The code I used to make the second request is the following (I added lots values for the header and body in hopes to retrieve the data by simulating the values used on the web page):
url = 'https://global.krx.co.kr/contents/GLB/99/GLB99000001.jspx'
headers = {
# ':authority': 'global.krx.co.kr',
# ':method': 'POST',
# ':path': '/contents/GLB/99/GLB99000001.jspx',
# ':scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,ko;q=0.8',
'content-length': '0',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': 'SCOUTER=x22rkf7ltsmr7l; __utma=88009422.986813715.1652669493.1652669493.1652669493.1; SCOUTER=z6pj0p85muce99; JSESSIONID=bOnAJtLWSpK1BiCuhWD0ldj1TqW5z6wEcn65oVgtyie841OlbdJs3fEHpUs1QtAV.bWRjX2RvbWFpbi9tZGNvd2FwMS1tZGNhcHAwMQ==; JSESSIONID=C2794518AD56B7119F0DA630B73B05AA.58tomcat2',
'origin': 'https://global.krx.co.kr',
'referer': 'https://global.krx.co.kr/contents/GLB/99/GLB99000001.jsp',
'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'sec-gpc': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
}
payload = {
'market_gubun': '0',
'isu_cdnm': 'All',
'isu_cd': '',
'isu_nm': '',
'isu_srt_cd': '',
'sort':'',
'ck_std_ind_cd': '20',
'par_pr': '',
'cpta_scl': '',
'sttl_trm': '',
'lst_stk_vl': '1',
'in_lst_stk_vl': '',
'in_lst_stk_vl2': '',
'cpt': '1',
'in_cpt': '',
'in_cpt2': '',
'nat_tot_amt': '1',
'in_nat_tot_amt': '',
'in_nat_tot_amt2': '',
'pagePath': '/contents/GLB/03/0308/0308010000/GLB0308010000.jsp',
'code': relay_data,
'pageFirstCall': 'Y',
}
# make request with url, headers, body
response = requests.post(url, headers=headers, data=payload)
print(response.text)
And here is the output for the code above:
{"DS1":[]}
Any help would be very much appreciated

Can I use post method in requests lib on this Binance site?

Here is the site:
https://www.binance.com/ru/futures-activity/leaderboard?type=myProfile&tradeType=PERPETUAL&encryptedUid=E921F42DCD4D9F6ECC0DFCE3BAB1D11A
I am parsing the positions of the trader by selenium, but today realize, that I can use post method.
Here is what "NETWORK" shows:
Here is response preview:
I have no experience with post method of requests, I tried this, but doesn't work.
import requests
hd = {'accept':"*/*",'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'}
ses = requests.Session()
c = ses.post('https://www.binance.com/bapi/futures/v1/public/future/leaderboard/getOtherPosition',headers=hd)
print(c.text)
Output is:
{"code":"000002","message":"illegal parameter","messageDetail":null,"data":null,"success":false}
Can someone help me to do it, please? Is it real?
It's working as POST method
import requests
url='https://www.binance.com/bapi/futures/v1/public/future/leaderboard/getOtherPerformance'
headers= {
"content-type": "application/json",
"x-trace-id": "4c3d6fce-a2d8-421e-9d5b-e0c12bd2c7c0",
"x-ui-request-trace": "4c3d6fce-a2d8-421e-9d5b-e0c12bd2c7c0"
}
payload = {"encryptedUid":"E921F42DCD4D9F6ECC0DFCE3BAB1D11A","tradeType":"PERPETUAL"}
req=requests.post(url,headers=headers,json=payload).json()
#print(req)
for item in req['data']:
roi = item['value']
print(roi)
Output:
-0.023215
-91841.251668
0.109495
390421.996614
-0.063094
-266413.73955621
0.099181
641189.24407088
0.072079
265977.556474
-0.09197
-400692.52138279
-0.069988
-469016.33171481
0.0445
292594.20440128
I am used Curlconverter, and it helped me a lot! Here is the working code:
import requests
headers = {
'authority': 'www.binance.com',
'accept': '*/*',
'accept-language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7,ja;q=0.6,ko;q=0.5,zh-CN;q=0.4,zh;q=0.3',
'bnc-uuid': '0202c537-8c2b-463a-bdef-33761d21986a',
'clienttype': 'web',
'csrftoken': 'd41d8cd98f00b204e9800998ecf8427e',
'device-info': 'eyJzY3JlZW5fcmVzb2x1dGlvbiI6IjE5MjAsMTA4MCIsImF2YWlsYWJsZV9zY3JlZW5fcmVzb2x1dGlvbiI6IjE5MjAsMTA0MCIsInN5c3RlbV92ZXJzaW9uIjoiV2luZG93cyAxMCIsImJyYW5kX21vZGVsIjoidW5rbm93biIsInN5c3RlbV9sYW5nIjoicnUtUlUiLCJ0aW1lem9uZSI6IkdNVCszIiwidGltZXpvbmVPZmZzZXQiOi0xODAsInVzZXJfYWdlbnQiOiJNb3ppbGxhLzUuMCAoV2luZG93cyBOVCAxMC4wOyBXaW42NDsgeDY0KSBBcHBsZVdlYktpdC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvKSBDaHJvbWUvMTAxLjAuNDk1MS42NyBTYWZhcmkvNTM3LjM2IiwibGlzdF9wbHVnaW4iOiJQREYgVmlld2VyLENocm9tZSBQREYgVmlld2VyLENocm9taXVtIFBERiBWaWV3ZXIsTWljcm9zb2Z0IEVkZ2UgUERGIFZpZXdlcixXZWJLaXQgYnVpbHQtaW4gUERGIiwiY2FudmFzX2NvZGUiOiI1ZjhkZDMyNCIsIndlYmdsX3ZlbmRvciI6Ikdvb2dsZSBJbmMuIChJbnRlbCkiLCJ3ZWJnbF9yZW5kZXJlciI6IkFOR0xFIChJbnRlbCwgSW50ZWwoUikgVUhEIEdyYXBoaWNzIDYyMCBEaXJlY3QzRDExIHZzXzVfMCBwc181XzAsIEQzRDExKSIsImF1ZGlvIjoiMTI0LjA0MzQ3NTI3NTE2MDc0IiwicGxhdGZvcm0iOiJXaW4zMiIsIndlYl90aW1lem9uZSI6IkV1cm9wZS9Nb3Njb3ciLCJkZXZpY2VfbmFtZSI6IkNocm9tZSBWMTAxLjAuNDk1MS42NyAoV2luZG93cykiLCJmaW5nZXJwcmludCI6IjE5YWFhZGNmMDI5ZTY1MzU3N2Q5OGYwMmE0NDE4Nzc5IiwiZGV2aWNlX2lkIjoiIiwicmVsYXRlZF9kZXZpY2VfaWRzIjoiMTY1MjY4OTg2NTQwMGdQNDg1VEtmWnVCeUhONDNCc2oifQ==',
'fvideo-id': '3214483f88c0abbba34e5ecf5edbeeca1e56e405',
'lang': 'ru',
'origin': 'https://www.binance.com',
'referer': 'https://www.binance.com/ru/futures-activity/leaderboard?type=myProfile&tradeType=PERPETUAL&encryptedUid=E921F42DCD4D9F6ECC0DFCE3BAB1D11A',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="101", "Google Chrome";v="101"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
'x-trace-id': 'e9d5223c-5d71-4834-8563-c253a1fc3ae8',
'x-ui-request-trace': 'e9d5223c-5d71-4834-8563-c253a1fc3ae8',
}
json_data = {
'encryptedUid': 'E921F42DCD4D9F6ECC0DFCE3BAB1D11A',
'tradeType': 'PERPETUAL',
}
response = requests.post('https://www.binance.com/bapi/futures/v1/public/future/leaderboard/getOtherPosition', headers=headers, json=json_data)
print(response.text)
So output now is:
{"code":"000000","message":null,"messageDetail":null,
"data":{
"otherPositionRetList":[{"symbol":"ETHUSDT","entryPrice":1985.926527932,"markPrice":2013.57606795,"pnl":41926.93300012,"roe":0.05492624,"updateTime":[2022,5,22,15,35,39,358000000],"amount":1516.370,"updateTimeStamp":1653233739358,"yellow":true,"tradeBefore":false},{"symbol":"KSMUSDT","entryPrice":80.36574159583,"markPrice":79.46000000,"pnl":-1118.13799285,"roe":-0.01128900,"updateTime":[2022,5,16,11,0,5,608000000],"amount":1234.5,"updateTimeStamp":1652698805608,"yellow":false,"tradeBefore":false},{"symbol":"IMXUSDT","entryPrice":0.9969444089129,"markPrice":0.97390429,"pnl":-13861.75961996,"roe":-0.02365747,"updateTime":[2022,5,22,15,57,3,329000000],"amount":601636,"updateTimeStamp":1653235023329,"yellow":true,"tradeBefore":false},{"symbol":"MANAUSDT","entryPrice":1.110770201096,"markPrice":1.09640000,"pnl":-6462.14960820,"roe":-0.05242685,"updateTime":[2022,5,21,16,6,2,291000000],"amount":449691,"updateTimeStamp":1653149162291,"yellow":false,"tradeBefore":false},{"symbol":"EOSUSDT","entryPrice":1.341744945184,"markPrice":1.35400000,"pnl":-4572.78323455,"roe":-0.09051004,"updateTime":[2022,5,22,11,47,48,542000000],"amount":-373134.3,"updateTimeStamp":1653220068542,"yellow":true,"tradeBefore":false},{"symbol":"BTCUSDT","entryPrice":29174.44207538,"markPrice":30015.10000000,"pnl":-173841.33354801,"roe":-0.47613317,"updateTime":[2022,5,21,15,13,0,252000000],"amount":-206.792,"updateTimeStamp":1653145980252,"yellow":false,"tradeBefore":false},{"symbol":"DYDXUSDT","entryPrice":2.21378804417,"markPrice":2.11967778,"pnl":-48142.71521969,"roe":-0.08879676,"updateTime":[2022,5,18,16,40,18,654000000],"amount":511556.5,"updateTimeStamp":1652892018654,"yellow":false,"tradeBefore":false}],"updateTime":[2022,5,16,11,0,5,608000000],"updateTimeStamp":1652698805608},"success":true}

Scraping values from View Source using Requests Python 3

So this code below is working fine but when i change the url to another site it doesn't work
import requests
import re
url = "https://www.autotrader.ca/a/ram/1500/hamilton/ontario/19_12052335_/?showcpo=ShowCpo&ncse=no&ursrc=pl&urp=2&urm=8&sprx=-2"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
response = requests.get(url, headers=headers)
phone_number = re.findall('"phoneNumber":"([\d-]+)"', response.text)
print(phone_number)
['905-870-7127']
This code below doesn't work it gives the output [] Please tell me what am i doing wrong
import requests
import re
urls = "https://www.kijijiautos.ca/vip/22686710/","https://www.kijijiautos.ca/vip/22686710/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
for url in urls:
response = requests.get(url, headers=headers)
number = re.findall('"number":"([\d-]+)"', response.text)
print(number)
[]
I think you are not getting The HTTP 200 OK success status as a response.for that cause you are unable to get the exptected ouptput. To get the HTTP 200 OK success status, I have changed the headers from inspecting http requests.
please try this
import requests
import re
import requests
headers = {
'authority': 'www.kijijiautos.ca',
'sec-ch-ua': '"Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"',
'pragma': 'no-cache',
'accept-language': 'en-CA',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',
'content-type': 'application/json',
'accept': 'application/json',
'cache-control': 'no-cache',
'x-client-id': 'c89e7ff8-1d5a-4c2b-a095-c08dc08ccd3b',
'x-client': 'ca.move.web.app',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.kijijiautos.ca/cars/hyundai/sonata/used/',
'cookie': 'mvcid=c89e7ff8-1d5a-4c2b-a095-c08dc08ccd3b; locale=en-CA; trty=e; _gcl_au=1.1.1363596757.1633936124; _ga=GA1.2.1193080228.1633936126; _gid=GA1.2.71842091.1633936126; AAMC_kijiji_0=REGION%7C3; aam_uuid=43389576784435124231935699643302941454; _fbp=fb.1.1633936286669.1508597061; __gads=ID=bb71a6fc168c1c33:T=1633936286:S=ALNI_MZk3lgy-9xgSGLPnfrkBET60uS6fA; GCLB=COyIgrWs-PWPsQE; lux_uid=163402080128473094; cto_bundle=zxCnjF95NFglMkZrTG5EZ2dzNHFSdjJ6QSUyQkJvM1BUbk5WTkpjTms0aWdZb3RxZUR3Qk1nd1BmcSUyQjZaZVFUdFdSelpua3pKQjFhTFk0U2ViTHVZbVg5ODVBNGJkZ2NqUGg1cHZJN3V0MWlwRkQwb1htcm5nNDlqJTJGUUN3bmt6ZFkzU1J0bjMzMyUyRkt5aGVqWTJ5RVJCa2ZJQUwxcFJnJTNEJTNE; _td=7f855061-c320-4570-b2d2-73c94bd22b13; rbzid=54THgSkyCRKwhVBqy+iHmjb1RG+uE6uH1XjpsXIazO5nO45GtpIXHGYii/PbJcdG3ahjIgKaBrjh0Yx2J6YCOLHEv3QYL559oz3jQaVrssH2/1Ui9buvIpuCwBOGG2xXGWW2qvcU5807PGsdubQDUvLkxmy4sor+4EzCI1OoUHMOG2asQwsgChqwzJixVvrE21E/NJdRfDLlejb5WeGEgU4B3dOYH95yYf5h+7fxV6H/XLhqbNa8e41DM3scfyeYWeqWCWmOH2VWZ7i3oQ0OXW1SkobLy0D6G+V9J5QMxb0=; rbzsessionid=ca53a07d3404ca93b3f8bc879291dc83; _uetsid=131a47702a6211ecba407d9ff6588dde; _uetvid=131b13602a6211ecacd0b56b0815e9b2',
}
response = requests.get('https://www.kijijiautos.ca/consumer/svc/a/22686710', headers=headers)
if response.status_code == 200:
# print(response.text)
numbers = re.findall(r'"number":"\+\d+"', response.text) # number one or more
print(numbers[0])
else:
print('status code is ', response.status_code)
output
# "number":"+17169905088"

How can get the json data automatically instead of copy and paste manually?

I want to get the json data in the target url:
target url
To get it manually :open it in brower manually and copy,paste.I want a more samrt way--programmatically and automatically,have tried with several way,all failed.
Method 1--traditional way with wget or curl:
wget https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300
--2021-02-09 11:55:44-- https://xueqiu.com/stock/cata/stocktypelist.json?page=1
Resolving xueqiu.com (xueqiu.com)... 39.96.249.191
Connecting to xueqiu.com (xueqiu.com)|39.96.249.191|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-02-09 11:55:44 ERROR 403: Forbidden.
Method 2--scrapy with selenium:
>>> from selenium import webdriver
>>> browser = webdriver.Chrome()
>>> url="https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300"
>>> browser.get(url)
It happen to me in the browser:
{"error_description":"遇到错误,请刷新页面或者重新登录帐号后再试","error_uri":"/stock/cata/stocktypelist.json","error_code":"400016"}
Method 3--build a mitmproxy:
mitmweb --listen-host 127.0.0.1 -p 8080
Set proxy in browser and open the target url in browser
Error info in terminal:
Web server listening at http://127.0.0.1:8081/
Opening in existing browser session.
Proxy server listening at http://127.0.0.1:8080
127.0.0.1:41268: clientconnect
127.0.0.1:41270: clientconnect
127.0.0.1:41268: HTTP/2 connection terminated by client: error code: 0, last stream id: 0, additional data: None
Error info in browser:
error_description "遇到错误,请刷新页面或者重新登录帐号后再试"
error_uri "/stock/cata/stocktypelist.json"
error_code "400016"
So powerful site to protect the data ,is there no way to get the data automatically?
You could use requests module
import json
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",}
import requests
cookies = {
'xq_a_token': '176b14b3953a7c8a2ae4e4fae4c848decc03a883',
'xqat': '176b14b3953a7c8a2ae4e4fae4c848decc03a883',
'xq_r_token': '2c9b0faa98159f39fa3f96606a9498edb9ddac60',
'xq_id_token': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTYxMzQ0MzE3MSwiY3RtIjoxNjEyODQ5MDY2ODI3LCJjaWQiOiJkOWQwbjRBWnVwIn0.VuyNicSjIvVkp9FrCzIlRyx8487XM4HH1C3X9KsFA2FipFiilSifBhux9pMNRyziHHiEifhX-xOgccc8IG1mn8cOylOVy3b-L1YG2T5Hs8MKgx7qm4gnV5Mzm_5_G5BiNtO44aczUcmp0g53dp7-0_Bvw3RlwXzT1DTvCKTV-s_zfBsOPyFTfiqyDUxU-oBRvkz1GpgVJzJL4EmZ8zDE2PBqeW00ueLLC7qPW50WeDCsEFS4ZPAvd2SbX9JPk-lU2WzlcMck2S9iFYmpDwuTeQuPbSeSl6jt5suwTImSgJDIUP9o2TX_Z7nNRDTYxvbP8XlejSt8X0pRDPDd_zpbMQ',
'u': '661612849116563',
'device_id': '24700f9f1986800ab4fcc880530dd0ed',
'Hm_lvt_1db88642e346389874251b5a1eded6e3': '1612849123',
's': 'c111f3y1kn',
'Hm_lpvt_1db88642e346389874251b5a1eded6e3': '1612849252',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
'sec-ch-ua': '"Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Accept': 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'no-cors',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'image',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
'Referer': '',
}
params = (
('page', '1'),
('size', '300'),
)
response = requests.get('https://xueqiu.com/stock/cata/stocktypelist.json', headers=headers, params=params, cookies=cookies)
print(response.status_code)
json_data = response.json()
print(json_data)
You could use scrapy:
import json
import scrapy
class StockSpider(scrapy.Spider):
name = 'stock_spider'
start_urls = ['https://xueqiu.com/stock/cata/stocktypelist.json?page=1&size=300']
custom_settings = {
'DEFAULT_REQUEST_HEADERS': {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:85.0) Gecko/20100101 Firefox/85.0',
'Host': 'xueqiu.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US',
'Accept-Encoding': 'gzip,deflate,br',
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'Pragma': 'no-cache',
'Referer': '',
},
'ROBOTSTXT_OBEY': False
}
handle_httpstatus_list = [400]
def parse(self, response):
json_result = json.loads(response.body)
yield json_result
Run spider: scrapy crawl stock_spider

How to show the hidden contents under "View More" when using Selenium - Python

driver = webdriver.Chrome(r'XXXX\chromedriver.exe')
FB_bloomberg_URL = 'https://www.bloomberg.com/quote/FB:US'
driver.get(FB_bloomberg_URL)
board_members = driver.find_elements_by_xpath('//* [#id="root"]/div/div/section[3]/div[10]/div[1]/div[2]/div/div[2]')[0]
board=board_members.text
board.split('\n')
I wrote the coding above to scrape the board information from Bloomberg for FaceBook. But I have trouble to extract all board members because others are hidden behind the "View More". How can I extract all names?
Thanks for the help.
You can do whole thing with requests and grab the appropriate cookie to pass to API from prior GET. The API can be found in the network tab when clicking the view more link and inspecting the web traffic.
import requests
headers = {
'dnt': '1',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'accept': '*/*',
'referer': 'https://www.bloomberg.com/quote/FB:US',
'authority': 'www.bloomberg.com',
'cookie':''
}
with requests.Session() as s:
r = s.get('https://www.bloomberg.com/quote/FB:US')
headers['cookie'] = s.cookies.get_dict()['_pxhd']
r = s.get('https://www.bloomberg.com/markets2/api/peopleForCompany/11092218', headers = headers).json()
board_members = [item['name'] for item in r['boardMembers']]
print(board_members)

Resources