Using WebSocket for Web Page Data Scraping - python-3.x

I want to scrape some of the data from here which is implemented based on websockets. So after inspecting the Chrome DevTools for wss address and header:
and the negotiation message:
I wrote:
from websocket import create_connection
headers = {
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,fa;q=0.8',
'Cache-Control': 'no-cache',
'Connection': 'Upgrade',
'Host': 'stream179.forexpros.com',
'Origin': 'https://www.investing.com',
'Pragma': 'no-cache',
'Sec-WebSocket-Extensions': 'client_max_window_bits',
'Sec-WebSocket-Key': 'ldcvnZNquzPkSNvpSdI09g==',
'Sec-WebSocket-Version': '13',
'Upgrade': 'websocket',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
}
ws = create_connection('wss://stream179.forexpros.com/echo/894/l27e2ja8/websocket', header=headers)
nego_message = '''["{\"_event\":\"bulk-subscribe\",\"tzID\":8,\"message\":\"pid-1:%%pid-8839:%%pid-166:%%pid-20:%%pid-169:%%pid-170:%%pid-44336:%%pid-27:%%pid-172:%%pid-2:%%pid-3:%%pid-5:%%pid-7:%%pid-9:%%pid-10:%%pid-945629:%%pid-11:%%pid-16:%%pid-68:%%pidTechSumm-1:%%pidTechSumm-2:%%pidTechSumm-3:%%pidTechSumm-5:%%pidTechSumm-7:%%pidTechSumm-9:%%pidTechSumm-10:%%pidExt-1:%%event-393634:%%event-393633:%%event-393636:%%event-393638:%%event-394479:%%event-394518:%%event-394514:%%event-394516:%%event-394515:%%event-394517:%%event-393654:%%event-394467:%%event-393653:%%event-394468:%%event-394545:%%event-394549:%%event-394548:%%event-394547:%%event-394550:%%event-394546:%%event-394551:%%event-394553:%%event-394552:%%event-394743:%%event-394744:%%event-393661:%%event-394469:%%event-394470:%%event-393680:%%event-393682:%%event-393681:%%event-393687:%%event-393694:%%event-393685:%%event-393689:%%event-393688:%%event-393695:%%event-393698:%%event-393704:%%event-393705:%%event-393724:%%event-393723:%%event-393725:%%event-393726:%%event-394591:%%event-393736:%%event-393733:%%event-393734:%%event-393740:%%event-393731:%%event-393732:%%event-393730:%%event-394617:%%event-394616:%%event-393737:%%event-378304:%%event-393645:%%event-394619:%%event-393755:%%event-393757:%%event-393760:%%event-393756:%%event-393758:%%event-393759:%%event-393761:%%event-393762:%%event-394481:%%event-394625:%%event-393754:%%event-394483:%%event-393775:%%event-394621:%%event-394622:%%event-376710:%%event-394623:%%event-394484:%%event-394624:%%isOpenExch-1:%%isOpenExch-2:%%isOpenExch-13:%%isOpenExch-3:%%isOpenExch-4:%%isOpenPair-1:%%isOpenPair-8839:%%isOpenPair-44336:%%cmt-1-5-1:%%domain-1:\"}"]'''
ws.send(nego_message)
while True:
print(ws.recv())
but I'm getting:
o
Traceback (most recent call last):
File "test.py", line 647, in <module>
print(ws.recv())
File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_core.py", line 313, in recv
opcode, data = self.recv_data()
File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_core.py", line 330, in recv_data
opcode, frame = self.recv_data_frame(control_frame)
File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_core.py", line 343, in recv_data_frame
frame = self.recv_frame()
File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_core.py", line 377, in recv_frame
return self.frame_buffer.recv_frame()
File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_abnf.py", line 361, in recv_frame
self.recv_header()
File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_abnf.py", line 309, in recv_header
header = self.recv_strict(2)
File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_abnf.py", line 396, in recv_strict
bytes_ = self.recv(min(16384, shortage))
File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_core.py", line 452, in _recv
return recv(self.sock, bufsize)
File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_socket.py", line 115, in recv
"Connection is already closed.")
websocket._exceptions.WebSocketConnectionClosedException: Connection is already closed.
[Finished in 1.9s]
What am I missing here?
Update 1: updating code using WebSocketApp:
def on_message(ws, message):
print("message:", message)
def on_error(ws, error):
print("error:", error)
def on_close(ws):
print("closed.")
def on_open(ws):
print("opened")
time.sleep(1)
ws.send(nego_message)
ws = websocket.WebSocketApp(
"wss://stream179.forexpros.com/echo/894/l27e2ja8/websocket",
on_open = on_open,
on_message = on_message,
on_error = on_error,
on_close = on_close,
header = headers
)
websocket.enableTrace(True)
ws.run_forever()
but still no success:
--- request header ---
GET /echo/894/l27e2ja8/websocket HTTP/1.1
Upgrade: websocket
Connection: Upgrade
Host: stream179.forexpros.com
Origin: http://stream179.forexpros.com
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,fa;q=0.8
Cache-Control: no-cache
Connection: Upgrade
Host: stream179.forexpros.com
Origin: https://www.investing.com
Pragma: no-cache
Sec-WebSocket-Extensions: client_max_window_bits
Sec-WebSocket-Key: ldcvnZNquzPkSNvpSdI09g==
Sec-WebSocket-Version: 13
Upgrade: websocket
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36
-----------------------
--- response header ---
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: XPKKpUMZLpSYx/1z8Q0499hcobs=
-----------------------
opened
send: b'\x81\xfe\x06{_\xda7\xd2\x04\xf8L\xf0\x00\xbfA\xb71\xae\x15\xe8}\xb8B\xbe4\xf7D\xa7=\xa9T\xa06\xb8R\xf0s\xf8C\xa8\x16\x9e\x15\xe8g\xf6\x15\xbf:\xa9D\xb38\xbf\x15\xe8}\xaa^\xb6r\xeb\r\xf7z\xaa^\xb6r\xe2\x0f\xe1f\xe0\x12\xf7/\xb3S\xffn\xec\x01\xe8z\xffG\xbb;\xf7\x05\xe2e\xff\x12\xa26\xbe\x1a\xe3i\xe3\r\xf7z\xaa^\xb6r\xeb\x00\xe2e\xff\x12\xa26\xbe\x1a\xe6k\xe9\x04\xe4e\xff\x12\xa26\xbe\x1a\xe0h\xe0\x12\xf7/\xb3S\xffn\xed\x05\xe8z\xffG\xbb;\xf7\x05\xe8z\xffG\xbb;\xf7\x04\xe8z\xffG\xbb;\xf7\x02\xe8z\xffG\xbb;\xf7\x00\xe8z\xffG\xbb;\xf7\x0e\xe8z\xffG\xbb;\xf7\x06\xe2e\xff\x12\xa26\xbe\x1a\xebk\xef\x01\xe0f\xe0\x12\xf7/\xb3S\xffn\xeb\r\xf7z\xaa^\xb6r\xeb\x01\xe8z\xffG\xbb;\xf7\x01\xeae\xff\x12\xa26\xbec\xb7<\xb2d\xa72\xb7\x1a\xe3e\xff\x12\xa26\xbec\xb7<\xb2d\xa72\xb7\x1a\xe0e\xff\x12\xa26\xbec\xb7<\xb2d\xa72\xb7\x1a\xe1e\xff\x12\xa26\xbec\xb7<\xb2d\xa72\xb7\x1a\xe7e\xff\x12\xa26\xbec\xb7<\xb2d\xa72\xb7\x1a\xe5e\xff\x12\xa26\xbec\xb7<\xb2d\xa72\xb7\x1a\xebe\xff\x12\xa26\xbec\xb7<\xb2d\xa72\xb7\x1a\xe3o\xe0\x12\xf7/\xb3S\x97\'\xae\x1a\xe3e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe1i\xe9\x03\xe8z\xffR\xa4:\xb4C\xffl\xe3\x04\xe4l\xe9\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x01\xe1i\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebl\xec\x04\xeae\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe6k\xed\x0e\xe8z\xffR\xa4:\xb4C\xffl\xe3\x03\xe7n\xe2\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xee\x02\xe3k\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebk\xef\x06\xe4e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe6j\xeb\x02\xe8z\xffR\xa4:\xb4C\xffl\xe3\x03\xe7n\xed\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x01\xe7k\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebk\xee\x01\xe5e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe1i\xef\x04\xe8z\xffR\xa4:\xb4C\xffl\xe3\x03\xe6i\xe2\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xee\x02\xe6j\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebk\xef\x03\xebe\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe6j\xee\x0f\xe8z\xffR\xa4:\xb4C\xffl\xe3\x03\xe7k\xed\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xee\x02\xe7o\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebk\xef\x03\xe4e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe6j\xef\x06\xe8z\xffR\xa4:\xb4C\xffl\xe3\x03\xe7j\xe9\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xee\x02\xe7m\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebk\xed\x03\xe1e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe6h\xee\x03\xe8z\xffR\xa4:\xb4C\xffl\xe3\x04\xe4i\xeb\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xee\x03\xe4f\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebk\xee\x00\xe2e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe1i\xe2\x07\xe8z\xffR\xa4:\xb4C\xffl\xe3\x04\xe4g\xe8\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x01\xean\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebl\xec\x0f\xe5e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe1i\xe3\x03\xe8z\xffR\xa4:\xb4C\xffl\xe3\x04\xe4g\xef\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x01\xeaf\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebl\xec\x0f\xeae\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe1i\xe3\x02\xe8z\xffR\xa4:\xb4C\xffl\xe3\x04\xe4f\xe2\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x00\xe2k\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebl\xed\x07\xe7e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe1h\xe8\x03\xe8z\xffR\xa4:\xb4C\xffl\xe3\x04\xe5m\xe9\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x00\xe0j\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebl\xed\x05\xe4e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe6j\xe3\x06\xe8z\xffR\xa4:\xb4C\xffl\xe3\x04\xe5l\xec\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x00\xe1l\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebl\xed\x04\xe6e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe1h\xee\x07\xe8z\xffR\xa4:\xb4C\xffl\xe3\x04\xe5l\xeb\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x00\xe1m\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebl\xed\x04\xe2e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe6i\xeb\x00\xe8z\xffR\xa4:\xb4C\xffl\xe3\x03\xe4n\xec\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x00\xe1h\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xe5g\xe9\x07\xe6e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe1i\xee\x02\xe8z\xffR\xa4:\xb4C\xffl\xe3\x03\xe4n\xe3\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x00\xe7j\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebl\xed\x02\xe5e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe1h\xec\x07\xe8z\xffR\xa4:\xb4C\xffl\xe3\x04\xe5j\xec\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x00\xe7g\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebl\xed\x02\xebe\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe1h\xec\x06\xe8z\xffR\xa4:\xb4C\xffl\xe3\x04\xe5i\xe8\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xee\x03\xean\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebk\xec\x05\xe7e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe1h\xef\x03\xe8z\xffR\xa4:\xb4C\xffl\xe3\x03\xe6g\xe9\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xe9\x00\xe5j\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebk\xec\x05\xe3e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe6i\xe8\x05\xe8z\xffR\xa4:\xb4C\xffl\xed\x01\xe5n\xea\r\xf7z\xbfA\xb71\xae\x1a\xe1f\xee\x01\xe0l\xe0\x12\xf7:\xacR\xbc+\xf7\x04\xebk\xee\x0f\xe6e\xff\x12\xb7)\xbfY\xa6r\xe9\x0e\xe6i\xe8\x03\xe8z\xff^\xa1\x10\xaaR\xbc\x1a\xa2T\xbar\xeb\r\xf7z\xb3D\x9d/\xbfY\x97\'\xb9_\xffm\xe0\x12\xf76\xa9x\xa2:\xb4r\xaa<\xb2\x1a\xe3l\xe0\x12\xf76\xa9x\xa2:\xb4r\xaa<\xb2\x1a\xe1e\xff\x12\xbb,\x95G\xb71\x9fO\xb17\xf7\x03\xe8z\xff^\xa1\x10\xaaR\xbc\x0f\xbb^\xa0r\xeb\r\xf7z\xb3D\x9d/\xbfY\x82>\xb3E\xffg\xe2\x04\xebe\xff\x12\xbb,\x95G\xb71\x8aV\xbb-\xf7\x03\xe6l\xe9\x01\xe8z\xffT\xbf+\xf7\x06\xffj\xf7\x06\xe8z\xffS\xbd2\xbb^\xbcr\xeb\r\xf0"\xf8j'
message: o
send: b'\x88\x82!\xdd\x07\xcf"5'
closed.
[Finished in 2.3s]

I tried to remove all the dashes from the message sent and eventually it worked.
nego_message = '{"_event":"bulk-subscribe","tzID":8,"message":"pid-0:%%isOpenExch-1:%%pid-8849:%%isOpenExch-1004:%%pid-8833:%%pid-8862:%%pid-8830:%%pid-8836:%%pid-8831:%%pid-8916:%%pid-8832:%%pid-169:%%pid-20:%%isOpenExch-2:%%pid-166:%%pid-172:%%isOpenExch-4:%%pid-27:%%isOpenExch-3:%%pid-167:%%isOpenExch-9:%%pid-178:%%isOpenExch-20:%%pid-6408:%%pid-6369:%%pid-13994:%%pid-6435:%%pid-13063:%%pid-26490:%%pid-243:%%pid-1:%%isOpenExch-1002:%%pid-2:%%pid-3:%%pid-5:%%pid-7:%%pid-9:%%pid-10:%%pid-23705:%%pid-23706:%%pid-23703:%%pid-23698:%%pid-8880:%%isOpenExch-118:%%pid-8895:%%pid-1141794:%%pid-1175152:%%isOpenExch-152:%%pid-1175153:%%pid-14958:%%pid-44336:%%isOpenExch-97:%%pid-8827:%%pid-6497:%%pid-941155:%%pid-104395:%%pid-1013048:%%pid-1055979:%%pid-1177973:%%pid-1142416:%%pidExt-1:%%cmt-1-5-1:%%pid-252:%%pid-1031244:%%isOpenExch-125:"}'
ws.send(nego_message)
while True:
print(ws.recv())
Outputs:
a["{\"message\":\"pid-3::{\\\"pid\\\":\\\"3\\\",\\\"last_dir\\\":\\\"greenBg\\\",\\\"last_numeric\\\":149.19,\\\"last\\\":\\\"149.19\\\",\\\"bid\\\":\\\"149.18\\\",\\\"ask\\\":\\\"149.19\\\",\\\"high\\\":\\\"149.29\\\",\\\"low\\\":\\\"149.12\\\",\\\"last_close\\\":\\\"149.26\\\",\\\"pc\\\":\\\"-0.07\\\",\\\"pcp\\\":\\\"-0.05%\\\",\\\"pc_col\\\":\\\"redFont\\\",\\\"turnover\\\":\\\"18.13K\\\",\\\"turnover_numeric\\\":\\\"18126\\\",\\\"time\\\":\\\"0:39:09\\\",\\\"timestamp\\\":1666139948}\"}"]

The while loop is calling ws.recv() twice. If you simply do:
print(ws.recv())
It will not attempt to call .recv() on a closed connection. The result of your message output is printing o before the stack trace.
As an aside, it seems like you might want a longer running connection using websocket.WebSocketApp (example) for a scrape.

Related

Response provided for One Page(200) but not for Other(401)

On the same website here, a consistent valid response is provided for One Page(200-url/nifty) but not provided for anotherpage(401-url/oispurtscontracts).
It does provide a valid response sometimes and other times returns a 401 error.
Browser cache cleared and reloaded.
Please provide a solution.
Error :
response.status_code = 401 for https://www.nseindia.com/api/live-analysis-oi-spurts-contracts
Code:
import requests
def connRequest(url,headers):
session = requests.Session()
request = session.get(url, headers=headers)
cookies = dict(request.cookies)
# print(cookies)
print(f"response.status_code = {request.status_code} for {url}")
response = session.get(url, headers=headers, cookies=cookies).json()
print(f"response = {response}")
return response
# Working - Response Provided
def nifty_Working():
url = 'https://www.nseindia.com/api/option-chain-indices?symbol=NIFTY'
# data = requests.get(url)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.5',
'Accept':'application/json'
}
response = connRequest(url, headers)
# 401 Error
def oiSpurtsContracts_NotWorking():
url = 'https://www.nseindia.com/api/live-analysis-oi-spurts-contracts'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en',
'Accept': 'application/json'
}
response = connRequest(url, headers)
def main():
# Working - Response Provided
nifty_Working()
print()
print()
print()
print()
# 401 Error
time.sleep(1)
oiSpurtsContracts_NotWorking()
main()

Connection timeout on port 443 on linode instance

I have been trying to write a python code and run it on one of my linode instance. The OS for the instance is Debian 11. This is my code:
# Libraries
import requests
import json
import math
url_oc = "https://www.nseindia.com/option-chain"
url_bnf = 'https://www.nseindia.com/api/option-chain-indices?symbol=BANKNIFTY'
url_nf = 'https://www.nseindia.com/api/option-chain-indices?symbol=NIFTY'
url_indices = "https://www.nseindia.com/api/allIndices"
# Headers
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
'accept-language': 'en,gu;q=0.9,hi;q=0.8',
'accept-encoding': 'gzip, deflate, br'}
sess = requests.Session()
cookies = dict()
request = sess.get(url_oc, headers=headers, timeout=5)
cookies = dict(request.cookies)
When I run the program, it throws read timeout error.
Traceback (most recent call last):
File "/home/python-ws/sniper1.py", line 19, in <module>
request = sess.get(url_oc, headers=headers, timeout=5)
File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 600, in get
return self.request("GET", url, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/requests/adapters.py", line 578, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.nseindia.com', port=443): Read timed out. (read timeout=5)
I tried to:
increase the timeout. But still the connection timeouts.
ping the url directly and telnet it on port 443, it was success
ufw allow 443/tcp
But still the connection is timing out. The internet is also connected. Of all these, what could be the reason for the timeout?

requests.get return error HTTPSConnectionPool Python

The code below, needs to return 200, but an error occurs for some domains.
import requests
url1 = 'https://www.pontofrio.com.br/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
response = requests.get(url1, headers, timeout=10)
print(response.status_code)
Return:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "C:\Python34\lib\http\client.py", line 1148, in getresponse
response.begin()
File "C:\Python34\lib\http\client.py", line 352, in begin
version, status, reason = self._read_status()
File "C:\Python34\lib\http\client.py", line 314, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "C:\Python34\lib\socket.py", line 371, in readinto
return self._sock.recv_into(b)
File "C:\Python34\lib\site-packages\urllib3\contrib\pyopenssl.py", line 309, in recv_into
return self.recv_into(*args, **kwargs)
File "C:\Python34\lib\site-packages\urllib3\contrib\pyopenssl.py", line 307, in recv_into
raise timeout('The read operation timed out')
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Python34\lib\site-packages\urllib3\util\retry.py", line 367, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Python34\lib\site-packages\urllib3\packages\six.py", line 686, in reraise
raise value
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "C:\Python34\lib\site-packages\urllib3\connectionpool.py", line 306, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='www.pontofrio.com.br', port=443): Read timed out. (read timeout=10)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:/teste.py", line 219, in <module>
url = montaurl(dominio)
File "c:/teste.py", line 81, in montaurl
response = requests.get(url1, headers, timeout=10)
File "C:\Python34\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Python34\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "C:\Python34\lib\site-packages\requests\adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.pontofrio.com.br', port=443): Read timed out. (read timeout=10)
Domain that works:
https://www.pichau.com.br/
Domains that don't work:
casasbahia.com.br
extra.com.br
boticario.com.br
I believe it is some block on the server of the pontofrio, how can I get around this?
There seemed to be a couple of issues, the first being how the headers were being set. The below doesn't actually pass the custom headers to the requests.get function.
response = requests.get(url1, headers, timeout=10)
This can be tested against httpbin:
import requests
url1 = 'https://httpbin.org/headers'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
response = requests.get(url1, headers, timeout=10)
print(response.text)
print(response.status_code)
Which outputs:
{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.25.1",
"X-Amzn-Trace-Id": "Root=1-608a0391-3f1cfa79444ac04865ad9111"
}
}
200
To properly set the headers argument:
response = requests.get(url1, headers=headers, timeout=10)
Let's test:
import requests
url1 = 'https://httpbin.org/headers'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
response = requests.get(url1, headers=headers, timeout=10)
print(response.text)
print(response.status_code)
Here's the output:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"Accept-Encoding": "none",
"Accept-Language": "en-US,en;q=0.8",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"X-Amzn-Trace-Id": "Root=1-608a0533-40c8281f5faa85d1050c6b6a"
}
}
200
Finally, the order of the headers and the 'Connection': 'keep-alive' header in particular were causing problems. Once I reordered and removed the Connection header it starting working on all of the urls.
Here's the code I used to test:
import requests
urls = ['https://www.pontofrio.com.br/',
'https://www.casasbahia.com.br',
'https://www.extra.com.br',
'https://www.boticario.com.br']
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4491.0 Safari/537.36'}
for url1 in urls:
print("Trying url: %s"% url1)
response = requests.get(url1, headers=headers, timeout=10)
print(response.status_code)
And the output:
Trying url: https://www.pontofrio.com.br/
200
Trying url: https://www.casasbahia.com.br
200
Trying url: https://www.extra.com.br
200
Trying url: https://www.boticario.com.br
200
I've tested to access the page with wget but without success. The problem seems that the server responds only to HTTP/2 requests.
Test with curl:
This times out:
$ curl --http1.1 -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" "https://www.pontofrio.com.br/"
# times out
This succeeds (note the --http2 parameter):
$ curl --http2 -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" "https://www.pontofrio.com.br/"
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
...
Unfortunately, requests module doesn't support it. You can however use httpx module that has experimental HTTP/2 support:
import httpx
import asyncio
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
}
async def get_text(url):
async with httpx.AsyncClient(http2=True, headers=headers) as client:
r = await client.get(url)
return r.text
txt = asyncio.run(get_text("https://www.pontofrio.com.br/"))
print(txt)
Prints:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
...
To install the httpx module with HTTP/2 support, use for example pip install httpx[http2]

Connection aborted.', RemoteDisconnected('Remote end closed connection without response') while using python

Hello I am attempting to reach https://api.louisvuitton.com/api/eng-us/catalog/availability/M80016 through a session while using request in python. Currently I am unable to reach it and get an error of Remote end closed connection without response.
I have been trying to debug but havent been successful. Bellow is my code and the output.
Code:
import requests
from requests.auth import HTTPBasicAuth
import json
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'}
s = requests.Session()
r = s.get("https://us.louisvuitton.com/eng-us/products/pocket-organizer-damier-graphite-nvprod2630093v#N60432",headers=headers)
if r:
print("Requested Successfully")
else:
print("Request Failed ==> " + str(r))
exit()
url2 = "https://api.qubit.com/graphql"
payload = json.dumps({
"query": "query ($trackingId: String!, $contextId: String!) {\n property(trackingId: $trackingId) {\n visitor(contextId: $contextId) {\n ipAddress\n ipLocation: location {\n city\n cityCode\n country\n countryCode\n latitude\n longitude\n area\n areaCode\n region\n regionCode\n }\n segment: segments {\n state\n }\n history {\n conversionCycleNumber: conversionCycle\n conversionNumber: conversions\n entranceNumber: entrances\n firstConversionTs: firstConversion\n firstViewTs: firstView\n lastConversionTs: lastConversion\n lastViewTs: lastView\n lifetimeValue\n sessionNumber: sessions\n viewNumber: views\n }\n }\n }\n}",
"variables": {
"trackingId": "louisvuitton_prod",
"contextId": "o6vfrf9jm4g-0k999shdp-fiadwa4"
}})
headers2 = {
'Content-Type': 'application/json'
}
x = s.post(url2,headers=headers2, data=payload)
if x:
print("Post Successfully")
else:
print("Post Failed ==> " + str(x))
exit()
headers3 = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)',
'Accept': "*/*",
'Cache-Control': "no-cache",
'Host': "api.louisvuitton.com",
'Accept-Encoding': "gzip, deflate",
'Connection': "keep-alive",
'cache-control': "no-cache",
'Content-Type': 'application/json'
}
cookies = s.cookies
t = s.get("https://api.louisvuitton.com/api/eng-us/catalog/availability/M80016",headers=headers3,cookies=cookies)
if t:
print("Get Successfully")
else:
print("Get Failed ==> " + str(t))
exit()
Output
Requested Successfully
Post Successfully
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3-1.25.10-py3.8.egg/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.8/site-packages/urllib3-1.25.10-py3.8.egg/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.8/site-packages/urllib3-1.25.10-py3.8.egg/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1347, in getresponse
response.begin()
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 276, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
Anyone have a clue or idea how to resolve this issues? Would appreciated any help.
If you inspect the cookies on the webpage in Chrome with Inspect Element -> application -> storage -> cookies -> https://us.louisvuitton.com/ you see about 40 cookies. However if you add import pprint to your code and at line 50 pprint.pprint(s.cookies.get_dict()) you see only 4 cookies. So you are missing many cookies.
The response you get is actually an Access Denied message as you can see if you use Inspect Element -> Network copy as cURL on the https://api.louisvuitton.com/api/eng-us/catalog/availability/nvprod... URL and remove the cookies except for your 4 and run it, if you run it will all the cookies it works fine.
So as there are many XHR requests than can set cookies I suggest you either go through all requests decode them if needed and read all the JavaScript files to see if they set cookies or a much easier solution use Selenium, requests-html https://pypi.org/project/requests-html/ or PyQT

python tor stem HTTP error 503

I'm currently trying to get a new ip via python.
show source :
import urllib.request
from stem import Signal
from stem.control import Controller
import socks, socket, time, random
proxy_support = urllib.request.ProxyHandler({"http" : "127.0.0.1:8118"})
opener = urllib.request.build_opener(proxy_support)
UA = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.73.11 (KHTML, like Gecko) Version/7.0.1 Safari/537.73.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.76 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:26.0) Gecko/20100101 Firefox/26.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0',
'Mozilla/5.0 (Windows NT 6.1; rv:26.0) Gecko/20100101 Firefox/26.0',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.76 Safari/537.36'
]
def newI():
controller = Controller.from_port(port = 9051)
try:
controller.authenticate()
controller.signal(Signal.NEWNYM)
bytes_read = controller.get_info("traffic/read")
bytes_written = controller.get_info("traffic/written")
print (bytes_read)
print (bytes_written)
finally:
controller.close()
if __name__ == '__main__':
params = 'site:google.com admin'
page = 0
for i in range(100):
url = 'http://www.google.co.kr/search?hl=ko&q=%s&start=%d' %(urllib.parse.quote(params), page)
proxy_support = urllib.request.ProxyHandler({"http" : "127.0.0.1:8118"})
urllib.request.install_opener(opener)
user_agent = random.choice(UA)
headers = {'User-Agent' : user_agent}
random_interval = random.randrange(1, 5, 1)
time.sleep(random_interval)
req = urllib.request.Request(url, headers = headers)
res = urllib.request.urlopen(req)
html = res.read()
print (len(html))
page = page + 10
newI()
I have my vidalia running and privoxy. I have my settings correctly set:
Web Proxy (HTTP): 127.0.0.1:8118 and the same for HTTPS
In my privoxy config file I have this line:
forward-socks5 / 127.0.0.1:9050 .
Though still when I run the code it is stuck on on case 1 and I can't get an ip. This is the log of my vidalia:
1. settings > Sharing > Run as client only
2. settings > Advanced > 127.0.0.1 : 9051
Though still when I run the code it is stuck on on case 1 and I can't get an ip. This is the log of my vidalia:
Traceback (most recent call last):
File "C:/Users/kwon/PycharmProjects/google_search/test.py", line 50, in <module>
res = urllib.request.urlopen(req)
File "C:\Python33\lib\urllib\request.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 507, in error
result = self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 692, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 513, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable
What am I doing wrong ?
Google prevent automated requests, take a look to this another post Tor blocked by Google

Resources