Why python script could not work in AWS ec2? - python-3.x

When my python code run in localhost,it work. But, it doesn't in AWS ec2?
My code is simple: send a https request with post method. It works in localhost, but it occurs problem in AWS ec2.
I change another https request, it works.So the network is ok.
The code and the received content are below:
#!/usr/bin/python
# -*- coding:utf-8 -*-
import random
import requests
url = 'https://www.haidilao.com/eportal/ui?moduleId=5&pageId=9c8cf76c4ca84fc686ca11aaa936f5c7&struts.portlet.action=/portlet/map-portlet!getMapDotDataByRegion.action&random='+str(random.random())
header = {
'host':'www.haidilao.com',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'accept': 'text/plain, */*; q=0.01',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8'
}
data = {
'queryContent': 'CA',
'country':'6bf4038f63234217aecf93668a63f04b',
'myLat':'',
'myLng':''
}
def test():
data = parse.urlencode(data).encode('utf-8')
req = requests.post(url,headers= header,data= data, verify= False)
print(req.status_code)
print(req.headers)
print(req.text)
test()
```python
```html
Content-Type: text/html
Content-Length: 60416
Connection: close
Date: Sat, 10 Aug 2019 13:02:31 GMT
Server: nginx
Last-Modified: Fri, 09 Aug 2019 14:05:06 GMT
ETag: "5d4d7d92-ec00"
Accept-Ranges: bytes
Vary: Accept-Encoding
X-Cache: Miss from cloudfront
Via: 1.1 46dd9ae2d97161deaefbdceeae5f57ac.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: SIN2-C1
X-Amz-Cf-Id: XNkaD2emKes3BpaY3ZVSGb1bxlnsHD1KZeHCZPXnOcspTaYXXjVzKA==
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<title>sorry,the channel has expired!</title>
<link rel="stylesheet" type="text/css" href="/eportal/uiFramework/css/tip.css">
</head>
<body>
<div class="easysite-error">
<img src="/eportal/uiFramework/images/picNews/tip.png" />
<font class="easysite-404"></font>
<font>sorry,the channel has expired!</font>
</div>
</body>
</html>

Related

HTTP get request Access Denied

Trying to understand why I am getting access denied when attempting to download the index.html from www.gamestop.com. I have figured out how to get around it. https://www.gamestop.com/on/demandware.static/Sites-gamestop-us-Site/-/default/v1592871955944/js/main.js. I was wondering if anyone understood why the basic url (www.gamestop.com) is rejected.
Code:
import requests
import http.client as http_client
import logging
headers = {
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding':'gzip, deflate, br',
'accept-language':'en-US,en;q=0.9',
'cache-control':'max-age=0',
'connection':'keep-alive',
'dnt':'1',
'downlink':'10',
'ect':'4g',
'rtt':'50',
'sec-fetch-dest':'document',
'sec-fetch-mode':'navigate',
'sec-fetch-site':'none',
'sec-fetch-user':'?1',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.410 3.97 Safari/537.36'
}
http_client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
r = requests.get('https://www.gamestop.com', headers=headers)
print(r.text)
print(r.status_code)
print(r.headers)
Output:
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.gamestop.com:443
send: b'GET / HTTP/1.1\r\nHost: www.gamestop.com\r\nuser-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.410 3.97 Safari/537.36\r\naccept-encoding: gzip, deflate, br\r\naccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\nconnection: keep-alive\r\naccept-language: en-US,en;q=0.9\r\ncache-control: max-age=0\r\ndnt: 1\r\ndownlink: 10\r\nect: 4g\r\nrtt: 50\r\nsec-fetch-dest: document\r\nsec-fetch-mode: navigate\r\nsec-fetch-site: none\r\nsec-fetch-user: ?1\r\nupgrade-insecure-requests: 1\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: Server: AkamaiGHost
header: Mime-Version: 1.0
header: Content-Type: text/html
header: Content-Length: 265
header: Expires: Fri, 26 Jun 2020 19:54:19 GMT
header: Date: Fri, 26 Jun 2020 19:54:19 GMT
header: Connection: close
header: Server-Timing: cdn-cache; desc=HIT
header: Server-Timing: cdn-cache; desc=HIT
DEBUG:urllib3.connectionpool:https://www.gamestop.com:443 "GET / HTTP/1.1" 403 265
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://www.gamestop.com/" on this server.<P>
Reference #18.19e8d93f.1593201259.5c2b9d0
</BODY>
</HTML>
403
{'Server': 'AkamaiGHost', 'Mime-Version': '1.0', 'Content-Type': 'text/html', 'Content-Length': '265', 'Expires': 'Fri, 26 Jun 2020 19:54:19 GMT', 'Date': 'Fri, 26 Jun 2020 19:54:19 GMT', 'Connection': 'close', 'Server-Timing': 'cdn-cache; desc=HIT, edge; dur=1'}
This is a code from my another project.
By using python fake user agent you can bypass this;
Use google to learn more about those module that i used here..
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
ua = UserAgent()
userAgent = ua.random
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument(f'user-agent={userAgent}')
driver = webdriver.Chrome(
executable_path=r'C:\Users\ASHIK\Desktop\chromedriver.exe', options=chrome_options)
driver.get("https://www.myntra.com/men?f=Categories%3ATshirts&p=1")
html_doc = driver.page_source
with open('myntra-ecom.html', 'w', encoding='utf-8') as hfile:
hfile.writelines(html_doc)
hfile.close()
print("Html file Downloaded...")

Why am i being detected as robot when i am replicating the exact request a browser is making?

This is the website "https://www.interlinecenter.com/" this website is making request to "http://cs.cruisebase.com/cs/forms/hotdeals.aspx?skin=605&nc=y" for loading html content in an "I-FRAME". I am making the exact same request using the same headers being sent by the browser but i am not getting the same content.
Here is the code i am using:
url='http://cs.cruisebase.com/cs/forms/hotdeals.aspx?skin=605&nc=y'
header = {
'Host': 'cs.cruisebase.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.interlinecenter.com/',
'Connection': 'keep-alive',
'Cookie': 'visid_incap_312345=yt2dprI6SuGoy44xQsnF36dOwV0AAAAAQUIPAAAAAAAqm0pG5WAWOGjtyY8GOrLv; __utma=15704100.1052110012.1572947038.1574192877.1575447075.6; __utmz=15704100.1575447075.6.6.utmcsr=interlinecenter.com|utmccn=(referral)|utmcmd=referral|utmcct=/; ASP.NET_SessionId=pzd3a0l5kso41hhbqf3jiqlg; nlbi_312345=/7dzbSeGvDjg2/oY/eQfhwAAAACv806Zf3m7TsjHAou/y177; incap_ses_1219_312345=tMxeGkIPugj4d1gaasLqECHE5l0AAAAAg1IvjaYhEfuSIYLXtc2f/w==; LastVisitedClient=605; AWSELB=85D5DF550634E967F245F317B00A8C32EB84DA2B6B927E6D5CCB7C26C3821788BFC50D95449A1BA0B0AFD152140A70F5EA06CBB8492B21E10EC083351D7EBC4C68F086862A; incap_ses_500_312345=6PJ9FxwJ3gh0vta6kVvwBthz510AAAAAvUZPdshu8GVWM2sbkoUXmg==; __utmb=15704100.2.10.1575447075; __utmc=15704100; __utmt_tt=1',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'max-age=0'
}
response = requests.get(url, timeout=10, headers=header)
byte_data = response.content
source_code = html.fromstring(byte_data)
print(response)
print(byte_data)
This is the response i am getting:
<Response [200]>
<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=9&xinfo=10-99927380-0%200NNN%20RT%281575456049298%202%29%20q%280%20-1%20-1%200%29%20r%281%20-1%29%20B12%284%2c316%2c0%29%20U2&incident_id=500000240101726326-477561257670738314&edet=12&cinfo=04000000&rpinfo=0" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 500000240101726326-477561257670738314</iframe></body></html>
I need to extract/scrape data at "https://cs.cruisebase.com/cs/forms/hotdeals.aspx?skin=605&nc=y".
Note: i don't want to use the selenium webdriver to get the data any help will be much appreciated, Thanks!
Did you try getting the headers by loading the target URL directly?
I sent a GET request to https://cs.cruisebase.com/cs/forms/hotdeals.aspx?skin=605&nc=y with the following headers, and I was able to get the complete response.
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'Cache-Control':'no-cache',
'Connection':'keep-alive',
'Cookie':'ENTER COOKIES',
'DNT':'1',
'Host':'cs.cruisebase.com',
'Pragma':'no-cache',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
I have left the Cookie field blank, you will have to enter cookies otherwise the page won't load. You can get the cookies from Chrome.

Trying to connect through the web socket and it always get the HTTP/1.1 405 Method Not Allowed error

I'm trying to make an HTTPS proxy server, but I cant make a connection to any server.
Edit: this is the part of the code that, after a client connect to my server, I get the message and try to send to the web. This is the message of a FireFox client trying to connect to Google:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock = socket.create_connection((google.com, 443))
ssl_sock = ssl.wrap_socket(sock)
fullData=b''
ssl_sock.send(b'CONNECT www.google.com:443 HTTP/1.1\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0\r\nProxy-Connection: keep-alive\r\nConnection: keep-alive\r\nHost: www.google.com:443\r\n\r\n')
while 1:
# receive data from web server
data = ssl_sock.recv(4026)
print(data)
if (len(data) > 0):
fullData+=data
else:
break
clientSock.send(fullData)
Google should got me a ok message but its getting me an error
HTTP/1.1 405 Method Not Allowed
Content-Type: text/html; charset=UTF-8
Referrer-Policy: no-referrer
Content-Length: 1592
Date: Fri, 24 May 2019 05:28:17 GMT
Alt-Svc: quic=":443"; ma=2592000; v="46,44,43,39"
Connection: close
<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
<title>Error 405 (Method Not Allowed)!!1</title>
<style>
*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}#media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}#media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}#media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
</style>
<a href=//www.google.com/><span id=logo aria-label=Google></span></a>
<p><b>405.</b> <ins>That\xe2\x80\x99s an error.</ins>
<p>The request method <code>CONNECT</code> is inappropriate for the URL <code>/</code>. <ins>That\xe2\x80\x99s all we know.</ins>

Expressjs Route contains weird characters

What could possibly be the reason for expressjs route to contain the following data? I am expecting it to return JSON data. I am making an ajax call to the server(expressjs) which gives me the below data with weird characters. Is this data gzipped? I have set the headers and contentType as follows:
headers: {"Access-Control-Allow-Origin":"*"}
contentType: 'application/json; charset=utf-8'
�=O�0�b��K�)�%7�܈9���G��%NOU���O'6��k�~6��S.���,��/�wأ%6�K�)��e�
The HTTP response is as follows:
General:
Request URL: http://localhost/expressRoute.js
Request Method: GET
Status Code: 200 OK
Remote Address: [::1]:80
Referrer Policy: no-referrer-when-downgrade
Response Headers:
Accept-Ranges: bytes
Connection: Keep-Alive
Content-Length: 29396
Content-Type: application/javascript
Date: Thu, 22 Nov 2018 00:50:36 GMT
ETag: "72d4-57b124e0c372e"
Keep-Alive: timeout=5, max=100
Last-Modified: Tue, 20 Nov 2018 05:57:12 GMT
Server: Apache/2.4.34 (Win32) OpenSSL/1.1.0i PHP/7.2.10
Request Headers:
Accept: */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cache-Control: no-cache
Connection: keep-alive
Host: localhost
Pragma: no-cache
Referer: http://localhost/index.html
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36

Access to a web page via a robot

I need to occasionally access an HTML page to update a database. This page is easily accessible via a web browser, but when I try to access it via a node.js application it doesn't work (the website detect that the request is made by a robot). However,
The robot request contains the same headers (including the
user-agent) that the web browser request.
The robot request doesn't contains referer header or cookie header, but the browser request either.
The IP of the robot is the same that the IP that I use
to browse the website.
In my eyes the robot request and the browser request are strictly identical. Nevertheless they are processed differently.
I'm running out of ideas... Maybe the request contains metadata like "this request was sent by node.js" but it would be really weird.
EDIT, here is a code sample :
// callback (error, responseContent)
function getPage (callback){
let options = {
protocol : 'https:',
hostname : 'xxx.yyy.fr',
port : 443,
path : '/abc/def',
agent : false,
headers : {
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate, br',
'Accept-Language' : 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Cache-Control' : 'no-cache',
'Connection' : 'keep-alive',
'DNT' : '1',
'Host' : 'ooshop.carrefour.fr',
'Pragma' : 'no-cache',
'Upgrade-Insecure-Requests' : '1',
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0'
}
};
https.get (options, function (res){
if (res.statusCode !== 200){
res.resume ();
callback ('Error : res code != 200, res code = ' + res.statusCode);
return;
}
res.setEncoding ('utf-8');
let content = '';
res.on ('data', chunk => content += chunk);
res.on ('end', () => callback (null, content));
}).on ('error', e => callback (e));
}
EDIT : here is a comparison of the requests/responses :
Mozilla Firefox
request headers :
GET /3274080001005/eau-de-source-cristaline HTTP/1.1
Host: ooshop.carrefour.fr
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate, br
DNT: 1
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Pragma: no-cache
Cache-Control: no-cache
response headers :
HTTP/2.0 200 OK
date: Wed, 11 Jul 2018 21:25:25 GMT
server: Unknown
content-type: text/html; charset=UTF-8
age: 0
x-varnish-cache: MISS
accept-ranges: bytes
set-cookie: visid_incap_1213048=G8a0mWzmQYi0GKuT2Ht7YeQ9QVsAAAAAQkIPAAAAAADvVZnsZHK18dQQxHakBprg; expires=Thu, 11 Jul 2019 11:17:56 GMT; path=/; Domain=.carrefour.fr
incap_ses_466_1213048=/2NKHS4HXU0T7FpkwpJ3BsV1RlsAAAAAAY3wbUkXacAceu2NkgUrhw==; path=/; Domain=.carrefour.fr
x-iinfo: 7-11020186-11020187 NNNN CT(1 2 0) RT(1531344324722 0) q(0 0 0 0) r(4 4) U12
x-cdn: Incapsula
content-encoding: gzip
X-Firefox-Spdy: h2
response content : expected HTML page
Node.js robot
request headers :
Accept : text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding : gzip, deflate, br
Accept-Language : fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3
Cache-Control : no-cache
Connection : keep-alive
DNT : 1
Host : ooshop.carrefour.fr
Pragma : no-cache
Upgrade-Insecure-Requests : 1
User-Agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
response headers :
content-type : text/html
connection : close, close
cache-control : no-cache
content-length : 210
x-iinfo : 1-17862634-0 0NNN RT(1531344295049 65) q(0 -1 -1 0) r(0 -1) B10(4,314,0) U19
set-cookie : incap_ses_466_1213048=j34jMBWkPFYT7FpkwpJ3Bqd1RlsAAAAAVBfoZBShAvoun/M8UFxPPA==; path=/; Domain=.carrefour.fr
response content :
<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">
</script>
<body>
</body></html>

Resources