How to write these 3 sentences with requests instead of urllib2? [duplicate] - python-3.x

Just a short, simple one about the excellent Requests module for Python.
I can't seem to find in the documentation what the variable 'proxies' should contain. When I send it a dict with a standard "IP:PORT" value it rejected it asking for 2 values.
So, I guess (because this doesn't seem to be covered in the docs) that the first value is the ip and the second the port?
The docs mention this only:
proxies – (optional) Dictionary mapping protocol to the URL of the proxy.
So I tried this... what should I be doing?
proxy = { ip: port}
and should I convert these to some type before putting them in the dict?
r = requests.get(url,headers=headers,proxies=proxy)

The proxies' dict syntax is {"protocol": "scheme://ip:port", ...}. With it you can specify different (or the same) proxie(s) for requests using http, https, and ftp protocols:
http_proxy = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy = "ftp://10.10.1.10:3128"
proxies = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
r = requests.get(url, headers=headers, proxies=proxies)
Deduced from the requests documentation:
Parameters:
method – method for the new Request object.
url – URL for the new Request object.
...
proxies – (optional) Dictionary mapping protocol to the URL of the proxy.
...
On linux you can also do this via the HTTP_PROXY, HTTPS_PROXY, and FTP_PROXY environment variables:
export HTTP_PROXY=10.10.1.10:3128
export HTTPS_PROXY=10.10.1.11:1080
export FTP_PROXY=10.10.1.10:3128
On Windows:
set http_proxy=10.10.1.10:3128
set https_proxy=10.10.1.11:1080
set ftp_proxy=10.10.1.10:3128

You can refer to the proxy documentation here.
If you need to use a proxy, you can configure individual requests with the proxies argument to any request method:
import requests
proxies = {
"http": "http://10.10.1.10:3128",
"https": "https://10.10.1.10:1080",
}
requests.get("http://example.org", proxies=proxies)
To use HTTP Basic Auth with your proxy, use the http://user:password#host.com/ syntax:
proxies = {
"http": "http://user:pass#10.10.1.10:3128/"
}

I have found that urllib has some really good code to pick up the system's proxy settings and they happen to be in the correct form to use directly. You can use this like:
import urllib
...
r = requests.get('http://example.org', proxies=urllib.request.getproxies())
It works really well and urllib knows about getting Mac OS X and Windows settings as well.

The accepted answer was a good start for me, but I kept getting the following error:
AssertionError: Not supported proxy scheme None
Fix to this was to specify the http:// in the proxy url thus:
http_proxy = "http://194.62.145.248:8080"
https_proxy = "https://194.62.145.248:8080"
ftp_proxy = "10.10.1.10:3128"
proxyDict = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
I'd be interested as to why the original works for some people but not me.
Edit: I see the main answer is now updated to reflect this :)

If you'd like to persisist cookies and session data, you'd best do it like this:
import requests
proxies = {
'http': 'http://user:pass#10.10.1.0:3128',
'https': 'https://user:pass#10.10.1.0:3128',
}
# Create the session and set the proxies.
s = requests.Session()
s.proxies = proxies
# Make the HTTP request through the session.
r = s.get('http://www.showmemyip.com/')

8 years late. But I like:
import os
import requests
os.environ['HTTP_PROXY'] = os.environ['http_proxy'] = 'http://http-connect-proxy:3128/'
os.environ['HTTPS_PROXY'] = os.environ['https_proxy'] = 'http://http-connect-proxy:3128/'
os.environ['NO_PROXY'] = os.environ['no_proxy'] = '127.0.0.1,localhost,.local'
r = requests.get('https://example.com') # , verify=False

The documentation
gives a very clear example of the proxies usage
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
What isn't documented, however, is the fact that you can even configure proxies for individual urls even if the schema is the same!
This comes in handy when you want to use different proxies for different websites you wish to scrape.
proxies = {
'http://example.org': 'http://10.10.1.10:3128',
'http://something.test': 'http://10.10.1.10:1080',
}
requests.get('http://something.test/some/url', proxies=proxies)
Additionally, requests.get essentially uses the requests.Session under the hood, so if you need more control, use it directly
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
session = requests.Session()
session.proxies.update(proxies)
session.get('http://example.org')
I use it to set a fallback (a default proxy) that handles all traffic that doesn't match the schemas/urls specified in the dictionary
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
session = requests.Session()
session.proxies.setdefault('http', 'http://127.0.0.1:9009')
session.proxies.update(proxies)
session.get('http://example.org')

i just made a proxy graber and also can connect with same grabed proxy without any input
here is :
#Import Modules
from termcolor import colored
from selenium import webdriver
import requests
import os
import sys
import time
#Proxy Grab
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get("https://www.sslproxies.org/")
tbody = driver.find_element_by_tag_name("tbody")
cell = tbody.find_elements_by_tag_name("tr")
for column in cell:
column = column.text.split(" ")
print(colored(column[0]+":"+column[1],'yellow'))
driver.quit()
print("")
os.system('clear')
os.system('cls')
#Proxy Connection
print(colored('Getting Proxies from graber...','green'))
time.sleep(2)
os.system('clear')
os.system('cls')
proxy = {"http": "http://"+ column[0]+":"+column[1]}
url = 'https://mobile.facebook.com/login'
r = requests.get(url, proxies=proxy)
print("")
print(colored('Connecting using proxy' ,'green'))
print("")
sts = r.status_code

here is my basic class in python for the requests module with some proxy configs and stopwatch !
import requests
import time
class BaseCheck():
def __init__(self, url):
self.http_proxy = "http://user:pw#proxy:8080"
self.https_proxy = "http://user:pw#proxy:8080"
self.ftp_proxy = "http://user:pw#proxy:8080"
self.proxyDict = {
"http" : self.http_proxy,
"https" : self.https_proxy,
"ftp" : self.ftp_proxy
}
self.url = url
def makearr(tsteps):
global stemps
global steps
stemps = {}
for step in tsteps:
stemps[step] = { 'start': 0, 'end': 0 }
steps = tsteps
makearr(['init','check'])
def starttime(typ = ""):
for stemp in stemps:
if typ == "":
stemps[stemp]['start'] = time.time()
else:
stemps[stemp][typ] = time.time()
starttime()
def __str__(self):
return str(self.url)
def getrequests(self):
g=requests.get(self.url,proxies=self.proxyDict)
print g.status_code
print g.content
print self.url
stemps['init']['end'] = time.time()
#print stemps['init']['end'] - stemps['init']['start']
x= stemps['init']['end'] - stemps['init']['start']
print x
test=BaseCheck(url='http://google.com')
test.getrequests()

It’s a bit late but here is a wrapper class that simplifies scraping proxies and then making an http POST or GET:
ProxyRequests
https://github.com/rootVIII/proxy_requests

Already tested, the following code works. Need to use HTTPProxyAuth.
import requests
from requests.auth import HTTPProxyAuth
USE_PROXY = True
proxy_user = "aaa"
proxy_password = "bbb"
http_proxy = "http://your_proxy_server:8080"
https_proxy = "http://your_proxy_server:8080"
proxies = {
"http": http_proxy,
"https": https_proxy
}
def test(name):
print(f'Hi, {name}') # Press Ctrl+F8 to toggle the breakpoint.
# Create the session and set the proxies.
session = requests.Session()
if USE_PROXY:
session.trust_env = False
session.proxies = proxies
session.auth = HTTPProxyAuth(proxy_user, proxy_password)
r = session.get('https://www.stackoverflow.com')
print(r.status_code)
if __name__ == '__main__':
test('aaa')

I share some code how to fetch proxies from the site "https://free-proxy-list.net" and store data to a file compatible with tools like "Elite Proxy Switcher"(format IP:PORT):
##PROXY_UPDATER - get free proxies from https://free-proxy-list.net/
from lxml.html import fromstring
import requests
from itertools import cycle
import traceback
import re
######################FIND PROXIES#########################################
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:299]: #299 proxies max
proxy = ":".join([i.xpath('.//td[1]/text()')
[0],i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
return proxies
######################write to file in format IP:PORT######################
try:
proxies = get_proxies()
f=open('proxy_list.txt','w')
for proxy in proxies:
f.write(proxy+'\n')
f.close()
print ("DONE")
except:
print ("MAJOR ERROR")

Related

Using proxies with TLS in python

I am currently trying to use TLS 1.1 with python requests. So far I've been using this snippet:
CIPHERS = (
'ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:AES256-SHA'
)
class TlsAdapter(HTTPAdapter):
def __init__(self, ssl_options=0, **kwargs):
self.ssl_options = ssl_options
super(TlsAdapter, self).__init__(**kwargs)
def init_poolmanager(self, *pool_args, **pool_kwargs):
ctx = ssl_.create_urllib3_context(ciphers=CIPHERS, cert_reqs=ssl.CERT_REQUIRED, options=self.ssl_options)
self.poolmanager = PoolManager(*pool_args,
ssl_context=ctx,
**pool_kwargs)
self.session = requests.session()
adapter = TlsAdapter(ssl.OP_NO_TLSv1 | ssl.OP_NO_TLSv1_1)
self.session.mount("https://", adapter)
It works fine if I run it without an HTTP proxy, but does not seem to work/create a TLS connection when I do set an HTTP proxy in my session, like so:
proxies = {
'http': 'http://{}:{}#{}:{}'.format(user, password, ip, port),
'https': 'http://{}:{}#{}:{}'.format(user, password, ip, port)
}
self.session.proxies.update(proxies = proxies)
Any ideas on how I could get it to still work with proxies? I am not too sure what I am doing wrong/need to change. Thank you!
It's the way the proxies are being set or updated. The current way isn't setting the proxy at all.
Instead of using this:
self.session.proxies.update(proxies = proxies)
It should be:
self.session.proxies.update(proxies)
https://2.python-requests.org/en/master/user/advanced/#proxies

How to mock multiple urls in request mock

I have a method which is calling two different end points and validating there response.
def foo_bar:
status_1 = requests.post(
"http://myapi/test/status1", {},
headers=headers)
status_2 = requests.post(
"http://myapi/test/status2", {},
headers=headers)
# and check the responses ...
I want to mock the both the url in pytest like this:
def foo_test:
with requests_mock.Mocker() as m1:
m1.post('http://myapi/test/status1',
json={},
headers={'x-api-key': my_api_key})
m1.post('http://myapi/test/status2',
json={},
headers={'x-api-key': my_api_key})
It always throws the error
**NO mock address: http://myapi/test/status2**
seems like its only mocking first url.
So is there any way to mock more than one url in one method?
Yes there is. From the docs: "There is a special symbol at requests_mock.ANY which acts as the wildcard to match anything. It can be used as a replace for the method and/or the URL."
import requests_mock
with requests_mock.Mocker() as rm:
rm.post(requests_mock.ANY, text='resp')
I am not sure if this is the best way but it works for me. You can assert afterwards which URLs were called with:
urls = [r._request.url, for r in rm._adapter.request_history]
I think you have something else going on, it's very normal to mock out a single path like this at a time so that you can return different values from different paths simply. Your example works for me:
import requests
import requests_mock
with requests_mock.Mocker() as m1:
my_api_key = 'key'
m1.post('http://myapi/test/status1',
json={},
headers={'x-api-key': my_api_key})
m1.post('http://myapi/test/status2',
json={},
headers={'x-api-key': my_api_key})
headers = {'a': 'b'}
status_1 = requests.post("http://myapi/test/status1", {}, headers=headers)
status_2 = requests.post("http://myapi/test/status2", {}, headers=headers)
assert status_1.status_code == 200
assert status_2.status_code == 200
Yes, there is a way!
You need to use additional_matcher callback (see docs) and requests_mock.ANY as URL.
Your example (with context manager)
import requests
import requests_mock
headers = {'key': 'val', 'another': 'header'}
def my_matcher(request):
url = request.url
mocked_urls = [
"http://myapi/test/status1",
"http://myapi/test/status2",
]
return url in mocked_urls # True or False
# as Context manager
with requests_mock.Mocker() as m1:
m1.post(
requests_mock.ANY, # Mock any URL before matching
additional_matcher=my_matcher, # Mock only matched
json={},
headers=headers,
)
r = requests.post('http://myapi/test/status1')
print(f"{r.text} | {r.headers}")
r = requests.post('http://myapi/test/status2')
print(f"{r.text} | {r.headers}")
# r = requests.get('http://myapi/test/status3').text # 'NoMockAddress' exception
Adaptation for pytest
Note: import requests_mock library with alias (because requests_mock is a fixture in pytest tests)
See example for pytest framework, GET method and your URLs:
# test_some_module.py
import requests
import requests_mock as req_mock
def my_matcher(request):
url = request.url
mocked_urls = [
"http://myapi/test/status1",
"http://myapi/test/status2",
]
return url in mocked_urls # True or False
def test_mocking_several_urls(requests_mock): # 'requests_mock' is fixture here
requests_mock.get(
req_mock.ANY, # Mock any URL before matching
additional_matcher=my_matcher, # Mock only matched
text="Some fake response for all matched URLs",
)
... Do your requests ...
# GET URL#1 -> response "Some fake response for all matched URLs"
# GET URL#2 -> response "Some fake response for all matched URLs"
# GET URL#N -> Raised exception 'NoMockAddress'

Get requests body using selenium and proxies

I want to be able to get a body of the specific subrequest using a selenium behind the proxy.
Now I'm using python + selenium + chromedriver. With logging I'm able to get each subrequest's headers but not body. My logging settings:
caps['loggingPrefs'] =
{'performance': 'ALL',
'browser': 'ALL'}
caps['perfLoggingPrefs'] = {"enableNetwork": True,
"enablePage": True,
"enableTimeline": True}
I know there are several options to form a HAR with selenium:
Use geckodriver and har-export-trigger. I tried to make it work with the following code:
window.foo = HAR.triggerExport().then(harLog => { return(harLog); });
return window.foo;
Unfortunately, I don't see the body of the response in the returning data.
Use browsermob proxy. The solution seems totally fine but I didn't find the way to make browsermob proxy work behind the proxy.
So the question is: how can I get the body of the specific network response on the request made during the downloading of the webpage with selenium AND use proxies.
UPD: Actually, with har-export-trigger I get the response bodies, but not all of them: the response body I need is in json, it's MIME type is 'text/html; charset=utf-8' and it is missing from the HAR file I generate, so the solution is still missing.
UPD2: After further investigation, I realized that a response body is missing even on my desktop firefox when the har-export-trigger add-on is turned on, so this solution may be a dead-end (issue on Github)
UPD3: This bug can be seen only with the latest version of har-export-trigger. With version 0.6.0. everything works just fine.
So, for future googlers: you may use har-export-trigger v. 0.6.0. or the approach from the accepted answer.
I have actually just finished to implemented a selenium HAR script with tools you are mentioned in the question. Both HAR getting from har-export-trigger and BrowserMob are verified with Google HAR Analyser.
A class using selenium, gecko driver and har-export-trigger:
class MyWebDriver(object):
# a inner class to implement custom wait
class PageIsLoaded(object):
def __call__(self, driver):
state = driver.execute_script('return document.readyState;')
MyWebDriver._LOGGER.debug("checking document state: " + state)
return state == "complete"
_FIREFOX_DRIVER = "geckodriver"
# load HAR_EXPORT_TRIGGER extension
_HAR_TRIGGER_EXT_PATH = os.path.abspath(
"har_export_trigger-0.6.1-an+fx_orig.xpi")
_PROFILE = webdriver.FirefoxProfile()
_PROFILE.set_preference("devtools.toolbox.selectedTool", "netmonitor")
_CAP = DesiredCapabilities().FIREFOX
_OPTIONS = FirefoxOptions()
# add runtime argument to run with devtools opened
_OPTIONS.add_argument("-devtools")
_LOGGER = my_logger.get_custom_logger(os.path.basename(__file__))
def __init__(self, log_body=False):
self.browser = None
self.log_body = log_body
# return the webdriver instance
def get_instance(self):
if self.browser is None:
self.browser = webdriver.Firefox(capabilities=
MyWebDriver._CAP,
executable_path=
MyWebDriver._FIREFOX_DRIVER,
firefox_options=
MyWebDriver._OPTIONS,
firefox_profile=
MyWebDriver._PROFILE)
self.browser.install_addon(MyWebDriver._HAR_TRIGGER_EXT_PATH,
temporary=True)
MyWebDriver._LOGGER.info("Web Driver initialized.")
return self.browser
def get_har(self):
# JSON.stringify has to be called to return as a string
har_harvest = "myString = HAR.triggerExport().then(" \
"harLog => {return JSON.stringify(harLog);});" \
"return myString;"
har_dict = dict()
har_dict['log'] = json.loads(self.browser.execute_script(har_harvest))
# remove content body
if self.log_body is False:
for entry in har_dict['log']['entries']:
temp_dict = entry['response']['content']
try:
temp_dict.pop("text")
except KeyError:
pass
return har_dict
def quit(self):
self.browser.quit()
MyWebDriver._LOGGER.warning("Web Driver closed.")
A subclass adding BrowserMob proxy for your reference as well:
class MyWebDriverWithProxy(MyWebDriver):
_PROXY_EXECUTABLE = os.path.join(os.getcwd(), "venv", "lib",
"browsermob-proxy-2.1.4", "bin",
"browsermob-proxy")
def __init__(self, url, log_body=False):
super().__init__(log_body=log_body)
self.server = Server(MyWebDriverWithProxy._PROXY_EXECUTABLE)
self.server.start()
self.proxy = self.server.create_proxy()
self.proxy.new_har(url,
options={'captureHeaders': True,
'captureContent': self.log_body})
super()._LOGGER.info("BrowserMob server started")
super()._PROFILE.set_proxy(self.proxy.selenium_proxy())
def get_har(self):
return self.proxy.har
def quit(self):
self.browser.quit()
self.proxy.close()
MyWebDriver._LOGGER.info("BroswerMob server and Web Driver closed.")

Get cookie using aiohttp

I am trying to get cookies from the browser using aiohttp. From the docs and googling I have only found articles about setting cookies in aiohttp.
In flask I would get the cookies as simply as
cookie = request.cookies.get('name_of_cookie')
# do something with cookie
Is there a simple way to fetch the cookie from browser using aiohttp?
Is there a simple way to fetch the cookie from the browser using aiohttp?
Not sure about whether this is simple but there is a way:
import asyncio
import aiohttp
async def main():
urls = [
'http://httpbin.org/cookies/set?test=ok',
]
for url in urls:
async with aiohttp.ClientSession(cookie_jar=aiohttp.CookieJar()) as s:
async with s.get(url) as r:
print('JSON', await r.json())
cookies = s.cookie_jar.filter_cookies('http://httpbin.org')
for key, cookie in cookies.items():
print('Key: "%s", Value: "%s"' % (cookie.key, cookie.value))
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
The program generates the following output:
JSON: {'cookies': {'test': 'ok'}}
Key: "test", Value: "ok"
Example adapted from https://aiohttp.readthedocs.io/en/stable/client_advanced.html#custom-cookies + https://docs.aiohttp.org/en/stable/client_advanced.html#cookie-jar
Now if you want to do a request using a previously set cookie:
import asyncio
import aiohttp
url = 'http://example.com'
# Filtering for the cookie, saving it into a varibale
async with aiohttp.ClientSession(cookie_jar=aiohttp.CookieJar()) as s:
cookies = s.cookie_jar.filter_cookies('http://example.com')
for key, cookie in cookies.items():
if key == 'test':
cookie_value = cookie.value
# Using the cookie value to do anything you want:
# e.g. sending a weird request containing the cookie in the header instead.
headers = {"Authorization": "Basic f'{cookie_value}'"}
async with s.get(url, headers=headers) as r:
print(await r.json())
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For testing urls containing a host part made up by an IP address use aiohttp.ClientSession(cookie_jar=aiohttp.CookieJar(unsafe=True)), according to https://github.com/aio-libs/aiohttp/issues/1183#issuecomment-247788489
Yes, the cookies are stored in request.cookies as a dict, just like in flask, so request.cookies.get('name_of_cookie') works the same.
In the examples section of the aiohttp repository there is a file, web_cookies.py that shows how to retrieve, set, and delete a cookie. Here's the section from that script that reads the cookies and returns it to the template as a preformatted string:
from pprint import pformat
from aiohttp import web
tmpl = '''\
<html>
<body>
Login<br/>
Logout<br/>
<pre>{}</pre>
</body>
</html>'''
async def root(request):
resp = web.Response(content_type='text/html')
resp.text = tmpl.format(pformat(request.cookies))
return resp
You can get the cookie value, domain, path etc, without having to loop thru all cookies.
s.cookie_jar._cookies
gives you all the cookies in a defaultdict with the domains as keys and their respective cookies as values. aiohttp uses SimpleCookie
So, to get the value of a cookie
s.cookie_jar._cookies.get("https://httpbin.org")["cookie_name"].value
for domain, path:
s.cookie_jar._cookies.get("https://httpbin.org")["cookie_name"]["domain"]
s.cookie_jar._cookies.get("https://httpbin.org")["cookie_name"]["path"]
more info can be found here: https://docs.python.org/3/library/http.cookies.html

Use tornado as a proxy, got a memory leak and don't know why?

I want to use tornado to act as a proxy_server. The things are simple:
1. recieve request from some client
2. get url, param, headers from the request
3. use a proxy to make request with the things in 2
4. return response content, headers and the proxy to client
here's the code:
# -*- coding: utf-8 -*-
import json
import functools
import base64
import sys
from tornado.httpserver import HTTPServer
from tornado.ioloop import IOLoop
from tornado.httpclient import AsyncHTTPClient
import tornado.web
import util
AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")
class MainHandler(tornado.web.RequestHandler):
def handle_response(self, proxy, response):
if response.error:
self.write("error")
else:
headers = {name.lower(): value for name, value in response.headers.get_all()}
result = {
"content": base64.b64encode(response.body),
"headers": headers,
"proxy": proxy,
}
self.write(json.dumps(result))
self.finish()
#tornado.web.asynchronous
def post(self):
url = self.get_argument("url", "")
body = self.get_argument("body", "")
headers = json.loads(self.get_argument("headers", ""))
method = self.get_argument("method", "GET")
proxy = self.get_argument("proxy", "")
client = AsyncHTTPClient()
if proxy == "":
proxy = util.get_random_ip() # get a proxy string, like '118.150.101.160:18088'
proxy_info = proxy.split(":")
request_info = {
"method": method,
"headers": headers,
"connect_timeout": 2,
"request_timeout": 5,
"proxy_host": proxy_info[0],
"proxy_port": int(proxy_info[1]),
}
cf = functools.partial(self.handle_response, proxy)
if method == "POST":
request_info["body"] = body
client.fetch(url, callback=cf, **request_info)
if __name__ == "__main__":
port = sys.argv[1]
print "start at port : ", port
application = tornado.web.Application([(r"/proxy", MainHandler)])
server = HTTPServer(application)
server.listen(int(port))
IOLoop.instance().start()
when the requests growing to about 300 per min(5/s not too high), the memory start to goes wrong. I use "glances" to check system info, at the begining, the memory used is 16.6M, but 3 min after, the memory raise to 107.8M, 10 min latter, the number can up to 8518.6M, that's horrible!
I don't know why, any help?

Resources