Get requests body using selenium and proxies - python-3.x

I want to be able to get a body of the specific subrequest using a selenium behind the proxy.
Now I'm using python + selenium + chromedriver. With logging I'm able to get each subrequest's headers but not body. My logging settings:
caps['loggingPrefs'] =
{'performance': 'ALL',
'browser': 'ALL'}
caps['perfLoggingPrefs'] = {"enableNetwork": True,
"enablePage": True,
"enableTimeline": True}
I know there are several options to form a HAR with selenium:
Use geckodriver and har-export-trigger. I tried to make it work with the following code:
window.foo = HAR.triggerExport().then(harLog => { return(harLog); });
return window.foo;
Unfortunately, I don't see the body of the response in the returning data.
Use browsermob proxy. The solution seems totally fine but I didn't find the way to make browsermob proxy work behind the proxy.
So the question is: how can I get the body of the specific network response on the request made during the downloading of the webpage with selenium AND use proxies.
UPD: Actually, with har-export-trigger I get the response bodies, but not all of them: the response body I need is in json, it's MIME type is 'text/html; charset=utf-8' and it is missing from the HAR file I generate, so the solution is still missing.
UPD2: After further investigation, I realized that a response body is missing even on my desktop firefox when the har-export-trigger add-on is turned on, so this solution may be a dead-end (issue on Github)
UPD3: This bug can be seen only with the latest version of har-export-trigger. With version 0.6.0. everything works just fine.
So, for future googlers: you may use har-export-trigger v. 0.6.0. or the approach from the accepted answer.

I have actually just finished to implemented a selenium HAR script with tools you are mentioned in the question. Both HAR getting from har-export-trigger and BrowserMob are verified with Google HAR Analyser.
A class using selenium, gecko driver and har-export-trigger:
class MyWebDriver(object):
# a inner class to implement custom wait
class PageIsLoaded(object):
def __call__(self, driver):
state = driver.execute_script('return document.readyState;')
MyWebDriver._LOGGER.debug("checking document state: " + state)
return state == "complete"
_FIREFOX_DRIVER = "geckodriver"
# load HAR_EXPORT_TRIGGER extension
_HAR_TRIGGER_EXT_PATH = os.path.abspath(
"har_export_trigger-0.6.1-an+fx_orig.xpi")
_PROFILE = webdriver.FirefoxProfile()
_PROFILE.set_preference("devtools.toolbox.selectedTool", "netmonitor")
_CAP = DesiredCapabilities().FIREFOX
_OPTIONS = FirefoxOptions()
# add runtime argument to run with devtools opened
_OPTIONS.add_argument("-devtools")
_LOGGER = my_logger.get_custom_logger(os.path.basename(__file__))
def __init__(self, log_body=False):
self.browser = None
self.log_body = log_body
# return the webdriver instance
def get_instance(self):
if self.browser is None:
self.browser = webdriver.Firefox(capabilities=
MyWebDriver._CAP,
executable_path=
MyWebDriver._FIREFOX_DRIVER,
firefox_options=
MyWebDriver._OPTIONS,
firefox_profile=
MyWebDriver._PROFILE)
self.browser.install_addon(MyWebDriver._HAR_TRIGGER_EXT_PATH,
temporary=True)
MyWebDriver._LOGGER.info("Web Driver initialized.")
return self.browser
def get_har(self):
# JSON.stringify has to be called to return as a string
har_harvest = "myString = HAR.triggerExport().then(" \
"harLog => {return JSON.stringify(harLog);});" \
"return myString;"
har_dict = dict()
har_dict['log'] = json.loads(self.browser.execute_script(har_harvest))
# remove content body
if self.log_body is False:
for entry in har_dict['log']['entries']:
temp_dict = entry['response']['content']
try:
temp_dict.pop("text")
except KeyError:
pass
return har_dict
def quit(self):
self.browser.quit()
MyWebDriver._LOGGER.warning("Web Driver closed.")
A subclass adding BrowserMob proxy for your reference as well:
class MyWebDriverWithProxy(MyWebDriver):
_PROXY_EXECUTABLE = os.path.join(os.getcwd(), "venv", "lib",
"browsermob-proxy-2.1.4", "bin",
"browsermob-proxy")
def __init__(self, url, log_body=False):
super().__init__(log_body=log_body)
self.server = Server(MyWebDriverWithProxy._PROXY_EXECUTABLE)
self.server.start()
self.proxy = self.server.create_proxy()
self.proxy.new_har(url,
options={'captureHeaders': True,
'captureContent': self.log_body})
super()._LOGGER.info("BrowserMob server started")
super()._PROFILE.set_proxy(self.proxy.selenium_proxy())
def get_har(self):
return self.proxy.har
def quit(self):
self.browser.quit()
self.proxy.close()
MyWebDriver._LOGGER.info("BroswerMob server and Web Driver closed.")

Related

Scrapy: How to use init_request and start_requests together?

I need to make an initial call to a service before I start my scraper (the initial call, gives me some cookies and headers), I decided to use InitSpider and override the init_request method to achieve this. however I also need to use start_requests to build my links and add some meta values like proxies and whatnot to that specific spider, but I'm facing a problem. whenever I override start_requests, my crawler doesn't call init_request anymore and I can not do the initialization and in order to get init_request working is to not override the start_requests method which is impossible in my case. any suggestions or possible solutions to my code:
class SomethingSpider(InitSpider):
name = 'something'
allowed_domains = ['something.something']
aod_url = "https://something?="
start_urls = ["id1","id2","id3"]
custom_settings = {
'DOWNLOAD_FAIL_ON_DATALOSS' : False,
'CONCURRENT_ITEMS': 20,
'DOWNLOAD_TIMEOUT': 10,
'CONCURRENT_REQUESTS': 3,
'COOKIES_ENABLED': True,
'CONCURRENT_REQUESTS_PER_DOMAIN': 20
}
def init_request(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
return self.initialized()
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def start_requests(self):
print("H4")
proxies = ["xyz:0000","abc:1111"]
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, meta={'proxy': random.choice(proxies)})
def parse(self, response):
#some processing happens
yield {
#some data
}
except Exception as err:
print("Connecting to...")
Spiders page (generic spiders section) on official scrapy docs doesn't have any mention of InitSpider You are trying to use.
InitSpider class from https://github.com/scrapy/scrapy/blob/2.5.0/scrapy/spiders/init.py written ~10 years ago (at that... ancient versions of scrapy start_requests method worked completely differently).
From this perspective I recommend You to not use undocumented and probably outdated InitSpider.
On current versions of scrapy required functionality can be implemented using regular Spider class:
import scrapy
class SomethingSpider(scrapy.Spider):
...
def start_requests(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
...
#Schedule next requests here:
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, ....})
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
...
If you are looking speicfically at incorporating logging in then I would reccomend you look at Using FormRequest.from_response() to simulate a user login in the scrapy docs.
Here is the spider example they give:
import scrapy
def authentication_failed(response):
# TODO: Check the contents of the response and return True if it failed
# or False if it succeeded.
pass
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
if authentication_failed(response):
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
finally, you can have a look at how too add proxies to your scrapy middleware as per this example (zyte are the guys who wrote scrapy) "How to set up a custom proxy in Scrapy?"

Access Locust Host attribute - Locust 1.0.0+

I had previously asked and solved the problem of dumping stats using an older version of locust, but the setup and teardown methods were removed in locust 1.0.0, and now I'm unable to get the host (base URL).
I'm looking to print out some information about requests after they've run. Following the docs at https://docs.locust.io/en/stable/extending-locust.html, I have an request_success listener inside my sequential task set - some rough sample code below:
class SearchSequentialTest(SequentialTaskSet):
#task
def search(self):
path = '/search/tomatoes'
headers = {"Content-Type": "application/json",
unique_identifier = uuid.uuid4()
data = {
"name": f"Performance-{unique_identifier}",
}
with self.client.post(
path,
data=json.dumps(data),
headers=headers,
catch_response=True,
) as response:
json_response = json.loads(response.text)
self.items = json_response['result']['payload'][0]['uuid']
print(json_response)
#events.request_success.add_listener
def my_success_handler(request_type, name, response_time, response_length, **kw):
print(f"Successfully made a request to: {self.host}/{name}")
But I cannot access the self.host - and if I remove it I only get a relative url.
How do I access the base_url inside a TaskSet's event hooks?
How do I access the base_url inside a TaskSet's event hooks?
You can do it by accessing the class variable directly in your request handler:
print(f"Successfully made a request to: {YourUser.host}/{name}")
Or you can use absolute URLs in your test (task) like this:
with self.client.post(
self.user.host + path,
...
Then you'll get the full url to your request listener.

how to resolve Uncaught Exception POST and tornado.access:500

It's a basic setup of a tornado web application with the intention of reading a JSON file POSTed by a client
Originally it was a Flask web but now converted to Tornado web. Tried using tornado-cors and also set_default_headers() function still shows the same errors.
class MainHandler(CorsMixin,tornado.web.RequestHandler):
def get(self):
self.write("Hello, world")
def set_default_headers(self):
self.set_header("Access-Constrol-Allow-Origin", "*")
self.set_header("Access-Constrol-Allow-Headers", "Content-Type")
self.set_header("Access-Constrol-Allow-Methods", "POST")
def post(self):
try:
data = tornado.escape.json_decode(self.request.body)
return data
except(Exception) as err:
print(str(err))
CORS_ORIGIN = "*"
CORS_HEADERS = "*"
CORS_METHODS = 'POST'

Set a passthrough using browsermob proxy?

So I'm using browsermob proxy to login selenium tests to get passed IAP for Google Cloud. But that just gets the user to the site, they still need to login to the site using some firebase login form. IAP has me adding Authorization header through browsermob so you can get to the site itself but the when you try to login through the firebase form you get a 401 error message: "Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential..
I thought I could get around this using the whitelist or blacklist feature to just not apply those headers to urls related to the firebase login, but it seems that whitelist and blacklist just return status codes and empty responses for calls that match the regex.
Is there a way to just passthrough certain calls based on the host? Or on the off chance I'm doing something wrong, let me know. Code below:
class ExampleTest(unittest.TestCase):
def setUp(self):
server = Server("env/bin/browsermob-proxy/bin/browsermob-proxy")
server.start()
proxy = server.create_proxy()
bearer_header = {}
bearer_header['Authorization'] = 'Bearer xxxxxxxxexamplexxxxxxxx'
proxy.headers({"Authorization": bearer_header["Authorization"]})
profile = webdriver.FirefoxProfile()
proxy_info = proxy.selenium_proxy()
profile.set_proxy(proxy_info)
proxy.whitelist('.*googleapis.*, .*google.com.*', 200) # This fakes 200 from urls on regex match
# proxy.blacklist('.*googleapis.*', 200) # This fakes 200 from urls if not regex match
self.driver = webdriver.Firefox(firefox_profile=profile)
proxy.new_har("file-example")
def test_wait(self):
self.driver.get("https://example.com/login/")
time.sleep(3)
def tearDown(self):
self.driver.close()
Figured this out a bit later. There isn't anything built into the BrowserMob proxy/client to do this. But you can achieve it through webdriver's proxy settings.
Chrome
self.chrome_options = webdriver.ChromeOptions()
proxy_address = '{}:{}'.format(server.host, proxy.port)
self.chrome_options.add_argument('--proxy-server=%s' % proxy_address)
no_proxy_string = ''
for item in no_proxy:
no_proxy_string += '*' + item + ';'
self.chrome_options.add_argument('--proxy-bypass-list=%s' % no_proxy_string)
self.desired_capabilities = webdriver.DesiredCapabilities.CHROME
self.desired_capabilities['acceptInsecureCerts'] = True
Firefox
self.desired_capabilities = webdriver.DesiredCapabilities.FIREFOX
proxy_address = '{}:{}'.format(server.host, proxy.port)
self.desired_capabilities['proxy'] = {
'proxyType': "MANUAL",
'httpProxy': proxy_address,
'sslProxy': proxy_address,
'noProxy': ['google.com', 'example.com']
}
self.desired_capabilities['acceptInsecureCerts'] = True

CherryPy server name tag

When running a CherryPy app it will send server name tag something like CherryPy/version.
Is it possible to rename/overwrite that from the app without modifying CherryPy so it will show something else?
Maybe something like MyAppName/version (CherryPy/version)
This can now be set on a per application basis in the config file/dict
[/]
response.headers.server = "CherryPy Dev01"
Actually asking on IRC on their official channel fumanchu gived me a more clean way to do this (using latest svn):
import cherrypy
from cherrypy import _cpwsgi_server
class HelloWorld(object):
def index(self):
return "Hello World!"
index.exposed = True
serverTag = "MyApp/%s (CherryPy/%s)" % ("1.2.3", cherrypy.__version__)
_cpwsgi_server.CPWSGIServer.environ['SERVER_SOFTWARE'] = serverTag
cherrypy.config.update({'tools.response_headers.on': True,
'tools.response_headers.headers': [('Server', serverTag)]})
cherrypy.quickstart(HelloWorld())
This string appears to be being set in the CherrPy Response class:
def __init__(self):
self.status = None
self.header_list = None
self._body = []
self.time = time.time()
self.headers = http.HeaderMap()
# Since we know all our keys are titled strings, we can
# bypass HeaderMap.update and get a big speed boost.
dict.update(self.headers, {
"Content-Type": 'text/html',
"Server": "CherryPy/" + cherrypy.__version__,
"Date": http.HTTPDate(self.time),
})
So when you're creating your Response object, you can update the "Server" header to display your desired string. From the CherrPy Response Object documentation:
headers
A dictionary containing the headers of the response. You may set values in
this dict anytime before the finalize phase, after which CherryPy switches
to using header_list ...
EDIT: To avoid needing to make this change with every response object you create, one simple way to get around this is to wrap the Response object. For example, you can create your own Response object that inherits from CherryPy's Response and updates the headers key after initializing:
class MyResponse(Response):
def __init__(self):
Response.__init__(self)
dict.update(self.headers, {
"Server": "MyServer/1.0",
})
RespObject = MyResponse()
print RespObject.headers["Server"]
Then you can can call your object for uses where you need to create a Response object, and it will always have the Server header set to your desired string.

Resources