I would like to achieve with mitmproxy and a custom python script to duplicate some requests multiple times.
As a example if the client does a post request to aaa.bbb.ccc and the url contains as a string ["test","runner"] it takes this request and sends it 10 times.
amount = 10
wordwatch = ["test","dev"]
domain = "aaa.bbb.ccc"
def request(context, flow):
print(type(flow.request.url))
if domain in flow.request.url:
if any(s in flow.request.url for s in wordwatch):
for i in range(0, amount + 1):
flow.response
The problem I have is I neither get an error or something in the output. Also on the serverside when I watch the apache log I can't even see the requests incoming. The server uses only http so it shouldn't be a https/certificate issue. It is very good possible that I misunderstood the documentation and samples on Manual Samples .
Related
Background
I have a python app, built using requests, that is used to upload files from client sites to a web server, using POST.
These files are usually small (1-300 KB), but sometime larger (15-20MB). Usually the uploads take a few seconds, however for large files over slow networks may take minutes to complete.
Problem
I'm having a problem figuring out how to use requests timeout in a rational way to handle sending large uploads using POST over slow networks (where the POST may take 1-2 min to complete).
What I'd like
I'd like to be able to declare a session and than a POST using the session, so that
a) an initial timeout was small (so network/gateway/... connection problems etc get detected quickly), BUT
b) a subsequent timeout that is long, so that after the connection is established, but the data takes a few minutes to upload, it won't timeout.
I can't seem to figure out how to do that
I'm also a bit confused by how/what/where the timeout parameters is used when specified as a tuple in conjuction with POST (looks like I'm not alone: https://stackoverflow.com/a/63994047/9423009)
Specifically to illustrate this (meta code - my production code is below), if I have a file to POST that may take 1-2 minute to upload:
file_to_upload = '/path_to_a_big_file'
my_session.post(
timeout=2,
files=file_to_upload
)
# above will timeout if POST takes > 2 seconds
my_session.post(
timeout=60,
files=file_to_upload
)
# above will succeed if POST takes 40 seconds, BUT will also take 60 seconds
# to throw any exceptions of problems with any routine type network/gateway 40X
# type problems
my_session.post(
timeout=(2, 60),
files=file_to_upload
)
# THIS WILL ALSO TIMEOUT AFTER 2 SECONDS!?
So based on above, how do you specify a small initial 'make connection' timeout, and then a longer, separate, timeout for a POST to complete sending?
Actual code and Additional Stuff
As the sending sites may have variable speed networks, and to handle flaky network problems etc, I use urllib3's Retry to generate Sessionss (courtesy of some great code at https://www.peterbe.com/plog/best-practice-with-retries-with-requests).
With this code, I have a small'ish initial timeout, that the Retry code will increase for a certain amount of times until things fail. But I don't believe this affects the problem here:
def requests_retry_session(
retries=3,
backoff_factor=0.3,
status_forcelist=(500, 502, 504),
session=None,
) -> requests.Session:
""" Return requests session using Retry to automatically retry on failures."""
# add POST to list of methods to retry on
methods = frozenset({'DELETE', 'GET', 'HEAD', 'OPTIONS', 'PUT', 'POST', 'TRACE'})
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
status=retries,
backoff_factor=backoff_factor,
method_whitelist=methods,
status_forcelist=status_forcelist,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
# ...
# send file
with open(file_to_send, 'rb') as fh:
file_arg = [(server_key, fh)]
with requests_retry_session() as s:
# try to specify small initial, and long subsequent POST timeout
# but doesn't work - if POST takes > 2 seconds it still will
# timeout
timeout=(2.0, 60.0)
response = s.post(
url,
headers=headers,
data={},
files=file_arg,
timeout=timeout,
proxies=proxies
)
response.raise_for_status()
I've implemented a web scraper with Nodejs, cheerio and request-promise that scrapes an endpoint (basic html page) and return certain information. The content of the page I'm crawling differs based on a parameter at the end of the url (http://some-url.com?value=12345 where 12345 is my dynamic value).
I need this crawler to work every x minutes and crawl multiple pages, and to do that I've set a cronjob using Google Cloud Scheduler. (I'm fetching the dynamic values I need from Firebase).
There could be more than 50 different values for which I'd need to crawl the specific page, but I would like to ease the load with which I'm sending the requests so the server doesn't choke. To accomplish this, I've tried to add a delay
1) using setTimeout
2) using setInterval
3) using a custom sleep implementation:
const sleep = require('util').promisify(setTimeout);
All 3 of these methods work locally; all of the requests are made with y seconds delay as intended.
But when tried with Firebase Cloud Functions and Google Cloud Scheduler
1) not all of the requests are sent
2) the delay is NOT consistent (some requests fire with the proper delay, then there are no requests made for a while and other requests are sent with a major delay)
I've tried many things but I wasn't able to solve this problem.
I was wondering if anyone could suggest a different theoretical approach or a certain library etc. I can take for this scenario, since the one I have now doesn't seem to work as I intended. I'm adding one of the approaches that locally work below.
Cheers!
courseDataRefArray.forEach(async (dataRefObject: CourseDataRef, index: number) => {
console.log(`Foreach index = ${index} -- Hello StackOverflow`);
setTimeout(async () => {
console.log(`Index in setTimeout = ${index} -- Hello StackOverflow`);
await CourseUtil.initiateJobForCourse(dataRefObject.ref, dataRefObject.data);
}, 2000 * index);
});
(Note: I can provide more code samples if necessary; but it's mostly following a loop & async/await & setTimeout pattern, and since it works locally I'm assuming that's not the main problem.)
I had a code snippet that I tried a couple of weeks ago, that was getting the HTML of the page no problem (I was learning about NLP and wanted to try a couple of things with the titles). However, now when I perform a request with proxies (proxies=session.proxies) I get404. When I omit using the proxies everything is fine (that means that I am using my own IP, headers in both cases are identical)... Can someone help me with what is going on here? I am positive that the IPs are blocked. When I use some free proxies from the Internet everything is fine. But they are super unstable, and so it's not possible to do anything. I have also looked at the ips that this code snippet produces:
with Controller.from_port(port=9051) as controller:
controller.authenticate('my_pass')
controller.signal(Signal.NEWNYM)
time.sleep(controller.get_newnym_wait())
2 ips worked (Netherlands and France) from like 20 (Austria, UK, Netherlands (but starts with different two leading digits), Liberia, France (also different two leading digis)) that I have tried so far. Is it possible to tell tor to use proxies from specific countries? I guess that wouldn't help me much. If only I could cycle through the working IPs, but I have read somewhere that that is not possible.
Here is the code that preceeds the above code snippet:
import requests
global session
session = requests.session()
session.proxies = {}
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'
url = 'https://www.reuters.com/finance/stocks/company-news/AAPL.OQ?date=12222016'
html = requests.get(url, timeout=(120, 120), proxies=session.proxies)
print(html)
Tried changing user agents to no avail.
Also note, that for the above to work, you need this.
So the title is a little confusing I guess..
I have a script that I've been writing that will display some random data and other non-essentials when I open my shell. I'm using grequests to make my API calls since I'm using more than one URL. For my weather data, I use WeatherUnderground's API since it will offer active alerts. The alerts and conditions data are on separate pages. What I can't figure out is how to insert the appropriate name in the grequests object when it is making requests. Here is the code that I have:
URLS = ['http://api.wunderground.com/api/'+api_id+'/conditions/q/autoip.json',
'http://www.ourmanna.com/verses/api/get/?format=json',
'http://quotes.rest/qod.json',
'http://httpbin.org/ip']
requests = (grequests.get(url) for url in URLS)
responses = grequests.map(requests)
data = [response.json() for response in responses]
#json parsing from here
In the URL 'http://api.wunderground.com/api/'+api_id+'/conditions/q/autoip.json' I need to make an API request to conditions and alerts to retrieve the data I need. How do I do this without rewriting a fourth URLS string?
I've tried
pages = ['conditions', 'alerts']
URL = ['http://api.wunderground.com/api/'+api_id+([p for p in pages])/q/autoip.json']
but, as I'm sure some of you more seasoned programmers know, threw and exception. So how can I iterate through these pages, or will I have to write out both complete URLS?
Thanks!
Ok I was actually able to figure out how to call each individual page within the grequests object by using a simple for loop. Here is the the code that I used to produced the expected results:
import grequests
pages = ['conditions', 'alerts']
api_id = 'myapikeyhere'
for p in pages:
URLS = ['http://api.wunderground.com/api/'+api_id+'/'+p+'/q/autoip.json',
'http://www.ourmanna.com/verses/api/get/?format=json',
'http://quotes.rest/qod.json',
'http://httpbin.org/ip']
#create grequest object and retrieve results
requests = (grequests.get(url) for url in URLS)
responses = grequests.map(requests)
data = [response.json() for response in responses]
#json parsing from here
I'm still not sure why I couldn't figure this out before.
Documentation for the grequests library here
I've been scratching my head the whole day yesterday about this and to my surprise, can't seem to find an easy way to check this.
I am using Python's Requests library to pass my proxy such as:
def make_request(url):
with requests.Session() as s:
s.mount("http://", HTTPAdapter(max_retries=3))
s.mount("https://", HTTPAdapter(max_retries=3))
page = None
d.rotate(-1) #d contains a dict of my proxies. this allows to rotate through the proxies everytime make_request is called.
s.proxies = d[0]
page = s.get(url, timeout=3)
print('proxy used: ' + str(d[0]))
return page.content
Problem is, I can't seem to make the request fail when the proxy is not expected to work. It seems there is always a fallback on my internet ip if the proxy is not working.
For example: I tried passing a random proxy ip like 101.101.101.101:8800 or removing the ip authentication that is needed on my proxies, the request is still passed, even though it should'nt.
I thought adding the timeout parameters when passing the request would do the trick, but obviously it didn't.
So
Why does this happen?
How can I check from which ip a request is being made?
From what I have seen so far, you should use the form
s.get(url, proxies = d)
This should use the proxies in the dict d to make a connection.
This form allowed me to check with working proxies and non-working proxies the status_code
print(s.status_code)
I will update once I find out whether it just circulates over the proxies in the dict to match a working one, or one is able to actually select which one to be used.
[UPDATE]
Tried to work around the dict in proxies, to use different proxy if I wanted to. However, proxies must be a dict to work. So I used a dict in the form of:
d = {"https" : 'https://' + str(proxy_ips[n].strip('\n'))}
This seems to work and allow me to use an ip I want to. Although it seems quite dull, I hope someone might come and help!
The proxies used can be seen through:
requests.utils.getproxies()
or
requests.utils.get_environ_proxies(url)
I hope that helps, obviously quite old question, but still!