I am trying to use the requests_mock as part of my unit test to test API Calls which run in a while loop so in order to end the while loop or to meet the condition of the while loop I need to send different response to my API. My URL remains the same but the param changes but requests_mock doesn't really cares about it.
My function goes like this :
def func():
response = requests.get(
url=<url>
params={"limit": 1000}
headers=<headers>
).json()
while "next" in response["info"].keys():
response = requests.get(
url=<url>
params={"limit": 1000, "info": response["info"]}
headers=<headers>
).json()
My test looks like:
def test_url(requests_mock):
requests_mock.get(url, json=<response_with_info_key>)
func()
data_request = requests_mock.request_history[0]
assert data_request.query = "limit=1000"
What i want is my while loop to end with my second response to be without the "next" key. What I have already tried :
def test_url(requests_mock):
requests_mock.get(url, json=[<response_with_info_key>, <response_without_info_key>])
func()
data_request = requests_mock.request_history[0]
assert data_request.query = "limit=1000"
The simplest way to explain the whole question would be : How do i make requests-mock send two different response for the same API?
Related
I have a function that submits a search job to a REST API, waits for the API to respond, then downloads 2 sets of JSON data, converts the both JSON's into Pandas dataframes, and returns both dataframes. below is a very simplified version of the function(minus error handling, logging, data scrubbing, etc...)
def getdata(searchstring, url, uname, passwd):
headers = {'content-type': 'application/json'}
json_data = CreateJSONPayload(searchstring)
rPOST = requests.post(url, auth=(uname, passwd), data=json_data, headers=headers)
statusURL = (str(json.loads(rPOST.text)[u'link'][u'href']))
Processing = True
while Processing == True:
rGET = requests.get(statusURL, auth=(uname, passwd))
if rGET.status_code== 200:
url1 = url + "/dataset1"
url2 = url + "/dataset2"
rGET1 = requests.get(url1, auth=(uname, passwd))
rGET2 = requests.get(url2, auth=(uname, passwd))
dfData1 = pd.read_json(rGET1.text)
dfData2 = pd.read_json(rGET2.text)
Processing = False
elif StatusCode == "Other return code handling":
print("handle errors") # Not relevant to question.
else:
sleep(15)
return dfData1, dfData2
The function itself works as expected. However the API being called can take anywhere from a couple of minutes to an hour to return the data depending on the parameters I pass to it and I need to submit multiple searches to it, so I'd rather not submit each search one after the other.
What's the best way to parallelize the calling of a function like this so that I can submit multiple requests to it at the same time, wait for all calls of the function have returned data, and finally continue on with data processing in the script?
I also need to be able to throttle the requests too, as the API rate limits me to no more than 15 concurrent connections at a time.
I'm writing a short Python program to request a JSON file using a Rest API call. The API limits me to a relatively small results set (50 or so) and I need to retrieve several thousand result sets. I've implemented a while loop to achieve this and it's working fairly well but I can't figure out the logic for 'continuing the while loop' until there are no more results to retrieve. Right now I've implemented a hard number value but would like to replace it with a conditional that stops the loop if no more results come back. The 'offset' field is the parameter that the API forces you to use to specify which set of results you want in your 50. My logic looks something like...
import requests
import json
from time import sleep
url = "https://someurl"
offsetValue = 0
PARAMS = {'limit':50, 'offset':offsetValue}
headers = {
"Accept": "application/json"
}
while offsetValue <= 1000:
response = requests.request(
"GET",
url,
headers=headers,
params = PARAMS
)
testfile = open("testfile.txt", "a")
testfile.write(json.dumps(json.loads(response.text), sort_keys=True, indent=4, separators=(",", ": ")))
testfile.close()
offsetValue = offsetValue + 1
sleep(1)
So I want to change the conditional the controls the while loop from a fixed number to a check to see if the result set for the getRequest is empty. Hopefully this makes sense.
Your loop can be while true. After each fetch, convert the payload to a dict. If the number of results is 0, then break.
Depending on how the API works, there may be other signals that there’s nothing more to fetch, e.g. some HTTP error, not necessarily the result count — you’ll have to discover the API’s logic for that.
I am trying to mock a simple POST request that creates a resource from the request body, and returns the resource that was created. For simplicity, let's assume the created resource is exactly as passed in, but given an ID when created. Here is my code:
def test_create_resource(requests_mock):
# Helper function to generate dynamic response
def get_response(request, context):
context.status_code = 201
# I assumed this would contain the request body
response = request.json()
response['id'] = 100
return response
# Mock the response
requests_mock.post('test-url/resource', json=get_response)
resource = function_that_creates_resource()
assert resource['id'] == 100
I end up with runtime error JSONDecodeError('Expecting value: line 1 column 1 (char 0)'). I assume this is because request.json() does not contain what I am looking for. How can I access the request body?
I had to hack up your example a little bit as there is some information missing - but the basic idea works fine for me. I think as mentioned something is wrong with the way you're creating the post request.
import requests
import requests_mock
with requests_mock.mock() as mock:
# Helper function to generate dynamic response
def get_response(request, context):
context.status_code = 201
# I assumed this would contain the request body
response = request.json()
response['id'] = 100
return response
# Mock the response
mock.post('http://example.com/test-url/resource', json=get_response)
# resource = function_that_creates_resource()
resp = requests.post('http://example.com/test-url/resource', json={'a': 1})
assert resp.json()['id'] == 100
This example is not complete and so we cannot truly see what is happening.
In particular, it would be useful to see a sample function_that_creates_resource.
That said, I think your get_response code is valid.
I believe that you are not sending valid JSON data in your post request in function_that_creates_resource.
I'm trying to do some web-scraping, as learning, using a predefined number of workers.
I'm using None as as sentinel to break out of the while loop and stop the worker.
The speed of each worker varies, and all workers are closed before the last
url is passed to gather_search_links to get the links.
I tried to use asyncio.Queue, but I had less control than with deque.
async def gather_search_links(html_sources, detail_urls):
while True:
if not html_sources:
await asyncio.sleep(0)
continue
data = html_sources.pop()
if data is None:
html_sources.appendleft(None)
break
data = BeautifulSoup(data, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
url = f'{domain_url}{atag.get("href")}'
detail_urls.appendleft(url)
print("apended data", len(detail_urls))
await asyncio.sleep(0)
async def get_page_source(urls, html_sources):
client = httpx.AsyncClient()
while True:
if not urls:
await asyncio.sleep(0)
continue
url = urls.pop()
print("url", url)
if url is None:
urls.appendleft(None)
break
response = await client.get(url)
html_sources.appendleft(response.text)
await asyncio.sleep(8)
html_sources.appendleft(None)
async def navigate(urls):
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
await asyncio.sleep(0)
nav_urls.appendleft(None)
loop = asyncio.get_event_loop()
nav_html = deque()
nav_urls = deque()
products_url = deque()
navigate_workers = [asyncio.ensure_future(navigate(nav_urls)) for _ in range(1)]
page_source_workers = [asyncio.ensure_future(get_page_source(nav_urls, nav_html)) for _ in range(2)]
product_urls_workers = [asyncio.ensure_future(gather_search_links(nav_html, products_url)) for _ in range(1)]
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
loop.run_until_complete(workers)
I'm a bit of a newbie, so this could be wrong as can be, but I believe that the issue is that all three of the functions: navigate(), gather_search_links(), and get_page_source() are asynchronous tasks that can be completed in any order. However, your checks for empty deques and your use of appendleft to ensure None is the leftmost item in your deques, look like they would appropriately prevent this. For all intents and purposes the code looks like it should run correctly.
I think the issue arises at this line:
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
According to this post, the asyncio.wait function does not order these tasks according to the order they're written above, instead it fires them according to IO as coroutines. Again, your checks at the beginning of gather_search_links and get_page_source are ensuring that one function runs after the other and thus this code should work if there is only a single worker for each function. If there are multiple workers for each function, I can see issues arising where None doesn't wind up being the leftmost item in your deques. Perhaps a print statement at the end of each function to show the contents of your deques would be useful in troubleshooting this.
I guess my major question would be, why do these tasks asnychronously if you're going to write extra code because the steps must be completed synchronously? In order to get the HTML you must first have the URL. In order to scrape the HTML you must first have the HTML. What benefit does asyncio provide here? All three of these make more sense to me as synchronous tasks. Get URL, get HTML, scrape HTML, and in that order.
EDIT: It occurred to me that the main benefit of asynchronous code here is that you don't want to have to wait on each individual URL to respond back synchronously when you fetch the HTML from them. What I would do in this situation is gather my URLs synchronously first, and then combine the get and scrape functions into a single asynchronous function, which would be your only asynchronous function. Then you don't need a sentinel or a check for a "None" value or any of that extra code and you get the full value of the asynchronous fetch. You could then store your scraped data in a list (or deque or whatever) of futures. This would simplify your code and provide you with the fastest possible scrape time.
LAST EDIT:
Here's my quick and dirty rewrite. I liked your code so I decided to do my own spin. I have no idea if it works, I'm not a Python person.
import asyncio
from collections import deque
import httpx as httpx
from bs4 import BeautifulSoup
# Get or build URLs from config
def navigate():
urls = deque()
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
return urls
# Asynchronously fetch and parse data for a single URL
async def fetchHTMLandParse(url):
client = httpx.AsyncClient()
response = await client.get(url)
data = BeautifulSoup(response.text, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
#Domain URL was defined elsewhere
url = f'{domain_url}{atag.get("href")}'
products_urls.appendleft(url)
loop = asyncio.get_event_loop()
products_urls = deque()
nav_urls = navigate()
fetch_and_parse_workers = [asyncio.ensure_future(fetchHTMLandParse(url)) for url in nav_urls]
workers = asyncio.wait([*fetch_and_parse_workers])
loop.run_until_complete(workers)
I use Scrapy 1.5.1
My Goal is to go through entire chain of requests for each variable before moving to the next variable. For some reason Scrapy takes 2 variables, then sends 2 requests, then takes another 2 variables and so on.
CONCURRENT_REQUESTS = 1
Here is my code sample:
def parsed ( self, response):
# inspect_response(response, self)
search = response.meta['search']
for idx, i in enumerate(response.xpath("//table[#id='ctl00_ContentPlaceHolder1_GridView1']/tr")[1:]):
__EVENTARGUMENT = 'Select${}'.format(idx)
data = {
'__EVENTARGUMENT': __EVENTARGUMENT,
}
yield scrapy.Request(response.url, method = 'POST', headers = self.headers, body = urlencode(data),callback = self.res_before_get,meta = {'search' : search}, dont_filter = True)
def res_before_get ( self, response):
# inspect_response(response, self)
url = 'http://www.moj-yemen.net/Search_detels.aspx'
yield scrapy.Request(url, callback = self.results, dont_filter = True)
My desired behavior is:
1 value from Parse is sent to res_before_get and then i do smth with it.
then another values from Parse is sent to res_before_get and so on.
Post
Get
Post
Get
But currently Scrapy takes 2 values from Parse and adds them to queue , then sends 2 requests from res_before_get. Thus im getting duplicate results.
Post
Post
Get
Get
What do I miss?
P.S.
This is asp.net site. Its logic is as follows:
makes POST request with search payload.
Make GET request to get actual data.
Both request share the same sessionID
Thats why it is important to preserve the order.
At the moment im getting POST1 and POST2. And since the sessionID is associated with POST2, both GET1 and GET2 return the same page.
Scrapy works asynchronously, so you cannot expect it to respect the order of your loops or anything.
If you need it to work sequentially, you'll have to accommodate the callbacks to work like that, for example:
def parse1(self, response):
...
yield Request(..., callback=self.parse2, meta={...(necessary information)...})
def parse2(self, response):
...
if (necessary information):
yield Request(...,
callback=self.parse2,
meta={...(remaining necessary information)...},
)