Python3 requests.get is too slow

Python3 requests.get is too slow - python-3.x

EDIT Adding info:
requests version: 2.21.0
Server info: a Windows python implementation which includes 10 instances of threading.Thread, each creating HTTPServer with a handler based on BaseHTTPRequestHandler. My do_GET looks like this:
def do_GET(self):
rc = 'some response'
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.send_header('Access-Control-Allow-Origin', '*')
self.end_headers()
self.wfile.write(rc.encode('utf-8'))
I'm getting a strange behaviour.
Using the curl command line, the GET command is finished quickly:
curl "http://localhost:3020/pbio/button2?cmd=uz-crosslink-leds&g1=0&g2=0&g3=0&g4=1&tmr=1"
However, using requests.get() of python takes too much time. I was isolated it up to
python -c "import requests; requests.get('http://localhost:3020/pbio/button2?cmd=uz-crosslink-leds&g1=0&g2=0&g3=0&g4=1&tmr=1')"
I scanned through many other questions here and have tried many things, without success.
Here are some of my findings:
If I'm adding timeout=0.2, the call is ending quickly without any error.
However, adding timeout=5 or timeout=(5,5)` doesn't make it take longer. It always seem to be waiting a full one second before returning with results.
Working with a session wrapper, and cancelling keep-alive, didn't improve. I mean for this:
with requests.Session() as session:
session.headers.update({'Connection': 'close'})
url = "http://localhost:3020/pbio/button2?cmd=uz-crosslink-leds&g1=0&g2=0&g3=0&g4=%d&tmr=0" % i
session.get(url, timeout=2)
Enabling full debug, I'm getting the following output:
url=http://localhost:3020/pbio/button2?cmd=uz-crosslink-leds&g1=0&g2=0&g3=0&g4=1&tmr=0
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:3020
send: b'GET /pbio/button2?cmd=uz-crosslink-leds&g1=0&g2=0&g3=0&g4=1&tmr=0 HTTP/1.1\r\nHost: localhost:3020\r\nUser-Agent: python-requests/2.21.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.0 200 OK\r\n'
header: Server: BaseHTTP/0.6 Python/3.7.2
header: Date: Wed, 01 May 2019 15:28:29 GMT
header: Content-type: text/html
header: Access-Control-Allow-Origin: *
DEBUG:urllib3.connectionpool:http://localhost:3020 "GET /pbio/button2?cmd=uz-crosslink-leds&g1=0&g2=0&g3=0&g4=1&tmr=0 HTTP/1.1" 200 None
url=http://localhost:3020/pbio/powermtr?cmd=read-power-density
DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost
slight pause here
send: b'GET /pbio/powermtr?cmd=read-power-density HTTP/1.1\r\nHost: localhost:3020\r\nUser-Agent: python-requests/2.21.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.0 200 OK\r\n'
header: Server: BaseHTTP/0.6 Python/3.7.2
header: Date: Wed, 01 May 2019 15:28:30 GMT
header: Content-type: text/html
header: Access-Control-Allow-Origin: *
DEBUG:urllib3.connectionpool:http://localhost:3020 "GET /pbio/powermtr?cmd=read-power-density HTTP/1.1" 200 None
6.710,i=4
url=http://localhost:3020/pbio/button2?cmd=uz-crosslink-leds&g1=0&g2=0&g3=0&g4=4&tmr=0
DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost
slight pause here
...

From the docs:
timeout is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds). If no timeout is specified explicitly, requests do not time out.

It took me 3 years to find an answer.
I still do not understand why, but at least I can suggest a working solution.
According to these docs, the timeout can be specified as a tuple, like this:
(timeout for connection, timeout for interval without data)
Although I do not understand why requests is waiting for [timeout] before issuing the connection, I can tell it to wait very little for the connection, and specify another timeout for the data.
So what I'm doing now, is giving a timeout of let's say (0.01, 4). Now the connection is immediate, and if the data has a deadtime of 4 seconds, it will generate a timeout exception.
Some interesting reading can be found here.
Hoping this info will help others!

Related

Bottle: HEAD always falls back to GET, thus functions always executed twice?

I'm using Bottle to implement a web interface for a simple database system. As documented, Bottle handles HTTP HEAD requests by falling back to the corresponding GET route and cutting off the response body. However, in my experience, it means that the function attached to the GET route is executed both times in response to a GET request. This can be problematic if that function performs an operation that side-effects, such as a database operation.
Is there a way to prevent this double execution from happening? Or should I define a fake HEAD route for every GET route?

Update: It sounds like Bottle is working as designed (calling the function only once per request). Your browser is the apparent source of the HEAD requests.
On HEAD requests, Bottle calls the method once, not twice. Can you demonstrate some code that shows the behaviour you're describing? When I run the following code, I see the "Called" line only once:
from bottle import Bottle, request
app = Bottle()
#app.get("/")
def home():
print(f"Called: {request.method}")
return "Some text\n"
app.run()
Output:
$ curl --head http://127.0.0.1:8080/
Called: HEAD
HTTP/1.0 200 OK
127.0.0.1 - - [13/Jan/2021 08:28:02] "HEAD / HTTP/1.1" 200 0
Date: Wed, 13 Jan 2021 13:28:02 GMT
Server: WSGIServer/0.2 CPython/3.8.6
Content-Length: 10
Content-Type: text/html; charset=UTF-8

Python3 Unable to Decode POST Request Result

I am sending a POST request with sockets and trying to decode the received HTML and print to the terminal.
This works fine in my initial GET request but when I try to decode and print the POST request I just get garbled text.
How can I change my decode so that the text is readable?
POST
POST
body = "hash="+md5
headers = """\
POST / HTTP/1.1\r
Host: url.com:57555\r
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0\r
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r
Accept-Language: en-US,en;q=0.5\r
Accept-Encoding: gzip, deflate\r
Referer: http://url.com:57555/\r
Content-Type: application/x-www-form-urlencoded\r
Content-Length: 32\r
Connection: close\r
Cookie: PHPSESSID=some_cookie\r
Upgrade-Insecure-Requests: 1\r
\r\n"""
payload = headers + body
s.sendall(payload.encode('utf-8'))
res = s.recv(4096)
print(str(res, errors='replace'))
Result...
python3 emdee5.py
HTTP/1.1 200 OK
Date: Sun, 26 May 2019 22:01:26 GMT
Server: Apache/2.4.18 (Ubuntu)
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 282
Connection: close
Content-Type: text/html; charset=UTF-8
]�1o� ���
ʒ��Ҩ��b�V��LN��؜
p�$����Py��d��FP��l� ^�֞i�ĜmA��F7i�zd}��VͩK8}ߠ���!�n�W>�wL9ۅr�#Ȑ����� 4i��ec{"%��0���)������W���A�I��"��GD�;�܉"J��JA}x��l1��3٠.y�>Om�#5��9
��ڨ�p�j����JN���MQ̀)�:�p�P{K���4J^-��+�7�oV'E;'=�����l�

Your request explicitly says that you are willing to accept a compressed response:
Accept-Encoding: gzip, deflate\r
And this is therefore what you get in the response
Content-Encoding: gzip
So, the body is compressed with gzip (which explains the garbled output) and you would need to decompress it. Given that you currently don't seem to be able to properly deal with compressed responses you should not claim in your request that you support these, i.e. remove the Accept-Encoding.
Apart from that more is likely wrong with your request:
body = "hash="+md5
...
Content-Length: 32\r
...
payload = headers + body
...
Given that md5 is 32 characters hex (or 16 byte binary) the body consisting of "hash"=+md5 is likely not 32 characters long as you claim in your Content-Length.
POST / HTTP/1.1\r
Additionally you send a HTTP/1.1 request so you have to be able to deal with chunked responses - but your code does not deal with these.
res = s.recv(4096)
Similarly your code blindly assumes that the full response can be retrieved within a single recv which does not need to be the case.
In summary: unless you have a deeper understanding of how HTTP works (which you do not seem to have) it is recommended that you use existing libraries to handle HTTP for you, since these were written by developers who have an understanding of HTTP.
And even if you already have an understanding of HTTP you'll likely will use these libraries anyway since you'll know that HTTP is far from trivial and that it makes no sense to implement all the necessary details and edge cases by yourself in your code if something robust already exists.

Check if a large file exists without downloading it

Not sure if this is possible, but I would like to check the status code of an HTTP request to a large file without downloading it; I just want to check if it's present on the server.
Is it possible to do this with Python's requests? I already know how to check the status code but I can only do that after the file has been downloaded.
I guess what I'm asking is can you issue a GET request and stop it as soon as you've receive the response headers?

Use requests.head(). This only returns the header of requests, not all content — in other words, it will not return the body of a message, but you can get all the information from the header.
The HEAD method is identical to GET except that the server MUST NOT
return a message-body in the response. The metainformation contained
in the HTTP headers in response to a HEAD request SHOULD be identical
to the information sent in response to a GET request. This method can
be used for obtaining metainformation about the entity implied by the
request without transferring the entity-body itself. This method is
often used for testing hypertext links for validity, accessibility,
and recent modification.
For example:
import requests
url = 'http://lmsotfy.com/so.png'
r = requests.head(url)
r.headers
Output:
{'Content-Type': 'image/png', 'Content-Length': '6347', 'ETag': '"18cb-4f7c2f94011da"', 'Accept-Ranges': 'bytes', 'Date': 'Mon, 09 Jan 2017 11:23:53 GMT', 'Last-Modified': 'Thu, 24 Apr 2014 05:18:04 GMT', 'Server': 'Apache', 'Keep-Alive': 'timeout=2, max=100', 'Connection': 'Keep-Alive'}
This code does not download the picture, but returns the header of the picture message, which contains the size, type and date. If the picture does not exist, there will be no such information.

Use HEAD method.
For example urllib
import urllib.request
response = urllib.request.urlopen(url)
if response.getcode() == 200:
print(response.headers['content-length'])
In your case with requests
import requests
response = requests.head(url)
if response.status_code == 200:
print(response.headers['content-length'])

Normally, you use HEAD method instead of GET for such sort of things. If you query some random server on the web, then be prepared that it may be configured to return inconsistent results (this is typical for servers requiring registration). In such cases you may want to use GET request with Range header to download only small number of bytes.

What makes conditional GETs "conditional" if the resource is obtained in the initial request?

Breaking down what makes a conditional GET:
In RFC 2616 it states that the GET method change to a "conditional GET" if the request message includes an If-* (If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range) header field.
It then states:
A conditional GET method requests that the entity be transferred ONLY under the circumstances described by the conditional header field(s).
From my understanding this is saying it will only return the data being requested if the condition is met with the "If-*" in any new subsequent requests. For example, if a GET request returns a response with a Etag header then the next request must include the If-None-Match with the ETag value to transfer the client back the requested resource.
However, If a client has to send an initial request before getting the returned "ETag" header (to return with If-None-Match) then they already have the requested resource. Thus, any future requests that return the If-None-Match header with the ETag value only dictate the return of the requested value, returning 200 OK (if the client does not return the If-None-Matchand ETag value from initial request) or 304 Not Modified (if they do), where this helps the client and server by caching the resource.
My Question:
Why does it state the entity (the resource from a request) will "be transferred ONLY" if the If-* condition is met (like in my example where the client returns the ETag value with anIf-None-Match in order to cache the requested resource) if the resource or "entity" is being returned with or without a "If-*" being returned? It doesn't return a resource "only under the circumstances described by the conditional header" because it returns the resource despiteless returning 200 OK or 304 Not Modified depending on if a "If-*" header is returned. What am I misunderstanding about this?
Full conditional GET reference from RFC 2616:
The semantics of the GET method change to a "conditional GET" if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field. A conditional GET method requests that the entity be transferred only under the circumstances described by the conditional header field(s). The conditional GET method is intended to reduce unnecessary network usage by allowing cached entities to be refreshed without requiring multiple requests or transferring data already held by the client.

First of all, please note that RFC 2616 is obsolete, and you should refer instead to RFC 7232.
It's hard to see what exactly is confusing you. So let me just illustrate with examples instead.
Scenario 1
Client A: I need http://example.com/foo/bar.
GET /foo/bar HTTP/1.1
Host: example.com
Server: Here you go.
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 12
ETag: "2ac07d4"
Hello world!
(some time passes)
Client A: I need http://example.com/foo/bar again. But I already have the "2ac07d4" version in my cache. Maybe that will do?
GET /foo/bar HTTP/1.1
Host: example.com
If-None-Match: "2ac07d4"
Server: Yeah, "2ac07d4" is fine. Just take it from your cache, I'm not sending it to you.
HTTP/1.1 304 Not Modified
Scenario 2
Client A: I need http://example.com/foo/bar.
GET /foo/bar HTTP/1.1
Host: example.com
Server: Here you go.
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 12
ETag: "2ac07d4"
Hello world!
(some time passes)
Client B: I want to upload a new version of http://example.com/foo/bar.
PUT /foo/bar HTTP/1.1
Content-Type: text/plain
Content-Length: 17
Hello dear world!
Server: This looks good, I'm saving it. I will call this version "f6049b9".
HTTP/1.1 204 No Content
ETag: "f6049b9"
(more time passes)
Client A: I need http://example.com/foo/bar again. But I already have the "2ac07d4" version in my cache. Maybe that will do?
GET /foo/bar HTTP/1.1
Host: example.com
If-None-Match: "2ac07d4"
Server: I'm sorry, but "2ac07d4" is out of date. We have a new version now, it's called "f6049b9". Here, let me send it to you.
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 17
ETag: "f6049b9"
Hello dear world!
Analysis
A conditional GET method requests that the entity be transferred ONLY under the circumstances described by the conditional header field(s).
Consider Client A's second request (in both scenarios).
The conditional header field is: If-None-Match: "2ac07d4".
The circumstances described by it are: "a selected representation of the resource does not match entity-tag "2ac07d4"".
Scenario 1: the circumstances do not hold, because the selected representation of the resource (the one containing Hello world!) does indeed match entity-tag "2ac07d4". Therefore, in accordance with the protocol, the server does not transfer the entity in its response.
Scenario 2: the circumstances do hold: the selected representation of the resource (the one containing Hello dear world!) doesn't match entity-tag "2ac07d4" (it matches "f6049b9" instead). Therefore, in accordance with the protocol, the server does transfer the entity in its response.
How does the server come up with these "2ac07d4" and "f6049b9", anyway? Of course, this depends on the application, but one straightforward way to do it is to compute a hash (such as SHA-1) of the entity body--a value that changes dramatically when even small changes are introduced.

ArangoDB batch mode doesn't work with Java Driver

I am evaluating ArangoDB using spring batch.
I tried to insert some data and, without batch mode, it works as expected.
However, if batch mode is on, the execution of the program hangs.
I am using arango 2.3.3 and com.arangodb:arangodb-java-driver:[2.2-SNAPSHOT,2.2]
arangoDriver.startBatchMode();
for(Account acc : items){
acc.getRecordHash();
acc.getIdHash();
arangoDriver.createDocument("AccountCollection", acc);
}
arangoDriver.executeBatch();
Any ideas what I am doing wrong?

I tried to reproduce what you are trying, first of all does the collection "AccountCollection" exist ? If not you would get an error in the batch result but still the program should not hang. i created a unittest :
#Test
public void test_StartCancelExecuteBatchMode() throws ArangoException {
driver.startBatchMode();
ArrayList<Account> items = new ArrayList<Account>();
items.add(new Account());
items.add(new Account());
items.add(new Account());
items.add(new Account());
for(Account acc : items){
acc.getRecordHash();
acc.getIdHash();
driver.createDocument("AccountCollection", acc, true, false);
}
driver.executeBatch();
}
This works perfectly and returns:
EOB
16:47:01.862 [main] DEBUG com.arangodb.http.HttpManager - [RES]http-POST: statusCode=200
16:47:01.862 [main] DEBUG com.arangodb.http.HttpManager - [RES]http-POST: text=--dlmtrMLTPRT
Content-Type: application/x-arango-batchpart
Content-Id: request1
HTTP/1.1 202 Accepted
Location: /_db/unitTestDatabase/_api/document/AccountCollection/48033214501
Content-Type: application/json; charset=utf-8
Etag: "48033214501"
Content-Length: 95
{"error":false,"_id":"AccountCollection/48033214501","_rev":"48033214501","_key":"48033214501"}
--dlmtrMLTPRT
Content-Type: application/x-arango-batchpart
Content-Id: request2
HTTP/1.1 202 Accepted
Location: /_db/unitTestDatabase/_api/document/AccountCollection/48033411109
Content-Type: application/json; charset=utf-8
Etag: "48033411109"
Content-Length: 95
{"error":false,"_id":"AccountCollection/48033411109","_rev":"48033411109","_key":"48033411109"}
--dlmtrMLTPRT
Content-Type: application/x-arango-batchpart
Content-Id: request3
HTTP/1.1 202 Accepted
Location: /_db/unitTestDatabase/_api/document/AccountCollection/48033607717
Content-Type: application/json; charset=utf-8
Etag: "48033607717"
Content-Length: 95
{"error":false,"_id":"AccountCollection/48033607717","_rev":"48033607717","_key":"48033607717"}
--dlmtrMLTPRT
Content-Type: application/x-arango-batchpart
Content-Id: request4
HTTP/1.1 202 Accepted
Location: /_db/unitTestDatabase/_api/document/AccountCollection/48033804325
Content-Type: application/json; charset=utf-8
Etag: "48033804325"
Content-Length: 95
{"error":false,"_id":"AccountCollection/48033804325","_rev":"48033804325","_key":"48033804325"}
--dlmtrMLTPRT--
But even when i create intentional errors the application never "hangs".
Frank just sent me your source code, i take a look into it. Can you try to find out where the programm is hanging ? is "executeBatch" reached at all ?

I already imported 1.6 Mio documents with your code and still everything works.
i guess it might be necessary to monitor your system resources during the import, if anything unusual occurs let us now. Generally speaking it does not seem to be the best practice to perform a one-time bulk import like this using the java api. i would recommend to use arangoimp to import the data directly into the database, this will be much faster. it is documented here

You need to increase the number of open file descriptors. The Mac has a very low limit (256). ArangoDB stores the data in datafiles of a certain chunk size. With large datasets more files are need (and some fd are already used for communication and other stuff).
When ArangoDB runs out of file descriptors, it can neither extend the dataset nor answer new questions. Therefore the import process will hang.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string