Check if a large file exists without downloading it

Check if a large file exists without downloading it - python-3.x

Not sure if this is possible, but I would like to check the status code of an HTTP request to a large file without downloading it; I just want to check if it's present on the server.
Is it possible to do this with Python's requests? I already know how to check the status code but I can only do that after the file has been downloaded.
I guess what I'm asking is can you issue a GET request and stop it as soon as you've receive the response headers?

Use requests.head(). This only returns the header of requests, not all content — in other words, it will not return the body of a message, but you can get all the information from the header.
The HEAD method is identical to GET except that the server MUST NOT
return a message-body in the response. The metainformation contained
in the HTTP headers in response to a HEAD request SHOULD be identical
to the information sent in response to a GET request. This method can
be used for obtaining metainformation about the entity implied by the
request without transferring the entity-body itself. This method is
often used for testing hypertext links for validity, accessibility,
and recent modification.
For example:
import requests
url = 'http://lmsotfy.com/so.png'
r = requests.head(url)
r.headers
Output:
{'Content-Type': 'image/png', 'Content-Length': '6347', 'ETag': '"18cb-4f7c2f94011da"', 'Accept-Ranges': 'bytes', 'Date': 'Mon, 09 Jan 2017 11:23:53 GMT', 'Last-Modified': 'Thu, 24 Apr 2014 05:18:04 GMT', 'Server': 'Apache', 'Keep-Alive': 'timeout=2, max=100', 'Connection': 'Keep-Alive'}
This code does not download the picture, but returns the header of the picture message, which contains the size, type and date. If the picture does not exist, there will be no such information.

Use HEAD method.
For example urllib
import urllib.request
response = urllib.request.urlopen(url)
if response.getcode() == 200:
print(response.headers['content-length'])
In your case with requests
import requests
response = requests.head(url)
if response.status_code == 200:
print(response.headers['content-length'])

Normally, you use HEAD method instead of GET for such sort of things. If you query some random server on the web, then be prepared that it may be configured to return inconsistent results (this is typical for servers requiring registration). In such cases you may want to use GET request with Range header to download only small number of bytes.

Related

Python Client Rest API Invocation - Invalid character found in method name [{}POST]. HTTP method names must be tokens

Client
Python Version - 3.9,
Python Requests module version - 2.25
Server
Java 13,
Tomcat 9.
I have a Tomcat+Java based server exposing REST APIs. I am writing a client in python to consume those APIs. Everything is fine until I send empty body in POST request. It is a valid use case for us. If I send empty body I get 400 bad request error - Invalid character found in method name [{}POST]. HTTP method names must be tokens. If I send empty request from POSTMAN or Java or CURL it works fine, problem is only when I used python as a client.
Following is python snippet -
json_object={}
header = {'alias': 'A', 'Content-Type' : 'application/json', 'Content-Length' : '0'}
resp = requests.post(url, auth=(username, password), headers=header, json=json_object)
I tried using data as well instead of json param to send payload with not much of success.
I captured the wireshark dumps to undertand it further and found that, the request tomcat received is not as per RFC2616 (https://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html). Especially the part -
Request-Line = Method SP Request-URI SP HTTP-Version CRLF
Because I could see in from wireshark dumps it looked like - {}POST MY-APP-URI HTTP/1.1
As we can see the empty body is getting prefixed with http-method, hence tomcat reports that as an error.
I then looked at python http library code -client.py. Following are relevant details -
File - client.py
Method - _send_output (starting at line # 1001) - It first sends the header at line #1010 and then the body somewhere down in the code. I thought(I could be wrong here) perhaps in this case header is way longer 310 bytes than body 2 bytes, so by the time complete header is sent on wire body is pushed and hence TCP frames are order in such a way that body appears first. To corroborate this I added a delay of 1 second just after sending header line#1011 and bingo, the error disappeared and it started working fine. Not sure if this is completely correct analysis, but can someone in the know can confirm or let me know how to fix this.

Python requests: chunked post request

I am trying to send a post request through the request module with headers["Transfer-encoding"] = "chunked", but I am getting back:
<BODY><h2>Bad Request - Invalid Content Length</h2><hr><p>HTTP Error 400. There is an invalid content length or chunk length in the request.</p>
I am sending a json string. headers["Content-Type"] = "application/json" is also given.
Does anybody know if I am missing some setting? Maybe I should set the chunk-size somewhere?
Analysing the headers of the request attached to the response I actually get a content-length header different from zero.
I also tried to create a custom generator from the json string, and pass it to the post method as data=, but it it seems to simply hang there (also above the given timeout=).

Your error says you didn't create the request properly (it's 4xx error, not 5xx which would indicate server issue).
Transfer-Encoding: chunked serves for sending data in chunks. When the body of your message consists of unspecified number of chunks and you send them in lets say - stream. I would suggest reading this.
Each chunk should have it's size in front of the data. For instance:
HTTP/1.1 200 OK
Content-Type: text/plain
Transfer-Encoding: chunked
9\r\n
Some data\r\n
6\r\n
Python\r\n
If you want to send chunked requests with python requests module. You probably need a generator method for that. Please see this. With such few information I can't help you more.

What does "content-type" mean in headers of python requests library and if the value is text/html;charset=UTF-8?

I want to do some operations with response from python requests library. After I use below function;
response = requests.get(f'{AUTHORIZE_URL}?client_id={CLIENT_ID}&response_type=code&state={STATE}&redirect_uri={REDIRECT_URI}')
I need to get an URL something like this in return;
http://127.0.0.1:8000/products/auth/?state=2b33fdd45jbevd6nam&code=MGY1MTMyNWY0YjQ0MzEwNmMxMjY2ZjcwMWE2MWY5ZDE5MzJlMjA1YjdkNWExNGRhYjIzOGI5NzQ5OWZkNTA5NA
While doing it, it will be easier to use JSON in order to get state and code values from URL but I cannot use it because I think the content type does not allow this.

See this for Content-Type explanation: Content-Type
In short the "content-type" in the headers of response got by using requests.get tells you what kind of the content server did send, in your case you'we got a response in the form of the HTML (like .html document) and you can read that response with response.text, if the "content-type" is "application/json" then you can read it as JSON like this response.json().
I see that you use some local server, your local server should send in headers "Content-Type": "application/json" and then you should be able to read JSON from response like this (you need to send JSON not hmtl or text from server):
targetURL = 'http://127.0.0.1:8000/products/auth/?state=2b33fdd45jbevd6nam&code=MGY1MTMyNWY0YjQ0MzEwNmMxMjY2ZjcwMWE2MWY5ZDE5MzJlMjA1YjdkNWExNGRhYjIzOGI5NzQ5OWZkNTA5NA'
response.get(targetURL).json()

What makes conditional GETs "conditional" if the resource is obtained in the initial request?

Breaking down what makes a conditional GET:
In RFC 2616 it states that the GET method change to a "conditional GET" if the request message includes an If-* (If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range) header field.
It then states:
A conditional GET method requests that the entity be transferred ONLY under the circumstances described by the conditional header field(s).
From my understanding this is saying it will only return the data being requested if the condition is met with the "If-*" in any new subsequent requests. For example, if a GET request returns a response with a Etag header then the next request must include the If-None-Match with the ETag value to transfer the client back the requested resource.
However, If a client has to send an initial request before getting the returned "ETag" header (to return with If-None-Match) then they already have the requested resource. Thus, any future requests that return the If-None-Match header with the ETag value only dictate the return of the requested value, returning 200 OK (if the client does not return the If-None-Matchand ETag value from initial request) or 304 Not Modified (if they do), where this helps the client and server by caching the resource.
My Question:
Why does it state the entity (the resource from a request) will "be transferred ONLY" if the If-* condition is met (like in my example where the client returns the ETag value with anIf-None-Match in order to cache the requested resource) if the resource or "entity" is being returned with or without a "If-*" being returned? It doesn't return a resource "only under the circumstances described by the conditional header" because it returns the resource despiteless returning 200 OK or 304 Not Modified depending on if a "If-*" header is returned. What am I misunderstanding about this?
Full conditional GET reference from RFC 2616:
The semantics of the GET method change to a "conditional GET" if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field. A conditional GET method requests that the entity be transferred only under the circumstances described by the conditional header field(s). The conditional GET method is intended to reduce unnecessary network usage by allowing cached entities to be refreshed without requiring multiple requests or transferring data already held by the client.

First of all, please note that RFC 2616 is obsolete, and you should refer instead to RFC 7232.
It's hard to see what exactly is confusing you. So let me just illustrate with examples instead.
Scenario 1
Client A: I need http://example.com/foo/bar.
GET /foo/bar HTTP/1.1
Host: example.com
Server: Here you go.
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 12
ETag: "2ac07d4"
Hello world!
(some time passes)
Client A: I need http://example.com/foo/bar again. But I already have the "2ac07d4" version in my cache. Maybe that will do?
GET /foo/bar HTTP/1.1
Host: example.com
If-None-Match: "2ac07d4"
Server: Yeah, "2ac07d4" is fine. Just take it from your cache, I'm not sending it to you.
HTTP/1.1 304 Not Modified
Scenario 2
Client A: I need http://example.com/foo/bar.
GET /foo/bar HTTP/1.1
Host: example.com
Server: Here you go.
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 12
ETag: "2ac07d4"
Hello world!
(some time passes)
Client B: I want to upload a new version of http://example.com/foo/bar.
PUT /foo/bar HTTP/1.1
Content-Type: text/plain
Content-Length: 17
Hello dear world!
Server: This looks good, I'm saving it. I will call this version "f6049b9".
HTTP/1.1 204 No Content
ETag: "f6049b9"
(more time passes)
Client A: I need http://example.com/foo/bar again. But I already have the "2ac07d4" version in my cache. Maybe that will do?
GET /foo/bar HTTP/1.1
Host: example.com
If-None-Match: "2ac07d4"
Server: I'm sorry, but "2ac07d4" is out of date. We have a new version now, it's called "f6049b9". Here, let me send it to you.
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 17
ETag: "f6049b9"
Hello dear world!
Analysis
A conditional GET method requests that the entity be transferred ONLY under the circumstances described by the conditional header field(s).
Consider Client A's second request (in both scenarios).
The conditional header field is: If-None-Match: "2ac07d4".
The circumstances described by it are: "a selected representation of the resource does not match entity-tag "2ac07d4"".
Scenario 1: the circumstances do not hold, because the selected representation of the resource (the one containing Hello world!) does indeed match entity-tag "2ac07d4". Therefore, in accordance with the protocol, the server does not transfer the entity in its response.
Scenario 2: the circumstances do hold: the selected representation of the resource (the one containing Hello dear world!) doesn't match entity-tag "2ac07d4" (it matches "f6049b9" instead). Therefore, in accordance with the protocol, the server does transfer the entity in its response.
How does the server come up with these "2ac07d4" and "f6049b9", anyway? Of course, this depends on the application, but one straightforward way to do it is to compute a hash (such as SHA-1) of the entity body--a value that changes dramatically when even small changes are introduced.

How to handle get completed POST request if 100:Continue in Express.js?

I have a relatively large (50K) text file being POSTed:
curl -H "Content-Type:text/plain" -d #system.log http://localhost:8888/
Testing using a proxy, I can see the full contents of the file are being posted.
However Express.JS sees 100:Continue in the headers, and a blank body. I have :
app.use(express.bodyParser());
enabled, BTW. Here are the headers:
{ 'user-agent': 'curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5',
host: 'localhost:8888',
accept: '*/*',
'content-type': 'text/plain',
'content-length': '55909',
expect: '100-continue' }
req.body is empty:
{}
How can I see all the data being posted in Express.JS?

My guess is that node is automatically sending the 100-continue response as per the node http module docs (assuming you are using a new enough node version). What I guess is happening is just simply that the text/plain content type is not a format that bodyParser can parse into something else (as opposed to json or www-form-urlencoded). So you can get your data from the standard data event which the request will emit for each chunk of data it reads.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string