How to serialize/deserialize hyper::Request and hyper::Response? - rust

Hyper provides API that abstracts you away from serialization/deserialization (they pretty much do it internally and send the result over the wire).
In most cases, it's great. However, for my case, I need to be able to serialize/deserialize them to/from byte array (or maybe as an alternative to/from AsyncRead/AsyncWrite). And I need it in the form how it's send on the wire (including HttpVersion, Uri, Headers, Body).
I am aware that I can get all these parts of the request (or response) separately. However, I don't want to duplicate serialization/deserialization functionality by implementing it on my own
I saw https://docs.rs/hyper/latest/hyper/client/conn/index.html which looks kind-of similar to what I need to do. However, it looks like it's more designed to work over TCPStream (however, I could be wrong about it).

An HTTP server (like Hyper) needs to read part of the HTTP message in order to process an HTTP request (for routing it needs the URI part and Content-Length header for reading the body). So a part of the incoming byte stream is already consumed and you need to rebuild that part. The body part is easy because it already gives you the byte stream you want.
What you want is similar to a reverse proxy. proxy-hyper
There is a problem with handling the raw request bytes for requests because it may be HTTP/1.1 or HTTP/2 which are different. If you want something simple you can use let (parts, body) = req.into_parts(); to create the request string. HTTP Request Format

Related

Can std::io::BufReader on a TcpStream lead to data loss?

Can a single instance of std::io::BufReader on a tokio::net::TcpStream lead to data loss when the BufReader is used to read_until a given (byte) delimiter?
That is, is there any possibility that after I use the BufReader for:
let buffer = Vec::new();
let reader = BufReader::new(tcp_stream);
tokio::io::read_until(reader, delimiter, buffer)
.map(move |(s, _)| s.into_inner())
a following tokio::io::read using the same stream would return data that is actually beyond the delimiter + 1, causing therefore data loss?
I have an issue (and complete reproducible example on Linux) that I have trouble explaining if the above assumption isn't correct.
I have a TCP server that is supposed to send the content of a file to multiple TCP clients following multiple concurrent requests.
Sometimes, using always the same inputs, the data received by the client is less than expected, therefore the transfer fails.
The error is not raised 100% of the times (that is, some of the client requests still succeed), but with the 100 tries defined in tcp_client.rs it was always reproducible for at least one of them.
The sequence of data transferred between client and server is composed by:
the client send a request
the server read the request and send a response
the client read the response
the server send the file data
the client read the file data
This issue is only reproducible only if steps 1, 2 and 3 are involved, otherwise it works as expected.
The error is raised when this tokio::io::read (used to read the file content) returns 0, as if the server closed the connection, even is the server is actually up and running, and all the data has been sent (there is an assertion after tokio::io::copy and I checked the TCP packets using a packet sniffer). On a side note, in all my runs the amount of data read before the error was always > 95% than the one expected.
Most importantly the common.rs module defines 2 different read_* functions:
read_until currently used.
read_exact not used.
The logic of the 2 is the same, they need to read the request/response (and both client and server can be updated to use one or the other). What is surprising is that the bug presents itself only when tokio::io::read_until is used, while tokio::io::read_exact works as expected.
Unless, I misused tokio::io::read_until or there is a bug in my implementation, I expected both versions to work without any issue. What I am seeing instead is this panic being raised because some clients cannot read all the data sent by the server.
Yes. This is described in the documentation for BufReader (emphasis mine):
When the BufReader is dropped, the contents of its buffer will be discarded.
The next sentence is correct but not extensive enough:
Creating multiple instances of a BufReader on the same stream can cause data loss.
The BufReader has read data from the underlying source and put it in the buffer, then you've thrown away the buffer. The data is gone.

node-red use several sources to build http post request

I am new to node-red and I am confused with the "message payload flow system".
I want to send a POST request that contains, among other params, files into the request payload. These files should be in an array called "files".
I read my files from my file system, this works fine, but in the function node, how do I build my POST payload?
So far I have this:
The problem is that the payload contains both files and I can't find a way to get them separately. How can I retrieve both my files, separately, into the BUILD-POST-REQ function?
The core Join node can be used to combine the output of parallel input streams. It has a number of modes that control how many input messages it will collect together.
These include a count of messages.
You can also choose how it combines the input messages, this can be as an array or as an object using the msg.topic as the key to the incoming msg.payload
Ok. I found A solution. But I don't know if this is a best practice. Feel free to correct me!
The idea is that after each file read, I store it in a new property of the msg object and then can access it later in the flow.

Why is urllib.request so slow?

When I use urllib.request.decode to get the python dictionary from JSON format it takes far too long. However upon looking at the data, I realized that I don't even want all of it.
Is there any way that I can only get some of the data, for example get the data from one of the keys of the JSON dictionary rather than all of them?
Alternatively, if there was any faster way to get the data that could work as well?
Or is it simply a problem with the connection and cannot be helped?
Also is the problem with the urllib.request.urlopen or is it with the json.loads or with the .read().decode().
The main symptoms of the problem is either taking roughly 5 seconds when trying to receive information which is not even that much (less than 1 page of non-formatted dictionary). The other symptom is that as I try to receive more and more information, there is a point when I simply receive no response from the webpage at all!
The 2 lines which take up the most time are:
response = urllib.request.urlopen(url) # url is a string with the url
data = json.loads(response.read().decode())
For some context on what this is part of, I am using the Edamam Recipe API.
Help would be appreciated.
Is there any way that I can only get some of the data, for example get the data from one of the keys of the JSON dictionary rather than all of them?
You could try with a streaming json parser, but I don't think you're going to get any speedup from this.
Alternatively, if there was any faster way to get the data that could work as well?
If you have to retrieve a json document from an url and parse the json content, I fail to imagine what could be faster than sending an http request, reading the response content and parsing it.
Or is it simply a problem with the connection and cannot be helped?
Given the figures you mentions, the issue is very certainly in the networking part indeed, which means anything between your python process and the server's process. Note that this includes your whole system (proxy/firewall, your network card, your OS tcp/ip stack etc, and possibly some antivirus on window), your network itself, and of course the end server which may be slow or a bit overloaded at times or just deliberately throttling your requests to avoid overload.
Also is the problem with the urllib.request.urlopen or is it with the json.loads or with the .read().decode().
How can we know without timing it on your own machine ? But you can easily check this out, just time the various parts execution time and log them.
The other symptom is that as I try to receive more and more information, there is a point when I simply receive no response from the webpage at all!
cf above - if you're sending hundreds of requests in a row, the server might either throttle your requests to avoid overload (most API endpoints will behave tha way) or just plain be overloaded. Do you at least check the http response status code ? You may get 503 (server overloaded) or 429 (too many requests) responses.

Node JS Streams: Understanding data concatenation

One of the first things you learn when you look at node's http module is this pattern for concatenating all of the data events coming from the request read stream:
let body = [];
request.on('data', chunk => {
body.push(chunk);
}).on('end', () => {
body = Buffer.concat(body).toString();
});
However, if you look at a lot of streaming library implementations they seem to gloss over this entirely. Also, when I inspect the request.on('data',...) event it almost ever only emits once for a typical JSON payload with a few to a dozen properties.
You can do things with the request stream like pipe it through some transforms in object mode and through to some other read streams. It looks like this concatenating pattern is never needed.
Is this because the request stream in handling POST and PUT bodies pretty much only ever emits one data event which is because their payload is way below the chunk partition size limit?. In practice, how large would a JSON encoded object need to be to be streamed in more than one data chunk?
It seems to me that objectMode streams don't need to worry about concatenating because if you're dealing with an object it is almost always no larger than one data emitted chunk, which atomically transforms to one object? I could see there being an issue if a client were uploading something like a massive collection (which is when a stream would be very useful as long as it could parse the individual objects in the collection and emit them one by one or in batches).
I find this to probably be the most confusing aspect of really understanding the node.js specifics of streams, there is a weird disconnect between streaming raw data, and dealing with atomic chunks like objects. Do objectMode stream transforms have internal logic for automatically concatenating up to object boundaries? If someone could clarify this it would be very appreciated.
The job of the code you show is to collect all the data from the stream into one buffer so when the end event occurs, you then have all the data.
request.on('data',...) may emit only once or it may emit hundreds of times. It depends upon the size of the data, the configuration of the stream object and the type of stream behind it. You cannot ever reliably assume it will only emit once.
You can do things with the request stream like pipe it through some transforms in object mode and through to some other read streams. It looks like this concatenating pattern is never needed.
You only use this concatenating pattern when you are trying to get the entire data from this stream into a single variable. The whole point of piping to another stream is that you don't need to fetch the entire data from one stream before sending it to the next stream. .pipe() will just send data as it arrives to the next stream for you. Same for transforms.
Is this because the request stream in handling POST and PUT bodies pretty much only ever emits one data event which is because their payload is way below the chunk partition size limit?.
It is likely because the payload is below some internal buffer size and the transport is sending all the data at once and you aren't running on a slow link and .... The point here is you cannot make assumptions about how many data events there will be. You must assume there can be more than one and that the first data event does not necessarily contain all the data or data separated on a nice boundary. Lots of things can cause the incoming data to get broken up differently.
Keep in mind that a readStream reads data until there's momentarily no more data to read (up to the size of the internal buffer) and then it emits a data event. It doesn't wait until the buffer fills before emitting a data event. So, since all data at the lower levels of the TCP stack is sent in packets, all it takes is a momentary delivery delay with some packet and the stream will find no more data available to read and will emit a data event. This can happen because of the way the data is sent, because of things that happen in the transport over which the data flows or even because of local TCP flow control if lots of stuff is going on with the TCP stack at the OS level.
In practice, how large would a JSON encoded object need to be to be streamed in more than one data chunk?
You really should not know or care because you HAVE to assume that any size object could be delivered in more than one data event. You can probably safely assume that a JSON object larger than the internal stream buffer size (which you could find out by studying the stream code or examining internals in the debugger) WILL be delivered in multiple data events, but you cannot assume the reverse because there are other variables such as transport-related things that can cause it to get split up into multiple events.
It seems to me that objectMode streams don't need to worry about concatenating because if you're dealing with an object it is almost always no larger than one data emitted chunk, which atomically transforms to one object? I could see there being an issue if a client were uploading something like a massive collection (which is when a stream would be very useful as long as it could parse the individual objects in the collection and emit them one by one or in batches).
Object mode streams must do their own internal buffering to find the boundaries of whatever objects they are parsing so that they can emit only whole objects. At some low level, they are concatenating data buffers and then examining them to see if they yet have a whole object.
Yes, you are correct that if you were using an object mode stream and the object themselves were very large, they could consume a lot of memory. Likely this wouldn't be the most optimal way of dealing with that type of data.
Do objectMode stream transforms have internal logic for automatically concatenating up to object boundaries?
Yes, they do.
FYI, the first thing I do when making http requests is to go use the request-promise library so I don't have to do my own concatenating. It handles all this for you. It also provides a promise-based interface and about 100 other helpful features which I find helpful.

How to correctly build a TCP frame decoder in nodejs

I'm trying to find a simple, modular and idiomatic way of parsing a text based protocol for TCP streams.
Say the protocol looks like this:
"[begin][length][blah][blah]...[blah][end][begin]...[end][begin]...[end]"
I'd like to correctly use streams (Transform?) to build a small component which just extracts individual messages (starts with [begin] and ends with [end]). Parsing to higher level data structures is left to other components.
I'm not so concerned about performance right now either so I'd just like to use a simple regex (this protocol is parsable with a regex).
I'm having trouble with a couple of concepts:
Since the buffer may have not have a complete message, how do I correctly handle state and leave the partial message alone, to be parsed when more data comes in? Do I have to keep my own buffer or is there a way "put back" the data I didn't use?
Since new data may contain several messages, can the Transform stream handle multiple messages (like do I call this.push(data); multiple times)?
(note that I'm trying to build this frame decoder outside of socket connection logic...I imagine it will be a class which extends stream.Transform and implement the read method)

Resources