Why is urllib.request so slow?

Why is urllib.request so slow? - python-3.x

When I use urllib.request.decode to get the python dictionary from JSON format it takes far too long. However upon looking at the data, I realized that I don't even want all of it.
Is there any way that I can only get some of the data, for example get the data from one of the keys of the JSON dictionary rather than all of them?
Alternatively, if there was any faster way to get the data that could work as well?
Or is it simply a problem with the connection and cannot be helped?
Also is the problem with the urllib.request.urlopen or is it with the json.loads or with the .read().decode().
The main symptoms of the problem is either taking roughly 5 seconds when trying to receive information which is not even that much (less than 1 page of non-formatted dictionary). The other symptom is that as I try to receive more and more information, there is a point when I simply receive no response from the webpage at all!
The 2 lines which take up the most time are:
response = urllib.request.urlopen(url) # url is a string with the url
data = json.loads(response.read().decode())
For some context on what this is part of, I am using the Edamam Recipe API.
Help would be appreciated.

Is there any way that I can only get some of the data, for example get the data from one of the keys of the JSON dictionary rather than all of them?
You could try with a streaming json parser, but I don't think you're going to get any speedup from this.
Alternatively, if there was any faster way to get the data that could work as well?
If you have to retrieve a json document from an url and parse the json content, I fail to imagine what could be faster than sending an http request, reading the response content and parsing it.
Or is it simply a problem with the connection and cannot be helped?
Given the figures you mentions, the issue is very certainly in the networking part indeed, which means anything between your python process and the server's process. Note that this includes your whole system (proxy/firewall, your network card, your OS tcp/ip stack etc, and possibly some antivirus on window), your network itself, and of course the end server which may be slow or a bit overloaded at times or just deliberately throttling your requests to avoid overload.
Also is the problem with the urllib.request.urlopen or is it with the json.loads or with the .read().decode().
How can we know without timing it on your own machine ? But you can easily check this out, just time the various parts execution time and log them.
The other symptom is that as I try to receive more and more information, there is a point when I simply receive no response from the webpage at all!
cf above - if you're sending hundreds of requests in a row, the server might either throttle your requests to avoid overload (most API endpoints will behave tha way) or just plain be overloaded. Do you at least check the http response status code ? You may get 503 (server overloaded) or 429 (too many requests) responses.

Related

How to manage the conversation flow if face timeout limit (5 seconds) in Dialogflow / Api.ai?

I am making a bot on Dialogflow with a Fulfillment. Considering the given strict 5-second window in DialogFlow, I am getting [empty response] as a response.
I want to overcome this issue, but my web service requires more than 9 seconds for the execution.
I am considering to redesigning the conversation flow in such a way that we will start streaming audio till the Response is processed.
Example:
User Question: xx xxx xxx xxxx xxxxx?
Response: a). We'll play fixed audio to keep the user engaged for few seconds till it finds a response text in the back end; b).
Receive answers from the web service and save them in the session to
display further.
How can I achieve this and how can I handle the Timeout issue?

You're on the right track, but there are a number of other things to consider.
First, however, keep in mind that anything that is trying to "avoid" the 5 second timeout already indicates some issues with the design. Waiting 10 seconds for a reply is a pretty long time with something as interactive as voice! Even 5 seconds, which is the timeout, is a long time. (And there is no way to change this timeout.)
So the first thing you may want to do is consider if there is a better/faster way to do what you want.
If not, the rough approach would be something like this:
Get the request from the user.
Track a unique identifier, either tied to the user or tied to the session. You'll be using this as a key into some kind of database or data store.
Start the API call as part of an asynchronous request or in another thread.
Reply immediately that you're working on it in a way that the user will send another request. (See below for this issue.) You'll want to make sure that the ID is maintained as part of this session - so you'll need to save it as part of the Session data.
At this point - you're basically doing two things in parallel.
When the API call completes, it needs to save the result in the datastore against the identifier. (It can't save it in the session itself - that response was already sent back to the Assistant.)
You're also waiting for a reply from the user. When it comes in:
Check to see if you have a response saved for this session yet.
If not, then go back to step 4. (You may want to track how many times you get here and give up at some point.)
If you do have the result, reply to the user with the information.
There is an issue with how you reply in step 4, since you want to do something that will guarantee you another request from the person expecting an answer. There are a few possible approaches:
The most straightforward way would be to send back a Media response to play a few seconds of "hold music". This has the advantage that, when the music stops, it will send an event to Dialogflow which you can capture as an Intent and then continue with step 5.
But there are some problems:
Not all versions of the Assistant support the Media response. You will need to check to confirm the feature is supported before you use it and, if not, use another approach (see below).
The media player that is presented on some Assistants allow the user to stop playback, or will not correctly send an event when the audio stops in some situations. So you may never get another request in this session.
Another approach involves some more advanced conversation design tricks, so may not always be suitable for your conversation. Your response can say that you're looking up the results but then ask the user a question - possibly one that is related to other information that you will need. With their reply, you can collect this information (if you need it) and then see if you have a result yet.
In some conversations - this works really well. For example, if you're looking up flights to somewhere, while you're looking that up you might ask them if they will need a hotel or rental car, which you might ask about anyway.
Other conversations, however, don't easily have such questions. In these cases, you may need to ask something that isn't relevant while you stall for time.

Speech Services STT- Possible to Link Request to Result?

I have a use case where a mobile app records a long series of commands. Each command is a short, single word (or number). They can happen quickly one right after the other, but the use case does not care if it takes several seconds to get results back from the Cognitive server. It is currently being implemented as discrete asynchronous requests rather than streaming (seems to be more reliable for us).
Since results are coming back async, I see no easy way to map the result back to its corresponding request (and ultimately the app command). Can I embed a unique ID somewhere that will get passed back to me? Is there some other option?

You are using the SDK?
If you do recognizeOnce you get the result from the audio as a call result (synchronous)
If you do continuousrecognition there is currently no way to tag the audio segment.

Efficient Way of Getting URL Redirect from Persistent URLs

I have a dataset that, in part, has a URL field indicating the location of a resource. Some URLs are persistent (e.g. handles and DOIs) and thus, need to be resolved to their original URL. I am primarily working with Python and the solution that seems to work, thus far, involves using the Requests HTTP library.
import requests
var_output_url = requests.get("http://hdl.handle.net/10179/619")
var_output_url.url
While this solution works, it is quite slow as I have to loop through ~4,000 files, each with around 2,000 URLs. Is there a more efficient way of resolving the URL redirects?
I tested my current solution on one batch and it took almost 5 minutes; at this rate, it will take me a couple of days (13 days) to process all the batches [...] I know, it will not necessarily be that long and I can run them in parallel

Using HEAD instead of GET should give you only headers and not the resource body, which in your example is html page. If you only need resolving url redirections, it would result in quite less time on data transfer over the network. Use parameter allow_redirects=True to allow redirection.
var_output_url = requests.head("http://hdl.handle.net/10179/619", allow_redirects=True)
var_output_url.url
>>> 'https://mro.massey.ac.nz/handle/10179/619'

"Resequencing" messages after processing them out-of-order

I'm working on what's basically a highly-available distributed message-passing system. The system receives messages from someplace over HTTP or TCP, perform various transformations on it, and then sends it to one or more destinations (also using TCP/HTTP).
The system has a requirement that all messages sent to a given destination are in-order, because some messages build on the content of previous ones. This limits us to processing the messages sequentially, which takes about 750ms per message. So if someone sends us, for example, one message every 250ms, we're forced to queue the messages behind each other. This eventually introduces intolerable delay in message processing under high load, as each message may have to wait for hundreds of other messages to be processed before it gets its turn.
In order to solve this problem, I want to be able to parallelize our message processing without breaking the requirement that we send them in-order.
We can easily scale our processing horizontally. The missing piece is a way to ensure that, even if messages are processed out-of-order, they are "resequenced" and sent to the destinations in the order in which they were received. I'm trying to find the best way to achieve that.
Apache Camel has a thing called a Resequencer that does this, and it includes a nice diagram (which I don't have enough rep to embed directly). This is exactly what I want: something that takes out-of-order messages and puts them in-order.
But, I don't want it to be written in Java, and I need the solution to be highly available (i.e. resistant to typical system failures like crashes or system restarts) which I don't think Apache Camel offers.
Our application is written in Node.js, with Redis and Postgresql for data persistence. We use the Kue library for our message queues. Although Kue offers priority queueing, the featureset is too limited for the use-case described above, so I think we need an alternative technology to work in tandem with Kue to resequence our messages.
I was trying to research this topic online, and I can't find as much information as I expected. It seems like the type of distributed architecture pattern that would have articles and implementations galore, but I don't see that many. Searching for things like "message resequencing", "out of order processing", "parallelizing message processing", etc. turn up solutions that mostly just relax the "in-order" requirements based on partitions or topics or whatnot. Alternatively, they talk about parallelization on a single machine. I need a solution that:
Can handle processing on multiple messages simultaneously in any order.
Will always send messages in the order in which they arrived in the system, no matter what order they were processed in.
Is usable from Node.js
Can operate in a HA environment (i.e. multiple instances of it running on the same message queue at once w/o inconsistencies.)
Our current plan, which makes sense to me but which I cannot find described anywhere online, is to use Redis to maintain sets of in-progress and ready-to-send messages, sorted by their arrival time. Roughly, it works like this:
When a message is received, that message is put on the in-progress set.
When message processing is finished, that message is put on the ready-to-send set.
Whenever there's the same message at the front of both the in-progress and ready-to-send sets, that message can be sent and it will be in order.
I would write a small Node library that implements this behavior with a priority-queue-esque API using atomic Redis transactions. But this is just something I came up with myself, so I am wondering: Are there other technologies (ideally using the Node/Redis stack we're already on) that are out there for solving the problem of resequencing out-of-order messages? Or is there some other term for this problem that I can use as a keyword for research? Thanks for your help!

This is a common problem, so there are surely many solutions available. This is also quite a simple problem, and a good learning opportunity in the field of distributed systems. I would suggest writing your own.
You're going to have a few problems building this, namely
2: Exactly-once delivery
1: Guaranteed order of messages
2: Exactly-once delivery
You've found number 1, and you're solving this by resequencing them in redis, which is an ok solution. The other one, however, is not solved.
It looks like your architecture is not geared towards fault tolerance, so currently, if a server craches, you restart it and continue with your life. This works fine when processing all requests sequentially, because then you know exactly when you crashed, based on what the last successfully completed request was.
What you need is either a strategy for finding out what requests you actually completed, and which ones failed, or a well-written apology letter to send to your customers when something crashes.
If Redis is not sharded, it is strongly consistent. It will fail and possibly lose all data if that single node crashes, but you will not have any problems with out-of-order data, or data popping in and out of existance. A single Redis node can thus hold the guarantee that if a message is inserted into the to-process-set, and then into the done-set, no node will see the message in the done-set without it also being in the to-process-set.
How I would do it
Using redis seems like too much fuzz, assuming that the messages are not huge, and that losing them is ok if a process crashes, and that running them more than once, or even multiple copies of a single request at the same time is not a problem.
I would recommend setting up a supervisor server that takes incoming requests, dispatches each to a randomly chosen slave, stores the responses and puts them back in order again before sending them on. You said you expected the processing to take 750ms. If a slave hasn't responded within say 2 seconds, dispatch it again to another node randomly within 0-1 seconds. The first one responding is the one we're going to use. Beware of duplicate responses.
If the retry request also fails, double the maximum wait time. After 5 failures or so, each waiting up to twice (or any multiple greater than one) as long as the previous one, we probably have a permanent error, so we should probably ask for human intervention. This algorithm is called exponential backoff, and prevents a sudden spike in requests from taking down the entire cluster. Not using a random interval, and retrying after n seconds would probably cause a DOS-attack every n seconds until the cluster dies, if it ever gets a big enough load spike.
There are many ways this could fail, so make sure this system is not the only place data is stored. However, this will probably work 99+% of the time, it's probably at least as good as your current system, and you can implement it in a few hundred lines of code. Just make sure your supervisor is using asynchronous requests so that you can handle retries and timeouts. Javascript is by nature single-threaded, so this is slightly trickier than normal, but I'm confident you can do it.

nodejs - writing strings to socket takes much time

I heard that "Writing strings to socket takes more time in nodejs because core modules does not allow copying data directly to the socket but it requires intermediate copy in memory before going to socket". I heard this line saying by Ryan Dahl himself in an interview. I will post the link once it is found.
please correct me if I am wrong in understanding any of these, thanks.
My question is - can we skip this intermediate copying issue by modifying any code in core modules of the node ? I have experienced 5-6 seconds lag in my server when it copies bulky/large/huge strings to 150+ sockets.
I am trying to minimize the amount of data to broadcst but still on the other hand can we optimize this copying of strings to socket ?
As per the comment, adding more contents.
Example of what I am doing -
I am broadcasting leaderboard of n(>100) users[all these are in one room]. It is in JSON format. "leaderboard" is an array of players. Every object of player contains name,email,profile_pic_url,score,rank.
All objects are in json format.
user information is fetched from redis, and then rank is calculated. Then this leaderboard is broadcasted in the room.
Above operation happens every 2 seconds. So after first successful broadcast, I can see a lag.
adding the code - I am using
socket.io for accepting the connections
redis store
room feature of socket.io
code -
io.sockets.in(RoomID).emit(StateName, LeaderboardObject);

can we skip this intermediate copying issue by modifying any code in core modules of the node ?
No.
You're using Socket.IO where the bulk of the work happens in JavaScript, not in a compiled extension. Even if you did find a way to get around the buffer copying, you wouldn't be able to use it in this case.
I suggest posting a separate question, asking about the actual speed problem you are having and ways to optimize your code.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string