I am writing a program for school, and I have come across a problem. It is a webcrawler, and sometimes it can get stuck on a url for over 24 hours. I was wondering if there was a way to continue the while loop and go to the next url if it takes over a set amount of time. Thanks
If you are using urllib in Python 3 You can use the timeout argument in the urllib.request.urlopen(url, timeout=<t in secs>). That parameters is used for all blocking operation used internally by urllib.
But if you are using a different library, consult the doc or mention in the question.
Related
I'm writing a game in Rust where each player can submit some python scripts to the server in order to automate various tasks in the game. I plan on using pyo3 to run the python from rust.
However, I can see an issue arising if a player submits a script like this:
def on_event(e):
while True:
pass
Now when the server calls the function (using something like PyAny::call1()) the thread will hang as it reaches the infinite loop.
My first thought was to have pyo3 execute the python one statement at a time, therefore being able to exit if the script been running for over a certain threshold, but I don't think pyo3 supports this.
My next idea was to give each player their own thread to run their own scripts on, that way if one of their scripts got stuck it only affected their gameplay. However, I still have the issue of not being able to kill a thread when it gets stuck in an infinite loop - if a lot of players submitted scripts that just looped, lots of threads would start using a lot of CPU time.
All I need is way to execute python scripts in a way such that if one of them does loop, it does not affect the server's performance at all.
Thanks :)
One solution is to restrict the time that you give each user script to run.
You can do it via PyThreadState_SetAsyncExc, see here for some code. It uses C calls of the interpreter, which you probably can access in Rust (with PyO3 FFI magic).
Another way would be to do it on the OS level: if you spawn a process for the user script, and then kill it when it runs for too long. This might be more secure if you limit what a process can access (with some OS calls), but requires some boilerplate to communicate between the host.
When I use urllib.request.decode to get the python dictionary from JSON format it takes far too long. However upon looking at the data, I realized that I don't even want all of it.
Is there any way that I can only get some of the data, for example get the data from one of the keys of the JSON dictionary rather than all of them?
Alternatively, if there was any faster way to get the data that could work as well?
Or is it simply a problem with the connection and cannot be helped?
Also is the problem with the urllib.request.urlopen or is it with the json.loads or with the .read().decode().
The main symptoms of the problem is either taking roughly 5 seconds when trying to receive information which is not even that much (less than 1 page of non-formatted dictionary). The other symptom is that as I try to receive more and more information, there is a point when I simply receive no response from the webpage at all!
The 2 lines which take up the most time are:
response = urllib.request.urlopen(url) # url is a string with the url
data = json.loads(response.read().decode())
For some context on what this is part of, I am using the Edamam Recipe API.
Help would be appreciated.
Is there any way that I can only get some of the data, for example get the data from one of the keys of the JSON dictionary rather than all of them?
You could try with a streaming json parser, but I don't think you're going to get any speedup from this.
Alternatively, if there was any faster way to get the data that could work as well?
If you have to retrieve a json document from an url and parse the json content, I fail to imagine what could be faster than sending an http request, reading the response content and parsing it.
Or is it simply a problem with the connection and cannot be helped?
Given the figures you mentions, the issue is very certainly in the networking part indeed, which means anything between your python process and the server's process. Note that this includes your whole system (proxy/firewall, your network card, your OS tcp/ip stack etc, and possibly some antivirus on window), your network itself, and of course the end server which may be slow or a bit overloaded at times or just deliberately throttling your requests to avoid overload.
Also is the problem with the urllib.request.urlopen or is it with the json.loads or with the .read().decode().
How can we know without timing it on your own machine ? But you can easily check this out, just time the various parts execution time and log them.
The other symptom is that as I try to receive more and more information, there is a point when I simply receive no response from the webpage at all!
cf above - if you're sending hundreds of requests in a row, the server might either throttle your requests to avoid overload (most API endpoints will behave tha way) or just plain be overloaded. Do you at least check the http response status code ? You may get 503 (server overloaded) or 429 (too many requests) responses.
Here I create a producer-customer program,the parent process(producer) create many child process(consumer),then parent process read file and pass data to child process.
but , here comes a performance problem,pass message between process cost too much time (I think).
for an example ,a 200MB original data ,parent process read and pretreat will cost less then 8 seconds , than just pass data to child process by multiprocess.pipe will cost another 8 seconds , and child processes do the remain work just cost another 3 ~ 4 seconds.
so ,a complete work flow cost less than 18 seconds ,and more than 40% time cost on communication between process , it is much bigger than I used think about ,and I tried multiprocess.Queue and Manager ,they are worse.
I works with windows7 / Python3.4.
I had google for several days , and POSH maybe a good solution , but it can't build with python3.4
there I have 3 ways:
1.is there any way can share python object direct between process in Python3.4 ? as POSH
or
2.is it possable pass the "pointer" of an object to child process and child process can recovery the "pointer" to python object?
or
3.multiprocess.Array may be a valid solution , but if I want share complex data structure, such as list, how it works? should I make a new class base on it and provide interfaces as list?
Edit1:
I tried the 3rd way,but it works worse.
I defined those value:
p_pos = multiprocessing.Value('i') #producer write position
c_pos = multiprocessing.Value('i') #customer read position
databuff = multiprocess.Array('c',buff_len) # shared buffer
and two function:
send_data(msg)
get_data()
in send_data function(parent process),it copies msg to databuff , and send the start and end position (two integer)to child process via pipe.
than in get_data function (child process) ,it received the two position and copy the msg from databuff.
in final,it cost twice than just use pipe #_#
Edit 2:
Yes , I tried Cython ,and the result looks good.
I just changed my python script's suffix to .pyx and compile it ,and the program speed up for 15%.
No doubt , I met the " Unable to find vcvarsall.bat" and " The system cannot find the file specified" error , and I cost whole day for solved the first one , and blocked by the second one.
Finally , I found Cyther , and all troubles gone ^_^.
I was at your place five month ago. I looked around few times but my conclusion is multiprocessing with Python has exactly the problem you describe :
Pipes and Queue are good but not for big objects from my experience
Manager() proxies objects are slow except arrays and those one are limited. if you want to share a complex data structure use a Namespace like it is done here : multiprocessing in python - sharing large object (e.g. pandas dataframe) between multiple processes
Manager() has a shared list you are looking for : https://docs.python.org/3.6/library/multiprocessing.html
There are no pointers or real memory management in Python, so you can't share selected memory cells
I solved this kind of problem by learning C++, but it's probably not what you want to read...
To pass data (especially big numpy arrays) to a child process, I think mpi4py can be very efficient since I can work directly on buffer-like objects.
An example of using mpi4py to spawn processes and communicate (using also trio, but it is another story) can be found here.
I'm writing a crawler module which is calling it-self recursively to download more and more links depending on a depth option parameter passed.
Besides that, I'm doing more tasks on the returned resources I've downloaded (enrich/change it depending on the configuration passed to the crawler). This process is going on recursively until it's done which might take a-lot of time (or not) depending on the configurations used.
I wish to optimize it to be as fast as possible and not to hinder on any Node.js application that will use it.I've set up an express server that one of its routes launch the crawler for a user defined (query string) host. After launching a few crawling sessions for different hosts, I've noticed that I can sometimes get real slow responses from other routes that only return simple text.The delay can be anywhere from a few milliseconds to something like 30 seconds, and it's seems to be happening at random times (well nothing is random but I can't pinpoint the cause).I've read an article of Jetbrains about CPU profiling using V8 profiler functionality that is integrated with Webstorm, but unfortunately it only shows on how to collect the information and how to view it, but it doesn't give me any hints on how to find such problems, so I'm pretty much stuck here.
Could anyone help me with this matter and guide me, any tips on what could hinder the express server that my crawler might do (A lot of recursive calls), or maybe how to find those hotspots I'm looking for and optimize them?
It's hard to say anything more specific on how to optimize code that is not shown, but I can give some advice that is relevant to the described situation.
One thing that comes to mind is that you may be running some blocking code. Never use deep recursion without using setTimeout or process.nextTick to break it up and give the event loop a chance to run once in a while.
The program I am developing uses threads to deal with long running processes. I want to be able to use Gauge Pulse to show the user that whilst a long running thread is in progress, something is actually taking place. Otherwise visually nothing will happen for quite some time when processing large files & the user might think that the program is doing nothing.
I have placed a guage within the status bar of the program. My problem is this. I am having problems when trying to call gauge pulse, no matter where I place the code it either runs to fast then halts, or runs at the correct speed for a few seconds then halts.
I've tried placing the one line of code below into the thread itself. I have also tried create another thread from within the long running process thread to call the code below. I still get the same sort of problems.
I do not think that I could use wx.CallAfter as this would defeat the point. Pulse needs to be called whilst process is running, not after the fact. Also tried usin time.sleep(2) which is also not good as it slows the process down, which is something I want to avoid. Even when using time.sleep(2) I still had the same problems.
Any help would be massively appreciated!
progress_bar.Pulse()
You will need to find someway to send update requests to the main GUI from your thread during the long running process. For example, if you were downloading a very large file using a thread, you would download it in chunks and after each chunk is complete, you would send an update to the GUI.
If you are running something that doesn't really allow chunks, such as creating a large PDF with fop, then I suppose you could use a wx.Timer() that just tells the gauge to pulse every so often. Then when the thread finishes, it would send a message to stop the timer object from updating the gauge.
The former is best for showing progress while the latter works if you just want to show the user that your app is doing something. See also
http://wiki.wxpython.org/LongRunningTasks
http://www.blog.pythonlibrary.org/2010/05/22/wxpython-and-threads/
http://www.blog.pythonlibrary.org/2013/09/04/wxpython-how-to-update-a-progress-bar-from-a-thread/