Datalab occasionally very slow on simple tasks - apache-spark

I've noticed that datalab is occasionally extremely slow (to the point where I believe it's just hanging).
import json
import pprint
import subprocess
import pyspark
For instance, on this really trivial code block, the code takes forever to train. If I keep trying to refresh the page and run it, sometimes it works. What can cause this?

Resize your VM or create a new one with a larger memory/CPUs. You can do this by stopping the VM and then editing it in Compute Engine.

Related

USB serial data, network, database and GUI, should I go for multiprocessing in Python?

I am designing a program in Python which
reads data via usb in two second interval from arduino to a Sqlite table (128kB each read outs)
Process the incoming data and store the results on another table
finally query the data on the table and show them on GUI created by tkinter and send the same data on the network to a server.
The question is for which part should I use multiprocessing or threading? Do I even need them? If I run first part from a separate Python file on background does it use necessarily different cpu core?
EDIT:
I found about pickling, now the question is:
is it good idea to pickle a 1kb string every 3 seconds, of course in the ramdrive? and depickle in another script.
I tested already two script and it is working, but I am not sure if this solution can be used in long term running?
It looks promising! specially when I dont see my selft stuck in multithreading or multiprocessing modules and seems like OS will assign necessary cores and threads.

webbrowser imports really slow

Importing webbrowser takes 30+ seconds to load, which really slows down the starting up of my program. I've tried setting my default browser to IE and Chrome, but still yielded the same result. Tried it in other machines too.
I'm running Python 3.6.4 (Windows 7 x64) with a fairly fast internet connection. I'm fairly new in python programming as well.
My questions will be:
What causes this slowdown? I'm watching youtube videos importing webbrowser, they seem to import it instantaneously. What can I do about it?
I've tried "cheating" my way around this desperately by putting the import in a function of a button so that it would not affect the startup of the program (didn't work. kind of stupid now I think about it. still took 30+ seconds to startup)
Another desperate measure I'm planning to do is put the import into a multi thread so it could import at the background while starting up the program. I haven't done multi threading yet, so I still need to learn this. Would this work though?
I don't know what other information I could share regarding this since I'm really lost here. Any advice would be much appreciated.
Edit: I made a simple py to time the execution of the code.
import timeit
start = timeit.default_timer()
import webbrowser
stop = timeit.default_timer()
print('Time: ', stop - start)
Output:
How are you so sure that slowdown is due to "import". This could be due to slower loading in your Chrome browser, try clearing the cache of Chrome browser. Did your Chrome browser is upto the mark as speed is considered ? also please show me the code.

Web scraping simultaneous request

from urllib import request
import urllib
from bs4 import BeautifulSoup as bs
page = request.urlopen("http://someurl.ulr").read()
soup = (page,"lxml")
Now this process is very slow because it makes one request parses data,
goes through the specified steps and then we go back to making the request.
- for example
for links in findAll('a'):
print (links.href)
and then we go back to making the request because we want to scrape another URL, how does one speed up this process?
Should I store the whole source code of the URL in a file and then do the necessary operations (parse through, find data we need) ---?
I've got this idea because from a DoS(Denial of Service) script that
uses import socks and threads for making a large amount of requests.
Note: This is only an idea,
Is there a more efficient way of doing this?
The most efficient way to this ist most probably using asyncio and at one point spawning as many python processes as you have threads.
asyncio documentation
and call your script like that:
for i in $(seq $(nproc)); do python yourscript.py $entry; done
This will lead to a massive speed improvement. In order to further increase the speed of processing you could use a regex parser instead of Beautifulsoup, this gave me a speedup of about 5 times.
You could as well use a specialized library for this task, e.g. scrapy

What happens when php fpm process crashes?

For example I import a csv file for some reason the process crashes, in this case the process is restarted so I want to know the restarted process will again import the file?
Since the process crashed, you will need to start the import again. It's possible during the import only a limited amount of data was imported, so you'll also want to check on that situation.

Running a method in the background of the Python interpreter

I've made a class in Python 3.x, that acts as a server. One method manages sending and receiving data via UDP/IP using the socket module (the data is stored in self.cmd, and self.msr respectively). I want to be able to modify the the self.msr, self.cmd variables from within the python interpreter online. For example:
>>> from myserver import MyServer
>>> s = MyServer()
>>> s.bakcground_recv_send() # runs in the background, constantly calling s.recv_msr(), s.send_cmd()
>>> process_data(s.msr) # I use the latest received data
>>> s.cmd[0] = 5 # this will be sent automatically
>>> s.msr # I can see what the newest data is
So far, s.bakcground_recv_send() does not exist. I need to manually call s.recv_msr() each time I want to see update the value of s.msr (s.recv_msr uses a blocking socket), and then call s.send_cmd() to send s.cmd.
In this particular case, which module makes more sense: multiprocess or threading?
Any hints how could I best solve this? I have no experience with either processes or threads (just read a lot, but I am still unsure which way to go).
In this case, threading makes most sense. In short, multiprocessing is for running processes on different procesors, threading is for doing things in the background.

Resources