Run Jupyter Notebook non interactively - keras

I have an Ubuntu virtual machine on Azure. I am running a Jupyter notebook on this VM to train a Keras sequential model. The model takes ~24 hours to fully train. Is there a way to run the notebook without using my browser. So leaving the Jupyter server and notebook running in the background BUT for all results/outputs/plots to be displayed within the Jupyter notebook?
When I try to do this by starting the Jupyter notebook interactively through my browser, then closing my browser but leaving the Jupyter kernel running, the cells within the notebook will execute but no output is displayed.
I have looked at this similar question: Running an IPython/Jupyter notebook non-interactively
But this is dealing with a different issue where the kernel is not running when trying to execute
This question is for the same issue: Jupyter notebook output cell freezes (Keras related?)
But the solution is to save some outputs in csv files which is not ideal

For anyone that finds this question looking for the same functionality in Jupyter;
Unfortunately as of May 2020 an issue relating to this is still open: https://github.com/jupyter/notebook/issues/1647
Therefore it looks like it is currently not possible to run Jupyter notebooks with the browser closed.

Related

aws auto-stop-idle does not detect papermill

I am using papermill to parametrize jupyter notebook deployed on AWS Sagemaker. I also used this lifestyle configuration that will auto shutdown if there are no running/idle notebooks. Unfortunately, it does not detect the Papermill process and continues to shutdown after reaching the specified idle time. What do I need to do to keep Sagemaker alive until the completion of Papermill
You could edit the idleness detection script to account for papermill processes.
Alternatively, if you have async jobs, which you can formulate as python code, you could use SageMaker processing jobs to execute them, which will not depend on your notebook instance being up.

PySpark / Kafka - org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated

So, I'm working on setting up a development environment for working with PySpark and Kafka. I'm working through getting things setup so I can run these tutorials in a Jupyter notebook as a 'hello world' exercise: https://spark.apache.org/docs/3.1.1/structured-streaming-kafka-integration.html
Unfortunately, I'm currently hitting the following error when I attempt to connect to the Kafka stream:
Py4JJavaError: An error occurred while calling o68.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:583)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:805)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:723)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1395)
...
Now, some digging has told me that the most common cause of this issue is version mismatches (either for the Spark, or Scala versions in use). However, I'm able to confirm that these are aligned properly:
Spark: 3.1.1
Scala: 2.12.10
conf/spark-defaults.conf
...
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
EDIT
So, some additional observations from trying to figure this out:
It looks like this is at least partially a Jupyter notebook issue, as I can now get things working just fine via the pyspark shell.
Looks like my notebook is firing up its own instance of Spark, so maybe there's some difference in how Spark is being run there vs from a terminal window?
At a loss for how they're different though, as both environments should be using mostly default configurations.
Ok - looks like it doesn't work when invoked via the regular Python REPL either, which is leading me to think there's something different about the spark context being created by the pyspark shell and the one I'm creating in my notebook.
Ok - looks like something differs when things are run via Jupyter - hadoop.common.configuration.version has a value of 0.23.0 for the notebook instance, but 3.0.0 for the pyspark shell instance. Not sure why this might be or what it may mean yet.
What else should I check to confirm that this is setup correctly?
Ok - so it looks like the difference was that findspark was locating and using a different Spark Home directory (one that came installed with the pyspark installation via pip).
It also looks like Spark 3.1.1 for Hadoop 2.7 has issues with the Kafka client (or maybe needs to be configured differently) but Spark 3.1.1 for Hadoop 3.2 works fine.
Solution was to ensure that I explicitly chose my SPARK_HOME by passing the spark_home path to findspark.init()
findspark.init(spark_home='/path/to/desired/home')
Things to watch out for that got me and might trip you up too:
If you've installed pyspark through pip / mambaforge this will also deploy a second SPARK_HOME - this can create dependency / library confusion.
Many of the scripts in bin/ use SPARK_HOME to determine where to execute, so don't assume that just because you're running a script from one home that you're running spark IN that home.

AWS EMR PySpark Jupyter notebook not running

I'm working on AWS EMR running PySpark in Jupyter notebook. All of a sudden I cannot run the scripts any more. When I click run nothing happens. Below is a screenshot of when I try to change kernel. No kernels are displayed.
What's the problem here?
I had the same problem, and apparently you also need Livy installed. If I don't add Livy to the list of apps, I can use the notebook for a few minutes and then it stops working. The docs don't mention this at all, but it does mention Livy needs to be there.

Using databricks for twtter sentiment analysis - issue running the official tutorial

I am starting to use Databricks and tried to implement one of the official tutorials (https://learn.microsoft.com/en-gb/azure/azure-databricks/databricks-sentiment-analysis-cognitive-services) from the website. However, I run into an issue - not even sure if I can call it an issue - when I run the second notebook (analysetweetsfromeventhub) then all commands (2nd, 3rd, 4th ...) are officially waiting to run, but never run. See the picture. Any idea what might be? Thanks.
After you cancel a running streaming cell in a notebook attached to a Databricks Runtime cluster, you cannot run any subsequent commands in the notebook. The commands are left in the “waiting to run” state, and you must clear the notebook’s state or detach and reattach the cluster before you can successfully run commands on the notebook.
Note that this issue occurs only when you cancel a single cell; it does not apply when you run all and cancel all cells.
In the meantime, you can do either of the following:
To remediate an affected notebook without restarting the cluster, go to the notebook’s Clear menu and select Clear State:
If restarting the cluster is acceptable, you can solve the issue by turning off idle context tracking. Set the following Spark configuration value on the cluster:
spark.databricks.chauffeur.enableIdleContextTracking false
Then restart the cluster.

Corrupt file getting generated when launching jupyter-pyspark kernel

We have installed and setup jupyter notebook on two of our linux hadoop servers with pyspark kernels. Both the servers have same kernel.jason configuration which has same spark and pyhton versions.
on one server jupyter notebook ui --> pyspark kernel is working fine but on other server when launching pyspark kernel, a file(with name - ??????????) is getting generated in the users home directory, able to execute queries in opened pyspark kernel session, but when new jupyternotebook ui is launched and in corresponding pyspark kernel, unable to execute the queries. We are able to execute only after removing the ??????? file which was generated and relauching the jupyter notebook again.
-We see this behaviuor for all users, and is happening only on one server. Can someone please help out with in resolving this issue.
versions:
Python 2.7.12
Spark 2.1.1
Steps performed:
-Verified pyspark kernel configs with jupyter running on other server which has no issues.
-restarted spark client on the server
-Tried rebooting the server which did not resolve the issue.
looks like might be an issue with the server hardware

Resources