AWS EMR PySpark Jupyter notebook not running - apache-spark

I'm working on AWS EMR running PySpark in Jupyter notebook. All of a sudden I cannot run the scripts any more. When I click run nothing happens. Below is a screenshot of when I try to change kernel. No kernels are displayed.
What's the problem here?

I had the same problem, and apparently you also need Livy installed. If I don't add Livy to the list of apps, I can use the notebook for a few minutes and then it stops working. The docs don't mention this at all, but it does mention Livy needs to be there.

Related

PySpark / Kafka - org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated

So, I'm working on setting up a development environment for working with PySpark and Kafka. I'm working through getting things setup so I can run these tutorials in a Jupyter notebook as a 'hello world' exercise: https://spark.apache.org/docs/3.1.1/structured-streaming-kafka-integration.html
Unfortunately, I'm currently hitting the following error when I attempt to connect to the Kafka stream:
Py4JJavaError: An error occurred while calling o68.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:583)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:805)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:723)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1395)
...
Now, some digging has told me that the most common cause of this issue is version mismatches (either for the Spark, or Scala versions in use). However, I'm able to confirm that these are aligned properly:
Spark: 3.1.1
Scala: 2.12.10
conf/spark-defaults.conf
...
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
EDIT
So, some additional observations from trying to figure this out:
It looks like this is at least partially a Jupyter notebook issue, as I can now get things working just fine via the pyspark shell.
Looks like my notebook is firing up its own instance of Spark, so maybe there's some difference in how Spark is being run there vs from a terminal window?
At a loss for how they're different though, as both environments should be using mostly default configurations.
Ok - looks like it doesn't work when invoked via the regular Python REPL either, which is leading me to think there's something different about the spark context being created by the pyspark shell and the one I'm creating in my notebook.
Ok - looks like something differs when things are run via Jupyter - hadoop.common.configuration.version has a value of 0.23.0 for the notebook instance, but 3.0.0 for the pyspark shell instance. Not sure why this might be or what it may mean yet.
What else should I check to confirm that this is setup correctly?
Ok - so it looks like the difference was that findspark was locating and using a different Spark Home directory (one that came installed with the pyspark installation via pip).
It also looks like Spark 3.1.1 for Hadoop 2.7 has issues with the Kafka client (or maybe needs to be configured differently) but Spark 3.1.1 for Hadoop 3.2 works fine.
Solution was to ensure that I explicitly chose my SPARK_HOME by passing the spark_home path to findspark.init()
findspark.init(spark_home='/path/to/desired/home')
Things to watch out for that got me and might trip you up too:
If you've installed pyspark through pip / mambaforge this will also deploy a second SPARK_HOME - this can create dependency / library confusion.
Many of the scripts in bin/ use SPARK_HOME to determine where to execute, so don't assume that just because you're running a script from one home that you're running spark IN that home.

Corrupt file getting generated when launching jupyter-pyspark kernel

We have installed and setup jupyter notebook on two of our linux hadoop servers with pyspark kernels. Both the servers have same kernel.jason configuration which has same spark and pyhton versions.
on one server jupyter notebook ui --> pyspark kernel is working fine but on other server when launching pyspark kernel, a file(with name - ??????????) is getting generated in the users home directory, able to execute queries in opened pyspark kernel session, but when new jupyternotebook ui is launched and in corresponding pyspark kernel, unable to execute the queries. We are able to execute only after removing the ??????? file which was generated and relauching the jupyter notebook again.
-We see this behaviuor for all users, and is happening only on one server. Can someone please help out with in resolving this issue.
versions:
Python 2.7.12
Spark 2.1.1
Steps performed:
-Verified pyspark kernel configs with jupyter running on other server which has no issues.
-restarted spark client on the server
-Tried rebooting the server which did not resolve the issue.
looks like might be an issue with the server hardware

Spark/k8s: How do I install Spark 2.4 on an existing kubernetes cluster, in client mode?

I want to install Apache Spark v2.4 on my Kubernetes cluster, but there does not seem to be a stable helm chart for this version. An older/stable chart (for v1.5.1) exists at
https://github.com/helm/charts/tree/master/stable/spark
How can I create/find a v2.4 chart?
Then: The reason for needing v2.4 is to enable client-mode, because I would like to be able to submit (PySpark/Jupyter notebook) jobs to the cluster from my laptop's dev environment. What extra steps are required to enable client-mode (including exposing the service)?
The closest attempt so far (but for Spark v2.0.0) that I have found, but which I haven't yet got working, is at
https://github.com/Uninett/kubernetes-apps/tree/master/spark
At https://github.com/phatak-dev/kubernetes-spark (also two years old), there is nothing about jupyter deployment.
Pangeo-specific: https://discourse.jupyter.org/t/spark-integration-documentation/243
SO thread: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1030
I have searched for up-to-date resources on this but have found nothing that has everything in one place. I will update this question with other relevant links if and when people are able to point them out to me. Hopefully it will be possible to cobble together an answer.
As ever, huge thanks in advance.
Update:
https://github.com/SnappyDataInc/spark-on-k8s for v2.2 is extremely easy to deploy - looks promising...
see https://hub.helm.sh/charts/microsoft/spark this is based off https://github.com/helm/charts/tree/master/stable/spark and uses spark 2.4.6 with hadoop 3.1. You can check the source for this chat at https://github.com/dbanda/charts. The Livy service makes it easy to submit spark jobs via REST API. You can also submit jobs using Zeppelin. We made this chart as alternative way to run spark on K8s without using the spark-submit k8s mode. I hope it helps.

How to schedule a pyspark job in jupiter notebook in microsoft azure for a spark cluster?

I am new to Spark. i have developed a pyspark script though the jupyter notebook interactive UI installed in our HDInsight cluster. A of now I ran the code from the jupyter itself but now I have to automate the script. I tried to use Azure Datafactory but could not find a way to run the pyspark script from there. Also tried to use oozie but could not figure out how to use it.I have tried by saving the notebook and reopened it and ran all cells but it is like manual way.
Please Help me to schedule a pyspark job in microsoft Azure.
I searched a discussion about the best practice to run scheduled jobs like crontab with Apache Spark for pyspark, which you might reviewed.
If without oozie, I have a simple idea that is to save jupyter notebook to local and write a shell script to submit the python script to HDInsight Spark via Livy with linux crontab as scheduler. As reference, you can refer to there as below.
IPython Notebook save location
How can I configure pyspark on livy to use anaconda python instead of the default one
Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy
Hope it helps.

Suggest a similar installer for apache spark and a notebook?

I am new to bigdata analytics. I am trying to install apache spark and a notebook to execute code like iPython. Is there an installer that comes with both spark set up and a good notebook tool inbuilt. I come from a back ground in PHP and Apache. I am used to tools like xampp, wamp that install multiple services in once click. Can any one suggest a similar installer for apache spark and a notebook? I have windows.
If iPython is not a mandatory requirement and if you can work with Zeppelin notebook with Apache spark I think you will need Sparklet. Its similar to what you seek a xampp like installer for spark engine and zeppelin tool.
You can see details here - Sparklet
It supports windows. Let me know if it solves your problem.

Resources