Suggest a similar installer for apache spark and a notebook? - apache-spark

I am new to bigdata analytics. I am trying to install apache spark and a notebook to execute code like iPython. Is there an installer that comes with both spark set up and a good notebook tool inbuilt. I come from a back ground in PHP and Apache. I am used to tools like xampp, wamp that install multiple services in once click. Can any one suggest a similar installer for apache spark and a notebook? I have windows.

If iPython is not a mandatory requirement and if you can work with Zeppelin notebook with Apache spark I think you will need Sparklet. Its similar to what you seek a xampp like installer for spark engine and zeppelin tool.
You can see details here - Sparklet
It supports windows. Let me know if it solves your problem.

Related

Using Pyspark locally when installed using databricks-connect

I have databricks-connect 6.6.0 installed, which has a Spark version 2.4.6. I have been using the databricks cluster till now, but I am trying to switch to using a local spark session for unit testing.
However, every time I run it, it still shows up on the cluster Spark UI as well as the local Spark UI on xxxxxx:4040.
I have tried initiating using SparkConf(), SparkContext(), and SQLContext() but they all do the same thing. I have also set the right SPARK_HOME, HADOOP_HOME, and JAVA_HOME, and downloaded winutils.exe separately, and none of these directories have spaces. I have also tried running it from console as well as from terminal using spark-submit.
This is one of the pieces of sample code I tried:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("name").getOrCreate()
inp = spark.createDataFrame([('Person1',12),('Person2',14)],['person','age'])
op = inp.toPandas()
I am using:
Windows 10, databricks-connect 6.6.0, Spark 2.4.6, JDK 1.8.0_265, Python 3.7, PyCharm Community 2020.1.1
Do I have to override the default/global spark session to initiate a local one? How would I do that?
I might be missing something - The code itself runs fine, it's just a matter of local vs. cluster.
TIA
You can’t run them side by side. I recommend having two virtual environments using Conda. One for databricks-connect one for pyspark. Then just switch between the two as needed.

Spark/k8s: How do I install Spark 2.4 on an existing kubernetes cluster, in client mode?

I want to install Apache Spark v2.4 on my Kubernetes cluster, but there does not seem to be a stable helm chart for this version. An older/stable chart (for v1.5.1) exists at
https://github.com/helm/charts/tree/master/stable/spark
How can I create/find a v2.4 chart?
Then: The reason for needing v2.4 is to enable client-mode, because I would like to be able to submit (PySpark/Jupyter notebook) jobs to the cluster from my laptop's dev environment. What extra steps are required to enable client-mode (including exposing the service)?
The closest attempt so far (but for Spark v2.0.0) that I have found, but which I haven't yet got working, is at
https://github.com/Uninett/kubernetes-apps/tree/master/spark
At https://github.com/phatak-dev/kubernetes-spark (also two years old), there is nothing about jupyter deployment.
Pangeo-specific: https://discourse.jupyter.org/t/spark-integration-documentation/243
SO thread: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1030
I have searched for up-to-date resources on this but have found nothing that has everything in one place. I will update this question with other relevant links if and when people are able to point them out to me. Hopefully it will be possible to cobble together an answer.
As ever, huge thanks in advance.
Update:
https://github.com/SnappyDataInc/spark-on-k8s for v2.2 is extremely easy to deploy - looks promising...
see https://hub.helm.sh/charts/microsoft/spark this is based off https://github.com/helm/charts/tree/master/stable/spark and uses spark 2.4.6 with hadoop 3.1. You can check the source for this chat at https://github.com/dbanda/charts. The Livy service makes it easy to submit spark jobs via REST API. You can also submit jobs using Zeppelin. We made this chart as alternative way to run spark on K8s without using the spark-submit k8s mode. I hope it helps.

Is Apache Spark recommended to run on windows?

I have a requirement to run Spark on Windows in a production environment. I would like to get advice in understanding if Apache Spark on Windows is recommended. If not, I would like to know the reason behind the same.

Installing Hadoop in LinuxMint

I have started a course on Hadoop on Udemy. Now here the instructor is using windows OS and installs a virtual box and then runs a Horton Sandbox image for using Hadoop.
I am using LinuxMint and after doing some research on install hadoop on Linux I found(click for ref) out that we can install the VM on linux and download the Horton Sandbox image run it.
I also found another method which does not uses the VM (click for ref). I am confused as to which is the best way for install hadoop.
Should I use the VM or the second method. Which is better for learning and development?
Thanks a lot for help!
can install the VM on linux
You can use a VM on any host OS... That's the point of a VM.
The last link is only Hadoop, where Hortonworks has much, much more like Spark, Hive, Hbase, Pig, etc. Things you'd need to additionally install and configure yourself otherwise
Which is better for learning and development?
I would strongly suggest using a VM (or containers) overall
1) rather than messing up your local OS trying to get Hadoop working
2) The Hortonworks documentation has lots of tutorials that can really only be ran in the sandbox with the pre installed datasets

Submit spark application from laptop

I want to submit spark python applications from my laptop. I have a standalone spark cluster, and the master is running at some visible IP (MASTER_IP). After downloading and unzipping Spark on my laptop, I got this to work
./bin/spark-submit --master spark://MASTER_IP:7077 ~/PATHTO/pi.py
From what I understand, it is defaulting to client mode (vs cluster mode). According to Spark (http://spark.apache.org/docs/latest/submitting-applications.html) -
"only YARN supports cluster mode for Python applications." Since I'm not using YARN, I must use client mode.
My question is - do I need to download all of Spark on my laptop? Or just a few libraries?
I want to allow the rest of my team to use my Spark cluster, but I want them to do the least amount of work as possible. They don't need to setup a cluster. They only need to submit jobs to it. Having them downloading all of Spark seems like overkill.
So, what exactly is the minimum that they need?
The spark-1.5.0-bin-hadoop2.6 package I have here is 304MB unpacked. More than half, 175MB is made up of spark-assembly-1.5.0-hadoop2.6.0.jar, the main Spark stuff. You can't get rid of this unless you want to compile your own package maybe. A large part of the rest is spark-examples-1.5.0-hadoop2.6.0.jar, 113MB. Removing this and zipping back up is harmless and saves you a lot already.
However, using some tools such that they don't have to work with the spark package directly, like spark-jobserver (never used but never heard somebody very positive about the current state) or spark-kernel (needs your own code still to interface with it, or when used with notebook (see below) limited compared to alternatives) as suggested by Reactormonk makes it even easier for them.
A popular thing to do in that sense is set up access to a notebook. As you're using Python, IPython with a PySpark profile would be most straightforward to set up. Other alternatives are Zeppelin and spark-notebook (my favourite) for using Scala.

Resources