I've used the Windows version of HDInsight before, and that has a tab where you can set the number of cores and ram per worker node for Zeppelin.
I followed this tutorial to get Zeppelin working:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-zeppelin-notebook/
The Linux version of HDInsight uses Ambari to manage the resources, but I can't seem to find a way to change the settings for Zeppelin.
Zeppelin is not selectable as a separate service in the list of services on the left. It also seems like it isn't available to be added when I choose 'add service' in actions.
I tried editing the general spark configs in Ambari by using override, then adding the worker nodes to my new config group and increasing the number of cores and RAM in custom spark-defaults. (Then clicked save and restarted all affected services.)
I tried editing the spark settings using
vi /etc/spark/conf/spark-defaults.conf
on the headnode, but that wasn't picked up by Ambari.
The performance in Zeppelin seems to stay the same for a query that takes about 1000-1100 seconds every time.
Zeppelin is not a service so it shouldn't show up in Ambari. If you are committed to managing it that way, you may be able to get this to work
https://github.com/tzolov/zeppelin-ambari-plugin
To edit via ssh you'll need edit the zeppelin-env.sh file. First give yourself edit perms.
sudo chmod u+w /usr/hdp/current/incubator-zeppelin/conf/zeppelin-env.sh
and then edit zeppelin configs using
vi /usr/hdp/current/incubator-zeppelin/conf/zeppelin-env.sh
Here you can configure the ZEPPELIN_JAVA_OPTS variable, adding:
-Dspark.executor.memory=1024m -Dspark.executor.cores=16
All that being said... any reason you can't just use a Jupyter notebook instead?
Related
I want to install Apache Spark v2.4 on my Kubernetes cluster, but there does not seem to be a stable helm chart for this version. An older/stable chart (for v1.5.1) exists at
https://github.com/helm/charts/tree/master/stable/spark
How can I create/find a v2.4 chart?
Then: The reason for needing v2.4 is to enable client-mode, because I would like to be able to submit (PySpark/Jupyter notebook) jobs to the cluster from my laptop's dev environment. What extra steps are required to enable client-mode (including exposing the service)?
The closest attempt so far (but for Spark v2.0.0) that I have found, but which I haven't yet got working, is at
https://github.com/Uninett/kubernetes-apps/tree/master/spark
At https://github.com/phatak-dev/kubernetes-spark (also two years old), there is nothing about jupyter deployment.
Pangeo-specific: https://discourse.jupyter.org/t/spark-integration-documentation/243
SO thread: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1030
I have searched for up-to-date resources on this but have found nothing that has everything in one place. I will update this question with other relevant links if and when people are able to point them out to me. Hopefully it will be possible to cobble together an answer.
As ever, huge thanks in advance.
Update:
https://github.com/SnappyDataInc/spark-on-k8s for v2.2 is extremely easy to deploy - looks promising...
see https://hub.helm.sh/charts/microsoft/spark this is based off https://github.com/helm/charts/tree/master/stable/spark and uses spark 2.4.6 with hadoop 3.1. You can check the source for this chat at https://github.com/dbanda/charts. The Livy service makes it easy to submit spark jobs via REST API. You can also submit jobs using Zeppelin. We made this chart as alternative way to run spark on K8s without using the spark-submit k8s mode. I hope it helps.
I am completely new at Spark and try to run a tutorial example, which counts the number of lines containing 'a' and 'b' in a text file in the local file system.
I am running it with SparkContext with master = "local", i.e. Spark is running in the same JVM. Now I would like to try it in "cluster mode".
So I would like to run a Spark cluster of a cluster manager and two worker nodes locally on my Mac laptop. What is the easiest way to do that ?
Quoting the official documentation about Spark Standalone Mode:
./sbin/start-master.sh
./sbin/start-slave.sh <master-spark-URL>
In other words, you should start the standalone Master first (using ./sbin/start-master.sh) followed by starting one or more standalone Workers (using ./sbin/start-slave.sh).
Quoting the docs again:
Once you have started a worker, look at the master's web UI (http://localhost:8080 by default)
You're done. Congrats!
If you are looking to learn various ways to use SPARK I would suggest you to download the CLOUDERA quick start VM's which will give a simple cluster setup.
All you need to do is download the quick start VM and play around with the settings accordingly.
The quick start VM can be found here
Reference:Cloudera VM
I know this question has been asked before but those answers seem to revolve around Hadoop. For Spark you don't really need all the extra Hadoop cruft. With the spark-ec2 script (available via GitHub for 2.0) your environment is prepared for Spark. Are there any compelling use cases (other than a far superior boto3 sdk interface) for running with EMR over EC2?
This question boils down to the value of managed services, IMHO.
Running Spark as a standalone in local mode only requires you get the latest Spark, untar it, cd to its bin path and then running spark-submit, etc
However, creating a multi-node cluster that runs in cluster mode requires that you actually do real networking, configuring, tuning, etc. This means you've got to deal with IAM roles, Security groups, and there are subnet considerations within your VPC.
When you use EMR, you get a turnkey cluster in which you can 1-click install many popular applications (spark included), and all of the Security Groups are already configured properly for network communication between nodes, you've got logging already setup and pointing at S3, you've got easy SSH instructions, you've got an already-installed apparatus for tunneling and viewing the various UI's, you've got visual usage metrics at the IO level, node level, and job submission level, you also have the ability to create and run Steps -- which are jobs that can be run in the command line of the drive node or as Spark applications that leverage the whole cluster. Then, on top of that, you can export that whole cluster, steps included, and copy paste the CLI script into a recurring job via DataPipeline and literally create an ETL pipeline in 60 seconds flat.
You wouldn't get any of that if you built it yourself in EC2. I know which one I would choose... EMR. But that's just me.
I am trying to test out Spark so I can summarize some data I have in Cassandra. I've been through all the DataStax tutorials and they are very vague as to how you actually enable spark. The only indication I can find is that it comes enabled automatically when you select "Analytics" node during install. However, I have an existing Cassandra node and I don't want to have to use a different machine for testing as I am just evaluating everything on my laptop.
Is it possible to just enable Spark on the same node and deal with any performance implications? If so how can I enable it so that it can be tested?
I see the folders there for Spark (although I'm not positive all the files are present) but when I check to see if it's set to Spark master, it says that no spark nodes are enabled.
dsetool sparkmaster
I am using Linux Ubuntu Mint.
I'm just looking for a quick and dirty way to get my data averaged and so forth and Spark seems like the way to go since it's a massive amount of data, but I want to avoid having to pay to host multiple machines (at least for now while testing).
Yes, Spark is also able to interact with a cluster even if it is not on all the nodes.
Package install
Edit the /etc/default/dse file, and then edit the appropriate line
to this file, depending on the type of node you want:
...
Spark nodes:
SPARK_ENABLED=1
HADOOP_ENABLED=0
SOLR_ENABLED=0
Then restart the DSE service
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseServ.html
Tar Install
Stop DSE on the node and the restart it using the following command
From the install directory:
...
Spark only node: $ bin/dse cassandra -k - Starts Spark trackers on a cluster of Analytics nodes.
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseStandalone.html
Enable spark by changing SPARK_ENABLED=1
using the command: sudo nano /usr/share/dse/resources/dse/conf/dse.default
I am trying to set up Hadoop permanently on Amazon EC2. Currently what I am doing is every morning launch EC2 instances and set up Hadoop. Is there any way i can avoid this tedious step? I am looking for an Hadoop image which can be loaded on EC2 and make things easy for me.
I know I can use EMR for hadoop services. But I dont know how to start a EMR (hadoop) cluster without submitting a job flow. I mean I need a hadoop cluster without any jobs running in it.
Ultimately my aim is to run bioinformatics applications like Distmap and Seal. For these applications to run there are many dependencies. So I need a free hadoop cluster to set up the environment and then run these applications.
I hope its clear what I am trying to do.
Thanks.
What you can do is one of the below:
Option 1. Start out with an EBS backed EC2 instance with your favourite Linux distro. Go ahead and install Hadoop software that you need. Create as many EC2 instances as the types of instances you are going to need (master / slaves /etc). You can create then your own AMIs in the AWS Console (right click on the EC2 instance and click "Create AMI"). You can then launch your own instances, as many as you need, based on this AMI. You can also create AMI's from instance-store backed instances, but that will mean dumping everything to S3 and creating an AMI from there. There are a lot of tutorials about this available, please leave a comment if you need directions :)
Option 2. Start out with a Hadoop based AMI, repeat the steps above after doing your own configurations / adding dependencies to them. I went ahead and searched for Hadoop AMI's from the AWS console and there are 48 in EU-West-1 (not sure what region you're working with).
Option 3. Start an EMR Cluster in interactive mode. There is also an option to keep the cluster alive after finishing job flows. If you also set the EC2 keys for the EMR instances, you should be able to SSH into them and have a functional Hadoop cluster (not sure about the dependencies though, you might be better of rolling your own).
I hope I understood correctly what you're trying to achieve and this helps a little bit.
This is more of a configure management and automation problem. Try CMT like chef and puppet to get this done according to your desire.