How do you install custom software on worker nodes in Azure HDInsight? - azure

I have created an Azure HDInsight cluster using PowerShell. Now I need to install some custom software on the worker nodes that is required for the mappers I will be running using Hadoop streaming. I haven't found any PowerShell command that could help me with this task. I can prepare a custom job that will setup all the workers, but I'm not convinced that this is the best solution. Are there better options?
With AWS Elastic MapReduce there is an option to install additional software in a bootstrap action that is defined when you create a cluster. I was looking for something similar.
You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data.
from: Create Bootstrap Actions to Install Additional Software

The short answer is that you don't. It's not ideal from a caching perspective, but you ought to be able to bundle all your job dependencies into the map reduce jar which is distributed across the cluster for you by YARN (part of Hadoop). This is broadly speaking transparent to the end user, as it's all handled through the job submission process.
If you need something large which is a shared dependency across many jobs, and you don't want it copied out every time, you can keep it on wasb:// storage, and reference that in a class path, but that might cause you complexity if you are for instance using the .NET Streaming API.

I've just heard from a collage that I need to update my Azure PS because recently a new Cmdlet Add-AzureHDInsightScriptAction was added and it does just that.
Customize HDInsight clusters using Script Action


Updating jar job on databricks

I have a shared cluster which is used by more than several jobs on databricks.
the update of the jar corresponding to the job is not used when I launch the execution of the job, on cluster, I see that it uses an old version of the jar.
to clarify, I publish the jar through API 2.0 in databricks.
my question why when i start the execution of my Job, the execution on the cluster always uses an old version.
Old jar will be removed from the cluster only when it's terminated. If you have a shared cluster that never terminates, then it doesn't happen. This a limitation not of the Databricks but Java that can't unload classes that are already in use (or it's very hard to implement reliably).
For most of cases it's really not recommended to use shared cluster, for several reasons:
it costs significantly more (~4x)
tasks are affecting each other from performance point of view
there is a high probability of dependencies conflicts + inability of updating libraries without affecting other tasks
there is a kind of "garbage" collected on the driver nodes
If you use shared cluster to get faster execution, I recommend to look onto Instance Pools, especially in combination of preloading of Databricks Runtime onto nodes in instance pool.

Is it possible to run ANY application or program with HADOOP YARN?

I'm studying distributed computing recently and found out Hadoop Yarn is one of them.
So thought if I just establish Hadoop Yarn cluster, then every application will run distributed.
But now someone told me that HADOOP Yarn cannot do anything by itself and need other things like mapreduce, spark, and hbase.
If this is correct, then is that mean only limited tasks can be run with Yarn?
Or can I apply Yarn's distributed computing to all applications I want?
Hadoop is the name which refers to the entire system.
HDFS is the actual storage system. Think of it as S3 or a distributed Linux filesystem.
YARN is a framework for scheduling jobs and allocating resources. It handles these things for you, but you don't interact very much with it.
Spark and MapReduce are managed by Yarn. With these two, you can actually write your code/applications and give work to the cluster.
HBase uses the HDFS storage (with is file based) and provides NoSql storage.
Theoretically you can run more than just Spark and MapReduce on Yarn and you can use something else then Yarn (Kubernetes is in works or supported now). You can even write your own processing tool, queue/resource management system, storage... Hadoop has many pieces which you may use or not, depending on your case. But the majority of Hadoop systems use Yarn and Spark.
If you want to deploy Docker containers for example, just a Kubernetes cluster would be a better choice. If you need batch/real time processing with Spark, use Hadoop.
YARN itself is a resource manager. You will need to write code that can be deployed onto those resources, and then that could do anything, given that the nodes running the tasks are themselves capable of running the job. For example, you cannot distribute a Python library without first installing the dependencies for that script. Mesos is a bit more generalized / accessible than YARN, if you want more flexibility for the same affect.
YARN mostly supports running JAR files, shell scripts (at least, from Oozie) or Docker containers can be deployed to it as well (refer Apache docs)
You may also refer to the Apache Slider or Twill projects for more information.

Does EMR still have any advantages over EC2 for Spark?

I know this question has been asked before but those answers seem to revolve around Hadoop. For Spark you don't really need all the extra Hadoop cruft. With the spark-ec2 script (available via GitHub for 2.0) your environment is prepared for Spark. Are there any compelling use cases (other than a far superior boto3 sdk interface) for running with EMR over EC2?
This question boils down to the value of managed services, IMHO.
Running Spark as a standalone in local mode only requires you get the latest Spark, untar it, cd to its bin path and then running spark-submit, etc
However, creating a multi-node cluster that runs in cluster mode requires that you actually do real networking, configuring, tuning, etc. This means you've got to deal with IAM roles, Security groups, and there are subnet considerations within your VPC.
When you use EMR, you get a turnkey cluster in which you can 1-click install many popular applications (spark included), and all of the Security Groups are already configured properly for network communication between nodes, you've got logging already setup and pointing at S3, you've got easy SSH instructions, you've got an already-installed apparatus for tunneling and viewing the various UI's, you've got visual usage metrics at the IO level, node level, and job submission level, you also have the ability to create and run Steps -- which are jobs that can be run in the command line of the drive node or as Spark applications that leverage the whole cluster. Then, on top of that, you can export that whole cluster, steps included, and copy paste the CLI script into a recurring job via DataPipeline and literally create an ETL pipeline in 60 seconds flat.
You wouldn't get any of that if you built it yourself in EC2. I know which one I would choose... EMR. But that's just me.

Which cluster type should I choose for Spark?

I am new to Apache Spark, and I just learned that Spark supports three types of cluster:
Standalone - meaning Spark will manage its own cluster
YARN - using Hadoop's YARN resource manager
Mesos - Apache's dedicated resource manager project
I think I should try Standalone first. In the future, I need to build a large cluster (hundreds of instances).
Which cluster type should I choose?
Spark Standalone Manager : A simple cluster manager included with Spark that makes it easy to set up a cluster. By default, each application uses all the available nodes in the cluster.
A few benefits of YARN over Standalone & Mesos:
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
The Spark standalone mode requires each application to run an executor on every node in the cluster; whereas with YARN, you choose the number of executors to use
YARN directly handles rack and machine locality in your requests, which is convenient.
The resource request model is, oddly, backwards in Mesos. In YARN, you (the framework) request containers with a given specification and give locality preferences. In Mesos you get resource "offers" and choose to accept or reject those based on your own scheduling policy. The Mesos model is a arguably more flexible, but seemingly more work for the person implementing the framework.
If you have a big Hadoop cluster already in place, YARN is better choice.
The Standalone manager requires the user configure each of the nodes with the shared secret. Mesos’ default authentication module, Cyrus SASL, can be replaced with a custom module. YARN has security for authentication, service level authorization, authentication for Web consoles and data confidentiality. Hadoop authentication uses Kerberos to verify that each user and service is authenticated by Kerberos.
High availability is offered by all three cluster managers but Hadoop YARN doesn’t need to run a separate ZooKeeper Failover Controller.
I think the best to answer that are those who work on Spark. So, from Learning Spark
Start with a standalone cluster if this is a new deployment.
Standalone mode is the easiest to set up and will provide almost all
the same features as the other cluster managers if you are only
running Spark.
If you would like to run Spark alongside other applications, or to use
richer resource scheduling capabilities (e.g. queues), both YARN and
Mesos provide these features. Of these, YARN will likely be
preinstalled in many Hadoop distributions.
One advantage of Mesos over both YARN and standalone mode is its
fine-grained sharing option, which lets interactive applications such
as the Spark shell scale down their CPU allocation between commands.
This makes it attractive in environments where multiple users are
running interactive shells.
In all cases, it is best to run Spark on the same nodes as HDFS for
fast access to storage. You can install Mesos or the standalone
cluster manager on the same nodes manually, or most Hadoop
distributions already install YARN and HDFS together.
Standalone is pretty clear as other mentioned it should be used only when you have spark only workload.
Between yarn and mesos, One thing to consider is the fact that unlike mapreduce, spark job grabs executors and hold it for entire lifetime of a job. where in mapreduce a job can get and release mappers and reducers over lifetime.
if you have long running spark jobs which during the lifetime of a job doesn't fully utilize all the resources it got in beginning, you may want to share those resources to other app and that you can only do either via Mesos or Spark dynamic scheduling.
So with yarn, only way have dynamic allocation for spark is by using spark provided dynamic allocation. Yarn won't interfere in that while Mesos will. Again this whole point is only important if you have a long running spark application and you would like to scale it up and down dynamically.
In this case and similar dilemmas in data engineering, there are many side questions to be answered before choosing one distribution method over another.
For example, if you are not running your processing engine on more than 3 nodes, you usually are not facing too big of a problem to handle so your margin of performance tuning between YARN and SparkStandalone (based on experience) will not clarify your decision. Because usually you will try to make your pipeline simple, specially when your services are not self-managed by cloud and bugs and failures happen often.
I choose standalone for relatively small or not-complex pipelines but if I'm feeling alright and have a Hadoop cluster already in place, I prefer to take advantage of all the extra configs that Hadoop(Yarn) can give me.
Mesos has more sophisticated scheduling design, allowing applications like Spark to negotiate with it. It's more suitable for the diversity of applications today. I found this site really insightful:
"... YARN is optimized for scheduling Hadoop jobs, which are historically (and still typically) batch jobs with long run times. This means that YARN was not designed for long-running services, nor for short-lived interactive queries (like small and fast Spark jobs), and while it’s possible to have it schedule other kinds of workloads, this is not an ideal model. The resource demands, execution model, and architectural demands of MapReduce are very different from those of long-running services, such as web servers or SOA applications, or real-time workloads like those of Spark or Storm..."

How to submit Apache Spark job to Hadoop YARN on Azure HDInsight

I am very excited that HDInsight switched to Hadoop version 2, which supports Apache Spark through YARN. Apache Spark is a much better fitting parallel programming paradigm than MapReduce for the task that I want to perform.
I was unable to find any documentation however on how to do remote job submission of a Apache Spark job to my HDInsight cluster. For remote job submission of standard MapReduce jobs I know that there are several REST endpoints like Templeton and Oozie. But as for as I was able to find, running Spark jobs is not possible through Templeton. I did find it to be possible to incorporate Spark jobs into Oozie, but I've read that this is a very tedious thing to do and also I've read some reports of job failure detection not working in this case.
Probably there must be a more appropriate way to submit Spark jobs. Does anyone know how to do remote job submissions of Apache Spark jobs to HDInsight?
You can install spark on a hdinsight cluster. You have to do it at by creating a custom cluster and adding an action script that will install Spark on the cluster at the time it creates the VMs for the Cluster.
To install with an action script on cluster install is pretty easy, you can do it in C# or powershell by adding a few lines of code to a standard custom create cluster script/program.
$config = Add-AzureHDInsightScriptAction -Config $config -Name "Install Spark" -ClusterRoleCollection HeadNode -Urin
clusterInfo.ConfigActions.Add(new ScriptAction(
"Install Spark", // Name of the config action
new ClusterNodeType[] { ClusterNodeType.HeadNode }, // List of nodes to install Spark on
new Uri(""), // Location of the script to install Spark
null //because the script used does not require any parameters.
you can then RDP into the headnode and run use the spark-shell or use spark-submit to run jobs. I am not sure how would run spark job and not rdp into the the headnode but that is an other question.
I also asked the same question with Azure guys. Following is the solution from them:
"Two questions to the topic: 1. How can we submit a job outside of the cluster without "Remote to…" — Tao Li
Currently, this functionality is not supported. One workaround is to build job submission web service yourself:
Create Scala web service that will use Spark APIs to start jobs on the cluster.
Host this web service in the VM inside the same VNet as the cluster.
Expose web service end-point externally through some authentication scheme. You can also employ intermediate map reduce job, it would take longer though.
You might consider using Brisk ( which offers Spark on Azure as a provisioned service (with support available). There's a free tier and it lets you access blob storage with a wasb://path/to/files just like HDInsight.
It doesn't sit on YARN; instead it is a lightweight and Azure oriented distribution of Spark.
Disclaimer: I work on the project!
