My problem: I am developing a Spark extension and I would like to perform tests and performance at scale before making the changes public. Currently such tests are a bit too manual: I compile & package my libraries, copy the jar files to a cluster where I have a private Spark deployment, restart Spark, then fire tests and benchmarks by hand. After each test I manually inspect logs and console output.
Could someone with more experience offer hints on how to make this more automatic? I am particularly interested in:
Ability to integrate with Github & Jenkins. Ideally I would only have to push a commit to the GitHub repo, then Jenkins would automatically pull and build, add the new libraries to a Spark environment, start Spark & trigger the tests and benchmarks, and finally collect & make output files available.
How to run and manage the Spark cluster. I see a number of options:
a) continue with having a single Spark installation: The test framework would update my jar files, restart Spark so the new libraries are picked up and then run the tests/benchmarks. The advantage would be that I only have to set up Spark (and maybe HDFS for sharing data & application binaries, YARN as the resource manager, etc) once.
b) run Spark in containers: My cluster would run a container management system (like Kubernetes). The test framework would create/update the Spark container image, fire up & configure a number of containers to start Spark, submit the test/benchmarks and collect results. The big advantage of this is that multiple developers can run tests in parallel and that I can test various versions of Spark & Hadoop.
Create a Docker container that has your entire solution contained including tests, push it to GitHub and have a DroneCi or Travis CI build it and listen for updates. It works great for me. 😀
There are many Spark docker images on GitHub or Docker hub I use this one:
https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook
Related
I'm studying distributed computing recently and found out Hadoop Yarn is one of them.
So thought if I just establish Hadoop Yarn cluster, then every application will run distributed.
But now someone told me that HADOOP Yarn cannot do anything by itself and need other things like mapreduce, spark, and hbase.
If this is correct, then is that mean only limited tasks can be run with Yarn?
Or can I apply Yarn's distributed computing to all applications I want?
Hadoop is the name which refers to the entire system.
HDFS is the actual storage system. Think of it as S3 or a distributed Linux filesystem.
YARN is a framework for scheduling jobs and allocating resources. It handles these things for you, but you don't interact very much with it.
Spark and MapReduce are managed by Yarn. With these two, you can actually write your code/applications and give work to the cluster.
HBase uses the HDFS storage (with is file based) and provides NoSql storage.
Theoretically you can run more than just Spark and MapReduce on Yarn and you can use something else then Yarn (Kubernetes is in works or supported now). You can even write your own processing tool, queue/resource management system, storage... Hadoop has many pieces which you may use or not, depending on your case. But the majority of Hadoop systems use Yarn and Spark.
If you want to deploy Docker containers for example, just a Kubernetes cluster would be a better choice. If you need batch/real time processing with Spark, use Hadoop.
YARN itself is a resource manager. You will need to write code that can be deployed onto those resources, and then that could do anything, given that the nodes running the tasks are themselves capable of running the job. For example, you cannot distribute a Python library without first installing the dependencies for that script. Mesos is a bit more generalized / accessible than YARN, if you want more flexibility for the same affect.
YARN mostly supports running JAR files, shell scripts (at least, from Oozie) or Docker containers can be deployed to it as well (refer Apache docs)
You may also refer to the Apache Slider or Twill projects for more information.
I know this question has been asked before but those answers seem to revolve around Hadoop. For Spark you don't really need all the extra Hadoop cruft. With the spark-ec2 script (available via GitHub for 2.0) your environment is prepared for Spark. Are there any compelling use cases (other than a far superior boto3 sdk interface) for running with EMR over EC2?
This question boils down to the value of managed services, IMHO.
Running Spark as a standalone in local mode only requires you get the latest Spark, untar it, cd to its bin path and then running spark-submit, etc
However, creating a multi-node cluster that runs in cluster mode requires that you actually do real networking, configuring, tuning, etc. This means you've got to deal with IAM roles, Security groups, and there are subnet considerations within your VPC.
When you use EMR, you get a turnkey cluster in which you can 1-click install many popular applications (spark included), and all of the Security Groups are already configured properly for network communication between nodes, you've got logging already setup and pointing at S3, you've got easy SSH instructions, you've got an already-installed apparatus for tunneling and viewing the various UI's, you've got visual usage metrics at the IO level, node level, and job submission level, you also have the ability to create and run Steps -- which are jobs that can be run in the command line of the drive node or as Spark applications that leverage the whole cluster. Then, on top of that, you can export that whole cluster, steps included, and copy paste the CLI script into a recurring job via DataPipeline and literally create an ETL pipeline in 60 seconds flat.
You wouldn't get any of that if you built it yourself in EC2. I know which one I would choose... EMR. But that's just me.
I have several spark jobs on a EMR cluster using yarn that must run on a regular basis and are submitted from Jenkins. Currently the Jenkins machine will ssh into the master node on EMR where a copy of the code is ready in a folder to be executed. I would like to be able to clone my repo into the jenkins workspace and submit the code from Jenkins to be executed on the cluster. Is there a simple way to do this? What is the best way to deploy spark from Jenkins?
You can use this rest api to call http requests from Jenkins to Start/Stop the jobs
If you have Python in Jenkins, implement script using Boto3 is a good, easy, flexible and powerful option.
You can manage EMR (So Spark) creating the full cluster or adding jobs to an existing one.
Also, using the same library, you can manage all AWS services.
I want to be able to run Spark 2.0 and Spark 1.6.1 in cluster mode on a single cluster to be able to share resources, what are the best practices to do this? this is because I want to be able to shield a certain set of applications from code changes that rely on 1.6.1 and others on Spark 2.0.
Basically the cluster could rely on dynamic allocation for Spark 2.0 but maybe not for 1.6.1 - this is flexible.
By using Docker this is possible you can run various version of Spark application, since Docker runs the application in Isolation.
Docker is an open platform for developing, shipping, and running applications. . With Docker you can separate your applications from your infrastructure and treat your infrastructure like a managed application.
Industries are adopting Docker since it provide this flexibility to run various version application in a single nut shell and many more
Mesos also allows to Run Docker Containers using Marathon
For more information please refer
https://www.docker.com/
https://mesosphere.github.io/marathon/docs/native-docker.html
Hope this helps!!!....
I have created an Azure HDInsight cluster using PowerShell. Now I need to install some custom software on the worker nodes that is required for the mappers I will be running using Hadoop streaming. I haven't found any PowerShell command that could help me with this task. I can prepare a custom job that will setup all the workers, but I'm not convinced that this is the best solution. Are there better options?
edit:
With AWS Elastic MapReduce there is an option to install additional software in a bootstrap action that is defined when you create a cluster. I was looking for something similar.
You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data.
from: Create Bootstrap Actions to Install Additional Software
The short answer is that you don't. It's not ideal from a caching perspective, but you ought to be able to bundle all your job dependencies into the map reduce jar which is distributed across the cluster for you by YARN (part of Hadoop). This is broadly speaking transparent to the end user, as it's all handled through the job submission process.
If you need something large which is a shared dependency across many jobs, and you don't want it copied out every time, you can keep it on wasb:// storage, and reference that in a class path, but that might cause you complexity if you are for instance using the .NET Streaming API.
I've just heard from a collage that I need to update my Azure PS because recently a new Cmdlet Add-AzureHDInsightScriptAction was added and it does just that.
Customize HDInsight clusters using Script Action