How to run Spark processes in develop environment using a cluster?

How to run Spark processes in develop environment using a cluster? - apache-spark

I'm implementing differents Apache Spark solutions using IntelliJ IDEA, Scala and SBT, however, each time that I want to run my implementation I need to do the next steps after creating the jar:
Amazon: To send the .jar to the master node using SSH, and then run
the command line spark-shell.
Azure: I'm using Databricks CLI, so each time that I want to upload a
jar, I uninstall the old library, remove the jar stored in the cluster,
and finally, I upload and install the new .jar.
So I was wondering if it is possible to do all these processes just in one click, using the IntelliJ IDEA RUN button for example, or using another method to make simpler all of it. Also, I was thinking about Jenkins as an alternative.
Basically, I'm looking for easier deployment options.

Related

SBT console vs Spark-Shell for interactive development

I'm wondering if there are any important differences between using SBT console and Spark-shell for interactively developing new code for a Spark project (notebooks are not really an option w/ the server firewalls).
Both can import project dependencies, but for me SBT is a little more convenient. SBT automatically brings in all the dependencies in build.sbt and spark-shell can use the --jar, --packages, and --repositories arguments in the command line.
SBT has the handy initialCommands setting to automatically run lines at startup. I use this for initializing the SparkContext.
Are there others?

With SBT you need not install SPARK itself theoretically.
I use databricks.

From my experience sbt calls external jars innately spark shell calls series of imports and contexts innately. I prefer spark shell because it follows the standard you need to adhere to when build the spark submit session.
For running the code in production you need to build the code into jars, calling them via spark submit. To build that you need to package it via sbt (compilation check) and run the spark submit submit call (logic check).
You can develope using either tool but you should code as if you did not have the advantages of sbt (calling the jars) and spark shell (calling the imports and contexts) because spark submit doesn't do either.

How to get Selenium working with Jenkins2 in GCP

I'm trying to get Selenium Grid and Jenkins working together in GKE.
I found the Selenium plugin (https://plugins.jenkins.io/selenium) for Jenkins, but I'm not sure it can be used to get what I want.
I stood Jenkins up by following the steps here:
https://github.com/GoogleCloudPlatform/kube-jenkins-imager
( I changed the image for the jenkins node to use Jenkins 2.86 )
This creates an instance of Jenkins running in kubernetes that spawns slaves into the cluster as needed.
But I don't believe that this is compatible with the Selenium plug-in. What's the best way to take what I have and get it working with this instance of Jenkins?
I was also able to get an instance of Selenium up and going in the same cluster using this:
https://gist.github.com/elsonrodriguez/261e746cf369a60a5e2d
( I dropped the version 2.x from the instances to pull in the latest containers. )
I had to bump the k8s nodes up to n1-standard-2 (2 vCPUs, 7.5 G Memory ) to get those containers to run.
For this proof of concept, the SE nodes don't need to be ephemeral. But I'm unsure what kind of permanent node container image I can deploy in k8s that would have the necessary SE drivers.
On the other hand, maybe it would be easier to just use the stand-alone SE containers that I found. If so, how do I use them with Jenkins2?
Has anyone else gone down this path?
Edit: I'm not interested in third-party selenium services at this time.

SauceLabs is a selenium grid in the cloud.
I wrote Saucery to make integrating from C# or Java with NUnit2, NUnit3 or JUnit 4 easy.
You can see the source code here, here and here or take a look at the Github Pages site here for more information.

Here is what I figured out.
I saw many indications that it was a hassle to run your own instance of Selenium grid. Enough time may have passed for this to be a little easier than it used to be. There seem to be a few ways to do it.
Jenkins itself has a plugin that is supposed to turn your Jenkins cluster into a Selenium 3 grid: https://plugins.jenkins.io/selenium . The problem I had with this is that I'm planning on hosting these instances in the cloud, and I wanted the Jenkins slaves to be ephemeral. I was unable to figure out how to get the plugin to work with ephemeral slaves.
I was trying to get this done as quickly as I could, so I only spent three days total on this project.
These are the forked repos that I'm using for the Jenkins solution:
https://github.com/jnorment-q2/kube-jenkins-imager
which basically implements this:
https://github.com/jnorment-q2/continuous-deployment-on-kubernetes
I'm pointing to my own repos to give a reference to exactly what I used in late October 2017 to get this working. Those repos are forked from the main repos, and it should be easy to compare the differences.
I had contacted google support with a question, they responded that this link might actually be a bit clearer:
https://cloud.google.com/solutions/jenkins-on-container-engine-tutorial
From what I can tell, this is a manual version of the more automated scripts I referenced.
To stand up Selenium, I used this:
https://github.com/jnorment-q2/selenium-on-k8s
This is a project I built from a gist referenced in the Readme, which references a project maintained by SeleniumHQ.
The main trick here is that Selenium is resource hungry. I had to use the second tier of google compute engines in order for it to deploy in Kubernetes. I adapted the script I used to stand up Jenkins to deploy Selenium Grid in a similar fashion.
Also of note, there appear to be only Firefox and Chrome options in the project from SeleniumHQ. I have yet to determine if it is even possible to run an instance of Safari.
For now, this is what we're going to go with.
The piece left is how to make a call to the Selenium grid from Jenkins. It turns out that selenium can be pip-installed into ephemeral slaves, and webdriver.Remote can be used to make the call.
Here is the demo script that I wrote to prove that everything works:
https://github.com/jnorment-q2/demo-se-webdriver-pytest/blob/master/test/testmod.py
It has a Jenkinsfile, so it should work with a fresh instance of Jenkins. Just create a new pipeline, change definition to 'Pipeline script from SCM', Git, https://github.com/jnorment-q2/demo-se-webdriver-pytest, then scroll up and click 'run with parameters' and add the parameter SE_GRID_SERVER with the full url ( including port ) of the SE grid server.
It should run three tests and fail on the third. ( The third test requires additional parameters for TEST_URL and TEST_URL_TITLE )

Apache Spark app workflow

How do You organize the Spark development workflow?
My way:
Local hadoop/yarn service.
Local spark service.
Intellij on one screen
Terminal with running sbt console
After I change Spark app code, I switch to terminal and run "package" to compile to jar and "submitSpark" which is stb task that runs spark-submit
Wait for exception in sbt console :)
I also tried to work with spark-shell:
Run shell and load previously written app.
Write line in shell
Evaluate it
If it's fine copy to IDE
After few 2,3,4, paste code to IDE, compile spark app and start again
Is there any way to develop Spark apps faster?

I develop the core logic of our Spark jobs using an interactive environment for rapid prototyping. We use the Spark Notebook running against a development cluster for that purpose.
Once I've prototyped the logic and it's working as expected, I "industrialize" the code in a Scala project, with the classical build lifecycle: create tests; build, package and create artifacts by Jenkins.

I found writing scripts and using :load / :copy streamlined things a bit since I didn't need to package anything. If you do use sbt I suggest you start it and use ~ package such that it automatically packages the jar when changes are made. Eventually of course everything will end up in an application jar, this is for prototyping and exploring.
Local Spark
Vim
Spark-Shell
APIs
Console

We develop ours applications using an IDE (Intellij because we code your spark's applications in Scala) using scalaTest for testing.
In those tests we use local[*] as SparkMaster in order to allow the debugging.
For integration testing we used Jenkins and we launch an "end to end" script as an Scala application.
I hope this will be useful

JavaSparkContext - jarOfClass or jarOfObject doesnt work

Hi I am trying to run my spark service against cluster. As it turns out I have to do setJars and set my applicaiton jar in there. If I do it using physical path like following it works
conf.setJars(new String[]{"/path/to/jar/Sample.jar"});
but If i try to use JavaSparkContext (or SparkContext) api jarOfClass or jarOfObject it doesnt work. Basically API cant find jar itself.
Following returns empty
JavaSparkContext.jarOfObject(this);
JavaSparkContext.jarOfClass(this.getClass())
Its an excellent API only if it worked! Any one else able to make use of this?

[I have included example for Scala. I am sure it will work the same way for Java]
It will work if you do :
SparkContext.jarOfObject(this.getClass)
Surprisingly, this works for Scala Object as well as Scala Class.

How are you running the app? If you are running it from an IDE or compilation tool such as sbt, then
the jar is not package when running,
if you have package it once before, then your /path/to/jar/Sample.jar exist, thus giving the hard coded path works, but this is not the .class that the jvm runing your app is using and it cannot find.

Packaging a Groovy application

I want to package a Groovy CLI application in a form that's easy to distribute, similar to what Java does with JARs. I haven't been able to find anything that seems to be able to do this. I've found a couple of things like this that are intended for one-off scripts, but nothing that can compile an entire Groovy application made up of a lot of separate Groovy files and resource data.
I don't necessarily need to have the Groovy standalone executable be a part of it (though that would be nice), and this is not a library intended to be used by other JVM languages. All I want is a simply packaged version of my application.
EDIT:
Based on the couple of responses I got, I don't think I was being clear enough on my goal. What I'm looking for is basically a archive format that Groovy can support. The goal here is to make this easier to distribute. Right now, the best way is to ZIP it up, have the user unzip it, and then modify a batch/shell file to start it. I was hoping to find a way to make this more like an executable JAR file, where the user just has to run a single file.
I know that Groovy compiles down to JVM-compatible byte-code, but I'm not trying to get this to run as Java code. I'm doing some dynamic addition of Groovy classes at runtime based on the user's configuration and Java won't be able to handle that. As I said in the original post, having the Groovy executable is included in the archive is kind of a nice-to-have. However, I do actually need Groovy to be executable that runs, not Java.

The Gradle Cookbook shows how to make a "fat jar" from a groovy project: http://wiki.gradle.org/display/GRADLE/Cookbook#Cookbook-Creatingafatjar
This bundles up all the dependencies, including groovy. The resulting jar file can be run on the command line like:
java -jar myapp.jar

I've had a lot of success using a combination of the eclipse Fat Jar plugin and Yet Another Java Service Wrapper.
Essentially this becomes a 'Java' problem not a groovy problem. Fat Jar is painless to use. It might take you a couple of tries to get your single jar right, but once all the dependencies are flattened into a single jar you are now off an running it at the command line with
java -jar application.jar
I then wrap these jars as a service. I often develop standalone groovy based services that perform some task. I set it up as a service on Windows server using Yet Another Java Service and schedule it using various techniques to interact with Windows services.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string