Compiling Spark program: no 'lib' directory - apache-spark

I am going through the tutorial:
https://www.tutorialspoint.com/apache_spark/apache_spark_deployment.htm
When I got to the Step 2: Compile program section I got stuck, because there is no lib folder in the spark directory which looks the following way:
Where is the lib folder? How could I compile the program?
I looked into the jars folder but there is no file named spark-assembly-1.4.0-hadoop2.6.0.jar

I am sorry I am not answering your question directly, but I want to guide you to the more convenient development process of Spark application.
When you are developing Spark application on your local computer you should use sbt (scala build tool). After you done writing code you should compile it with sbt (running sbt assembly). Sbt will produce 'fat jar' archive, that already has all required dependencies for a job. Then you should upload jar to spark cluster (for example using spark-submit script).
There is no reason to install sbt on your cluster because it is needed only for compilation.
You should check starter project that I created for you.

Related

Set Glue Code to External Libraries in Cucumber

We have multiple testing repos, and some of the scenarios depend on the steps already created in a separate repo, so I'm trying to build the JAR file and include it in the external libraries of the other repo. Then I define my gluecode in the IntelliJ runner with two separate lines:
com.edge.automation
C:\Users\MY_NAME\.m2\repository\com\reissue-automation\2.0.3-SNAPSHOT\reissue-automation-2.0.3-SNAPSHOT-tests.jar!\com.reissue.automation.stepdefinitions
IntelliJ is able to recognize the Gherkin sentence, but when I run it, it is throwing this exception:
eissueautomationstepdefinitions'
at io.cucumber.core.options.CucumberPropertiesParser.parseAll(CucumberPropertiesParser.java:156)
at io.cucumber.core.options.CucumberPropertiesParser.parse(CucumberPropertiesParser.java:88)
at io.cucumber.core.cli.Main.run(Main.java:48)
at io.cucumber.core.cli.Main.main(Main.java:33)
Does anybody know what this error means or if it's possible to include glue code from external libraries?
I ended up running the following command to copy my dependencies into the target folder which added it to the classpath.
mvn install dependency:copy-dependencies -DskipTests
Then it picked up the glue no problem.
com.edge.automation
com.reissue.automation.stepdefinitions
If anyone has a better solution feel free to post.
May you add the detailed steps on how you accomplish it?
I tried running
mvn install dependency:copy-dependencies -DskipTests
And then running my maven command for running the test cases and not luck.
If you can add your folder structure and how you put it in the glue code of cucumber runner will be appreciated.
Also cucumber version might help.
Thanks

How to deploy jar dependencies (native dlls) on spark workers in Azure Databricks?

I am writing a Java wrapper TnHandler.java that uses JNA, and calls mycustom.so native library, it has other dependency files. I export my java app as runnable jar and installed in Azure data bricks cluster.
In PySpark I call my jar like this in PyPark notebook in Databricks
from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"")
jvm = sc._gateway.jvm
java_import(jvm, "*")
foo = jvm.TnHandler()
def applyTn(s):
return foo.dummyTn(s)
applyTn("give me $20")`
I keep getting this error java.lang.UnsatisfiedLinkError: Unable to load library 'mycustom.so':
libmycustom.so: cannot open shared object file: No such file or directory
I think the reason is .so file and all of its dependencies are not present in worker node where code is being executed.
How do I ensure that desired .so and all of its dependencies are found in class path of which ever node the code is being executed ?
JNA relies on LD-Library-Path environment variable, to search for the libraries that you are trying to load.
I solved the problem by setting LD_Library_Path environment variable in my cluster settings

Running gauge tests from jar file

I am new to gauge testing tool .I have a maven project that consists of specs and step implementations. Mvn package phase does generate a jar file with all the required classes. However I cant figure out how i can run the gauge specs using a Main class in java, such that i can just run the jar file to run the tests. Is this possible?
Unfortunately no, Gauge binary must be installed and available to execute the specs.
As the Gauge binary is not written in java it cannot be bundled in a jar file and invoked from a Main class.
If you'd like to automatically download and use Gauge in a CI/CD environment, try something like https://github.com/maven-download-plugin/maven-download-plugin to download gauge into a convenient location as part of your mvn build itself.
More info about this here
There is a way to do this. You have to package maven and gauge inside the project directory and include them in the jar. In the main method, unzip all files, then run a shell script to export maven and gauge in the project directory to $PATH, then execute mvn gauge:execute as usual. It's a bit of a hack as it extracts everything to the directory in which the jar is located, but it works on RHEL 7 and I haven't managed to find a cleaner method.

Can we run spark on mesos using the precompiled hadoop-spark package?

I have a Mesos Cluster on which I want to run Spark jobs.
I have downloaded the spark precompiled package and I can use the spark-shell by simply decompressing the archive.
So far, I haven't managed to run spark jobs on the Mesos Cluster.
First question : Do I need to install and build Spark from source to get it work on Mesos? And Does this precompiled package used only for Spark on Yarn and Hadoop?
Second question : Can anyone provide the best way to build spark. I have found many ways like :
sbt clean assembly
./build/mvn -Pmesos -DskipTests clean package
./build/sbt package
I don't know which one to use, and whether they are all correct or not.

How to use different Spark version (Spark 2.4) on YARN cluster deployed with Spark 2.1?

I have a Hortonworks yarn cluster with Spark 2.1.
However I want to run my application with spark 2.3+ (because an essential third-party ML library in use needs it).
Do we have to use spark-submit from the Spark 2.1 version or we have to submit job to yarn using Java or Scala with a FAT jar? Is this even possible? What about Hadoop libraries?
On a Hortonworks cluster, running a custom spark version in yarn client/cluster mode needs following steps:
Download Spark prebuilt file with appropriate hadoop version
Extract and unpack into a spark folder. eg. /home/centos/spark/spark-2.3.1-bin-hadoop2.7/
Copy jersey-bundle 1.19.1 jar into spark jar folder [Download from here][1]
Create a zip file containing all the jars in spark jar folder. Spark-jar.zip
Put this spark-jar.zip file in a world accessible hdfs location such as (hdfs dfs -put spark-jars.zip /user/centos/data/spark/)
get hdp version (hdp-select status hadoop-client): eg output. hadoop-client - 3.0.1.0-187
Use the above hdp version in export commands below
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/3.0.1.0-187/hadoop/conf}
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/3.0.1.0-187/hadoop}
export SPARK_HOME=/home/centos/spark/spark-2.3.1-bin-hadoop2.7/
Edit the spark-defaults.conf file in spark_home/conf directory, add following entries
spark.driver.extraJavaOptions -Dhdp.version=3.0.1.0-187
spark.yarn.am.extraJavaOptions -Dhdp.version=3.0.1.0-187
create java-opts file in spark_home/conf directory, add below entries and use the above mentioned hdp version
-Dhdp.version=3.0.1.0-187
export LD_LIBRARY_PATH=/usr/hdp/3.0.1.0-187/hadoop/lib/native:/usr/hdp/3.0.1.0-187/hadoop/lib/native/Linux-amd64-64
spark-shell --master yarn --deploy-mode client --conf spark.yarn.archive=hdfs:///user/centos/data/spark/spark-jars.zip
I assume you use sbt as the build tool in your project. The project itself could use Java or Scala. I also think that the answer in general would be similar if you used gradle or maven, but the plugins would simply be different. The idea is the same.
You have to use an assembly plugin (e.g. sbt-assembly) that is going to bundle all non-Provided dependencies together, including Apache Spark, in order to create a so-called fat jar or uber-jar.
If the custom Apache Spark version is part of the application jar that version is going to be used whatever spark-submit you use for deployment. The trick is to trick the classloader so it loads the jars and classes of your choice not spark-submit's (and hence whatever is used in the cluster).

Resources