How to config spark.io.compression.codec=lzf in Spark - apache-spark

How to config spark.io.compression.codec=lzf in Spark?
Usually, I use spark-submit to run our driver class like below
./spark-submit --master spark://testserver:7077 --class
com.spark.test.SparkTest --conf "spark.io.compression.codec=lzf"
/tmp/test/target/test.jar.
So I can set spark.io.compression.codec=lzf in the command. But if I don't want to use spark-submit to run our driver class. I want to run in a spark-job-server. How to config in spark-job-server ?thanks
I tried to set it in env variables. But it doesn't work. I also tried below. Still not work.
sparkConf = new SparkConf().setMaster("spark://testserver:7077").setAppName("Javasparksqltest").
set("spark.executor.memory", "8g").set("spark.io.compression.codec", "lzf");

You can pass that option to spark-submit, or spark-shell by putting it in the conf/spark-defaults.conf associated to it. The details are in the configuration section of the doc.
For the spark-jobserver, you configure a given context, especially if it is being sent as a context implicitly created from a job. There are several ways to do so (the gist of it being that settings are hierarchized under spark.context-settings), but the "Context configuration" of the Readme.md details how to do it:
https://github.com/spark-jobserver/spark-jobserver/blob/master/README.md

Use complete class name "org.apache.spark.io.LZFCompressionCodec" instead of "lzf"

Related

Changing of tmp directory not working in Spark

I wanted to change the tmp directory used by spark, so I had something like that in my spark-submit.
spark-submit <other parameters> --conf "spark.local.dir=<somedirectory>" <other parameters>
But I am noticing that it has not effect, as Spark still uses the default tmp directory. What am I doing wrong here?
By the way, I am using Spark's standalone cluster.
From https://spark.apache.org/docs/2.1.0/configuration.html
In Spark 1.0 and later spark.local.‌​dir overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager."
OK, it looks like this option is deprecated. One method that works is to change the value of SPARK_LOCAL_DIRS in spark-env.sh. For example, like this.
SPARK_LOCAL_DIRS="/data/tmp/spark"

Submitting application on Spark Cluster using spark submit

I am new to Spark.
I want to run a Spark Structured Streaming application on cluster.
Master and workers has same configuration.
I have few queries for submitting app on cluster using spark-submit:
You may find them comical or strange.
How can I give path for 3rd party jars like lib/*? ( Application has 30+ jars)
Will Spark automatically distribute application and required jars to workers?
Does it require to host application on all the workers?
How can i know status of my application as I am working on console.
I am using following script for Spark-submit.
spark-submit
--class <class-name>
--master spark://master:7077
--deploy-mode cluster
--supervise
--conf spark.driver.extraClassPath <jar1, jar2..jarn>
--executor-memory 4G
--total-executor-cores 8
<running-jar-file>
But code is not running as per expectation.
Am i missing something?
To pass multiple jar file to Spark-submit you can set the following attributes in file SPARK_HOME_PATH/conf/spark-defaults.conf (create if not exists):
Don't forget to use * at the end of the paths
spark.driver.extraClassPath /fullpath/to/jar/folder/*
spark.executor.extraClassPath /fullpathto/jar/folder/*
Spark will set the attributes in the file spark-defaults.conf when you use the spark-submit command.
Copy your jar file on that directory and when you submit your Spark App on the cluster, the jar files in the specified paths will be loaded, too.
spark.driver.extraClassPath: Extra classpath entries to prepend
to the classpath of the driver. Note: In client mode, this config
must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead,
please set this through the --driver-class-path command line option or
in your default properties file.
--jars will transfer your jar files to worker nodes, and become available in both driver and executors' classpaths.
Please refer below link to see more details.
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
You can make a fat jar containing all dependencies. Below link helps you understand that.
https://community.hortonworks.com/articles/43886/creating-fat-jars-for-spark-kafka-streaming-using.html

Running Spark Job on Zeppelin

I have written a custom spark library in scala. I am able to run this successfully as a spark-submit step by spawning the cluster and running the following commands. Here I first get my 2 jars by -
aws s3 cp s3://jars/RedshiftJDBC42-1.2.10.1009.jar .
aws s3 cp s3://jars/CustomJar .
and then i run my spark job as
spark-submit --deploy-mode client --jars RedshiftJDBC42-1.2.10.1009.jar --packages com.databricks:spark-redshift_2.11:3.0.0-preview1,com.databricks:spark-avro_2.11:3.2.0 --class com.activities.CustomObject CustomJar.jar
This runs my CustomObject successfully. I want to run the similar thing in Zeppelin, But I do not know how to add jars and then run a spark-submit step?
You can add these dependencies to the Spark interpreter within Zeppelin:
Go to "Interpreter"
Choose edit and add the jar file
Restart the interpreter
More info here
EDIT
You might also want to use the %dep paragraph in order to access the zvariable (which is an implicit Zeppeling context) in order to do something like this:
%dep
z.load("/some_absolute_path/myjar.jar")
It depend how you run Spark. Most of the time, the Zeppelin interpreter will embed the Spark driver.
The solution is to configure the Zeppelin interpreter instead:
ZEPPELIN_INTP_JAVA_OPTS will configure java options
SPARK_SUBMIT_OPTIONS will configure spark options

can't add alluxio.security.login.username to spark-submit

I have a spark driver program which I'm trying to set the alluxio user for.
I read this post: How to pass -D parameter or environment variable to Spark job? and although helpful, none of the methods in there seem to do the trick.
My environment:
- Spark-2.2
- Alluxio-1.4
- packaged jar passed to spark-submit
The spark-submit job is being run as root (under supervisor), and alluxio only recognizes this user.
Here's where I've tried adding "-Dalluxio.security.login.username=alluxio":
spark.driver.extraJavaOptions in spark-defaults.conf
on the command line for spark-submit (using --conf)
within the sparkservices conf file of my jar application
within a new file called "alluxio-site.properties" in my jar application
None of these work set the user for alluxio, though I'm easily able to set this property in a different (non-spark) client application that is also writing to alluxio.
Anyone able to make this setting apply in spark-submit jobs?
If spark-submit is in client mode, you should use --driver-java-options instead of --conf spark.driver.extraJavaOptions=... in order for the driver JVM to be started with the desired options. Therefore your command would look something like:
./bin/spark-submit ... --driver-java-options "-Dalluxio.security.login.username=alluxio" ...
This should start the driver with the desired Java options.
If the Spark executors also need the option, you can set that with:
--conf "spark.executor.extraJavaOptions=-Dalluxio.security.login.username=alluxio"

replace default application.conf file in spark-submit

My code works like:
val config = ConfigFactory.load
It gets the key-value pairs from application.conf by default. Then I use -Dconfig.file= to point to another conf file.
It works fine for command below:
dse -u cassandra -p cassandra spark-submit
--class packagename.classname --driver-java-options
-Dconfig.file=/home/userconfig.conf /home/user-jar-with-dependencies.jar
But now I need to split the userconfig.conf to 2 files. I tried command below. It doesn't work.
dse -u cassandra -p cassandra spark-submit
--class packagename.classname --driver-java-options
-Dconfig.file=/home/userconfig.conf,env.conf
/home/user-jar-with-dependencies.jar
By default spark will look in defaults.conf but you can 1) specify another file using 'properties-file' 2) you can pass individual keu value properties using --conf or 3) you can set up the configuration programmatically in your code using the sparkConf object
Does this help or are you looking for the akka application.conf file?

Resources