Metrics System not recognizing Custom Source/Sink in application jar - apache-spark

Followup from here.
I've added Custom Source and Sink in my application jar and found a way to get a static fixed metrics.properties on Stand-alone cluster nodes. When I want to launch my application, I give the static path - spark.metrics.conf="/fixed-path/to/metrics.properties". Despite my custom source/sink being in my code/fat-jar - I get ClassNotFoundException on CustomSink.
My fat-jar (with Custom Source/Sink code in it) is on hdfs with read access to all.
So here's what all I've already tried setting (since executors can't find Custom Source/Sink in my application fat-jar):
spark.executor.extraClassPath = hdfs://path/to/fat-jar
spark.executor.extraClassPath = fat-jar-name.jar
spark.executor.extraClassPath = ./fat-jar-name.jar
spark.executor.extraClassPath = ./
spark.executor.extraClassPath = /dir/on/cluster/* (although * is not at file level, there are more directories - I have no way of knowing random application-id or driver-id to give absolute name before launching the app)
It seems like this is how executors are getting initialized for this case (please correct me if I am wrong) -
Driver tells here's the jar location - hdfs://../fat-jar.jar and here are some properties like spark.executor.memory etc.
N number of Executors spin up (depending on configuration) on cluster
Start downloading hdfs://../fat-jar.jar but initialize metrics system in the mean time (? - not sure of this step)
Metrics system looking for Custom Sink/Source files - since it's mentioned in metrics.properties - even before it's done downloading fat-jar (which actually has all those files) (this is my hypothesis)
ClassNotFoundException - CustomSink not found!
Is my understanding correct? Moreover, is there anything else I can try? If anyone has experience with custom source/sinks, any help would be appreciated.

I stumbled upon the same ClassNotFoundException when I needed to extend existing GraphiteSink class and here's how I was able to solve it.
First, I created a CustomGraphiteSink class in org.apache.spark.metrics.sink package:
package org.apache.spark.metrics.sink;
public class CustomGraphiteSink extends GraphiteSink {}
Then I specified the class in metrics.properties
*.sink.graphite.class=org.apache.spark.metrics.sink.CustomGraphiteSink
And passed this file to spark-submit via:
--conf spark.metrics.conf=metrics.properties

In order to use custom source/sink, one has to distribute it using spark-submit --files and set it via spark.executor.extraClassPath

Related

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Cannot modify the value of a Spark config: spark.executor.instances

I am using spark 3.0 and I am setting parameters
My parameters:
spark.conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.conf.set("fs.s3a.fast.upload.buffer", "bytebuffer")
spark.conf.set("spark.sql.files.maxPartitionBytes",134217728)
spark.conf.set("spark.executor.instances", 4)
spark.conf.set("spark.executor.memory", 3)
Error:
pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: spark.executor.instances
I DONT want to pass it through spark-submit as this is pytest case that I am writing.
How do I get through this?
According to spark official documentation, the spark.executor.instances property may not be affected when setting programmatically through SparkConf in runtime, so it would be suggested to set through configuration file or spark-submit command line options.
Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.
You can try to add those option to PYSPARK_SUBMIT_ARGS before initialize SparkContext. Its syntax is similar to spark-submit.

Cannot create a Dataproc cluster when setting the fs.defaultFS property?

This was already the object of discussion in previous post, however, I'm not convinced with the answers as the Google docs specify that it is possible to create a cluster setting the fs.defaultFS property. Moreover, even if possible to set this property programmatically, sometimes, it's more convenient to set it from command line.
So I wanted to know why the following option when passed to my cluster creation command does not work: --properties core:fs.defaultFS=gs://my-bucket? Please note I haven't included all parameters as I ran the command without the previous flag and it succeeded to create the cluster. However, when passing this, I get: "failed: Cannot start master: Insufficientnumber of DataNodes reporting."
If anyone managed to create a dataproc cluster by setting the fs.defaultFS that'd be great? Thanks.
It's true there are still known issues due to certain dependencies on actual HDFS; the docs were not intended to imply that setting fs.defaultFS to a GCS path at cluster-creation time would work, but to simply provide a convenient example of a property that appears in core-site.xml; in theory it would work to set fs.defaultFS to a different preexisting HDFS cluster, for example. I've filed a ticket to change the example in the documentation to avoid confusion.
Two options:
Just override fs.defaultFS at job-submission time using per-job properties
Workaround some of the known issues by setting fs.defaultFS explicitly using an initialization action instead of cluster properties.
Option 1 is better understood to work because cluster-level HDFS dependencies won't change. Option 2 works because most of the incompatibilities occur during initial startup only, and initialization actions run after the relevant daemons start up already. To override the setting in an init action, you'd use bdconfig:
bdconfig set_property \
--name 'fs.defaultFS' \
--value 'gs://my-bucket' \
--configuration_file /etc/hadoop/conf/core-site.xml \
--clobber

spark saveAsTextFile method is really strange in java api,it just not work right in my program

I am new to spark and get this problem when i run my test program。I install spark on an linux server,and it has just one master node and one worker node。Then I write test program on my laptop,code like this:
`JavaSparkContext ct= new JavaSparkContext ("spark://192.168.90.74:7077","test","/home/webuser/spark/spark-1.5.2-bin-hadoop2.4",new String[0]);
ct.addJar("/home/webuser/java.spark.test-0.0.1-SNAPSHOT-jar-with-dependencies.jar");
List list=new ArrayList();
list.add(1);
list.add(6);
list.add(9);
JavaRDD<String> rdd=ct.parallelize(list);
System.out.println(rdd.collect());
rdd.saveAsTextFile("/home/webuser/temp");
ct.close();`
I suppose I could get /home/webuser/temp on my server ,but in fact this program create c://home/webuser/temp in my laptop which os is win8,I don't understand,
shouldn't saveAsTextFile() run on spark's worker node?why it just run on my laptop,which is sprak's driver,I suppose.
It depends on which filesystem is the default for your Spark installation. According to what you're saying the default filesystem for you is file:/// which is the default. In order to change this, you need to modify the fs.defaultFS property in core-site.xml of your Hadoop configuration. Otherwise, you can simply change your code and specify the filesystem URL in the code, i.e.:
rdd.saveAsTextFile("hdfs://192.168.90.74/home/webuser/temp");
if 192.168.90.74 is your Namenode.

When to use SPARK_CLASSPATH or SparkContext.addJar

I'm using a standalone spark cluster, one master and 2 workers.
I really don't understand how to use wisely SPARK_CLASSPATH or SparkContext.addJar. I tried both and It looks like addJar doesn't work as I used to believe.
In my case I tried to use some joda-time function, in the closures or outside. If I set SPARK_CLASSPATH with a path to the joda-time jar, everything works ok. But if I remove SPARK_CLASSPATH and add in my program:
JavaSparkContext sc = new JavaSparkContext("spark://localhost:7077", "name", "path-to-spark-home", "path-to-the-job-jar");
sc.addJar("path-to-joda-jar");
It doesn't work anymore, although in logs I can see:
14/03/17 15:32:57 INFO SparkContext: Added JAR /home/hduser/projects/joda-time-2.1.jar at http://127.0.0.1:46388/jars/joda-time-2.1.jar with timestamp 1395066777041
and immediatly after:
Caused by: java.lang.NoClassDefFoundError: org/joda/time/DateTime
at com.xxx.sparkjava1.SimpleApp.main(SimpleApp.java:57)
... 6 more
Caused by: java.lang.ClassNotFoundException: org.joda.time.DateTime
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
I used to suppose that SPARK_CLASSPATH was setting the classpath for the driver part of the job, and SparkContext.addJar was setting the classpath for the executors, but It does not seem right anymore.
Anyone knows better than me?
SparkContext.addJar is broken in 0.9 as well as ADD_JARS environment variable. It used to work as documented in 0.8.x and the fix is already commited to master, so it's expected in the next release. For now you can either use workaround described in Jira or make patched Spark build.
See relevant mailing list discussion: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3C5234E529519F4320A322B80FBCF5BDA6#gmail.com%3E
Jira issue: https://spark-project.atlassian.net/plugins/servlet/mobile#issue/SPARK-1089
SPARK_CLASSPATH is deprecated since Spark 1.0+. You can add jars to the classpath programatically, inside file spark-defaults.conf or with spark-submit flags.
Add jars to a Spark Job - spark-submit

Resources