Apache Hudi deltastreamer throwing Exception in thread "main" org.apache.hudi.com.beust.jcommander.ParameterException' no main parameter was defined

Apache Hudi deltastreamer throwing Exception in thread "main" org.apache.hudi.com.beust.jcommander.ParameterException' no main parameter was defined - apache-spark

Version Apache Hudi 0.6.1,Spark 2.4.6
Below is the standard spark-submit command for Hudi deltastreamer, where it is throwing as no main parameter is defined. I could see all the properties parameters are given. Appreciate any help on this error.
[hadoop#ip-00-00-00-00 target]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 'ls /mnt/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar' --master yarn --deploy-mode client --storage-type COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
20/09/08 05:14:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.hudi.com.beust.jcommander.ParameterException: Was passed main parameter '--master' but no main parameter was defined in your arg class
at org.apache.hudi.com.beust.jcommander.JCommander.initMainParameterValue(JCommander.java:936)
at org.apache.hudi.com.beust.jcommander.JCommander.parseValues(JCommander.java:752)
at org.apache.hudi.com.beust.jcommander.JCommander.parse(JCommander.java:340)
at org.apache.hudi.com.beust.jcommander.JCommander.parse(JCommander.java:319)
at org.apache.hudi.com.beust.jcommander.JCommander.<init>(JCommander.java:240)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.getConfig(HoodieDeltaStreamer.java:445)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:454)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/09/08 05:14:46 INFO util.ShutdownHookManager: Shutdown hook called
20/09/08 05:14:46 INFO util.ShutdownHookManager: Deleting directory /mnt/tmp/spark-3ad6af85-94be-4117-a479-53423a91fd75

I think it is the way the parms of spark-submit and class is conflicting, so I followed the order as given below and it worked
spark-submit \
--jars "/mnt/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar" \
--deploy-mode "client" \
--class "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer" \
/mnt/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar \
--props /var/demo/config/kafka-source.properties \
--table-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path /user/hive/warehouse/stock_ticks_cow \
--target-table stock_ticks_cow \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

Related

spark.driver.extraLibraryPath override original library path

I have a spark job running on AWS EMR cluster, it need access native lib(*.so), per spark's document (https://spark.apache.org/docs/2.3.0/configuration.html) I need add "spark.driver.extraLibraryPath" and "spark.executor.extraLibraryPath" options in spark-submit command line
spark-submit \
--class test.Clustering \
--conf spark.executor.extraLibraryPath="/opt/test/lib/native" \
--conf spark.driver.extraLibraryPath="/opt/test/lib/native" \
--master yarn \
--deploy-mode client \
s3-etl-prepare-1.0-SNAPSHOT-jar-with-dependencies.jar "$#"
It works as I expected, native lib is loaded, the problem is: during spark job I need doing a distribute lzo indexer MR job which need lzo native library, the lzo code could not load the native gpl library:
21/06/16 09:49:09 ERROR GPLNativeCodeLoader: Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1124)
at com.hadoop.compression.lzo.GPLNativeCodeLoader.<clinit>(GPLNativeCodeLoader.java:32)
at com.hadoop.compression.lzo.LzoCodec.<clinit>(LzoCodec.java:71)
at com.hadoop.compression.lzo.DistributedLzoIndexer.<init>(DistributedLzoIndexer.java:28)
at test.misc.FileHelper.distributIndexLzoFile(FileHelper.scala:260)
at test.scalaapp.Clustering$.main(Clustering.scala:66)
at test.scalaapp.Clustering.main(Clustering.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
it seems "spark.driver.extraLibraryPath" option override or change the whole library path rather than append a new one, how can I keep both gpl lzo native path and my own library path?

Reading data from S3 using pyspark throws java.lang.NumberFormatException: For input string: "100M"

I am using the following code to read some json data from S3:
df = spark_sql_context.read.json("s3a://test_bucket/test.json")
df.show()
The above code throws the following exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o64.json.
: java.lang.NumberFormatException: For input string: "100M"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1538)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:248)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:391)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I have read several other SO posts on this topic (like this one or this) and have done all they have mentioned but nothing seems to fix my issue.
I am using spark-2.4.4-bin-without-hadoop and hadoop-3.1.2. As for the jar files, I've got:
aws-java-sdk-bundle-1.11.199.jar
hadoop-aws-3.0.0.jar
hadoop-common-3.0.0.jar
Also, using the following spark-submit command to run the code:
/opt/spark-2.4.4-bin-without-hadoop/bin/spark-submit
--conf spark.app.name=read_json --master yarn --deploy-mode client --num-executors 2
--executor-cores 2 --executor-memory 2G --driver-cores 2 --driver-memory 1G
--jars /home/my_project/jars/aws-java-sdk-bundle-1.11.199.jar,
/home/my_project/jars/hadoop-aws-3.0.0.jar,/home/my_project/jars/hadoop-common-3.0.0.jar
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.rpc.askTimeout=600s" /home/my_project/read_json.py
Anything I might be missing here?

From the stack trace the error is thrown when it's trying to read one of the configuration options, so the issue is with one of the default configuration options that now require numeric format.
In my case the error was resolved after I added the following configuration parameter to the spark-submit command:
--conf fs.s3a.multipart.size=104857600
See Tuning S3A Uploads.

I am posting what I ended up doing to fix the issue for anyone who might see the same exception:
I added hadoop-aws to HADOOP_OPTIONAL_TOOLS in hadoop-env.sh. I also removed all configurations in spark for s3a except the access/secret and everything worked. My code before the changes:
# Setup the Spark Process
conf = SparkConf() \
.setAppName(app_name) \
.set("spark.hadoop.mapred.output.compress", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec") \
.set("spark.hadoop.mapred.output.compression.`type", "BLOCK") \
.set("spark.speculation", "false")\
.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")\
.set("com.amazonaws.services.s3.enableV4", "true")
# Some other configs
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.access.key", s3_key
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.secret.key", s3_secret
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.multipart.size", "104857600"
)
And after:
# Setup the Spark Process
conf = SparkConf() \
.setAppName(app_name) \
.set("spark.hadoop.mapred.output.compress", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec") \
.set("spark.hadoop.mapred.output.compression.`type", "BLOCK") \
.set("spark.speculation", "false")
# Some other configs
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.access.key", s3_key
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.secret.key", s3_secret
)
That probably means that it was a class path issue. The hadoop-aws wasn't getting added to the class path and so under the covers it was defaulting to some other implementation of S3AFileSystem.java. Hadoop and spark are a huge pain in this area because there are so many different places and ways to load things and java is particular about the order as well because if it doesn't happen in the right order, it will just go with whatever was loaded last. Hope this helps others facing the same issue.

Can't setup spark application with spark-atlas-connector

Can't setup my spark application with apache atlas via spark-atlas-connector .
I had clone https://github.com/hortonworks-spark/spark-atlas-connector project and executed mvn package. Then I put all jars in my project and setup config like this:
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("atlas-test")
.setMaster("local[2]")
.set("spark.extraListeners", "com.hortonworks.spark.atlas.SparkAtlasEventTracker")
.set("spark.sql.queryExecutionListeners", "com.hortonworks.spark.atlas.SparkAtlasEventTracker")
.set("spark.sql.streaming.streamingQueryListeners", "com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker")
val spark = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val df = spark.read.format("kafka")
.option("kafka.bootstrap.servers", BROKER_SERVERS)
.option("subscribe", "foobar")
.option("startingOffset", "earliest")
.load()
df.show()
df.write
.format("kafka")
.option("kafka.bootstrap.servers", BROKER_SERVERS)
.option("topic", "foobar-out")
.save()
}
Atlas is started via docker container which I pulled.
Kafka with Zookeper are stared via docker container which I pulled too.
The job works without spark-atlas-connector but when I want to add a connector it throws exceptions.
19/08/09 16:40:16 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Exception when registering SparkListener
at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2398)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:555)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at Boot$.main(Boot.scala:21)
at Boot.main(Boot.scala)
Caused by: org.apache.atlas.AtlasException: Failed to load application properties
at org.apache.atlas.ApplicationProperties.get(ApplicationProperties.java:134)
at org.apache.atlas.ApplicationProperties.get(ApplicationProperties.java:86)
at com.hortonworks.spark.atlas.AtlasClientConf.configuration$lzycompute(AtlasClientConf.scala:25)
at com.hortonworks.spark.atlas.AtlasClientConf.configuration(AtlasClientConf.scala:25)
at com.hortonworks.spark.atlas.AtlasClientConf.get(AtlasClientConf.scala:50)
at com.hortonworks.spark.atlas.AtlasClient$.atlasClient(AtlasClient.scala:120)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:33)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:37)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2691)
at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2680)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2680)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2387)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2386)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2386)
... 8 more
Caused by: com.hortonworks.spark.atlas.shade.org.apache.commons.configuration.ConfigurationException: Cannot locate configuration source null
at com.hortonworks.spark.atlas.shade.org.apache.commons.configuration.AbstractFileConfiguration.load(AbstractFileConfiguration.java:259)
at com.hortonworks.spark.atlas.shade.org.apache.commons.configuration.AbstractFileConfiguration.load(AbstractFileConfiguration.java:238)
at com.hortonworks.spark.atlas.shade.org.apache.commons.configuration.AbstractFileConfiguration.<init>(AbstractFileConfiguration.java:197)
at com.hortonworks.spark.atlas.shade.org.apache.commons.configuration.PropertiesConfiguration.<init>(PropertiesConfiguration.java:284)
at org.apache.atlas.ApplicationProperties.<init>(ApplicationProperties.java:69)
at org.apache.atlas.ApplicationProperties.get(ApplicationProperties.java:125)
... 32 more
19/08/09 16:40:16 INFO SparkContext: SparkContext already stopped.
Exception in thread "main" org.apache.spark.SparkException: Exception when registering SparkListener
at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2398)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:555)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at Boot$.main(Boot.scala:21)
at Boot.main(Boot.scala)
Caused by: org.apache.atlas.AtlasException: Failed to load application properties
at org.apache.atlas.ApplicationProperties.get(ApplicationProperties.java:134)
at org.apache.atlas.ApplicationProperties.get(ApplicationProperties.java:86)
at com.hortonworks.spark.atlas.AtlasClientConf.configuration$lzycompute(AtlasClientConf.scala:25)
at com.hortonworks.spark.atlas.AtlasClientConf.configuration(AtlasClientConf.scala:25)
at com.hortonworks.spark.atlas.AtlasClientConf.get(AtlasClientConf.scala:50)
at com.hortonworks.spark.atlas.AtlasClient$.atlasClient(AtlasClient.scala:120)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:33)
at com.hortonworks.spark.atlas.SparkAtlasEventTracker.<init>(SparkAtlasEventTracker.scala:37)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2691)
at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2680)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2680)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2387)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2386)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2386)
... 8 more
Caused by: com.hortonworks.spark.atlas.shade.org.apache.commons.configuration.ConfigurationException: Cannot locate configuration source null
at com.hortonworks.spark.atlas.shade.org.apache.commons.configuration.AbstractFileConfiguration.load(AbstractFileConfiguration.java:259)
at com.hortonworks.spark.atlas.shade.org.apache.commons.configuration.AbstractFileConfiguration.load(AbstractFileConfiguration.java:238)
at com.hortonworks.spark.atlas.shade.org.apache.commons.configuration.AbstractFileConfiguration.<init>(AbstractFileConfiguration.java:197)
at com.hortonworks.spark.atlas.shade.org.apache.commons.configuration.PropertiesConfiguration.<init>(PropertiesConfiguration.java:284)
at org.apache.atlas.ApplicationProperties.<init>(ApplicationProperties.java:69)
at org.apache.atlas.ApplicationProperties.get(ApplicationProperties.java:125)
... 32 more
19/08/09 16:40:17 INFO ShutdownHookManager: Shutdown hook called

System.setProperty("atlas.conf", "") is the correct solution as noted by OP.
SAC uses ApplicationProperties.java.
Specifically it uses ApplicationProperties.get.
Source code is here:
https://github.com/apache/atlas/blob/master/intg/src/main/java/org/apache/atlas/ApplicationProperties.java#L118
You can see the variable ATLAS_CONFIGURATION_DIRECTORY_PROPERTY is set to "atlas.conf":
https://github.com/apache/atlas/blob/master/intg/src/main/java/org/apache/atlas/ApplicationProperties.java#L43

I believe you have forgotten one more step from the setup documentation. The error you have stems from
Caused by: com.hortonworks.spark.atlas.shade.org.apache.commons.configuration.ConfigurationException: Cannot locate configuration source null
And to quote their README file in the github repo you've posted:
Also make sure atlas configuration file atlas-application.properties is in the Driver's classpath. For example, putting this file into <SPARK_HOME>/conf.

please refer to this from the official spark-atlas-connector github page. The atlas-application.properties file should be reachable.
Also make sure atlas configuration file atlas-application.properties is in the Driver's classpath. For example, putting this file into /conf.
If you're using cluster mode, please also ship this conf file to the remote Drive using --files atlas-application.properties.

The following should do the trick. Please note --files and --driver-class-path options that are necessary to place this configuration file on CLASSPATH and hence available for Atlas Client classes.
Moreover, spark-shell uses paths relative to the Spark Atlas Connector so change accordingly.
$SPARK_HOME/bin/spark-shell \
--jars spark-atlas-connector-assembly/target/spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar \
--conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker \
--files spark-atlas-connector/src/test/resources/atlas-application.properties \
--driver-class-path spark-atlas-connector/src/test/resources

Spark 2.1.1 with typesafeconfig

I'm trying to support some external configuration file for my spark application using typesafeconfig.
I'm loading the application.conf file in my application code like this (driver):
val config = ConfigFactory.load()
val myProp = config.getString("app.property")
val df = spark.read.avro(myProp)
application.conf looks like this:
app.propety="some value"
spark-submit execution looks like this:
spark-submit
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
--driver-class-path $HOME/conf/*.conf \
--files $HOME/conf/application.conf \
my-app-0.0.1-SNAPSHOT.jar
seems it doesn't work and I'm getting:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'app'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:147)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
at com.paypal.cfs.fpti.Main$.main(Main.scala:42)
at com.paypal.cfs.fpti.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
looking at the logs i do see that "--files" work, seems like a classpath issue...
18/03/13 01:08:30 INFO SparkContext: Added file file:/home/user/conf/application.conf at file:/home/user/conf/application.conf with timestamp 1520928510820
18/03/13 01:08:30 INFO Utils: Copying /home/user/conf/application.conf to /tmp/spark-2938fde1-fa4a-47af-8dc6-1c54b5e89d48/userFiles-c2cec57f-18c8-491d-8679-df7e7da45e05/application.conf

Turns out I was pretty close to the answer to begin with... here is how it worked for me:
spark-submit \
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
--driver-class-path $APP_HOME/conf \
--files $APP_HOME/conf/application.conf \
$APP_HOME/my-app-0.0.1-SNAPSHOT.jar
then $APP_HOME will contain the below:
conf/application.conf
my-app-0.0.1-SNAPSHOT.jar
I guess you need to make sure the application.conf is placed inside a folder, that is the trick.

In order to specify the config file path, you may pass it as an application argument, and then read it from the args variable of the main class.
This is how you would execute the spark-submit command. Note that I've specified the config file after the application jar.
spark-submit
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
my-app-0.0.1-SNAPSHOT.jar $HOME/conf/application.conf
And then, load the config file from the path specified in args(0):
import com.typesafe.config.ConfigFactory
[...]
val dbconfig = ConfigFactory.parseFile(new File(args(0))
Now you have access to the properties of your application.conf file.
val myProp = config.getString("app.property")
Hope it helps.

How to get the working directory in executor

I am using the following command to submit Spark job, I hope to send jar and config files to each executor and load it there
spark-submit --verbose \
--files=/tmp/metrics.properties \
--jars /tmp/datainsights-metrics-source-assembly-1.0.jar \
--total-executor-cores 4\
--conf "spark.metrics.conf=metrics.properties" \
--conf "spark.executor.extraClassPath=datainsights-metrics-source-assembly-1.0.jar" \
--class org.microsoft.ofe.datainsights.StartServiceSignalPipeline \
./target/datainsights-1.0-jar-with-dependencies.jar
--files and --jars is used to send files to executors, I found that the files are sent to the working directory of executor like 'worker/app-xxxxx-xxxx/0/
But when job is running, the executor always throws exception saying that it could not find the file 'metrics.properties'or the class which is contained in 'datainsights-metrics-source-assembly-1.0.jar'. It seems that the job is looking for files under another dir rather than working directory.
Do you know how to load the file which is sent to executors?
Here is the trace (The class 'org.apache.spark.metrics.PerfCounterSource' is contained in the jar 'datainsights-metrics-source-assembly-1.0.jar'):
ERROR 2016-01-14 16:10:32 Logging.scala:96 - org.apache.spark.metrics.MetricsSystem: Source class org.apache.spark.metrics.PerfCounterSource cannot be instantiated
java.lang.ClassNotFoundException: org.apache.spark.metrics.PerfCounterSource
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) ~[na:1.7.0_80]
at java.net.URLClassLoader$1.run(URLClassLoader.java:355) ~[na:1.7.0_80]
at java.security.AccessController.doPrivileged(Native Method) [na:1.7.0_80]
at java.net.URLClassLoader.findClass(URLClassLoader.java:354) ~[na:1.7.0_80]
at java.lang.ClassLoader.loadClass(ClassLoader.java:425) ~[na:1.7.0_80]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) ~[na:1.7.0_80]
at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ~[na:1.7.0_80]
at java.lang.Class.forName0(Native Method) ~[na:1.7.0_80]
at java.lang.Class.forName(Class.java:195) ~[na:1.7.0_80]

It looks like you have a typo in your --jars argument, so it could be that it's not actually loading the file and and continuing silently.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Apache Hudi deltastreamer throwing Exception in thread "main" org.apache.hudi.com.beust.jcommander.ParameterException' no main parameter was defined - apache-spark

Related

spark.driver.extraLibraryPath override original library path

Reading data from S3 using pyspark throws java.lang.NumberFormatException: For input string: "100M"

Can't setup spark application with spark-atlas-connector

Spark 2.1.1 with typesafeconfig

How to get the working directory in executor

Categories

Resources