How to use EMRFS S3-optimized committer without EMR? - apache-spark

I want to use EMRFS S3-optimized committer locally without EMR cluster.
I have set "fs.s3a.impl" = "com.amazon.ws.emr.hadoop.fs.EmrFileSystem" instead of "org.apache.hadoop.fs.s3a.S3AFileSystem" and following exception raised:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
Tried to use following packages from maven without any success:
com.amazonaws:aws-java-sdk:1.12.71
com.amazonaws:aws-java-sdk-emr:1.12.70

Sorry, but using EMRFS, including the S3-optimized committer, is not possible off of EMR.
EMRFS is not an open source package, nor is the library available in Maven Central. This is why the class is not found when you try to add aws-java-sdk-emr as a dependency; that package is solely for the AWS Java SDK client package used when interfacing with the EMR service (e.g., to create clusters).

Related

Unable to build Spark application with multiple main classes for Databricks job

I have a spark application that contains multiple spark jobs to be run on Azure data bricks. I want to build and package the application into a fat jar. The application is able to compile successfully. While I am trying to package (command: sbt package) the application, it gives an error "[warn] multiple main classes detected: run 'show discoveredMainClasses' to see the list".
How to build the application jar (Without specifying any main class) so that I can upload it to Databricks job and specify the main classpath over there?
This message is just a warning (see [warn] in it), it doesn't prevent generation of the jar files (normal or fat). Then you can upload resulting jar to DBFS (or ADLS for newer Databricks Runtime versions), and create Databricks job either as Jar task, or Spark Submit task.
If sbt fails, and doesn't produce jars, then you have some plugin that forces error on the warnings.
Also notice that sbt package doesn't produce fat jar - it produce jar only for classes in your project. You will need to use sbt assembly (install sbt-assembly plugin for that) to generate fat jar, but make sure that you marked Spark & Delta dependencies as provided.

What is LongAdder related to cassandra+spark connector?

When i load data into cassandra with using databricks, its getting the issue with
Caused by: java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder
Its simple saveToCassandra to table.
I looked this twitter jsr166e jar in maven , its very old, added in 2013,
I don't know why this jar is not available in Spark+cassandra_coonector
That error indicates you are missing dependencies and / or the Spark Cassandra connector is not on the runtime classpath of the Spark application. Not sure how you installed the connector but you should have used the packages method to ensure that dependencies are met and the Connector is correctly configured.
Read more HERE
Hope that helps,
Pat

Providing Hive support to a deployed Apache Spark

I need to use Hive-specific features in Spark SQL, however I have to work with an already deployed Apache Spark instance that, unfortunately, doesn't have Hive support compiled in.
What would I have to do to include Hive support for my job?
I tried using the spark.sql.hive.metastore.jars setting, but then I always get these exceptions:
DataNucleus.Persistence: Error creating validator of type org.datanucleus.properties.CorePropertyValidator
ClassLoaderResolver for class "" gave error on creation : {1}
and
org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
In the setting I am providing a fat-jar of spark-hive (excluded spark-core and spark-sql) with all its optional Hadoop dependencies (CDH-specific versions of hadoop-archives, hadoop-common, hadoop-hdfs, hadoop-mapreduce-client-core, hadoop-yarn-api, hadoop-yarn-client and hadoop-yarn-common).
I am also specifying spark.sql.hive.metastore.version with the value 1.2.1
I am using CDH5.3.1 (with Hadoop 2.5.0) and Spark 1.5.2 on Scala 2.10

Classpath resolution between spark uber jar and spark-submit --jars when similar classes exist in both

What is the precedence in class loading when both the uber jar of my spark application and the contents of --jars option to my spark-submit shell command contain similar dependencies ?
I ask this from a third-party library integration standpoint. If I set --jars to use a third-party library at version 2.0 and the uber jar coming into this spark-submit script was assembled using version 2.1, which class is loaded at runtime ?
At present, I think of keeping my dependencies on hdfs, and add them to the --jars option on spark-submit, while hoping via some end-user documentation to ask users to set the scope of this third-party library to be 'provided' in their spark application's maven pom file.
This is somewhat controlled with params:
spark.driver.userClassPathFirst &
spark.executor.userClassPathFirst
If set to true (default is false), from docs:
(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only.
I wrote some of the code that controls this, and there were a few bugs in the early releases, but if you're using a recent Spark release it should work (although it is still an experimental feature).

How to add the Breeze which is build by myself to Apache Spark?

I added some methods to the Breeze library and I can see those methods through IDE. And I was trying to add the Breeze library which is build by myself to my project which is based on Apache Spark. However, when I package my project by command "sbt assembly" and run it on my cluster, it throws an error "no such method xxx" which means the cluster actually didn't run my Breeze library. So could anyone tell me how to make the cluster run the Breeze library which is build by myself?
I have a guess that spark uses some version of Breeze libraries itself and prefer them over you custom .jars in assembly. You can try to build spark with your custom library. Install your library in your local maven repository, specify it in apache spark's pom.xml and build your own spark version.

Resources