I am using redis client in spark job and getting an exception
java.lang.NoSuchMethodError: io.netty.bootstrap.Bootstrap.config()Lio/netty/bootstrap/BootstrapConfig;
at org.redisson.client.RedisClient$1$1.operationComplete(RedisClient.java:234)
Its due to netty version mismatch
Spark used netty version netty-buffer=netty-buffer-4.0.23 but redis needs 4.1 , Is it possible to override netty jar in spark-submit command for both driver and executor .
It depends on how you are assembling your project.
Basically we create a fat jar containing all dependencies inside with maven-shade-plugin or maven-assembly-plugin. So to avoid this issue you can specify relocation in the shade plugin configuration. It looks something like this:
<relocations>
<relocation>
<pattern>io.netty</pattern>
<shadedPattern>your.prefix.io.netty</shadedPattern>
</relocation>
</relocations>
Related
Tech stack:
Spark 2.4.4
Hive 2.3.3
HBase 1.4.8
sbt 1.5.8
What is the best practice for Spark dependency overriding?
Suppose that Spark app (CLUSTER MODE) already have spark-hive (2.44) dependency (PROVIDED)
I compiled and assembled "custom" spark-hive jar that I want to use in Spark app.
There is not a lot of information about how you're running Spark, so it's hard to answer exactly.
But typically, you'll have Spark running on some kind of server or container or pod (in k8s).
If you're running on a server, go to $SPARK_HOME/jars. In there, you should find the spark-hive jar that you want to replace. Replace that one with your new one.
If running in a container/pod, do the same as above and rebuild your image from the directory with the replaced jar.
Hope this helps!
I have a requirement to connect to Azure Blob Storage from a Spark application to read data. The idea is to access the storage using Hadoop filesystem support (i.e, using hadoop-azure and azure-storage dependencies, [https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/2.8.5][1]).
We submit the job on a Spark on the K8S cluster. The embedded spark library doesn't come prepackaged with the required Hadoop-azure jar. So I am building a fat jar with all the dependencies. Problem is, even if the library is part of the fat jar, the spark doesn't seem to load it, and hence I am getting the error "java.io.IOException: No FileSystem for scheme: wasbs".
The spark version is 2.4.8 and the Hadoop version is 2.8.5. Is this behavior expected, that even though the dependency is part of the fat jar, Spark is not loading it? How to force the spark to load all the dependencies in the fat jar?
It happened the same with another dependency, and I had to manually pass it using the --jars option. However, the --jars option is not feasible if the application grows.
I tried adding the fat jar itself on the executor extraClassPath, however that causes a few other version conflicts.
Any information on this would be helpful.
Thanks & Regards,
Swathi Desai
Let's say that we have spark application that write/read to/from HDFS and we have some additional dependency, let's call it dep.
Now, let's do spark-submit on our jar built with sbt. I know that spark-submit send some jars (known as spark-libs). However, my questions are:
(1) How does version of spark influence on sent dependencies? I mean a difference between spark-with-hadoop/bin/spark-submit and spark-without-hadopo/bin/spark-submit?
(2) How does version of hadoop installed on cluster (hadoop cluster) influence on dependencies?
(3) Who is responsible for providing my dependency dep? Should I build fat-jar (assembly) ?
Please note that both first questions are about from what HDFS calls come from (I mean calls done by my spark application like write/read).
Thanks in advance
spark-without-hadoop refers only to the downloaded package, not application development.
The more correct phrasing is "Bring your own Hadoop," meaning you still are required to have the base Hadoop dependencies for any Spark application.
Should I build fat-jar (assembly) ?
If you have libraries that are outside of hadoop-client and those provided by Spark (core, mllib, streaming), then yes
We used DSE4.8.3 Cassandra to run CDH5.5.0 Spark in oozie, just found that DSE Cassandra has guava-16.0.1.jar conflict issue as following.
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, com.google.common.reflect.TypeToken.isPrimitive()Z
java.lang.NoSuchMethodError: com.google.common.reflect.TypeToken.isPrimitive()Z
Cassandra version in DSE 4.8.3 was 2.1.11.969. Spark version in CDH 5.5.0 was 1.5.0. For cassandra driver and connector.
1.If we used cassandra-driver-core-2.2.0-rc3.jar and spark-cassandra-connector_2.10-1.5.0-M2.jar, which both used guava-16.0.1.jar as their dependencies, it threw above exception "Method not found: com.google.common.reflect.TypeToken.isPrimitive()Z" in CDH (CDH5.5.0 spark used guava-14.0.1.jar, not guava-16.0.1.jar ).
2.If we used lower version cassandra-driver-core-2.2.0-rc1.jar and spark-cassandra-connector_2.10-1.5.0-M1.jar, which both used guava-14.0.1.jar as their dependencies, it threw following exception:
Exception in thread "main" java.lang.AbstractMethodError: com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.close()V
at com.datastax.driver.core.Cluster$Manager.close(Cluster.java:1417)
at com.datastax.driver.core.Cluster$Manager.access$200(Cluster.java:1167)
at com.datastax.driver.core.Cluster.closeAsync(Cluster.java:461)
at com.datastax.driver.core.Cluster.close(Cluster.java:472)
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:163)
I found an answer for this exception:(saying that using upper version spark-cassandra-connector_2.10-1.5.0-M2.jar will resolve the issue)
Spark + Cassandra connector fails with LocalNodeFirstLoadBalancingPolicy.close()
So now, we are mystified with the Cassandra dependencies issue. How to fix this cassandra guava-16.0.1 dependency issue? Is it possible to build a new spark-cassandra-connector.jar fixing with both issues? Can you help to resolve this issue? Thanks!
There should be no C* Driver dependency as that should be brought in automatically with the Spark Cassandra Connector Dependency as a transitive dependency. I would use the 1.5.0 release. Then you need to make sure when building that you exclude all other guava versions.
This means make sure if you are making a fat jar you aren't including any Spark distributions in your code and any Hadoop libs have Guava excluded.
There are a few other mail threads on this for more details
Detected Guava issue #1635 which indicates that a version of Guava less than 16.01 is in use.
https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/HnTsWJkI5jo
Issue with guava
https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/uB_DN_CcK2k
I am trying to read data from cassandra 2.0.6 using Spark. I use datastax drivers.While reading I got an error like " Loss was due to java.lang.ClassNotFoundException
java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.CassandraRDD". But I included spark-cassandra-connector_2.10 in my pom.xml which has com.datastax.spark.connector.rdd.CassandraRDD class.Am i missing any other settings or environment variables.
You need to make sure that the connector is on the class-path for the executor using the -cp option or that it is a bundled jar in the spark context (using the SparkConf.addJars() ).
Edit for Modern Spark
In Spark > 1.X it's usually recommend that you use the spark-submit command to place your dependencies on the executor classpath. See
http://spark.apache.org/docs/latest/submitting-applications.html