Getting scalatest to run from databricks - databricks

What's missing in order to get this simple scalatest working in databricks? (mostly as per their userguide)
I have cluster, into which I've uploaded the following latest scalatest jar and restarted cluster:
https://oss.sonatype.org/content/groups/public/org/scalatest/scalatest_2.13/3.2.13/scalatest_2.13-3.2.13.jar
Following: https://www.scalatest.org/user_guide/using_the_scalatest_shell
I'm running their sample code:
import org.scalatest._
class ArithmeticSuite extends AnyFunSuite with matchers.should.Matchers {
test("addition works") {
1 + 1 should equal (2)
}
}
run(new ArithmeticSuite)
The errors start with:
command-1362189841383727:3: error: not found: type AnyFunSuite
class ArithmeticSuite extends AnyFunSuite with matchers.should.Matchers {
^
command-1362189841383727:3: error: object should is not a member of package org.scalatest.matchers
class ArithmeticSuite extends AnyFunSuite with matchers.should.Matchers {

Databricks clusters are using Scala 2.12, while you have installed library compiled for Scala 2.13 that is binary incompatible with Databricks cluster. Take version for Scala 2.12 (with _2.12 suffix).
P.S. Although scalatest already could be a part of Databricks Runtime - check it by running %sh ls -ls /dataricks/jars/*scalatest*

Needed to use scalatest test structure for 3.0.8
https://www.scalatest.org/scaladoc/3.0.8/org/scalatest/FeatureSpec.html
<artifactId>scalatest_2.12</artifactId>
<version>3.0.8</version>
to match the version 3.0.8 included with databricks runtime
This resolved our errors

Related

Spark spark-sql-kafka - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

I am experimenting with spark reading from a kafka topic through "Structured Streaming + Kafka Integration Guide".
Spark version: 3.2.1
Scala version: 2.12.15
Following their guide on the spark-shell including the dependencies, I start my shell:
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1
However, once I run something like the following in my shell:
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","http://HOST:PORT").option("subscribe", "my-topic").load()
I get the following exception:
java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer`
Any ideas how to overcome this issue?
My assumption was with using --packages, all dependencies should be loaded as well. But this does not seem to be the case. From the logs I assume that the package gets loaded successfully, including the kafka-clients dependency:
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
resolving dependencies :: org.apache.spark#spark-submit-parent-3b04f646-471c-4cc8-88fb-7e32bc3226ed;1.0
confs: \[default\]
found org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.1 in central
found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.1 in central
found org.apache.kafka#kafka-clients;2.8.0 in central
found org.lz4#lz4-java;1.7.1 in central
found org.xerial.snappy#snappy-java;1.1.8.4 in central
found org.slf4j#slf4j-api;1.7.30 in central
found org.apache.hadoop#hadoop-client-runtime;3.3.1 in central
found org.spark-project.spark#unused;1.0.0 in central
found org.apache.hadoop#hadoop-client-api;3.3.1 in central
found org.apache.htrace#htrace-core4;4.1.0-incubating in central
found commons-logging#commons-logging;1.1.3 in central
found com.google.code.findbugs#jsr305;3.0.0 in central
found org.apache.commons#commons-pool2;2.6.2 in central
The logs seem fine, but you can try to include kafka-clients dependency in --packages argument as well
Otherwise, I'd suggest creating an uber jar instead of downloading libraries every time you submit the app

Spark-HBase - GCP template (1/3) - How to locally package the Hortonworks connector?

I'm trying to test the Spark-HBase connector in the GCP context and tried to follow [1], which asks to locally package the connector [2] using Maven (I tried Maven 3.6.3) for Spark 2.4, and leads to following issue.
Error "branch-2.4":
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project shc-core: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: NullPointerException -> [Help 1]
References
[1] https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc
[2] https://github.com/hortonworks-spark/shc/tree/branch-2.4
As suggested in the comments (thanks #Ismail !), using Java 8 works to build the connector:
sdk use java 8.0.275-zulu
mvn clean package -DskipTests
One can then import the jar in Dependencies.scala of the GCP template as follows.
...
val shcCore = "com.hortonworks" % "shc-core" % "1.1.3-2.4-s_2.11" from "file:///<path_to_jar>/shc-core-1.1.3-2.4-s_2.11.jar"
...
// shcCore % (shcVersionPrefix + scalaBinaryVersion) excludeAll(
shcCore excludeAll(
...

Getting Invalid Yaml exception after we upgraded java version

We have updated the java minor version and Cassandra version as given below.
Older Java version "1.8.0_144"
Newer Java version "1.8.0_261"
Older Cassandra version : 2.1.16
Newer Cassandra version : 3.11.2
Now we are getting below exception :
INFO [main] 2020-08-19 10:55:09,364 YamlConfigurationLoader.java:89 - Configuration
location: file:/etc/cassandra/default.conf/cassandra.yaml
Exception (org.apache.cassandra.exceptions.ConfigurationException) encountered during startup: Invalid yaml: file:/etc/cassandra/default.conf/cassandra.yaml
Error: null; Can't construct a java object for tag:yaml.org,2002:org.apache.cassandra.config.Config; exception=java.lang.reflect.InvocationTargetException; in 'reader', line 1, column 1:
cluster_name: VV Cluster
We did not change "cassandra.yaml". Same yaml is working on other nodes running on older versions.
Sometimes the issue is not with YAML. As suggested in this Datastax post, you can run a tool utlility like cassandra-stress which bypass the parser. This may help in identifying the real reason of failure.
Sometime it happens that YAML exception eats up the main issue and it appears as invalid yaml exception.

Zeppelin + Spark: Reading Parquet from S3 throws NoSuchMethodError: com.fasterxml.jackson

Using Zeppelin 0.7.2 binaries from the main download, and Spark 2.1.0 w/ Hadoop 2.6, the following paragraph:
val df = spark.read.parquet(DATA_URL).filter(FILTER_STRING).na.fill("")
Produces the following:
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<init>(ScalaNumberDeserializersModule.scala:49)
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<clinit>(ScalaNumberDeserializersModule.scala)
at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:20)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:37)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.parallelize(SparkContext.scala:715)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:594)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.inferSchema(ParquetFileFormat.scala:235)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
... 47 elided
This error does not happen in the normal spark-shell, only in Zeppelin. I have attempted the following fixes, which do nothing:
Download jackson 2.6.2 jars to the zeppelin lib folder and restart
Add jackson 2.9 dependencies from the maven repositories to the interpreter settings
Deleting the jackson jars from the zeppelin lib folder
Googling is turning up no similar situations. Please don't hesitate to ask for more information, or make suggestions. Thanks!
I had the same problem. I added com.amazonaws:aws-java-sdk and org.apache.hadoop:hadoop-aws as dependencies for the Spark interpreter. These dependencies bring in their own versions of com.fasterxml.jackson.core:* and conflict with Spark's.
You also must exclude com.fasterxml.jackson.core:* from other dependencies, this is an example ${ZEPPELIN_HOME}/conf/interpreter.json Spark interpreter depenency section:
"dependencies": [
{
"groupArtifactVersion": "com.amazonaws:aws-java-sdk:1.7.4",
"local": false,
"exclusions": ["com.fasterxml.jackson.core:*"]
},
{
"groupArtifactVersion": "org.apache.hadoop:hadoop-aws:2.7.1",
"local": false,
"exclusions": ["com.fasterxml.jackson.core:*"]
}
]
Another way is to include it right in the notebook cell:
%dep
z.load("com.fasterxml.jackson.core:jackson-core:2.6.2")

Spark SQL: TwitterUtils Streaming fails for unknown reason

I am using the latest Spark master and additionally, I am loading these jars:
- spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar
- twitter4j-core-4.0.2.jar
- twitter4j-stream-4.0.2.jar
My simple test program that I execute in the shell looks as follows:
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
System.setProperty("twitter4j.oauth.consumerKey", "jXgXF...")
System.setProperty("twitter4j.oauth.consumerSecret", "mWPvQRl1....")
System.setProperty("twitter4j.oauth.accessToken", "26176....")
System.setProperty("twitter4j.oauth.accessTokenSecret", "J8Fcosm4...")
var ssc = new StreamingContext(sc, Seconds(1))
var tweets = TwitterUtils.createStream(ssc, None)
var statuses = tweets.map(_.getText)
statuses.print()
ssc.start()
However, I won't get any tweets. The main error I see is
14/08/04 10:52:35 ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)
....
And then for each iteration:
INFO scheduler.ReceiverTracker: Stream 0 received 0 blocks
I'm not sure where the problem lies.
How can I verify that my twitter credentials are correctly recognized?
Might there be another jar missing?
NoSuchMethodError should always cause you to ask whether you are running with the same versions of libraries and classes that you compiled with.
If you look at the pom.xml file for the Spark examples module, you'll see that it uses twitter4j 3.0.3. You're bringing incompatible 4.0.2 with you at runtime and that breaks it.
Yes, Sean Owen has given the good reason, after I add two dependency files on the pom.xml file:
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-core</artifactId>
<version>3.0.6</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>3.0.6</version>
</dependency>
In this way we change the default twitter4j version from 4.0.x to 3.0.x (http://mvnrepository.com/artifact/org.twitter4j/twitter4j-core), then the incompatible problem will be solved.

Resources