spark-cassandra-connector on EMR serverless (PySpark) - apache-spark

I'm working on getting an application running on EMR Serverless and am having trouble pulling in the spark-cassandra-connector. I have no issue pulling it on my local, but all of my attempts at using the library on EMR Serverless have failed.
When I include the library using --jars s3://XXX/XXXX/spark-cassandra-connector-driver_2.12-3.2.0.jar, I error out on the following line
d = spark \
.read \
.format("org.apache.spark.sql.cassandra") \
.options(table="YYYY", keyspace="YYY") \
.load()
with the error
py4j.protocol.Py4JJavaError: An error occurred while calling o121.load.
: java.lang.ClassNotFoundException:
Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at
http://spark.apache.org/third-party-projects.html
When I try to add the package using --packages com.datastax.spark:spark-cassandra-connector_2.12:3.2.0, the application times out with
com.datastax.spark#spark-cassandra-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5ee06249-c545-4b92-804f-ecedd322158a;1.0
confs: [default]
:: resolution report :: resolve 524554ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: com.datastax.spark#spark-cassandra-connector_2.12;3.2.0
==== local-m2-cache: tried
file:/home/hadoop/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
file:/home/hadoop/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar
==== local-ivy-cache: tried
/home/hadoop/.ivy2/local/com.datastax.spark/spark-cassandra-connector_2.12/3.2.0/ivys/ivy.xml
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
/home/hadoop/.ivy2/local/com.datastax.spark/spark-cassandra-connector_2.12/3.2.0/jars/spark-cassandra-connector_2.12.jar
==== central: tried
https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar
==== spark-packages: tried
https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.datastax.spark#spark-cassandra-connector_2.12;3.2.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar (java.net.ConnectException: Connection timed out (Connection timed out))
I'm betting the --package issue comes from a firewall configuration problem, but I'm not seeing any way to open up access. As for the --jars issue, I'm not sure why pulling in the .jar isn't enough for Spark to recognize the org.apache.spark.sql.cassandra format.
Any help on either issue would be appreciated, thanks!

Yes, the issue with --packages most probably due your egress settings that prevent accessing the Maven central.
To use --jars you need to specify all necessary jars, like, driver, connector, Java driver, etc. The easiest way to avoid this is to use so-called assembly build that is available on Maven Central too with com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0 coordinate. Just download referenced jar file.

Related

Setting DEBUG or VERBOSE message level before creating spark session

I'm trying to debug my spark application. It is not correctly starting a spark session, because it is unable to find a specific package that is found in an external repository I set with the spark.jars.repositories configuration.
My code looks like this:
builder = (
SparkSession.builder.appName("my_app")
.master("local[*]")
.config(
"spark.jars.packages",
+ "ml.dmlc:xgboost4j-spark_2.12:1.5.2,"
+ "ml.dmlc:xgboost4j_2.12:1.5.2,"
+ "my.internal.group:my-internal-package:0.1.0"
)
.config("spark.jars.repositories", f"https://my.nexus.repository/repository/")
)
spark = builder.getOrCreate()
This raises an exception with the following error:
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: my.internal.group#my-internal-package;0.1.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: my.internal.group#my-internal-package;0.1.0: not found]
My question is: how can I set these VERBOSE or DEBUG message levels BEFORE starting the spark session?

Spark spark-sql-kafka - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

I am experimenting with spark reading from a kafka topic through "Structured Streaming + Kafka Integration Guide".
Spark version: 3.2.1
Scala version: 2.12.15
Following their guide on the spark-shell including the dependencies, I start my shell:
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1
However, once I run something like the following in my shell:
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","http://HOST:PORT").option("subscribe", "my-topic").load()
I get the following exception:
java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer`
Any ideas how to overcome this issue?
My assumption was with using --packages, all dependencies should be loaded as well. But this does not seem to be the case. From the logs I assume that the package gets loaded successfully, including the kafka-clients dependency:
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
resolving dependencies :: org.apache.spark#spark-submit-parent-3b04f646-471c-4cc8-88fb-7e32bc3226ed;1.0
confs: \[default\]
found org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.1 in central
found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.1 in central
found org.apache.kafka#kafka-clients;2.8.0 in central
found org.lz4#lz4-java;1.7.1 in central
found org.xerial.snappy#snappy-java;1.1.8.4 in central
found org.slf4j#slf4j-api;1.7.30 in central
found org.apache.hadoop#hadoop-client-runtime;3.3.1 in central
found org.spark-project.spark#unused;1.0.0 in central
found org.apache.hadoop#hadoop-client-api;3.3.1 in central
found org.apache.htrace#htrace-core4;4.1.0-incubating in central
found commons-logging#commons-logging;1.1.3 in central
found com.google.code.findbugs#jsr305;3.0.0 in central
found org.apache.commons#commons-pool2;2.6.2 in central
The logs seem fine, but you can try to include kafka-clients dependency in --packages argument as well
Otherwise, I'd suggest creating an uber jar instead of downloading libraries every time you submit the app

Ubuntu 18.04: Pyspark UNRESOLVED DEPENDENCIES: module not found: org.apache.spark#spark-streaming-kafka-0-10;2.3.0

I am trying to execute a spark script with the following command.
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.0 src/sparkProcessing.py
And I am getting 'Unresolved dependency error as show below.
I am using Spark 2.3.0, Scala 2.12 and Kafka 1.1.0
Following is the error which I am getting:
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: org.apache.spark#spark-streaming-kafka-0-10;2.3.0
http://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-streaming-kafka-0-10/2.3.0/spark-streaming-kafka-0-10-2.3.0.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: org.apache.spark#spark-streaming-kafka-0-10;2.3.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-10/2.3.0/spark-streaming-kafka-0-10-2.3.0.pom (javax.net.ssl.SSLException: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty)
Server access error at url https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-10/2.3.0/spark-streaming-kafka-0-10-2.3.0.jar (javax.net.ssl.SSLException: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty)
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.spark#spark-streaming-kafka-0-10;2.3.0: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1270)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:49)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:350)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Solved it using https://stackoverflow.com/a/50688351/5808464
I purged the other java alternatives and install openjdk which I had earlier removed it myself.

Spark + Amazon S3 "s3a://" urls

AFAIK, the newest, best S3 implementation for Hadoop + Spark is invoked by using the "s3a://" url protocol. This works great on pre-configured Amazon EMR.
However, when running on a local dev system using the pre-built spark-2.0.0-bin-hadoop2.7.tgz, I get
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 99 more
Next I tried to launch my Spark job specifying the hadoop-aws addon:
$SPARK_HOME/bin/spark-submit --master local \
--packages org.apache.hadoop:hadoop-aws:2.7.3 \
my_spark_program.py
I get
::::::::::::::::::::::::::::::::::::::::::::::
:: FAILED DOWNLOADS ::
:: ^ see resolution messages for details ^ ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.google.code.findbugs#jsr305;3.0.0!jsr305.jar
:: org.apache.avro#avro;1.7.4!avro.jar
:: org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.jar(bundle)
::::::::::::::::::::::::::::::::::::::::::::::
I made a dummy build.sbt project in a temp directory with those three dependencies to see if a basic sbt build could successfully download those and I got:
[error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.avro#avro;1.7.4: several problems occurred while resolving dependency: org.apache.avro#avro;1.7.4 {compile=[default(compile)]}:
[error] org.apache.avro#avro;1.7.4!avro.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/apache/avro/avro/1.7.4/avro-1.7.4.pom
[error] org.apache.avro#avro;1.7.4!avro.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/apache/avro/avro/1.7.4/avro-1.7.4.pom
[error]
[error] unresolved dependency: com.google.code.findbugs#jsr305;3.0.0: several problems occurred while resolving dependency: com.google.code.findbugs#jsr305;3.0.0 {compile=[default(compile)]}:
[error] com.google.code.findbugs#jsr305;3.0.0!jsr305.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.pom
[error] com.google.code.findbugs#jsr305;3.0.0!jsr305.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.pom
[error]
[error] unresolved dependency: org.xerial.snappy#snappy-java;1.0.4.1: several problems occurred while resolving dependency: org.xerial.snappy#snappy-java;1.0.4.1 {compile=[default(compile)]}:
[error] org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.pom
[error] org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.pom
[error] Total time: 2 s, completed Sep 2, 2016 6:47:17 PM
Any ideas on how I can get this working?
It looks like you need additional jars in your submit flag. The Maven repository has a number of AWS packages for Java which you can use to fix your current error: https://mvnrepository.com/search?q=aws
I continuously receive headaches with the S3A filesystem error; but the aws-java-sdk:1.7.4 jar works for Spark 2.0.
Further dialogue on the matter can be found here; albeit there is indeed an actual package in the Maven AWS EC2 repository.
https://sparkour.urizone.net/recipes/using-s3/
Try this:
spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 my_spark_program.py
If you are using Apache Spark (that is: I'm ignoring the build Amazon ship in EMR), you need to add a dependency on org.apache.hadoop:hadoop-aws for exactly the same version of Hadoop that the rest of spark uses. This adds the S3a FS and the transitive dependencies. The version of the AWS SDK must be the same as that used to build the hadoop-aws library, as it's a bit of a moving target.
See: Apache Spark and Object Stores

Apache Spark compile failed while installing Netty

We untar spark-0.9.0-incubating.tgz and trying to build it for use with Yarn.
SPARK_HADOOP_VERSION=2.0.0-cdh4.6.0 SPARK_YARN=true sbt/sbt assembly
...
[info] Resolving io.netty#netty-all;4.0.13.Final ...
[error] Server access Error: Connection timed out url=https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
[error] Server access Error: Connection timed out url=https://oss.sonatype.org/service/local/staging/deploy/maven2/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
...
If I just cut-paste the url into a browser, I get:
404 - ItemNotFoundException
Retrieval of /io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom from M2Repository(id=snapshots) is forbidden by repository policy SNAPSHOT.
org.sonatype.nexus.proxy.ItemNotFoundException: Retrieval of /io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom from M2Repository(id=snapshots) is forbidden by repository policy SNAPSHOT.
at org.sonatype.nexus.proxy.maven.AbstractMavenRepository.doRetrieveItem(AbstractMavenRepository.java:380)
at org.sonatype.nexus.proxy.maven.maven2.M2Repository.doRetrieveItem(M2Repository.java:396)
at org.sonatype.nexus.proxy.repository.AbstractRepository.retrieveItem(AbstractRepository.java:765)
at org.sonatype.nexus.proxy.repository.AbstractRepository.retrieveItem(AbstractRepository.java:608)
at org.sonatype.nexus.proxy.router.DefaultRepositoryRouter.retrieveItem(DefaultRepositoryRouter.java:155)
at org.sonatype.nexus.web.content.NexusContentServlet.doGet(NexusContentServlet.java:359)
at org.sonatype.nexus.web.content.NexusContentServlet.service(NexusContentServlet.java:331)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
I have seen this reported in a number of places but no solution. Is this error because we are behind a corporate firewall, or is this due to something else? Please advise.
I had proxy set as environment variables, but it appears they are not being picked up. Adding them in sbt directly worked for me.
Edit $SPARK_HOME/sbt/sbt
For example,
EXTRA_ARGS="-Dhttp.proxySet=true -Dhttp.proxyHost=myproxy.mycompany.com -Dhttp.proxyPort=80 -Dhttps.proxySet=true -Dhttps.proxyHost=myproxy.mycompany.com -Dhttps.proxyPort=80 -Dftp.proxySet=true -Dftp.proxyHost=myproxy.mycompany.com -Dftp.proxyPort=80 -Dhttp.nonProxyHosts=mydomain -Dhttps.nonProxyHosts=mydomain -Dftp.nonProxyHosts=mydomain"

Resources