Spark + Amazon S3 "s3a://" urls

Spark + Amazon S3 "s3a://" urls - apache-spark

AFAIK, the newest, best S3 implementation for Hadoop + Spark is invoked by using the "s3a://" url protocol. This works great on pre-configured Amazon EMR.
However, when running on a local dev system using the pre-built spark-2.0.0-bin-hadoop2.7.tgz, I get
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 99 more
Next I tried to launch my Spark job specifying the hadoop-aws addon:
$SPARK_HOME/bin/spark-submit --master local \
--packages org.apache.hadoop:hadoop-aws:2.7.3 \
my_spark_program.py
I get
::::::::::::::::::::::::::::::::::::::::::::::
:: FAILED DOWNLOADS ::
:: ^ see resolution messages for details ^ ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.google.code.findbugs#jsr305;3.0.0!jsr305.jar
:: org.apache.avro#avro;1.7.4!avro.jar
:: org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.jar(bundle)
::::::::::::::::::::::::::::::::::::::::::::::
I made a dummy build.sbt project in a temp directory with those three dependencies to see if a basic sbt build could successfully download those and I got:
[error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.avro#avro;1.7.4: several problems occurred while resolving dependency: org.apache.avro#avro;1.7.4 {compile=[default(compile)]}:
[error] org.apache.avro#avro;1.7.4!avro.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/apache/avro/avro/1.7.4/avro-1.7.4.pom
[error] org.apache.avro#avro;1.7.4!avro.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/apache/avro/avro/1.7.4/avro-1.7.4.pom
[error]
[error] unresolved dependency: com.google.code.findbugs#jsr305;3.0.0: several problems occurred while resolving dependency: com.google.code.findbugs#jsr305;3.0.0 {compile=[default(compile)]}:
[error] com.google.code.findbugs#jsr305;3.0.0!jsr305.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.pom
[error] com.google.code.findbugs#jsr305;3.0.0!jsr305.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.pom
[error]
[error] unresolved dependency: org.xerial.snappy#snappy-java;1.0.4.1: several problems occurred while resolving dependency: org.xerial.snappy#snappy-java;1.0.4.1 {compile=[default(compile)]}:
[error] org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.pom
[error] org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.pom
[error] Total time: 2 s, completed Sep 2, 2016 6:47:17 PM
Any ideas on how I can get this working?

It looks like you need additional jars in your submit flag. The Maven repository has a number of AWS packages for Java which you can use to fix your current error: https://mvnrepository.com/search?q=aws
I continuously receive headaches with the S3A filesystem error; but the aws-java-sdk:1.7.4 jar works for Spark 2.0.
Further dialogue on the matter can be found here; albeit there is indeed an actual package in the Maven AWS EC2 repository.
https://sparkour.urizone.net/recipes/using-s3/
Try this:
spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 my_spark_program.py

If you are using Apache Spark (that is: I'm ignoring the build Amazon ship in EMR), you need to add a dependency on org.apache.hadoop:hadoop-aws for exactly the same version of Hadoop that the rest of spark uses. This adds the S3a FS and the transitive dependencies. The version of the AWS SDK must be the same as that used to build the hadoop-aws library, as it's a bit of a moving target.
See: Apache Spark and Object Stores

Related

Azure Databricks - Library Installation Fails

I am running a job from a jar file in Azure Databricks. This jar has a dependency on azure-storage-file-share. Previously, there wasn't an issue installing this dependency using Maven within the Databricks UI. Now, I get failure with this error message:
Run result unavailable: job failed with error message Library installation failed for library due to user error for maven { coordinates: "com.azure:azure-storage-file-share:12.16.2" } Error messages: Library installation attempted on the driver node of cluster 0131-144423-r1g81134 and failed. Please refer to the following error message to fix the library or contact Databricks support. Error Code: DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File file:/local_disk0/tmp/clusterWideResolutionDir/maven/ivy/jars/io.netty_netty-transport-native-kqueue-4.1.86.Final.jar does not exist
To try to work around this, I manually installed this netty library (and several others) as well. I can see in the logs that it was able to download the jar successfully:
downloading https://maven-central.storage-download.googleapis.com/maven2/io/netty/netty-transport-native-kqueue/4.1.86.Final/netty-transport-native-kqueue-4.1.86.Final-osx-x86_64.jar ... [SUCCESSFUL ] io.netty#netty-transport-native-kqueue;4.1.86.Final!netty-transport-native-kqueue.jar (202ms)
However, it still fails. This is error message is in the same log after the one above:
23/01/31 14:50:09 WARN LibraryState: [Thread 137] Failed to install library maven;azure-storage-file-share;com.azure;12.16.2;; java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File file:/local_disk0/tmp/clusterWideResolutionDir/maven/ivy/jars/io.netty_netty-transport-native-kqueue-4.1.86.Final.jar does not exist

spark-cassandra-connector on EMR serverless (PySpark)

I'm working on getting an application running on EMR Serverless and am having trouble pulling in the spark-cassandra-connector. I have no issue pulling it on my local, but all of my attempts at using the library on EMR Serverless have failed.
When I include the library using --jars s3://XXX/XXXX/spark-cassandra-connector-driver_2.12-3.2.0.jar, I error out on the following line
d = spark \
.read \
.format("org.apache.spark.sql.cassandra") \
.options(table="YYYY", keyspace="YYY") \
.load()
with the error
py4j.protocol.Py4JJavaError: An error occurred while calling o121.load.
: java.lang.ClassNotFoundException:
Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at
http://spark.apache.org/third-party-projects.html
When I try to add the package using --packages com.datastax.spark:spark-cassandra-connector_2.12:3.2.0, the application times out with
com.datastax.spark#spark-cassandra-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5ee06249-c545-4b92-804f-ecedd322158a;1.0
confs: [default]
:: resolution report :: resolve 524554ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: com.datastax.spark#spark-cassandra-connector_2.12;3.2.0
==== local-m2-cache: tried
file:/home/hadoop/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
file:/home/hadoop/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar
==== local-ivy-cache: tried
/home/hadoop/.ivy2/local/com.datastax.spark/spark-cassandra-connector_2.12/3.2.0/ivys/ivy.xml
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
/home/hadoop/.ivy2/local/com.datastax.spark/spark-cassandra-connector_2.12/3.2.0/jars/spark-cassandra-connector_2.12.jar
==== central: tried
https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar
==== spark-packages: tried
https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.datastax.spark#spark-cassandra-connector_2.12;3.2.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar (java.net.ConnectException: Connection timed out (Connection timed out))
I'm betting the --package issue comes from a firewall configuration problem, but I'm not seeing any way to open up access. As for the --jars issue, I'm not sure why pulling in the .jar isn't enough for Spark to recognize the org.apache.spark.sql.cassandra format.
Any help on either issue would be appreciated, thanks!

Yes, the issue with --packages most probably due your egress settings that prevent accessing the Maven central.
To use --jars you need to specify all necessary jars, like, driver, connector, Java driver, etc. The easiest way to avoid this is to use so-called assembly build that is available on Maven Central too with com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0 coordinate. Just download referenced jar file.

Spark spark-sql-kafka - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

I am experimenting with spark reading from a kafka topic through "Structured Streaming + Kafka Integration Guide".
Spark version: 3.2.1
Scala version: 2.12.15
Following their guide on the spark-shell including the dependencies, I start my shell:
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1
However, once I run something like the following in my shell:
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","http://HOST:PORT").option("subscribe", "my-topic").load()
I get the following exception:
java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer`
Any ideas how to overcome this issue?
My assumption was with using --packages, all dependencies should be loaded as well. But this does not seem to be the case. From the logs I assume that the package gets loaded successfully, including the kafka-clients dependency:
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
resolving dependencies :: org.apache.spark#spark-submit-parent-3b04f646-471c-4cc8-88fb-7e32bc3226ed;1.0
confs: \[default\]
found org.apache.spark#spark-sql-kafka-0-10_2.12;3.2.1 in central
found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.2.1 in central
found org.apache.kafka#kafka-clients;2.8.0 in central
found org.lz4#lz4-java;1.7.1 in central
found org.xerial.snappy#snappy-java;1.1.8.4 in central
found org.slf4j#slf4j-api;1.7.30 in central
found org.apache.hadoop#hadoop-client-runtime;3.3.1 in central
found org.spark-project.spark#unused;1.0.0 in central
found org.apache.hadoop#hadoop-client-api;3.3.1 in central
found org.apache.htrace#htrace-core4;4.1.0-incubating in central
found commons-logging#commons-logging;1.1.3 in central
found com.google.code.findbugs#jsr305;3.0.0 in central
found org.apache.commons#commons-pool2;2.6.2 in central

The logs seem fine, but you can try to include kafka-clients dependency in --packages argument as well
Otherwise, I'd suggest creating an uber jar instead of downloading libraries every time you submit the app

issue on Maven dependency in Azure pipeline

I have upgraded the log4j to 2.17 in my client machine azure pipeline, after that i am getting the error as below. Can anyone help me how to solve it.
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project Test_Automation: Execution default-compile of goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile failed: Plugin org.apache.maven.plugins:maven-compiler-plugin:3.1 or one of its dependencies could not be resolved: Failed to collect dependencies at org.apache.maven.plugins:maven-compiler-plugin:jar:3.1 -> org.codehaus.plexus:plexus-container-default:jar:1.5.5 -> org.apache.xbean:xbean-reflect:jar:3.4 -> log4j:log4j:jar:1.2.12: Failed to read artifact descriptor for log4j:log4j:jar:1.2.12: Could not transfer artifact log4j:log4j:pom:1.2.12 from/to central (https://repo.maven.apache.org/maven2): sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target -> [Help 1]
[ERROR]

Seems SSL issue you have 2 options
configure ssl
use non https url for maven repository
For second option you may use:
<repository>
<id>maven</id>
<name>maven</name>
<url>http://repo1.maven.org/maven2</url>
</repository>

build failed for datastax spark-cassandra connector

I was trying to build the spark-cassandra connector and followed this link :
http://www.planetcassandra.org/blog/kindling-an-introduction-to-spark-with-cassandra/
Which further in the link asks to download the connector from git and build using sbt. But, when i try to run the command ./sbt/sbt assembly. It throws the following exception :
Launching sbt from sbt/sbt-launch-0.13.8.jar
[info] Loading project definition from /home/naresh/Desktop/spark-cassandra-connector/project
Using releases: https://oss.sonatype.org/service/local/staging/deploy/maven2 for releases
Using snapshots: https://oss.sonatype.org/content/repositories/snapshots for snapshots
Scala: 2.10.5 [To build against Scala 2.11 use '-Dscala-2.11=true']
Scala Binary: 2.10
Java: target=1.7 user=1.7.0_79
[info] Set current project to root (in build file:/home/naresh/Desktop/spark-cassandra-connector/)
[warn] Credentials file /home/hduser/.ivy2/.credentials does not exist
[warn] Credentials file /home/hduser/.ivy2/.credentials does not exist
[warn] Credentials file /home/hduser/.ivy2/.credentials does not exist
[warn] Credentials file /home/hduser/.ivy2/.credentials does not exist
[warn] Credentials file /home/hduser/.ivy2/.credentials does not exist
[info] Compiling 140 Scala sources and 1 Java source to /home/naresh/Desktop/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/classes...
[error] /home/naresh/Desktop/spark-cassandra-connector/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraCatalog.scala:48: not found: value processTableIdentifier
[error] val id = processTableIdentifier(tableIdentifier).reverse.lift
[error] ^
[error] /home/naresh/Desktop/spark-cassandra-connector/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraCatalog.scala:134: value toSeq is not a member of org.apache.spark.sql.catalyst.TableIdentifier
[error] cachedDataSourceTables.refresh(tableIdent.toSeq)
[error] ^
[error] /home/naresh/Desktop/spark-cassandra-connector/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraSQLContext.scala:94: not found: value BroadcastNestedLoopJoin
[error] BroadcastNestedLoopJoin
[error] ^
[error] three errors found
[info] Compiling 11 Scala sources to /home/naresh/Desktop/spark-cassandra-connector/spark-cassandra-connector-embedded/target/scala-2.10/classes...
[warn] /home/naresh/Desktop/spark-cassandra-connector/spark-cassandra-connector-embedded/src/main/scala/com/datastax/spark/connector/embedded/SparkTemplate.scala:69: value actorSystem in class SparkEnv is deprecated: Actor system is no longer supported as of 1.4.0
[warn] def actorSystem: ActorSystem = SparkEnv.get.actorSystem
[warn] ^
[warn] one warning found
[error] (spark-cassandra-connector/compile:compileIncremental) Compilation failed
[error] Total time: 27 s, completed 4 Nov, 2015 12:34:33 PM

This works for me,
run mvn -DskipTests clean package
you can find build spark command in README.md file from your spark Dir.
Before run that command You’ll need to configure Maven to use more
memory than usual by setting MAVEN_OPTS
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark + Amazon S3 "s3a://" urls - apache-spark

Related

Azure Databricks - Library Installation Fails

spark-cassandra-connector on EMR serverless (PySpark)

Spark spark-sql-kafka - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

issue on Maven dependency in Azure pipeline

build failed for datastax spark-cassandra connector

Categories

Resources