how to list spark-packages added to the spark context? - apache-spark

Is it possible to list what spark packages have been added to the spark session?
The class org.apache.spark.deploySparkSubmitArguments has a variable for the packages:
var packages: String = null
Assuming this is a list of the spark packages, is this available via SparkContext or somewhere else?

I use the following method to retrieve that information: spark.sparkContext.listJars
For example:
$ spark-shell --packages elsevierlabs-os:spark-xml-utils:1.4.0
scala> spark.sparkContext.listJars.foreach(println)
spark://192.168.0.255:51167/jars/elsevierlabs-os_spark-xml-utils-1.4.0.jar
spark://192.168.0.255:51167/jars/commons-io_commons-io-2.4.jar
spark://192.168.0.255:51167/jars/commons-logging_commons-logging-1.2.jar
spark://192.168.0.255:51167/jars/org.apache.commons_commons-lang3-3.4.jar
spark://192.168.0.255:51167/jars/net.sf.saxon_Saxon-HE-9.6.0-7.jar
In this case, I loaded the spark-xml-utils package, and the other jars were loaded as dependencies.

Related

Spark Error: illegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem

I'm a newbie with this so be a little patience please.
I'm running a Spark Job to write some stuff into hbase and I get this error:
2022-06-22 12:45:22:901 | ERROR | Caused by: java.lang.IllegalAccessError:
class org.apache.hadoop.hdfs.web.HftpFileSystem
cannot access its superinterface
org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator
I read Error through remote Spark Job: java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem and since I'm using gradle instead of maven I tried to exclude the class org.apache.hadoop.hdfs.web.HftpFileSystem like this...
compileOnly ("org.apache.spark:spark-core_$scala_major:$spark_version"){
exclude group: "org.apache.hadoop", module: "hadoop-client"
}
Compilations works fine but execution fails exactly in the same way.
These are my versions:
spark_version = 2.4.7
hadoop_version = 3.1.1
All I read is about conflicts between spark and haddop so.
How can I fix this? All I have is exclude the class from spark core dependency and add the rigth version of haddop dependency.
Where I can find some reference about what versions are compatible? (To set the rigth version of haddop lib )
Can this be solved by changing something into the cluster by the infra guys?
I am not sure if I understood the issue correctly.
Thanks.

using spark streaming from pyspark 2.4.4

I have a spark 2.4.4 version set up in a k8s container. I am trying to write a simple hello world for using spark streams like this:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
spark = SparkSession.builder.appName("pyspark-kafka").getOrCreate()
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createDirectStream(ssc, ['users-update'], {"metadata.broker.list":'pubsub-0.pubsub:9092,pubsub-1.pubsub:9092,pubsub-2.pubsub:9092'})
Note that pubsub-x.pubsub are kafka brokers that are visible to my container. (And a simple python program that directly uses the kafka-python client with the brokers and topic in my last line of pyspark code works just fine.)
I get this error message:
________________________________________________________________________________________________
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.4 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.4.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
________________________________________________________________________________________________
There are no version 2.4.4 kafka libraries anywhere on maven. https://search.maven.org/search?q=spark%20kafka shows that the last posted jars are with version 2.10 or 2.11.
I do have a spark-streaming_2.12-2.4.4.jar jar in my pyspark installation but it doesn't seem to have the right kafka classes.
Thanks for any pointers!
--Sridhar
Spark v2.4.4 is pre-built with scala v2.11. From spark download page:
Note that, Spark is pre-built with Scala 2.11 except version 2.4.2, which is pre-built with Scala 2.12.
So, basically 2.10 and 2.11 are the scala version that spark is built with and you should download the spark-streaming-kafka jar which is built with the same version of scala in your case 2.11.
I have checked in the jars folder in spark 2.4.4 and spark-streaming_2.11-2.4.4.jar jar is present there. So you should remove spark-streaming_2.12-2.4.4.jar if you have added this to the classpath externally or else you will get version mismatch.
You can download the spark-streaming-kafka-0-8-assembly.jar from here
And I think you also need to add kafka-clients jar from here as well.
I do have a spark-streaming_2.12-2.4.4.jar jar in my pyspark installation but it doesn't seem to have the right kafka classes.
That's the base Streaming packages for Spark alone. Spark does not come with Kafka classes
Spark Streaming is deprecated in favor of Spark Structured Streaming
You want this package for Spark with Scala 2.12
'org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.4'
And you'd start like this including options for bootstrap servers
df = spark.readStream().format("kafka")

Jar conflicts between apache spark and hadoop

I try to setup and run a Spark cluster running on top of YARN and using HDFS.
I first set up Hadoop for HDFS using hadoop-3.1.0.
Then I configured YARN and started both.
I was able to upload data to the HDFS and yarn also seems to work fine.
Then I installed spark-2.3.0-bin-without-hadoop on my master only and tried to submit an application.
Since it is spark without Hadoop I had to modify spark-env.sh, adding the following line like mentioned in the documentation:
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
Using only this line I got the following exception:
Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
Which I guess means that he does not find the Spark-libraries. So I added the spark jars to the classpath:
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath):/usr/local/spark/jars/*
But now I get the following Exception:
com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson version: 2.7.8
As it turns out, Hadoop 3.1.0 provides Jackson 2.7.8 while Spark 2.3.0 provides Jackson 2.6.7. As I see it, both are now in the classpath resulting in a conflict.
Since it seems I really need both the Hadoop and Spark libraries to submit anything, I do not know how to get around that problem.
How is Hadoop-3.0.0 's compatibility with older versions of Hive, Pig, Sqoop and Spark
there was answer from #JacekLaskowski that spark is not supported on hadoop 3. As far as I know, nothing changed for last 6 month in that area.

error while connecting cassandra and spark

I have installed cassandra 2.1.11, spark 2.0.0.bin hadoop 2.7 and java version 1.8.0_101 on my ubuntu 14.04.
For the Spark Cassandra Connector, i have installed git
sudo apt-get install git
git clone https://github.com/datastax/spark-cassandra-connector.git
and build it
cd spark-cassandra-connector
git checkout v1.4.0
./sbt/sbt assembly
and placed the jar of scala on home directory
cp spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.4.0-SNAPSHOT.jar ~
and used the connector
bin/spark-shell --jars ~/spark-cassandra-connector-assembly-1.4.0-SNAPSHOT.jar
and in the scala promt
sc.stop
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
I have created test keyspace and table my_table from cqlsh and to test the connection, i have run the following command
eval test_spark_rdd = sc.cassandraTable("test", "my_table")
and got the error
error: missing or invalid dependency detected while loading class file 'CassandraConnector.class'.
Could not access type Logging in package org.apache.spark,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'CassandraConnector.class' was compiled against an incompatible version of org.apache.spark.
Is this due to the version mismatch of spark and cassandra?
This is a mismatch between Spark and Spark. You choose to use a 1.4.0 library with Spark 2.0.0.
Use the 2.0.0 release and also use Spark Packages.
https://spark-packages.org/package/datastax/spark-cassandra-connector
> $SPARK_HOME/bin/spark-shell --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11

how to use graphframes inside SPARK on HDInsight cluster

I have setup an SPARK cluster on HDInsight and was am trying to use GraphFrames using this tutorial.
I have already used the custom scripts during the cluster creation to enable the GraphX on the spark cluster as described here.
When I am running the notepad,
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._
i get the following error
<console>:45: error: object graphframes is not a member of package org
import org.graphframes._
^
I tried to install the graphframes from the spark terminal via Jupyter using the following command:
$SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.5
but Still I am unable to get it working. I am new to Spark and HDInsight so can someone please point out what else I need to install on this cluster to get this working.
Today, this works in spark-shell, but doesn't work in jupyter notebook. So when you run this:
$SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.5
It works (at least on spark 1.6 cluster version) in the context of this spark-shell session.
But in jupyter there is currently no way to load packages. This feature is going to be added soon to jupyter notebooks in the clusters. In the meantime you can use spark-shell, or spark-submit, etc.
Once you upload or import graphframes libraries from Maven repository, you need to restart your cluster so as to attach the library.
So it works for me.

Resources