I'm new to Cassandra and Pyspark, initially I installed cassandra version 3.11.1, openjdk 1.8, pyspark 3.x and scala 1.12. I was getting a lot of errors as shown below after running my python server.
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at com.datastax.spark.connector.util.ConfigParameter.<init>(ConfigParameter.scala:7)
at com.datastax.spark.connector.rdd.ReadConf$.<init>(ReadConf.scala:33)
at com.datastax.spark.connector.rdd.ReadConf$.<clinit>(ReadConf.scala)
at org.apache.spark.sql.cassandra.DefaultSource$.<init>(DefaultSource.scala:134)
at org.apache.spark.sql.cassandra.DefaultSource$.<clinit>(DefaultSource.scala)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:55)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 23 more
I didn't know what exactly this error is but after some research I realized that the pyspark Cassandra connection is having some issues. Then I checked the versions also. During my research I saw that Cassandra versions other than 4.x is not compatible with Python3.9. I uninstalled Cassandra and tried to install cassandra4 distribution but that is throwing me another set of errors after running the command:
wget http://mirror.cogentco.com/pub/apache/cassandra/4.0-beta2/apache-cassandra-4.0-beta2-bin.tar.gz
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
cassandra : Depends: python3 (>= 3.6) but 3.5.1-3 is to be installed
Recommends: ntp but it is not going to be installed or
time-daemon
E: Unable to correct problems, you have held broken packages.
Can someone help me understand the issue? How can I install Cassandra and Pyspark along with Python3.9. Is there any version incompatibility here?
updating question based on answer
I have updated my versions on another machine:
Currently, I'm using the following versions: Pyspark 3.0.1 Cassandra:4.0 cqlsh:5.0.1 python:3.6 Scala:2.12
I tried using the connector 3.0.0 as well as 3.1.0 both are giving me errors:
UNRESOLVED DEPENDENCY: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1389)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
.......
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Connection string used: --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell as pyspark version is 3.0.1 now.
You are using wrong version of Cassandra connector - if you are using pyspark 3.x, the you need to get corresponding version - 3.0 or 3.1. Your version is for old versions of Spark:
pyspark --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0
P.S. Cassandra 4.0 is also released already - it makes no sense use beta2
I'm trying to use the spark-avro package as described in Apache Avro Data Source Guide.
When I submit the following command:
val df = spark.read.format("avro").load("~/foo.avro")
I get an error:
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 49 elided
Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
at org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 62 more
I've tried different versions of the org.apache.spark:spark-avro_2.12:2.4.0 package (2.4.0, 2.4.1, and 2.4.2), and I currently use Spark 2.4.1, but neither worked.
I start my spark-shell with the following command:
spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.0
tl;dr Since Spark 2.4.x+ provides built-in support for reading and writing Apache Avro data, but the spark-avro module is external and not included in spark-submit or spark-shell by default, you should make sure that you use the same Scala version (ex. 2.12) for the spark-shell and --packages.
The reason for the exception is that you use spark-shell that is from Spark built against Scala 2.11.12 while --packages specifies a dependency with Scala 2.12 (in org.apache.spark:spark-avro_2.12:2.4.0).
Use --packages org.apache.spark:spark-avro_2.11:2.4.0 and you should be fine.
just incase if some one is interested for pyspark 2.7 and spark 2.4.3
below package works
bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.3
One more thing I noticed when I had the same issue is that it runs fine for the first time and shows the error thereafter. So clear the cache by adding rm command to the docker file. That was sufficient in my case.
I want to resolve the error when I run spark-shell
I am on arch installing apache spark from the AUR
When I run the command spark-shell I get the following error
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
I also installed the slf4j from aur and it does not help either. how can run spark on arch
It seems like you have a problem regarding Java which I canno't help you with...
Did you do the correct configuration for Spark ?
cd /etc/profile.d
sudo nano apache-spark.sh
Modifiy : SPARK_HOME=/opt/apache-spark by SPARK_HOME=/opt/apache-spark/bin
Worked for me
I installed apache-spark and pyspark on my machine (Ubuntu), and in Pycharm, I also updated the environment variables (e.g. spark_home, pyspark_python).
I'm trying to do:
import os, sys
os.environ['SPARK_HOME'] = ".../spark-2.3.0-bin-hadoop2.7"
sys.path.append(".../spark-2.3.0-bin-hadoop2.7/bin/pyspark/")
sys.path.append(".../spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip")
from pyspark import SparkContext
from pyspark import SparkConf
sc = SparkContext('local[2]')
words = sc.parallelize(["scala", "java", "hadoop", "spark", "akka"])
print(words.count())
But, I receive some weird warnings:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.IllegalArgumentException
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.base/java.lang.Thread.run(Thread.java:844)
How can I solve this problem?
Actually, I found a tricky solution. To solve the following problem:
Be sure that you installed Py4j, correctly. It's better to install it by using an official release. To do,
download the latest official release from from https://pypi.org/project/py4j/.
untar/unzip the file and navigate to the newly created directory, e.g., cd py4j-0.x.
run
sudo python(3) setup.py install
Then downgrade your Java to version 8 (previously, I used version 10.).
To do, first remove the current version of Java using:
sudo apt-get purge openjdk-\* icedtea-\* icedtea6-\*
and then Install Java 8 using:
sudo apt install openjdk-8-jre-headless
Now the code works for me properly.
I also confirm that the solution works on Ubuntu 18.04 LTS.
I had a java 10 installed and tried to run the Python examples from:
http://spark.apache.org/docs/2.3.1/, i.e. things as simple as:
./bin/spark-submit examples/src/main/python/pi.py 10
It did not work!
After applying the suggested fix:
sudo apt-get purge openjdk-\* icedtea-\* icedtea6-\*
sudo apt autoremove
sudo apt install openjdk-8-jre-headless
the example eventually worked; I mean if you consider that the right answer is:
Pi is roughly 3.142000
Thanks for the solution,
Bagvian
I had two versions of java before, java8 and java9. When I deleted Java9, the problem has been solved.
Step 1:
Downgrade or upgrade your java version to 8, if you have already installed one.
(see how to alternate among java versions)
Step 2:
Add the following to ~/.bashrc
export JAVA_HOME='/usr/lib/jvm/java-8-openjdk-amd64'
export PATH=$JAVA_HOME/bin:$PATH
export SPARK_HOME='/path/to/spark-2.x.x-bin-hadoop2.7'
export PATH=$SPARK_HOME/bin:$PATH
and run source ~/.bashrc to load it, or just start a new terminal.
An alternative approach would be to copy /path/to/spark-2.x.x-bin-hadoop2.7/conf/spark-env.sh.template to /path/to/spark-2.x.x-bin-hadoop2.7/conf/spark-env.sh. Then add the following to spark-env.sh
export JAVA_HOME='/usr/lib/jvm/java-8-openjdk-amd64'
export PYSPARK_PYTHON=python3
Then add the following to ~/.bashrc
export SPARK_HOME='/path/to/spark-2.x.x-bin-hadoop2.7'
export PATH=$SPARK_HOME/bin:$PATH
export SPARK_CONF_DIR=$SPARK_HOME/conf
and run source ~/.bashrc.
I need to maintain both OpenJDK 11 and JDK 8 for different purposes, so downgrading is not an option. For Spark Programs, I leverage by exporting (overriding) JAVA_HOME path pointing to JDK8 as below.
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home/
direnv + adoptopenjdk8 (brew tap homebrew/cask-versions + brew cask install adoptopenjdk8) worked great for me in this situation (macOS)
# ~/.direnvrc
use_java() {
if [ "$#" -ne 1 ]; then
echo "usage: use java VERSION" >&2
return 1
fi
local v
v="$1"
if [ "$v" -le "8" ]; then
v="1.$v"
fi
export JAVA_HOME="$(/usr/libexec/java_home -v "$v")"
PATH_add $JAVA_HOME/bin
}
# .envrc in the project directory
use_java 8
If you are using anaconda, try:
conda install -c cyclus java-jdk
I had same problem. I had java-11, so I deleted Java-11 and installed java-8, the problem has been solved.
The main here of getting the error is due to the incorrect/incomplete path in the environment variable. You need to add path for java, spark, pyspark_python, hadoop(containing the bin folder).Most probably this solution can be resolved by adding right paths.
https://youtu.be/WQErwxRTiW0 ---- this video helped me in resolving my issue(video describes all the installation and correct paths)
I have the same issue.
PySpark 2.x.x supports Java 8. And PySpark 3.x.x supports Java 8 and Java 11.
So, check your PySpark and Java version.
If you are using PySpark 2.x.x then you need to install/upgrade/downgrade Java 8 and point your JAVA_HOME to java 8 jdk path.
I am using Spark 1.4.1.
I can use spark-submit without problem.
But when I ran ~/spark/bin/spark-shell
I got the error below
I have configured SPARK_HOME and JAVA_HOME.
However, It was OK with Spark 1.2
15/10/08 02:40:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Failed to initialize compiler: object scala.runtime in compiler mirror not found.
** Note that as of 2.8 scala does not assume use of the java classpath.
** For the old behavior pass -usejavacp to scala, or if using a Settings
** object programatically, settings.usejavacp.value = true.
Failed to initialize compiler: object scala.runtime in compiler mirror not found.
** Note that as of 2.8 scala does not assume use of the java classpath.
** For the old behavior pass -usejavacp to scala, or if using a Settings
** object programatically, settings.usejavacp.value = true.
Exception in thread "main" java.lang.AssertionError: assertion failed: null
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:247)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I was having the same problem running spark but I found it was my fault for not configuring scala properly.
Make sure you have Java, Scala and sbt installed and Spark is built:
Edit your .bashrc file
vim .bashrc
Set your env variables:
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export PATH=$JAVA_HOME:$PATH
export SCALA_HOME=/usr/local/src/scala/scala-2.11.5
export PATH=$SCALA_HOME/bin:$PATH
export SPARK_HOME=/usr/local/src/apache/spark.2.0.0/spark
export PATH=$SPARK_HOME/bin:$PATH
Source your settings
. .bashrc
check scala
scala -version
make sure the repl starts
scala
if your repel starts try and start your spark shell again.
./path/to/spark/bin/spark-shell
you should get the spark repl
You could try running
spark-shell -usejavacp
It didn't work for me, but it did work for someone in the descriptions of Spark Issue 18778.
Have you installed scala and sbt?
The log said it didn't find the main class.