Getting Error when I ran hive UDF written in Java in pyspark EMR 5.x

Getting Error when I ran hive UDF written in Java in pyspark EMR 5.x - apache-spark

I have a Hive UDF written in java and I am trying to use it in pyspark 2.0.0. below are the steps
1. Copy the jar file to EMR
2. started a pyspark job like below
pyspark --jars ip-udf-0.0.1-SNAPSHOT-jar-with-dependencies-latest.jar
3. used the below code access the UDF
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
sc = spark.sparkContext
sqlContext = HiveContext(sc)
sqlContext.sql("create temporary function ip_map as 'com.mediaiq.hive.IPMappingUDF'")
I get the below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o43.sql.
: java.lang.NoSuchMethodError:
org.apache.hadoop.hive.conf.HiveConf.getTimeVar(Lorg/apache/hadoop/hive/conf/HiveConf$ConfVars;Ljava/util/concurrent/TimeUnit;)J
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:76)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:98)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465) at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)
at
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:189)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at
org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at
org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at
org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at
org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:214) at
java.lang.Thread.run(Thread.java:745)

You may have built your UDF with a different version of Hive. Be sure to specify the same version of Hive in your pom.xml used to build the jar containing the UDF. See this previous answer, for example.

Related

How to add Spark-excel to PySpark

I'm trying to read xlsx to PySpark and tried with multiple ways to import the library of Spark-excel but I still get errors while reading xlsx file.
I'm using Spark with standalone mode on my Mac.
My code:
# spark configuration
spark_path = "/spark/spark-3.0.1-bin-hadoop2.7"
findspark.init(spark_path)
spark = SparkSession.builder.master("local").appName("Word Count").config("--packages com.crealytics:spark-excel_2.12:0.13.7").getOrCreate()
data_location = "bank_transactions.xlsx"
df = spark.read.format("com.crealytics.spark.excel").load(data_location)
I got the following error:
Py4JJavaError: An error occurred while calling o37.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at com.crealytics.spark.excel.Utils$MapIncluding.<init>(Utils.scala:9)
at com.crealytics.spark.excel.WorkbookReader$.<init>(WorkbookReader.scala:31)
at com.crealytics.spark.excel.WorkbookReader$.<clinit>(WorkbookReader.scala)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:28)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:602)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 23 more

Solutions:
Download proper spark-excel library, for me it's:
https://mvnrepository.com/artifact/com.crealytics/spark-excel_2.12/0.13.7
Create directory spark_jars in the SPARK_HOME then store the spark-excel package in spark_jars directory
Add the spark_jars to spark.executor.extraClassPath of Spark session:
findspark.init(spark_path)
spark = SparkSession.builder.master("local") \
.appName("Word Count") \
.config("spark.jars.packages","com.crealytics:spark-excel_2.12:0.13.7") \
.getOrCreate()
spark

Error when trying to load 30GB SAS file with Pyspark

I am trying to replicate what was done in this article Loading Big SAS files
What I am doing is starting up a jupyter notebook and running the code below. I keep getting a Java load error and I can't figure out why.
Spark Version:2.4.6
Scala Version:2.12.2
Java Version:1.8.0_261
import findspark
findspark.init()
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df=spark.read.format('com.github.saurfang.sas.spark')\
.load(r'D:\IvyDB\opprcd\opprcd2019.sas7bdat')
Error I always get is below
Py4JJavaError: An error occurred while calling o163.load.
: java.util.concurrent.TimeoutException: Timed out after 60 sec while reading file metadata, file might be corrupt. (Change timeout with 'metadataTimeout' paramater)
at com.github.saurfang.sas.spark.SasRelation.inferSchema(SasRelation.scala:189)
at com.github.saurfang.sas.spark.SasRelation.(SasRelation.scala:62)
at com.github.saurfang.sas.spark.SasRelation$.apply(SasRelation.scala:43)
at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:209)
at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:42)
at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

In our case, we were able to fix this issue by adding Parso library into pyspark. Parso is one of the requirements in Spark SAS Data Source.

Exception in Pyspark Structured Streaming while reading from Kafka

Environment: Spark 2.4.0
I have included spark-sql-kafka-0-10 jar, and it's of the same version as that of the Spark I am using.
Here's the exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: java.lang.NoClassDefFoundError: org.apache.kafka.common.serialization.ByteArrayDeserializer
at org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:487)
at org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:414)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:66)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:209)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:95)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:95)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:171)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:90)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:508)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:812)
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.ByteArrayDeserializer
at java.net.URLClassLoader.findClass(URLClassLoader.java:610)
at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:937)
at java.lang.ClassLoader.loadClass(ClassLoader.java:882)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:343)
at java.lang.ClassLoader.loadClass(ClassLoader.java:865)
... 20 more

I didn't have kafka-clients jar in my classpath. Adding it fixes the missing class exception
Starting the spark-shell with the packages option will work too:
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0

Spark Loading Data from Azure Data Lake Store - Py4JJavaError: NoSuchMethodError

I am trying to load data in Spark 2.3.1 from ADLS using the following:
moviesfileAdls = "adl://xxxxxx.azuredatalakestore.net/Data/movies.csv"
dfMovies = spark.read.format("csv") \
.option("header", "true") \
.option("delimiter",",") \
.load(moviesfileAdls)
The setup: Hadoop-3.1.1 running on the same box as spark-2.3.1-bin-hadoop2.7. In hdfs, I am able to get the file using the following command:
hadoop distcp adl://xxxxxx.azuredatalakestore.net/Data/movies.csv /user/hadoop/movies
The above command successfully copies the file into local HDFS so I believe the hadoop setup is OK.
However, when I try to run the spark.read.format("csv") command, I am getting the following error:
Py4JJavaError: An error occurred while calling o54.load.
: java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
at org.apache.hadoop.fs.adl.AdlConfKeys.addDeprecatedKeys(AdlConfKeys.java:126)
at org.apache.hadoop.fs.adl.AdlFileSystem.<clinit>(AdlFileSystem.java:98)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I tried adding the ADLS jars directly in spark-defaults.conf:
spark.jars /usr/local/hadoop/share/hadoop/tools/lib/azure-data-lake-store-sdk-2.3.1.jar, /usr/local/hadoop/share/hadoop/tools/lib/hadoop-azure-datalake-3.1.1.jar
HADOOP_CLASSPATH refers to the folder where the jars are located according to the spark user:
spark#xxxxx:~$ echo $HADOOP_CLASSPATH /usr/local/hadoop/etc/hadoop/*:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/tools/lib/*
Any pointers are greatly appreciated.

How to read scylladb table in pyspark dataframe

I am trying to read scylladb table installed one pc into pyspark dataframe on another pc.
The 2 pcs have ssh connectivity and I am able to read the table via python code, there is problem only while connecting with spark.I have used this connector:
--packages datastax:spark-cassandra-connector:2.3.0-s_2.11 ,
My spark -version = 2.3.1 , scala-version-2.11.8.
**First Approach**
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession
conf = SparkConf().set("spark.cassandra.connection.host","192.168.0.118")
sc = SparkContext(conf = conf)
spark=SparkSession.builder.config(conf=conf).appName('FinancialRecon').getOrCreate()
sqlContext =SQLContext(sc)
data=spark.read.format("org.apache.spark.sql.cassandra").options(table="datarecon",keyspace="finrecondata").load().show()
Resulting error:
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o43.load.
: java.lang.ClassNotFoundException: org.apache.spark.Logging was removed in Spark 2.0. Please check if your library is compatible with Spark 2.0
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:646)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 13 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 33 more
Another Approch that I have used is :
data=sc.read.format("org.apache.spark.sql.cassandra").options(table="datarecon",keyspace="finrecondata").load().show()
For this I get:
AttributeError: 'SparkContext' object has no attribute 'read'
Third Approach:
data=sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="datarecon",keyspace="finrecondata").load().show()
For this I get the same error as the first approach.
Please advice whether it is scylla spark connector issue or some spark library issue and how to solve it.

Follow these steps :
1.Run the spark-shell with the packages line.To configure the default Spark Configuration pass key value pairs with --conf, In my case scylla host is 172.17.0.2
bin/spark-shell --conf spark.cassandra.connection.host=172.17.0.2 --packages datastax:spark-cassandra-connector:2.3.0-s_2.11
2.Enable Cassandra-specific functions on the SparkContext, SparkSession, RDD, and DataFrame:
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
3.Load data from scylla
val rdd = sc.cassandraTable("my_keyspace", "my_table")
4.Test
scala> rdd.collect().foreach(println)
CassandraRow{id: 1, name: ash}

The resulting error occurs due to a version conflict. Maybe you can solve it reading here.
The first approach will work because read method is available on SparkSession.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Getting Error when I ran hive UDF written in Java in pyspark EMR 5.x - apache-spark

You may have built your UDF with a different version of Hive. Be sure to specify the same version of Hive in your pom.xml used to build the jar containing the UDF. See this previous answer, for example.

Related

How to add Spark-excel to PySpark

Error when trying to load 30GB SAS file with Pyspark

Exception in Pyspark Structured Streaming while reading from Kafka

Spark Loading Data from Azure Data Lake Store - Py4JJavaError: NoSuchMethodError

How to read scylladb table in pyspark dataframe

Categories

Resources