vertica sql in pyspark not running

vertica sql in pyspark not running - apache-spark

I'm new to use pyspark running sql to vertica. I have vertica_python module imported, I have sql to vertica table, I need to convert the query result into pyspark dataframe. I use spark-submit command to run, but got error saying 'MemoryError'. Can someone help?
I tried to increase and decrease the drive and executor memory in spark-submit command, didn't work, got same error message.
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
sc=SparkContext(appName="build_history_pyspark")
sqlContext=SQLContext(sc)
start_load_time_UTC='2019-06-01 00:00:00'
start_load_time_UTC=datetime.strptime(start_load_time_UTC,'%Y-%m-%d %H:%M:%S')
url="jdbc:vertica://144.60.100.141:5433/SRVVERTICA" . #This is the vertica database host IP and port
properties={"user":"hhhhh","password":"yyyyy","driver": "com.vertica.jdbc.Driver"}
sql_for_kafka="SELECT STARTTIME, LONGITUDE,LATITUDE FROM my_table WHERE (VERTICA_LOAD_TIME >'%s')" %(start_load_time_UTC)
sdf_orig=sqlContext.read.format("JDBC").options(url=url, query=sql_for_kafka, **properties).load()
sdf_orig.show(10)
error saying:
MemoryError
my command:
spark-submit --driver-memory 10g --num-executors 10 --executor-cores 4 --executor-memory 10g build_history_pyspark.py
I did further debugging, I think the error is due to the variable start_load_time_UTC. I really need to use a variable here not the straight value. But don't know the correct syntax.
I also changed the query statement to:
sql_for_kafka="SELECT STARTTIME, LONGITUDE,LATITUDE FROM my_table WHERE VERTICA_LOAD_TIME >$start_load_time_UTC"
I got different error:
Py4JJavaError: An error occurred while calling o57.load.
: java.sql.SQLDataException: [Vertica][VJDBC](3679) ERROR: Invalid
input syntax for timestamp: "$start_load_time_UTC"
at com.vertica.util.ServerErrorData.buildException(Unknown Source)
at com.vertica.io.ProtocolStream.readExpectedMessage(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepareImpl(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepare(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepare(Unknown Source)
at com.vertica.jdbc.common.SPreparedStatement.<init>(Unknown Source)
at com.vertica.jdbc.jdbc4.S4PreparedStatement.<init>(Unknown Source)
at com.vertica.jdbc.VerticaJdbc4PreparedStatementImpl.<init>(Unknown Source)
at com.vertica.jdbc.VJDBCObjectFactory.createPreparedStatement(Unknown Source)
at com.vertica.jdbc.common.SConnection.prepareStatement(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:115)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.vertica.support.exceptions.DataException: [Vertica]VJDBC ERROR: Invalid input syntax for timestamp: "$start_load_time_UTC"
... 28 more
I finally solved the problem by changing the query statement to below:
sql_for_kafka="(SELECT STARTTIME, LONGITUDE,LATITUDE FROM my_table WHERE (VERTICA_LOAD_TIME >'{0}') )temp" .format(start_load_time_UTC)

Related

Error when trying to load 30GB SAS file with Pyspark

I am trying to replicate what was done in this article Loading Big SAS files
What I am doing is starting up a jupyter notebook and running the code below. I keep getting a Java load error and I can't figure out why.
Spark Version:2.4.6
Scala Version:2.12.2
Java Version:1.8.0_261
import findspark
findspark.init()
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df=spark.read.format('com.github.saurfang.sas.spark')\
.load(r'D:\IvyDB\opprcd\opprcd2019.sas7bdat')
Error I always get is below
Py4JJavaError: An error occurred while calling o163.load.
: java.util.concurrent.TimeoutException: Timed out after 60 sec while reading file metadata, file might be corrupt. (Change timeout with 'metadataTimeout' paramater)
at com.github.saurfang.sas.spark.SasRelation.inferSchema(SasRelation.scala:189)
at com.github.saurfang.sas.spark.SasRelation.(SasRelation.scala:62)
at com.github.saurfang.sas.spark.SasRelation$.apply(SasRelation.scala:43)
at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:209)
at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:42)
at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

In our case, we were able to fix this issue by adding Parso library into pyspark. Parso is one of the requirements in Spark SAS Data Source.

Tune Spark and YARN properties in HDP Sandbox

I'm using HDP sandbox 3.1 and performing NLTK on 50K files using spark2 Interpreter and Zeppelin Notebook.
It's a single node setup.
I've given 12GB RAM to Guest System and 6CPUs.
In spark, I'm reading all 50K files in a single RDD Operation, but at 63% my process hangs, and then it leads to ERROR.
Now which Values in Spark and YARN I've to set, so Spark can work in full throttle.
Edit: Each file size is around 3KB
Following is the log when Error occurred
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1848)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1761)
at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1361)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
at org.apache.spark.SparkContext$$anon$3.run(SparkContext.scala:1876)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:162)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.\n', JavaObject id=o101)

Spark Loading Data from Azure Data Lake Store - Py4JJavaError: NoSuchMethodError

I am trying to load data in Spark 2.3.1 from ADLS using the following:
moviesfileAdls = "adl://xxxxxx.azuredatalakestore.net/Data/movies.csv"
dfMovies = spark.read.format("csv") \
.option("header", "true") \
.option("delimiter",",") \
.load(moviesfileAdls)
The setup: Hadoop-3.1.1 running on the same box as spark-2.3.1-bin-hadoop2.7. In hdfs, I am able to get the file using the following command:
hadoop distcp adl://xxxxxx.azuredatalakestore.net/Data/movies.csv /user/hadoop/movies
The above command successfully copies the file into local HDFS so I believe the hadoop setup is OK.
However, when I try to run the spark.read.format("csv") command, I am getting the following error:
Py4JJavaError: An error occurred while calling o54.load.
: java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
at org.apache.hadoop.fs.adl.AdlConfKeys.addDeprecatedKeys(AdlConfKeys.java:126)
at org.apache.hadoop.fs.adl.AdlFileSystem.<clinit>(AdlFileSystem.java:98)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I tried adding the ADLS jars directly in spark-defaults.conf:
spark.jars /usr/local/hadoop/share/hadoop/tools/lib/azure-data-lake-store-sdk-2.3.1.jar, /usr/local/hadoop/share/hadoop/tools/lib/hadoop-azure-datalake-3.1.1.jar
HADOOP_CLASSPATH refers to the folder where the jars are located according to the spark user:
spark#xxxxx:~$ echo $HADOOP_CLASSPATH /usr/local/hadoop/etc/hadoop/*:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/tools/lib/*
Any pointers are greatly appreciated.

How to read scylladb table in pyspark dataframe

I am trying to read scylladb table installed one pc into pyspark dataframe on another pc.
The 2 pcs have ssh connectivity and I am able to read the table via python code, there is problem only while connecting with spark.I have used this connector:
--packages datastax:spark-cassandra-connector:2.3.0-s_2.11 ,
My spark -version = 2.3.1 , scala-version-2.11.8.
**First Approach**
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession
conf = SparkConf().set("spark.cassandra.connection.host","192.168.0.118")
sc = SparkContext(conf = conf)
spark=SparkSession.builder.config(conf=conf).appName('FinancialRecon').getOrCreate()
sqlContext =SQLContext(sc)
data=spark.read.format("org.apache.spark.sql.cassandra").options(table="datarecon",keyspace="finrecondata").load().show()
Resulting error:
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o43.load.
: java.lang.ClassNotFoundException: org.apache.spark.Logging was removed in Spark 2.0. Please check if your library is compatible with Spark 2.0
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:646)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 13 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 33 more
Another Approch that I have used is :
data=sc.read.format("org.apache.spark.sql.cassandra").options(table="datarecon",keyspace="finrecondata").load().show()
For this I get:
AttributeError: 'SparkContext' object has no attribute 'read'
Third Approach:
data=sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="datarecon",keyspace="finrecondata").load().show()
For this I get the same error as the first approach.
Please advice whether it is scylla spark connector issue or some spark library issue and how to solve it.

Follow these steps :
1.Run the spark-shell with the packages line.To configure the default Spark Configuration pass key value pairs with --conf, In my case scylla host is 172.17.0.2
bin/spark-shell --conf spark.cassandra.connection.host=172.17.0.2 --packages datastax:spark-cassandra-connector:2.3.0-s_2.11
2.Enable Cassandra-specific functions on the SparkContext, SparkSession, RDD, and DataFrame:
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
3.Load data from scylla
val rdd = sc.cassandraTable("my_keyspace", "my_table")
4.Test
scala> rdd.collect().foreach(println)
CassandraRow{id: 1, name: ash}

The resulting error occurs due to a version conflict. Maybe you can solve it reading here.
The first approach will work because read method is available on SparkSession.

Getting Error when I ran hive UDF written in Java in pyspark EMR 5.x

I have a Hive UDF written in java and I am trying to use it in pyspark 2.0.0. below are the steps
1. Copy the jar file to EMR
2. started a pyspark job like below
pyspark --jars ip-udf-0.0.1-SNAPSHOT-jar-with-dependencies-latest.jar
3. used the below code access the UDF
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
sc = spark.sparkContext
sqlContext = HiveContext(sc)
sqlContext.sql("create temporary function ip_map as 'com.mediaiq.hive.IPMappingUDF'")
I get the below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o43.sql.
: java.lang.NoSuchMethodError:
org.apache.hadoop.hive.conf.HiveConf.getTimeVar(Lorg/apache/hadoop/hive/conf/HiveConf$ConfVars;Ljava/util/concurrent/TimeUnit;)J
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:76)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:98)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465) at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)
at
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:189)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at
org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at
org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at
org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at
org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:214) at
java.lang.Thread.run(Thread.java:745)

You may have built your UDF with a different version of Hive. Be sure to specify the same version of Hive in your pom.xml used to build the jar containing the UDF. See this previous answer, for example.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

vertica sql in pyspark not running - apache-spark

Related

Error when trying to load 30GB SAS file with Pyspark

Tune Spark and YARN properties in HDP Sandbox

Spark Loading Data from Azure Data Lake Store - Py4JJavaError: NoSuchMethodError

How to read scylladb table in pyspark dataframe

Getting Error when I ran hive UDF written in Java in pyspark EMR 5.x

Categories

Resources