Error encountered writing to Kudu using Spark / Scala - apache-spark

I'm trying to write data into Kudu from Spark and I'm getting this error
java.lang.UnsupportedOperationException at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:296) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:174) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
Code example:
df.write.options(Map("kudu.master"-> "ip:7051",
"kudu.table"-> "test_kudu")).mode("append").kudu
Used libraries:
org.apache.kudu:kudu-spark2_2.11:1.8.0
org.apache.kudu:kudu-client:1.8.0
Thanks!

Related

How to run CreateIndex function in Hyperspace (spark)

I am trying to create an index using hyperspace in pyspark.
But I am getting this error
sample_data = [(1, "name1"), (2, "name2")]
spark.createDataFrame(sample_data, ['id','name']).write.mode("overwrite").parquet("table")
df = spark.read.parquet("table")
from hyperspace import *
# Create an instance of Hyperspace
hyperspace = Hyperspace(spark)
hs.createIndex(df, IndexConfig("index", ["id"], ["name"]))
java.lang.ClassCastException: org.apache.spark.sql.execution.datasources.SerializableFileStatus cannot be cast to org.apache.hadoop.fs.FileStatus
I am running on Azure databricks environment-
Spark 3.0.0 scala 2.12
When I try to do the same on spark 2.4.2 scala 2.12 or scala 2.11
I get the error in the same function (CreateIndex)
Here I get the following error-
.Py4JJavaError: An error occurred while calling None.com.microsoft.hyperspace.index.IndexConfig.
: java.lang.NoClassDefFoundError:
Can anyone suggest some solutions.
Per the last comment of https://github.com/microsoft/hyperspace/discussions/285, it is a known issue with Databricks runtime.
If you use open source spark, it should work.
Seeking a solution with Databricks team.

Error when attempting to read Parquet in Spark

I am using Python Spark 2.4.3
I read the CSV and make a dataframe from it and write it to Parquet just fine. The 3rd line is what breaks.
df = spark.read.csv("file.csv", header=True)
df.write.parquet("result_parquet")
parquetFile = spark.read.parquet("result_parquet")
I am getting this:
Py4JJavaError: An error occurred while calling o1312.parquet.
: java.lang.IllegalArgumentException: Unsupported class file major version 55
What am I doing wrong? I got the line straight from the Spark documentation https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#loading-data-programmatically
The problem is I was using Java 11 (not supported fully by Spark). I uninstalled and Installed Java 8 and now it works

Getting error 'Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient' while running spark.sql query

I am trying to run a simple command
spark.sql('SELECT * FROM df).show()
I am getting this error
AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
I just started learning spark a few days back and don't understand what is causing this error. Please help.
What steps need to be followed to stop this exception?

Unable to read parquet file locally in spark

I am running Pyspark locally and trying to read a parquet file and load into a data frame from notebook.
df = spark.read.parquet("metastore_db/tmp/userdata1.parquet")
I am getting this exception
An error occurred while calling o738.parquet.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Does anyone know how to do it?
Assuming that you are running spark on your local, you should be doing something like
df = spark.read.parquet("file:///metastore_db/tmp/userdata1.parquet")

spark connecting to Phoenix NoSuchMethod Exception

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.
please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()
Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

Resources