Cannot create Spark Phoenix DataFrames - apache-spark

I am trying to load data from Apache Phoenix into a Spark DataFrame.
I have been able to successfully create an RDD with the following code:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val foo: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
foo.collect().foreach(x => println(x))
However I have not been so lucky trying to create a DataFrame. My current attempt is:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.phoenixTableAsDataFrame(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
df.select(df("ID")).show
Unfortunately the above code results in a ClassCastException:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
I am still very new to spark. If anyone can help it would be very much appreciated!

Although you haven't mentioned your spark version and details of the exception...
Please see PHOENIX-2287 which is fixed, which says
Environment: HBase 1.1.1 running in standalone mode on OS X *
Spark 1.5.0 Phoenix 4.5.2
Josh Mahonin added a comment - 23/Sep/15 17:56 Updated patch adds
support for Spark 1.5.0, and is backwards compatible back down to
1.3.0 (manually tested, Spark version profiles may be worth looking at in the future) In 1.5.0, they've gone and explicitly hidden the
GenericMutableRow data structure. Fortunately, we are able to the
external-facing 'Row' data type, which is backwards compatible, and
should remain compatible in future releases as well. As part of the
update, Spark SQL deprecated a constructor on their 'DecimalType'.
In updating this, I exposed a new issue, which is that we don't
carry-forward the precision and scale of the underlying Decimal type
through to Spark. For now I've set it to use the Spark defaults, but
I'll create another issue for that specifically. I've included an
ignored integration test in this patch as well.

Related

What changes are required when moving simple synapsesql implementation from Spark 2.4.8 to Spark 3.1.2?

I have a simple implementation of .write.synapsesql() method (code shown below) that works in Spark 2.4.8 but not in Spark 3.1.2 (documentation/example here). The data in use is a simple notebook-created foobar type table. Searching for key phrases online from and about the error did not turn up any new information for me.
What is the cause of the error in 3.1.2?
Spark 2.4.8 version (behaves as desired):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None)
Spark 3.1.2 version (extra method is same as in documentation, can also be left out with a similar result):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None,
Some(callBackFunctionToReceivePostWriteMetrics))
The resulting error (only in 3.1.2) is:
WriteFailureCause -> java.lang.IllegalArgumentException: Failed to derive `https` scheme based staging location URL for SQL COPY-INTO}
As the documentation from the question states, ensure that you are setting the options correctly with
val writeOptionsWithAADAuth:Map[String, String] = Map(Constants.SERVER -> "<dedicated-pool-sql-server-name>.sql.azuresynapse.net",
Constants.TEMP_FOLDER -> "abfss://<storage_container_name>#<storage_account_name>.dfs.core.windows.net/<some_temp_folder>")
and including the options in your .write statement like so:
df.write.options(writeOptionsWithAADAuth).synapsesql(...)

Spark Delta table restore to version

I am trying to restore a delta table to its previous version via spark java , am using local ide .code is as below
import io.delta.tables.*;
DeltaTable deltaTable = DeltaTable.forPath(spark, <path-to-table>);
DeltaTable deltaTable = DeltaTable.forName(spark, <table-name>);
deltaTable.restoreToVersion(0) // restore table to oldest version
deltaTable.restoreToTimestamp("2019-02-14") // restore to a specific timestamp
As per the documentation databricks doc the method given here is not available in delta-core version 0.8.0. The method is also not in the api docs .
Is this only available in Datbricks run time?
Currently i have to load with the previous version and rewrite the df using delta.Is there any better way to do it?
Deltalake version 0.8 does not have restoreToVersion and restoreToTimestamp methods. There is no trace of such methods in open-source deltalake 0.8 as you can check in delta-lake repository
So currently and as far as I know, there is no other method than rewriting from a previous version, as explained in answers of this question
EDIT
As commented by boyangeor, restoreToVersion and restoreToTimestamp are now available in DeltaLake from version 1.2
The rollback "restoreVersion" is very much this:
In python
delta_table_path = "/tmp/delta-table"
df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)
df.show()
In Java:
String delta_table_path = "/tmp/delta-table"
Dataset<Row> df = spark.read().format("delta").option("versionAsOf", 0).load(delta_table_path);
df.show();
In Scala:
var delta_table_path = "/tmp/delta-table"
val df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)
df.show()

Do we need any external jar for xml parsing in Spark?

I'm trying to parse XML in Spark. i am getting below error. Could you please help me?
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object TestSpark{
def main(args:Array[String})
{
val conf = new SparkConf().setAppName("Test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rootTag", "book")
load("c:\\sample.xml")
}
}`
Error:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.xml.
No other external jar are required except the databricks spark xml. You need to add dependency for 2.0+. If you are using older Spark then you need t use this.
You need to use
groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.4.1
Match the Scala version to that of Spark. Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should need the Spark source package and build with Scala 2.10 support.
This may help
Compatibility issue with Scala and Spark for compiled jars
spark-xml

Can't access Spark 2.0 Temporary Table from beeline

With Spark 1.5.1, I've already been able to access spark-shell temporary tables from Beeline using Thrift Server. I've been able to do so by reading answers to related questions on Stackoverflow.
However, after upgrading to Spark 2.0, I can't see temporary tables from Beeline anymore, here are the steps I'm following.
I'm launching spark-shell using the following command:
./bin/spark-shell --master=myHost.local:7077 —conf spark.sql.hive.thriftServer.singleSession=true
Once the spark shell is ready I enter the following lines to launch thrift server and create a temporary view from a data frame taking its source in a json file
import org.apache.spark.sql.hive.thriftserver._
spark.sqlContext.setConf("hive.server2.thrift.port","10002")
HiveThriftServer2.startWithContext(spark.sqlContext)
val df = spark.read.json("examples/src/main/resources/people.json")
df.createOrReplaceTempView("people")
spark.sql("select * from people").show()
The last statement displays the table, it runs fine.
However when I start beeline and log to my thrift server instance, I can't see any temporary tables:
show tables;
+------------+--------------+--+
| tableName | isTemporary |
+------------+--------------+--+
+------------+--------------+--+
No rows selected (0,658 seconds)
Did I miss something regarding my spark upgrade from 1.5.1 to 2.0, how can I gain access to my temporary tables ?
This worked for me after upgrading to spark 2.0.1
val sparkConf =
new SparkConf()
.setAppName("Spark Thrift Server Demo")
.setMaster(sparkMaster)
.set("hive.metastore.warehouse.dir", hdfsDataUri + "/hive")
val spark = SparkSession
.builder()
.enableHiveSupport()
.config(sparkConf)
.getOrCreate()
val sqlContext = new org.apache.spark.sql.SQLContext(spark.sparkContext)
HiveThriftServer2.startWithContext(sqlContext)

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Resources