Spark persistent table is not available in the other node - apache-spark

I have a simple Spark(2.3.0) Standalone cluster with 1 master and 2 workers (node-1 and node-2). I saved my dataframe as a persistent table into Hive metastore using the saveAsTable command with pyspark on node-1:
>>> df.write.saveAsTable("test")
It works fine. I can restart pyspark on that node (node-1) and can see that the table is still there:
>>> spark.sql('show tables').show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default| test| false|
+--------+---------+-----------+
But when I go to the other node (node-2), I get the following:
>>> spark.sql('show tables').show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
Can anyone help me on how we can use the table in node-2?

This is because metastore data is save locally in an Apache Derby database. I solve this by using mysql instead of derby.

Related

How to read a csv file from unix server in pyspark

I need to create spark dataframe from csv file which is located in my UNIX server.
I tried like below,
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("demo").getOrCreate()
df = spark.read.format('csv').option('header','True'). \
load("ftp://USER:PASSWORD#UNIX_IP/home/user/sample.csv")
df.show(10)
But its throwing the error as,
pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Illegal character in user info at index 32
Could anyone help me to resolve this. How we need to refer the ftp location in pyspark? Do we need to include any other library for this?
You need to use the addFile method like this:
import org.apache.spark.SparkFiles
sc.addFile("ftp://user:pwd#host:port/home/user/sample.csv")
spark.read.csv(SparkFiles.get("sample.csv")).show()
To test it, you could use a public ftp like:
sc.addFile("ftp://anonymous:anonymous#ftp.gnu.org/README")
spark.read.csv(SparkFiles.get("README")).show(2)
+--------------------+--------------------+
| _c0| _c1|
+--------------------+--------------------+
| This is ftp.gnu.org| the FTP server o...|
|NOTICE (Updated O...| null|
+--------------------+--------------------+
In python:
from pyspark import SparkFiles
sc.addFile('ftp://user:pwd#host:port/home/user/sample.csv')
spark.read.csv(SparkFiles.get('sample.csv')).show()

Create database spark sql

I'm using spark 2.4.4 with AWS glue catalog.
In my spark job, I need to create a database in glue if it doesn't exist. I'm using the following statement in spark sql to do so.
spark.sql("CREATE DATABASE IF NOT EXISTS %s".format(hiveDatabase));
It works as expected in spark-shell, a database gets create in Glue.
But when I run the same piece of code using spark-submit, then the database is not created. Is there a commit/flush that I need to do when using spark-submit?
EDIT
I'm getting different results for show databases in spark-shell and spark-submit:
+---------------------+
|databaseName |
+---------------------+
|all |
|default |
|hive-db |
|navi-database-account|
|navi-par |
|testdb |
+---------------------+
+------------+
|databaseName|
+------------+
|default |
+------------+
Looks like spark-submit is creating the DB somewhere, but not in glue.
Needed to add following config:
("spark.sql.catalogImplementation", "hive")

SQLContext in Spark2 not getting updated hive table records

I have a running application which queries hive table using HiveContext and it works fine if i run the application using spark-submit in spark1.6 . As part of upgrade we switched to spark2.1 and using spark2-submit. Since spark2 doesnt support HiveContext i m uing SQLContext instead. Issue i m facing is once i start the context any incremental changes in hive table is not visible in the hive query results. I am starting SparkContext with the enableHiveSupport() . IF i stop and restart the application i can see the rows. The application writing the data is doing MSCK REPAIR TABLE after writing so i am not sure what i am missing.
This is the code snippet
val spark= SparkSession.builder().enableHiveSupport().getOrCreate()
val sqlc=spark.sqlContext
sqlc.sql("select * from table1").show(false)
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
Now in another session i added new row but if i ran the above code it still returns only 2 rows .
This works fine if i do a refresh table ie
val spark= SparkSession.builder().enableHiveSupport().getOrCreate()
val sqlc=spark.sqlContext
sqlc.sql("refresh table table1")
sqlc.sql("select * from table1").show(false)
My question is why should i do a refeshTable since i never did to do it in spark1.6 when i query using HiveContext and SQLContext is supposed to behave the same way as HiveContext
Try
sqlContext.refreshTable("my_table")
in spark 2.x spark.catalog.refreshTable("my_table")
in SQL Format spark.sql("refresh table my_table")

Spark and Hive table schema out of sync after external overwrite

I'm am having issues with the schema for Hive tables being out of sync between Spark and Hive on a Mapr cluster with Spark 2.1.0 and Hive 2.1.1.
I need to try to resolve this problem specifically for managed tables, but the issue can be reproduced with unmanaged/external tables.
Overview of Steps
Use saveAsTable to save a dataframe to a given table.
Use mode("overwrite").parquet("path/to/table") to overwrite the data for the previously saved table. I am actually modifying the data through a process external to Spark and Hive, but this reproduces the same issue.
Use spark.catalog.refreshTable(...) to refresh metadata
Query the table with spark.table(...).show(). Any columns that were the same between the original dataframe and the overwriting one will show the new data correctly, but any columns that were only in the new table will not be displayed.
Example
db_name = "test_39d3ec9"
table_name = "overwrite_existing"
table_location = "<spark.sql.warehouse.dir>/{}.db/{}".format(db_name, table_name)
qualified_table = "{}.{}".format(db_name, table_name)
spark.sql("CREATE DATABASE IF NOT EXISTS {}".format(db_name))
Save as a managed table
existing_df = spark.createDataFrame([(1, 2)])
existing_df.write.mode("overwrite").saveAsTable(table_name)
Note that saving as an unmanaged table with the following will produce the same issue:
existing_df.write.mode("overwrite") \
.option("path", table_location) \
.saveAsTable(qualified_table)
View the contents of the table
spark.table(table_name).show()
+---+---+
| _1| _2|
+---+---+
| 1| 2|
+---+---+
Overwrite the parquet files directly
new_df = spark.createDataFrame([(3, 4, 5, 6)], ["_4", "_3", "_2", "_1"])
new_df.write.mode("overwrite").parquet(table_location)
View the contents with the parquet reader, the contents show correctly
spark.read.parquet(table_location).show()
+---+---+---+---+
| _4| _3| _2| _1|
+---+---+---+---+
| 3| 4| 5| 6|
+---+---+---+---+
Refresh spark's metadata for the table and read in again as a table. The data will be updated for the columns that were the same, but the additional columns do not display.
spark.catalog.refreshTable(qualified_table)
spark.table(qualified_table).show()
+---+---+
| _1| _2|
+---+---+
| 6| 5|
+---+---+
I have also tried updating the schema in hive before calling spark.catalog.refreshTable with the below command in the hive shell:
ALTER TABLE test_39d3ec9.overwrite_existing REPLACE COLUMNS (`_1` bigint, `_2` bigint, `_3` bigint, `_4` bigint);
After running the ALTER command I then run describe and it shows correctly in hive
DESCRIBE test_39d3ec9.overwrite_existing
OK
_1 bigint
_2 bigint
_3 bigint
_4 bigint
Before running the alter command it only shows the original columns as expected
DESCRIBE test_39d3ec9.overwrite_existing
OK
_1 bigint
_2 bigint
I then ran spark.catalog.refreshTable but it didn't effect spark's view of the data.
Additional Notes
From the spark side, I did most of my testing with PySpark, but also tested in a spark-shell (scala) and a sparksql shell. While in the spark shell I also tried using a HiveContext but didn't work.
import org.apache.spark.sql.hive.HiveContext
import spark.sqlContext.implicits._
val hiveObj = new HiveContext(sc)
hiveObj.refreshTable("test_39d3ec9.overwrite_existing")
After performing the ALTER command in the hive shell, I verified in Hue that the schema also changed there.
I also tried running the ALTER command with spark.sql("ALTER ...") but the version of Spark we are on (2.1.0) does not allow it, and looks like it won't be available until Spark 2.2.0 based on this issue: https://issues.apache.org/jira/browse/SPARK-19261
I have also read through the spark docs again, specifically this section: https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#hive-metastore-parquet-table-conversion
Based on those docs, spark.catalog.refreshTable should work. The configuration for spark.sql.hive.convertMetastoreParquet is typically false, but I switched it to true for testing and it didn't seem to effect anything.
Any help would be appreciated, thank you!
I faced a similar issue while using spark 2.2.0 in CDH 5.11.x package.
After spark.write.mode("overwrite").saveAsTable() when I issue spark.read.table().show no data will be displayed.
On checking I found it was a known issue with CDH spark 2.2.0 version. Workaround for that was to run the below command after the saveAsTable command was executed.
spark.sql("ALTER TABLE qualified_table set SERDEPROPERTIES ('path'='hdfs://{hdfs_host_name}/{table_path}')")
spark.catalog.refreshTable("qualified_table")
eg: If your table LOCATION
is like hdfs://hdfsHA/user/warehouse/example.db/qualified_table
then assign 'path'='hdfs://hdfsHA/user/warehouse/example.db/qualified_table'
This worked for me. Give it a try. I assume by now your issue would have been resolved. If not you can try this method.
workaround source: https://www.cloudera.com/documentation/spark2/2-2-x/topics/spark2_known_issues.html

Can't access Spark 2.0 Temporary Table from beeline

With Spark 1.5.1, I've already been able to access spark-shell temporary tables from Beeline using Thrift Server. I've been able to do so by reading answers to related questions on Stackoverflow.
However, after upgrading to Spark 2.0, I can't see temporary tables from Beeline anymore, here are the steps I'm following.
I'm launching spark-shell using the following command:
./bin/spark-shell --master=myHost.local:7077 —conf spark.sql.hive.thriftServer.singleSession=true
Once the spark shell is ready I enter the following lines to launch thrift server and create a temporary view from a data frame taking its source in a json file
import org.apache.spark.sql.hive.thriftserver._
spark.sqlContext.setConf("hive.server2.thrift.port","10002")
HiveThriftServer2.startWithContext(spark.sqlContext)
val df = spark.read.json("examples/src/main/resources/people.json")
df.createOrReplaceTempView("people")
spark.sql("select * from people").show()
The last statement displays the table, it runs fine.
However when I start beeline and log to my thrift server instance, I can't see any temporary tables:
show tables;
+------------+--------------+--+
| tableName | isTemporary |
+------------+--------------+--+
+------------+--------------+--+
No rows selected (0,658 seconds)
Did I miss something regarding my spark upgrade from 1.5.1 to 2.0, how can I gain access to my temporary tables ?
This worked for me after upgrading to spark 2.0.1
val sparkConf =
new SparkConf()
.setAppName("Spark Thrift Server Demo")
.setMaster(sparkMaster)
.set("hive.metastore.warehouse.dir", hdfsDataUri + "/hive")
val spark = SparkSession
.builder()
.enableHiveSupport()
.config(sparkConf)
.getOrCreate()
val sqlContext = new org.apache.spark.sql.SQLContext(spark.sparkContext)
HiveThriftServer2.startWithContext(sqlContext)

Resources