How to refresh a table and do it concurrently? - apache-spark

I'm using Spark Streaming 2.1. I'd like to refresh some cached table (loaded by spark provided DataSource like parquet, MySQL or user-defined data sources) periodically.
how to refresh the table?
Suppose I have some table loaded by
spark.read.format("").load().createTempView("my_table")
and it is also cached by
spark.sql("cache table my_table")
is it enough with following code to refresh the table, and when
the table is loaded next, it will automatically be cached
spark.sql("refresh table my_table")
or do I have to do that manually with
spark.table("my_table").unpersist
spark.read.format("").load().createOrReplaceTempView("my_table")
spark.sql("cache table my_table")
is it safe to refresh the table concurrently?
By concurrent I mean using ScheduledThreadPoolExecutor to do the refresh work apart from the main thread.
What will happen if the Spark is using the cached table when I call refresh on the table?

In Spark 2.2.0 they have introduced feature of refreshing the metadata of a table if it was updated by hive or some external tools.
You can achieve it by using the API,
spark.catalog.refreshTable("my_table")
This API will update the metadata for that table to keep it consistent.

I had a problem to read a table from hive using a SparkSession specifically the method table, i.e. spark.table(table_name). Every time after wrote the table and try to read that
I got this error:
java.IO.FileNotFoundException ... The underlying files may have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I tried to refresh the table using spark.catalog.refreshTable(table_name) also sqlContext neither worked.
My solutions as wrote the table and after read using:
val usersDF = spark.read.load(s"/path/table_name")
It's work fine.
Is this a problem? Maybe the data at hdfs is not updated yet?

Related

Cache error when querying table stored in Glue Data Catalog through Zeppelin

I have an error with the way Zeppelin cache tables. We update the data in the Glue Data Catalog in real time, so when we want to query a partition that was updated using Spark, sometimes we get the following error:
org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://bucket/prefix/partition.snappy.parquet, range: 0-16165503, partition values: [empty row], isDataPresent: false, eTag: 53ea26b5ecc9a194efe5163f3c297800-1
This can be solved by issuing the command refresh table <table_name> or restarting the Spark interpreter from the Zeppelin UI, but it might as well be the retrying that solves the issue instead of deleting the cache.
One solution may be to run a scheduled query that refresh all tables at a given time, but this would be highly inefficient.
Thanks!
please spark.sql("refresh TABLE {db}.{table}")
When to execute REFRESH TABLE my_table in spark?

Hive Table or view not found although the Table exists

I am trying to run a spark job written in Java, on the Spark cluster to load records as dataframe into a Hive Table i created.
df.write().mode("overwrite").insertInto(dbname.tablename);
Although the table and database exists in Hive, it throws below error:
org.apache.spark.sql.AnalysisException: Table or view not found: dbname.tablename, the database dbname doesn't exist.;
I also tried reading from an existing hive table different than the above table thinking there might be an issue while my table creation.
I also checked if my user has permission to the hdfs folder where the hive is storing the data.
It all looks fine, not sure what could be the issue.
Please suggest.
Thanks
I think it is searching for that table in spark instead of hive.

External Hive Table Refresh table vs MSCK Repair

I have external hive table stored as Parquet, partitioned on a column say as_of_dt and data gets inserted via spark streaming.
Now Every day new partition get added. I am doing msck repair table so that the hive metastore gets the newly added partition info. Is this the only way or is there a better way? I am concerned if downstream users querying the table, will msck repair cause any issue in non availability of data or stale data? I was going through the HiveContext API and see refreshTable option. Any idea if this makes sense to use refreshTable instead ?
To directly answer your question msck repair table, will check if partitions for a table is active. Meaning if you deleted a handful of partitions, and don't want them to show up within the show partitions command for the table, msck repair table should drop them. Msck repair could take more time than an invalidate or refresh statement, however Invalidate Metadata only runs within Hive updating only the Hive Metastore. Refresh runs only in Spark SQL and updates the Spark metadata store.
Hive metastore should be fine if you are completing the add partition step somewhere in the processing, however if you ever want to access the hive table through Spark SQL you will need to update the metadata through Spark (or Impala or another process that updates the spark metadata).
Anytime you update or change the contents of a hive table, the Spark metastore can fall out of sync, causing you to be unable to query the data through the spark.sql command set. Meaning if you want to query that data you need to keep the Spark metastore in sync.
If you have a Spark version that allows for it, you should refresh and add partitions to Hive tables within Spark, so all metastores are in sync. Below is how I do it:
//Non-Partitioned Table
outputDF.write.format("parquet").mode("overwrite").load(fileLocation)
spark.sql("refresh table " + tableName)
//Partitioned Table
outputDF.write.format("parquet").mode("overwrite").load(fileLocation + "/" + partition)
val addPartitionsStatement = "alter table" + tableName = " add if not exists partition(partitionKey='" + partition + "') location '" + fileLocation + "/" + partition + "'"
spark.sql(addPartitionsStatement)
spark.sql("refresh table " + tableName)
It looks like refreshTable does refresh the cached metadata, not affecting Hive metadata.
Doc says:
Invalidate and refresh all the cached the metadata of the given table.
For performance reasons, Spark SQL or the external data source library
it uses might cache certain metadata about a table, such as the
location of blocks. When those change outside of Spark SQL, users
should call this function to invalidate the cache.
Method does not update Hive metadata, so repair is necessary.

Live sync of Spark RDD from Cassandra table

I'm looking for a way to keep a Spark RDD in sync with a Cassandra table. I know it is possible to load a full Cassandra table into an RDD as a one shot operation but would like to keep the RDD synchronized with updates happening to the Cassandra table.
This will allow to not reload the full table into Spark everytime I need to get fresh data into Spark (which can be long if the table is big).
Any hint ?

Caching tables in Spark SQL

I'm using Spark SQL and would like to cache a table that was originally created in Hive. This works fine if the table is in Hive's default database, e.g.
CACHE TABLE test1;
However, if it is in a different database, e.g. myDB then I cannot do
CACHE TABLE myDB.test1;
since Spark complains that failure: ``as'' expected but.' found`.
I can however access and query the table, for instance by running
SELECT * FROM myDB.test1;
Is there a way round this?
Found an answer:
USE myDB;
CACHE TABLE test1;

Resources