Caching tables in Spark SQL - apache-spark

I'm using Spark SQL and would like to cache a table that was originally created in Hive. This works fine if the table is in Hive's default database, e.g.
CACHE TABLE test1;
However, if it is in a different database, e.g. myDB then I cannot do
CACHE TABLE myDB.test1;
since Spark complains that failure: ``as'' expected but.' found`.
I can however access and query the table, for instance by running
SELECT * FROM myDB.test1;
Is there a way round this?

Found an answer:
USE myDB;
CACHE TABLE test1;

Related

Cache error when querying table stored in Glue Data Catalog through Zeppelin

I have an error with the way Zeppelin cache tables. We update the data in the Glue Data Catalog in real time, so when we want to query a partition that was updated using Spark, sometimes we get the following error:
org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://bucket/prefix/partition.snappy.parquet, range: 0-16165503, partition values: [empty row], isDataPresent: false, eTag: 53ea26b5ecc9a194efe5163f3c297800-1
This can be solved by issuing the command refresh table <table_name> or restarting the Spark interpreter from the Zeppelin UI, but it might as well be the retrying that solves the issue instead of deleting the cache.
One solution may be to run a scheduled query that refresh all tables at a given time, but this would be highly inefficient.
Thanks!
please spark.sql("refresh TABLE {db}.{table}")
When to execute REFRESH TABLE my_table in spark?

How to access HIVE ACID tables using HANA SDA Virtual table?

We are currently using HANA 1 sps 12 and SPARK controller to create virtual tables and access the HIVE data in HANA. The problem is we have some SC2 table we want to archive in HANA for which we need full CRUD operation. We have converted a few of the Hive tables to ACID (transactional = true). Now we are not able to fetch records, it returns 0 records.
We tried to use Drill, which has native support for Hive acid tables, but when we are querying the Hive tables using Drill ODBC driver and DSN, it fails. After checking the queries that hit Drill we found out, HANA is wrapping the schema name in double-quotes. eg. Select * from "hive.schemaname".tablename.
We tried to change the default quoting to " from the default backtick, but ended up losing remote schema refresh as that query sends wrap the schema name with backtick `.
There is an incompatibility of Spark 2 and Hive ACID transactional tables. External tables can be used as a workaround. If you have an S-user SAP have documented this in Note 2901291 (SAP HANA Spark Controller - Incompatibility of Spark 2 and Hive ACID transactional tables).
ACID table access is not possible using SDA, so we created a non-ACID table from the ACID table in Hive and accessed it back in the HANA DB. We are using ACID to perform all our necessary CRUD operations and then insert overwrite into a partitioned non-ACID table, this way only the partitions where an update/change happened will be recreated, rather than the whole table.

Hive Table or view not found although the Table exists

I am trying to run a spark job written in Java, on the Spark cluster to load records as dataframe into a Hive Table i created.
df.write().mode("overwrite").insertInto(dbname.tablename);
Although the table and database exists in Hive, it throws below error:
org.apache.spark.sql.AnalysisException: Table or view not found: dbname.tablename, the database dbname doesn't exist.;
I also tried reading from an existing hive table different than the above table thinking there might be an issue while my table creation.
I also checked if my user has permission to the hdfs folder where the hive is storing the data.
It all looks fine, not sure what could be the issue.
Please suggest.
Thanks
I think it is searching for that table in spark instead of hive.

How do I find if a Hive table is defined as an external table through PySpark?

For context - data lives on S3, written as Hive tables. I'm running some Jupyter notebooks on my local machine that's supposed to point to the S3 data as Hive tables, with the metdata being stored on some relational DB on a Spark cluster.
When I run some local scripts/Jupyter notebooks on my local machine to create and load some tables, it's saying that I've created some external tables even though I didn't create them as external tables.
When I run spark.sql("show tables in target_db").show(20, False) I see nothing. Then I create the table without the external option, then run the show command again, which outputs :
+----------+-------------------+-----------+
|database |tableName |isTemporary|
+----------+-------------------+-----------+
|target_db |mytable |false |
+----------+-------------------+-----------+
and run my script, which errors out, saying : org.apache.spark.sql.AnalysisException: Operation not allowed: TRUNCATE TABLE on external tables: ``target_db``.``mytable``;
I dropped the table on the cluster itself, so I think there's no issue with that. How is Spark thinking that my table is an external table? Do I need to change how I'm creating the table?
You should access the data from s3 by creating external table. Say this table is called T1.
If the table T1 definition uses partitions then you need to repair the table to load the partitions.
You cannot truncate external table T1. You should only read from it.
Tables created with CREATE EXTERNAL TABLE ... statement are external.
Tables created with CREATE TABLE are not.
You can check which one with a SHOW CREATE TABLE table_name
or with a DESCRIBE FORMATTED table_name, the field called Type can be MANAGED or EXTERNAL.

How to refresh a table and do it concurrently?

I'm using Spark Streaming 2.1. I'd like to refresh some cached table (loaded by spark provided DataSource like parquet, MySQL or user-defined data sources) periodically.
how to refresh the table?
Suppose I have some table loaded by
spark.read.format("").load().createTempView("my_table")
and it is also cached by
spark.sql("cache table my_table")
is it enough with following code to refresh the table, and when
the table is loaded next, it will automatically be cached
spark.sql("refresh table my_table")
or do I have to do that manually with
spark.table("my_table").unpersist
spark.read.format("").load().createOrReplaceTempView("my_table")
spark.sql("cache table my_table")
is it safe to refresh the table concurrently?
By concurrent I mean using ScheduledThreadPoolExecutor to do the refresh work apart from the main thread.
What will happen if the Spark is using the cached table when I call refresh on the table?
In Spark 2.2.0 they have introduced feature of refreshing the metadata of a table if it was updated by hive or some external tools.
You can achieve it by using the API,
spark.catalog.refreshTable("my_table")
This API will update the metadata for that table to keep it consistent.
I had a problem to read a table from hive using a SparkSession specifically the method table, i.e. spark.table(table_name). Every time after wrote the table and try to read that
I got this error:
java.IO.FileNotFoundException ... The underlying files may have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I tried to refresh the table using spark.catalog.refreshTable(table_name) also sqlContext neither worked.
My solutions as wrote the table and after read using:
val usersDF = spark.read.load(s"/path/table_name")
It's work fine.
Is this a problem? Maybe the data at hdfs is not updated yet?

Resources