How to find out whether Spark table is parquet or delta? - apache-spark

I have a database with some tables in parquet format and others in delta format. If I want to append data to the table, I need to specify it if a table is in delta format (default is parquet).
How can I determine a table's format?
I tried show tblproperties <tbl_name> but this gives an empty result.

According to Delta lake Api Doc you can check
DeltaTable.isDeltaTable(Spark, "path")
Please see the note in the documentation
This uses the active SparkSession in the current thread to read the table data. Hence, this throws error if active SparkSession has not been set, that is, SparkSession.getActiveSession() is empty.

Related

External Table in Databricks is showing only future date data

I had a delta table in databricks and data is available in ADLS. data is partitioned by date column, from 01-06-2022 onwards data is available in parquet format in adls but when i query the table in databricks i can see data from future date onwards everyday.older data is not displaying. every day data is overwriting to the table path with partitioned date column.
df.write.format('delta').mode('overwrite').save('{}/{}'.format(DELTALAKE_PATH, table))
Using Overwrite mode will delete past data and add new data. This is the reason for your issue.
df.write.format('delta').mode('append').save('{}/{}'.format(DELTALAKE_PATH, table))
Using append mode will append new data beneath the existing data. This will keep your existing data and when you execute a query, it will return past records as well.
You need to use append mode in place of overwrite mode.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only to queries where existing rows in the Result Table are not expected to change.
Reference - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts

Query Hive table in spark 2.2.0

I have a hive table(say table1) in avro file format with 1900 columns. When I query the table in hive - I am able to fetch data but when I query the same table in spark sql I am getting metastore client lost connection. Attempting to reconnect
I have also queried another hive table(say table2) in avro file format with 130 columns it's fetching data both in hive and spark.
What I observed is I can see data in hdfs location of table2 but I can't see any data in table1 hdfs location (but it's feching data when I query only in hive)
Split tell you about number of mappers in MR job.
It doesn’t show you the exact location from where the data had been picked.
The below will help you to check where the data for Table1 is stored in HDFS.
For Table 1: You can check the location of data in HDFS by running a SELECT query with WHERE conditions in Hive with MapReduce as the execution engine. Once the job is complete, you can check the map task's log of the YARN application (specifically for the text "Processing file") and find where the input data files have been taken from.
Also, try checking the location of data for both the tables present in HiveMetastore by running "SHOW CREATE TABLE ;" in hive for both the tables in Hive. From the result, try to check the "LOCATION" details.

Hive PartitionFilter are not Applied

I have facing this issue with hive.
When i Query a table ,which is partitioned on date column,
SELECT count(*) from table_name where date='2018-06-01' the query reads the entire table data and keeps for running hours,
Using EXPLAIN I found that HIVE is not applying the PartitionFilter on the query
I have double checked that the table is partitioned on date column by desc table_name.
Execution engine is Spark And Data is stored in Azure Data lake in Parquet Format
However I have another table in the Database for which the PartitionFilter is applied and it executes as expected.
Can there be some issue with the hive metadata or it is something else
Found the cause for this issue,
Hive wasn't applying the Partition Filters on some table ,because those tables were cached.
Thus when i restared the thrift server the cache was removed and Partition Filters were applied

After creating a dataframe from a hive table, if the data in the table is altered, will the dataframe contain new data or old data?

Data gets loaded into dataframe when an action is performed on it.
But before performing any action and after creating it from a hive table, if the data in the table is modified, will the changes be reflected in the dataframe?
The dataframe will not contain the old data, because the dataframe does not contain any data at all. a dataframe is nothing more than a "query plan", not materalized data.
In your case, I would say that you get your new data or alternatively a FileNotFoundException if spark has already cached hive table metadata and filenames and these things changed with the new data.

Unable to append data into existing hive table with HiveContext

We are reading data from a hive table with hiveContext using a spark dataframe. After doing some aggregations on the data we store this data into another table (which already has data). But the new data is not being appended to the existing table... and not showing any error either...
Note: Before storing into hive I am able to print the dataframe.
df.write.insertInto(target_db.target_table,overwrite = False)

Resources