Override underlying parquet data seamlessly for impala table - apache-spark

I have an Impala table backed by parquet files which is used by another team.
Every day I run a batch Spark job that overwrites the existing parquet files (creating new data set, the existing files will be deleted and new files will be created)
Our Spark code look like this
dataset.write.format("parquet").mode("overwrite").save(path)
During this update (overwrite parquet data file and then REFRESH Impala table), if someone accesses the table then they would end up with error saying the underlying data files are not there.
Is there any solution or workaround available for this issue? Because I do not want other teams see the error at any point in time when they access the table.
Maybe I can write the new data files into different location then make Impala table point to that location?

The behaviour you are seeing is because of the way how Impala is designed to work. Impala fetches the Metadata of the table such as Table structure, Partition details, HDFS File paths from HMS and the block details of the corresponding HDFS File paths from NameNode. All these details are fetched by Catalog and will be distributed across the Impala daemons for their execution.
When the table's underlying files are removed and new files are written outside Impala, it is necessary to perform a REFRESH so that the new file details (such as files and corresponding block details) will be fetched and distributed across daemons. This way Impala becomes aware of the newly written files.
Since, you're overwriting the files, Impala queries would fail to find the files that it is aware of because they have been removed already and the new files are being written. This is an expected event.
As a solution, you can perform one of the below,
Append the new files in the same HDFS Path of the table, instead of overwriting. This way, Impala queries run on the table would still return the results. However the results would be only the older data (because Impala is not aware of new files yet) but the error you said would be avoided during the time when the overwrite is occurring. Once the new files are created in the Table's directories, you can perform a HDFS Operation to remove the files followed by an Impala REFRESH statement for this table.
OR
As you said, you can write the new parquet files in a different HDFS Path and once the write is complete, you can either [remove the old files, move the new files into the actual HDFS Path of the table, followed by a REFRESH] OR [Issue an ALTER statement against the table to modify the location of the table's data pointing to the new directory]. If it's a daily process, you might have to implement this through a script that runs upon successful write process done by Spark by passing the directories (new and old directories) as arguments.
Hope this helps!

Related

how do you remove underlying files from s3 when using pyspark overwrite mode

To write data to an external Hive table, where the data is being stored in an S3 bucket, we can use df.write.parquet("s3://...", mode="overwrite", partitionBy="id").
In s3 I would expect to see id=1/, id=2/, id=3/, id=n.../.
My understanding is that with the overwrite mode, if the same id exists in your final dataframe that you are performing the write operation on, those partitions will be updated.
But how do we get rid of other partitions that are in the external location that aren't apart of the new dataset?
For example, say the new dataset contained id=5,6,7 and we no longer need id=1,2,3, is it true that the overwrite mode is not going to delete these folders? If so, what is the best way to remove them?

Is it ok to use a delta table tracker based on parquet file name in Azure databricks?

Today at work i saw a delta lake tracker based on file name. By delta tracker, i mean a function that defines whether a parquet file has already been ingested or not.
The code would check what file (from the delta table) has not already been ingested, and the parquets file in the delta table would then be read using this : spark.createDataFrame(path,StringType())
Having worked with Delta tables, it does not seem ok to me to use a delta tracker that way.
In case record is deleted, what are the chances that the delta log would point to a new file , and that this deleted record would
be read as a new one?
In case record is updated, what would be the chance that delta log would not point to a new file, and that this updated record
would not be considered ?
In case some maintenance is happening on the delta table, what are
the chances that some new files are written out of nowhere ? Which may cause a record to be re-ingested
Any observation or suggestion whether it is ok to work that way would be great. Thank you
In Delta Lake everything works on file level. So there are no 'in place' updates or deletes. Say a single record gets deleted (or updated) then roughly the following happens:
Read in the parquet file with the relevant record (+ the other records which happen to be in the file)
Write out all records except for the deleted record into a new parquet file
Update the transaction log with a new version, marking the old parquet file as removed and the new parquet file as added. Note the old parquet file doesn't get physically deleted until you run the VACUUM command.
The process for an update is basically the same.
To answer your questions more specifically:
In case record is deleted, what are the chances that the delta log
would point to a new file , and that this deleted record would be read
as a new one?
The delta log will point to a new file, but the deleted record will not be in there. There will be all the other records which happened to be in the original file.
In case record is updated, what would be the chance that delta log
would not point to a new file, and that this updated record would not
be considered ?
Files are not updated in place, so this doesn't happen. A new file is written containing the updated record (+ any other other records in the original file). The transaction log is updated to 'point' to this new file.
In case some maintenance is happening on the delta table, what are the
chances that some new files are written out of nowhere ? Which may
cause a record to be re-ingested
This is possible, although not 'out of nowhere'. For example if you run OPTIMIZE existing parquet files get reshuffled/combined to improve performance. Basically this means a number of new parquet files will be written and a new version in the transaction log will point to these parquet files. If you pickup all new files after this you will re-ingest data.
Some considerations: if your delta table is append only you could use structured streaming to read from it instead. If not then Databricks offers Change Data Feed giving your record level details of inserts, updates and deletes.

Spark overwrite does not delete files in target path

My goal is to build a daily process that will overwrite all partitions under specific path in S3 with new data from data frame.
I do -
df.write.format(source).mode("overwrite").save(path)
(Also tried the dynamic overwrite option).
However, in some runs old data is not being deleted. Means I see files from old date together with new files under the same partition.
I suspect it has something to do with runs that broke in the middle due to memory issues and left some corrupted files that the next run did not delete but couldn’t reproduce it yet.
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") - option will keep your existing partition and overwriting a single partition. if you want to overwrite all existing partitions and keep the current partition then unset the above configurations. ( i tested in spark version 2.4.4)

Databricks Delta cache contains a stale footer and stale page entries Error

I have been getting notebook failures intermittingly relating to querying a TEMPORARY VIEW that is selecting from a parquet file located on a ADLS Gen2 mount.
Delta cache contains a stale footer and stale page entries for the file dbfs:/mnt/container/folder/parquet.file, these will be removed (4 stale page cache entries). Fetched file stats (modificationTime: 1616064053000, fromCachedFile: false) do not match file stats of cached footer and entries (modificationTime: 1616063556000, fromCachedFile: true).
at com.databricks.sql.io.parquet.CachingParquetFileReader.checkForStaleness(CachingParquetFileReader.java:700)
at com.databricks.sql.io.parquet.CachingParquetFileReader.close(CachingParquetFileReader.java:511)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.close(SpecificParquetRecordReaderBase.java:327)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.close(VectorizedParquetRecordReader.java:164)
at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.close(DatabricksVectorizedParquetRecordReader.java:484)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.close(RecordReaderIterator.scala:70)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:45)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:291)
The a datafactory Copy Data activity is performed to Source (from mssql table) and Sink (Parquet file) using snappy compression before the notebook command is executed. No other activities or pipelines write to this file. However, multiple notebooks will perform selects against this same parquet file.
From what I can tell from the error message, the delta cache is older than the parquet file itself. Is there a way to turn off the caching for this particular file (it is very small dataset) or invalidate the cache prior to the Data Copy activity? I am aware of the CLEAR CACHE command but this does it for all tables and not specifically temp views.
We have a similar process and we have been having the exact same problem.
If you need to invalidate the cache for a specific file/folder you can use something like the following Spark-SQL command:
REFRESH {file_path}
Where file path is either the path through the dbfs or your mount.
Worth noting is that if you specify a folder instead of a file all files within that folder (recursively) will be refreshed.
This also may very well not solve your problem. It seems to have helped us, but that is more of a gut feeling as we have not been actively looking at the frequency of these failures.
The documentation.
Our Specs:
Azure
Databricks Runtime 7.4
Driver: Standard_L8s_v2
Workers: 24 Standard_L8s_v2

How can I drop database in hive without deleting database directory?

When I run drop database command, spark deletes database directory and all its subdirectories on hdfs. How can I avoid this?
Short answer:
Unless you set up your database so that it contains only external tables that exist outside of the database HDFS directory, there is no way to achieve this without copying all of your data to another location in HDFS.
Long answer:
From the following website:
https://www.oreilly.com/library/view/programming-hive/9781449326944/ch04.html
By default, Hive won’t permit you to drop a database if it contains tables. You can either drop the tables first or append the CASCADE keyword to the command, which will cause the Hive to drop the tables in the database first:
Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior, where existing tables must be dropped before dropping the database.
When a database is dropped, its directory is also deleted.
You can copy the data to another location before dropping the database. I know it's a pain - but that's how Hive operates.
If you were trying to just drop a table without deleting the HDFS directory of the table, there's a solution for this described here: Can I change a table from internal to external in hive?
Dropping an external table preserves the HDFS location for the data.
Cascading the database drop to the tables after converting them to external will not fix this, because the database drop impacts the whole HDFS directory the database resides in. You would still need to copy the data to another location.
If you create a database from scratch, each table inside of which is external and references a location outside of the database HDFS directory, dropping this database would preserve the data. But if you have it set up so that the data is currently inside of the database HDFS directory, you will not have this functionality; it's something you would have to set up from scratch.

Resources