My goal is to build a daily process that will overwrite all partitions under specific path in S3 with new data from data frame.
I do -
df.write.format(source).mode("overwrite").save(path)
(Also tried the dynamic overwrite option).
However, in some runs old data is not being deleted. Means I see files from old date together with new files under the same partition.
I suspect it has something to do with runs that broke in the middle due to memory issues and left some corrupted files that the next run did not delete but couldn’t reproduce it yet.
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") - option will keep your existing partition and overwriting a single partition. if you want to overwrite all existing partitions and keep the current partition then unset the above configurations. ( i tested in spark version 2.4.4)
Related
When we run VACUUM command, does it go through each parquet file and remove older versions of each record or does it retain all the parquet files even id it has one record with the latest version? What about compaction? Is this any different?
Vacuum and Compaction go through the _delta_log/ folder in your Delta Lake Table and identify the files that are still being referenced.
Vacuum deletes all unreferenced files.
Compaction reads in the referenced files and writes your new partitions back to the table, unreferencing the existing files.
Think of a single version of a Delta Lake table as a set of parquet data files. Every version adds an entry (about files added and removed) to the transaction log (under _delta_log directory).
VACUUM
VACUUM allows defining what number of hours to retain (using RETAIN number HOURS clause). That gives Delta Lake the versions to delete (up to the number HOURS). These versions are "translated" into a series of parquet files (remember one single parquet file belongs to a single version until it is deleted that may take a couple of versions).
This translation gives the files to be deleted.
Compaction
Compaction is basically an optimization (and is usually triggered by OPTIMIZE command or a combination of repartition, dataChange disabled and overwrite).
This is nothing else as another version of a delta table (but this time data is not changed so other transactions can happily be all committed).
The explanation about VACUUM above applies here.
I have been getting notebook failures intermittingly relating to querying a TEMPORARY VIEW that is selecting from a parquet file located on a ADLS Gen2 mount.
Delta cache contains a stale footer and stale page entries for the file dbfs:/mnt/container/folder/parquet.file, these will be removed (4 stale page cache entries). Fetched file stats (modificationTime: 1616064053000, fromCachedFile: false) do not match file stats of cached footer and entries (modificationTime: 1616063556000, fromCachedFile: true).
at com.databricks.sql.io.parquet.CachingParquetFileReader.checkForStaleness(CachingParquetFileReader.java:700)
at com.databricks.sql.io.parquet.CachingParquetFileReader.close(CachingParquetFileReader.java:511)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.close(SpecificParquetRecordReaderBase.java:327)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.close(VectorizedParquetRecordReader.java:164)
at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.close(DatabricksVectorizedParquetRecordReader.java:484)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.close(RecordReaderIterator.scala:70)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:45)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:291)
The a datafactory Copy Data activity is performed to Source (from mssql table) and Sink (Parquet file) using snappy compression before the notebook command is executed. No other activities or pipelines write to this file. However, multiple notebooks will perform selects against this same parquet file.
From what I can tell from the error message, the delta cache is older than the parquet file itself. Is there a way to turn off the caching for this particular file (it is very small dataset) or invalidate the cache prior to the Data Copy activity? I am aware of the CLEAR CACHE command but this does it for all tables and not specifically temp views.
We have a similar process and we have been having the exact same problem.
If you need to invalidate the cache for a specific file/folder you can use something like the following Spark-SQL command:
REFRESH {file_path}
Where file path is either the path through the dbfs or your mount.
Worth noting is that if you specify a folder instead of a file all files within that folder (recursively) will be refreshed.
This also may very well not solve your problem. It seems to have helped us, but that is more of a gut feeling as we have not been actively looking at the frequency of these failures.
The documentation.
Our Specs:
Azure
Databricks Runtime 7.4
Driver: Standard_L8s_v2
Workers: 24 Standard_L8s_v2
I have an Impala table backed by parquet files which is used by another team.
Every day I run a batch Spark job that overwrites the existing parquet files (creating new data set, the existing files will be deleted and new files will be created)
Our Spark code look like this
dataset.write.format("parquet").mode("overwrite").save(path)
During this update (overwrite parquet data file and then REFRESH Impala table), if someone accesses the table then they would end up with error saying the underlying data files are not there.
Is there any solution or workaround available for this issue? Because I do not want other teams see the error at any point in time when they access the table.
Maybe I can write the new data files into different location then make Impala table point to that location?
The behaviour you are seeing is because of the way how Impala is designed to work. Impala fetches the Metadata of the table such as Table structure, Partition details, HDFS File paths from HMS and the block details of the corresponding HDFS File paths from NameNode. All these details are fetched by Catalog and will be distributed across the Impala daemons for their execution.
When the table's underlying files are removed and new files are written outside Impala, it is necessary to perform a REFRESH so that the new file details (such as files and corresponding block details) will be fetched and distributed across daemons. This way Impala becomes aware of the newly written files.
Since, you're overwriting the files, Impala queries would fail to find the files that it is aware of because they have been removed already and the new files are being written. This is an expected event.
As a solution, you can perform one of the below,
Append the new files in the same HDFS Path of the table, instead of overwriting. This way, Impala queries run on the table would still return the results. However the results would be only the older data (because Impala is not aware of new files yet) but the error you said would be avoided during the time when the overwrite is occurring. Once the new files are created in the Table's directories, you can perform a HDFS Operation to remove the files followed by an Impala REFRESH statement for this table.
OR
As you said, you can write the new parquet files in a different HDFS Path and once the write is complete, you can either [remove the old files, move the new files into the actual HDFS Path of the table, followed by a REFRESH] OR [Issue an ALTER statement against the table to modify the location of the table's data pointing to the new directory]. If it's a daily process, you might have to implement this through a script that runs upon successful write process done by Spark by passing the directories (new and old directories) as arguments.
Hope this helps!
As the data in case of Cassandra is physically removed during compaction, is it possible to access the recently deleted data in any way? I'm looking for something similar to Oracle Flashback feature (AS OF TIMESTAMP).
Also, I can see the pieces of deleted data in the relevant commit log file, however it's obviously unreadable. Is it possible to convert this file to a more readable format?
You will want to execute a restore from your commitlog.
The safest is to copy the commitlog to a new cluster (with same schema), and restore following the instructions (comments) from commitlog_archiving.properties file. In your case, you will want to set restore_point_in_time to a time between your insert and your delete.
Looking in my keyspace directory I see several versions of most of my tables. I am assuming this is because I dropped them at some point and recreated them as I was refining the schema.
table1-b3441432142142sdf02328914104803190
table1-ba234143018dssd810412asdfsf2498041
These created tables names are very cumbersome to work with. Try changing to one of the directories without copy pasting the directory name from the terminal window... Painful. So easy to mistype something.
That side note aside, how do I tell which directory is the most current version of the table? Can I automatically delete the old versions? I am not clear if these are considered snapshots or not since each directory also can contain snapshots. I read in another post you can stop autosnapshot, but I'm not sure I want that. I'd rather just automatically delete any tables not being currently used (i.e.: that are not the latest version).
I stumbled across this trying to do a backup. I realized I am forced go to every table directory and copy out the snapshot files (there are like 50 directories..not including all the old table versions) which seems like a terrible design (maybe I'm missing something??).
I assumed I could do a snapshot of the whole keyspace and get one file back or at least output all the files to a single directory that represents the snapshot of the entire keyspace. At the very least it would be nice knowing what the current versions are so I can grab the correct files and offload them to storage somewhere.
DataStax Enterprise has a backup feature but it only supports AWS and I am using Azure.
So to clarify:
How do I automatically delete old table versions and know which is
the current version?
How can I backup the most recent versions of the tables and output the files to a single directory that I can offload somewhere? I only have two nodes, so simply relying on the repair is not a good option for me if a node goes down.
You can see the active version of a table by looking in the system keyspace and checking the cf_id field. For example, to see the version for a table in the 'test' keyspace with table name 'temp', you could do this:
cqlsh> SELECT cf_id FROM system.schema_columnfamilies WHERE keyspace_name='test' AND columnfamily_name='temp' allow filtering;
cf_id
--------------------------------------
d8ea9830-20e9-11e5-afc0-c381f961c62a
As far as I know, it is safe to delete (rm -r) outdated table version directories that are no longer active. I imagine they don't delete them automatically so that you can recover the data if you dropped them by mistake. I don't know of a way to have them removed automatically even if auto snapshot is disabled.
I don't think there is a command to write all the snapshot files to a single directory. According to the documentation on snapshot, "After the snapshot is complete, you can move the backup files to another location if needed, or you can leave them in place." So it's left up to the application developer how they want to handle archiving the snapshot files.