Spark Delta table restore to version - apache-spark

I am trying to restore a delta table to its previous version via spark java , am using local ide .code is as below
import io.delta.tables.*;
DeltaTable deltaTable = DeltaTable.forPath(spark, <path-to-table>);
DeltaTable deltaTable = DeltaTable.forName(spark, <table-name>);
deltaTable.restoreToVersion(0) // restore table to oldest version
deltaTable.restoreToTimestamp("2019-02-14") // restore to a specific timestamp
As per the documentation databricks doc the method given here is not available in delta-core version 0.8.0. The method is also not in the api docs .
Is this only available in Datbricks run time?
Currently i have to load with the previous version and rewrite the df using delta.Is there any better way to do it?

Deltalake version 0.8 does not have restoreToVersion and restoreToTimestamp methods. There is no trace of such methods in open-source deltalake 0.8 as you can check in delta-lake repository
So currently and as far as I know, there is no other method than rewriting from a previous version, as explained in answers of this question
EDIT
As commented by boyangeor, restoreToVersion and restoreToTimestamp are now available in DeltaLake from version 1.2

The rollback "restoreVersion" is very much this:
In python
delta_table_path = "/tmp/delta-table"
df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)
df.show()
In Java:
String delta_table_path = "/tmp/delta-table"
Dataset<Row> df = spark.read().format("delta").option("versionAsOf", 0).load(delta_table_path);
df.show();
In Scala:
var delta_table_path = "/tmp/delta-table"
val df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)
df.show()

Related

What changes are required when moving simple synapsesql implementation from Spark 2.4.8 to Spark 3.1.2?

I have a simple implementation of .write.synapsesql() method (code shown below) that works in Spark 2.4.8 but not in Spark 3.1.2 (documentation/example here). The data in use is a simple notebook-created foobar type table. Searching for key phrases online from and about the error did not turn up any new information for me.
What is the cause of the error in 3.1.2?
Spark 2.4.8 version (behaves as desired):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None)
Spark 3.1.2 version (extra method is same as in documentation, can also be left out with a similar result):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None,
Some(callBackFunctionToReceivePostWriteMetrics))
The resulting error (only in 3.1.2) is:
WriteFailureCause -> java.lang.IllegalArgumentException: Failed to derive `https` scheme based staging location URL for SQL COPY-INTO}
As the documentation from the question states, ensure that you are setting the options correctly with
val writeOptionsWithAADAuth:Map[String, String] = Map(Constants.SERVER -> "<dedicated-pool-sql-server-name>.sql.azuresynapse.net",
Constants.TEMP_FOLDER -> "abfss://<storage_container_name>#<storage_account_name>.dfs.core.windows.net/<some_temp_folder>")
and including the options in your .write statement like so:
df.write.options(writeOptionsWithAADAuth).synapsesql(...)

Spark "modifiedBefore" option while reading data from files

I am using Spark-2.4 to read files from hadoop.
The requirement is to read the files whose modified time is before some provided value.
I came across the spark documentation that mentions about the option modifiedBefore, please refer to the following spark doc Modification Time Path Filters, but I am not sure if it's available in spark 2.4, if not how can I achieve this?
The options modifiedBefore and modifiedAfter are available since Spark 3+ and can only be used in batch not streaming. For Spark 2.4, you can use Hadoop FileSystem globStatus method and filter files using getModificationTime.
Here is an example of a function that takes a path and a threshold and returns list of file paths filtered using the threshold:
import org.apache.hadoop.fs.Path
def getFilesModifiedBefore(path: Path, modifiedBefore: String) = {
val format = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
val thresHoldTime = format.parse(modifiedBefore).getTime()
val files = path.getFileSystem(sc.hadoopConfiguration).globStatus(path)
files.filter(_.getModificationTime < thresHoldTime).map(_.getPath.toString)
}
Then using it with spark.read.csv :
val df = spark.read.csv(getFilesModifiedBefore(new Path("/mypath"), "2021-03-17T10:46:12"):_*)

wrong schema in case of dataframe obtained from ORC hive table [duplicate]

I have a directory containing ORC files. I am creating a DataFrame using the below code
var data = sqlContext.sql("SELECT * FROM orc.`/directory/containing/orc/files`");
It returns data frame with this schema
[_col0: int, _col1: bigint]
Where as the expected schema is
[scan_nbr: int, visit_nbr: bigint]
When I query on files in parquet format I get correct schema.
Am I missing any configuration(s)?
Adding more details
This is Hortonworks Distribution HDP 2.4.2 (Spark 1.6.1, Hadoop 2.7.1, Hive 1.2.1)
We haven't changed the default configurations of HDP, but this is definitely not the same as the plain vanilla version of Hadoop.
Data is written by upstream Hive jobs, a simple CTAS (CREATE TABLE sample STORED AS ORC as SELECT ...).
I tested this on filed generated by CTAS with the latest 2.0.0 hive & it preserves the column names in the orc files.
The problem is the Hive version, which is 1.2.1, which has this bug HIVE-4243
This was fixed in 2.0.0.
Setting
sqlContext.setConf('spark.sql.hive.convertMetastoreOrc', 'false')
fixes this.
If you have the parquet version as well, you can just copy the column names over, which is what I did (also, the date column was partition key for orc so had to move it to the end):
tx = sqlContext.table("tx_parquet")
df = sqlContext.table("tx_orc")
tx_cols = tx.schema.names
tx_cols.remove('started_at_date')
tx_cols.append('started_at_date') #move it to end
#fix column names for orc
oldColumns = df.schema.names
newColumns = tx_cols
df = functools.reduce(
lambda df, idx: df.withColumnRenamed(
oldColumns[idx], newColumns[idx]), range(
len(oldColumns)), df)
We can use:
val df = hiveContext.read.table("tableName")
Your df.schema or df.columns will give actual column names.
If version upgrade is not an available option, quick fix could be to rewrite ORC file using PIG. That seems to work just fine.

Spark SQL on ORC files doesn't return correct Schema (Column names)

I have a directory containing ORC files. I am creating a DataFrame using the below code
var data = sqlContext.sql("SELECT * FROM orc.`/directory/containing/orc/files`");
It returns data frame with this schema
[_col0: int, _col1: bigint]
Where as the expected schema is
[scan_nbr: int, visit_nbr: bigint]
When I query on files in parquet format I get correct schema.
Am I missing any configuration(s)?
Adding more details
This is Hortonworks Distribution HDP 2.4.2 (Spark 1.6.1, Hadoop 2.7.1, Hive 1.2.1)
We haven't changed the default configurations of HDP, but this is definitely not the same as the plain vanilla version of Hadoop.
Data is written by upstream Hive jobs, a simple CTAS (CREATE TABLE sample STORED AS ORC as SELECT ...).
I tested this on filed generated by CTAS with the latest 2.0.0 hive & it preserves the column names in the orc files.
The problem is the Hive version, which is 1.2.1, which has this bug HIVE-4243
This was fixed in 2.0.0.
Setting
sqlContext.setConf('spark.sql.hive.convertMetastoreOrc', 'false')
fixes this.
If you have the parquet version as well, you can just copy the column names over, which is what I did (also, the date column was partition key for orc so had to move it to the end):
tx = sqlContext.table("tx_parquet")
df = sqlContext.table("tx_orc")
tx_cols = tx.schema.names
tx_cols.remove('started_at_date')
tx_cols.append('started_at_date') #move it to end
#fix column names for orc
oldColumns = df.schema.names
newColumns = tx_cols
df = functools.reduce(
lambda df, idx: df.withColumnRenamed(
oldColumns[idx], newColumns[idx]), range(
len(oldColumns)), df)
We can use:
val df = hiveContext.read.table("tableName")
Your df.schema or df.columns will give actual column names.
If version upgrade is not an available option, quick fix could be to rewrite ORC file using PIG. That seems to work just fine.

Cannot create Spark Phoenix DataFrames

I am trying to load data from Apache Phoenix into a Spark DataFrame.
I have been able to successfully create an RDD with the following code:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val foo: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
foo.collect().foreach(x => println(x))
However I have not been so lucky trying to create a DataFrame. My current attempt is:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.phoenixTableAsDataFrame(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
df.select(df("ID")).show
Unfortunately the above code results in a ClassCastException:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
I am still very new to spark. If anyone can help it would be very much appreciated!
Although you haven't mentioned your spark version and details of the exception...
Please see PHOENIX-2287 which is fixed, which says
Environment: HBase 1.1.1 running in standalone mode on OS X *
Spark 1.5.0 Phoenix 4.5.2
Josh Mahonin added a comment - 23/Sep/15 17:56 Updated patch adds
support for Spark 1.5.0, and is backwards compatible back down to
1.3.0 (manually tested, Spark version profiles may be worth looking at in the future) In 1.5.0, they've gone and explicitly hidden the
GenericMutableRow data structure. Fortunately, we are able to the
external-facing 'Row' data type, which is backwards compatible, and
should remain compatible in future releases as well. As part of the
update, Spark SQL deprecated a constructor on their 'DecimalType'.
In updating this, I exposed a new issue, which is that we don't
carry-forward the precision and scale of the underlying Decimal type
through to Spark. For now I've set it to use the Spark defaults, but
I'll create another issue for that specifically. I've included an
ignored integration test in this patch as well.

Resources