I have a use case where the table columns will be changing [addition/deletion] at each refresh [currently its weekly refresh]. It is stored as delta format. is there any way that can we trach the version of these column addition/deletion like a kind of meta store.
Is there any where I can find such information in delta table or delta file format ?
Related
I'm trying to ingest historical data into a data catalog using Apache Hudi upsert. As the data is years and months old, I wanted to iterate each month, adding the historical date as a column to be queryable.
The problem is: incremental queries in Hudi takes _hoodie_commit_time as reference, and that commit time will not reflect the historical dates.
Is there a way to query Hudi tables using this custom date column as "instant" reference for incremental queries, maybe by adding this column data to the table metadata?
I have a data Flow that reads from Parquet files, does some filtering and then loads into a Delta Lake. The data flow would run multiple times and I don't want duplicate data in my Delta Lake. To safeguard this, I thought to implement a delete-insert mechanism- Find the minimum and maximum date of the incoming data and delete all the data in destination (delta) that falls under this range. Once deleted, all filtered incoming data would be inserted into delta lake.
From documentation, I saw that I need to add policies at row level in an Alter Row Tx to mark that particular row for deletion. I added Delete-If condition as - between(toDate(date, 'MM/dd/yyyy'), toDate("2021-12-22T01:49:57", 'MM/dd/yyyy'), toDate("2021-12-23T01:49:57", 'MM/dd/yyyy')) where date is a column in incoming data.
However, in data preview of Alter Row Tx, all the rows are being marked for insertion and 0 for deletion when there definitely are records that belong to that range.
I suspect that Delete-If condition does not work the way I want it to. In that case, how do I implement deletion between data range in Data Flow with Delta as destination ?
You need to tell ADF what to do with the other portions of the timestamp (it's not a date type yet). Try this:
toString(toTimestamp('2021-12-22T01:49:57', 'yyyy-MM-dd'T'HH:mm:ss'), 'MM/dd/yyyy')
the log of a delta table stores metadata about the transactions and about statistics (data type, min, max, nr. columns etc). However, I can only see the data types when looking into the json file of this log. Does anyone know how to obtain the min, max and nr. columns of this delta table without computing anything (since the delta table should have this information when reading the file)?
This depends on if you are using open source version or Databricks version. The former don’t have this functionality, it exists only in DB version
I am having a source lets say SQL DB or an oracle database and I wanted to pull the table data to Azure SQL database. But the problem is I don't have any date column on which data is getting inserting or a primary key column. So is there any other way to perform this operation.
One way of doing it semi-incremental is to partition the table by a fairly stable column in the source table, then you can use mapping data flow to compare the partitions ( can be done with row counts, aggregations, hashbytes etc ). Each load you store the compare output in the partitions metadata somewhere to be able to compare it again the next time you load. That way you can reload only the partitions that were changed since your last load.
I have a requirement I want to meet. I need to sqoop over data from a DB to Hive. I am sqooping on a daily basis since this data is updated daily.
This data will be used as lookup data from a spark consumer for enrichment. We want to keep a history of all the data we have received but we don't need all the data for lookup only the latest data (same day). I was thinking of creating a hive view from the historical table and only showing records that were inserted that day. Is there a way to automate the view on a daily basis so that the view query will always have the latest data?
Q: Is there a way to automate the view on a daily basis so that the
view query will always have the latest data?
No need to update/automate the process if you get a partitioned table based on date.
Q: We want to keep a history of all the data we have received but we
don't need all the data for lookup only the latest data (same day).
NOTE : Either hive view or hive table you should always avoid scanning the full table data aka full table scan for getting latest partitioned data.
Option 1: hive approach to query data
If you want to adapt hive approach
you have to go with partition column for example : partition_date and partitioned table in hive
select * from table where partition_column in
(select max(distinct partition_date ) from yourpartitionedTable)
or
select * from (select *,dense_rank() over (order by partition_date desc) dt_rnk from db.yourpartitionedTable ) myview
where myview.dt_rnk=1
will give the latest partition always. (if same day or todays date is there in partition data then it will give the same days partition data otherwise it will give max partition_date) and its data from the partition table.
Option 2: Plain spark approach to query data
with spark show partitions command i.e. spark.sql(s"show Partitions $yourpartitionedtablename") get the result in array and sort that to get latest partition date. using that you can query only latest partitioned date as lookup data using your spark component.
see my answer as an idea for getting latest partition date.
I prefer option2 since no hive query is needed and no full table query since
we are using show partitions command. and no performance bottle necks
and speed will be there.
One more different idea is querying with HiveMetastoreClient or with option2... see this and my answer and the other
I am assuming that you are loading daily transaction records to your history table with some last modified date. Every time you insert or update record to your history table you get your last_modified_date column updated. It could be date or timestamp also.
you can create a view in hive to fetch the latest data using analytical function.
Here's some sample data:
CREATE TABLE IF NOT EXISTS db.test_data
(
user_id int
,country string
,last_modified_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;
I am inserting few sample records. you see same id is having multiple records for different dates.
INSERT INTO TABLE db.test_data VALUES
(1,'India','2019-08-06'),
(2,'Ukraine','2019-08-06'),
(1,'India','2019-08-05'),
(2,'Ukraine','2019-08-05'),
(1,'India','2019-08-04'),
(2,'Ukraine','2019-08-04');
creating a view in Hive:
CREATE VIEW db.test_view AS
select user_id, country, last_modified_date
from ( select user_id, country, last_modified_date,
max(last_modified_date) over (partition by user_id) as max_modified
from db.test_data ) as sub
where last_modified_date = max_modified
;
hive> select * from db.test_view;
1 India 2019-08-06
2 Ukraine 2019-08-06
Time taken: 5.297 seconds, Fetched: 2 row(s)
It's showing us result with max date only.
If you further inserted another set of record with max last modified date as:
hive> INSERT INTO TABLE db.test_data VALUES
> (1,'India','2019-08-07');
hive> select * from db.test_view;
1 India 2019-08-07
2 Ukraine 2019-08-06
for reference:Hive View manuual