Recently I have created a delta table with a partition column created by generated column.
Below are my two observations. Nee your guidance.
1st Approach:
run_date DATE GENERATED ALWAYS AS (CAST(date_trunc('month', run_timestamp) AS DATE))
2nd Approach:
run_date DATE GENERATED ALWAYS AS (CAST(run_timestamp AS DATE))
When I created the partition column using the first approach, partition skipping is not working, but partition skipping is working for the second approach.
The first approach suffices our requirement. According to Databricks documentation value of the generated column can be decided based on a user-specified function over other columns in the Delta table.
But I am not getting in our case why partition skipping is not following. Am I missing any configuration in order to get the expected outcome?
Can you please guide me?
Related
I am replicating my data from Azure SQl DB TO Azure SQL DB. I have some tables with date columns and some tables with just the ID columns which are assigning primary key. While performing incremental load in ADF, I can select date as watermark column for the tables which have date column and id as watermark column for the tables which has id column, But the issue is my id has guid values, So can I i take that as my watermark column ? and if yes while copy activity process it gives me following error in ADF
Please see the image for above reference
How can I overcome this issue. Help is appreciated
Thank you
Gp
I have tried dynamic mapping https://martinschoombee.com/2022/03/22/dynamic-column-mapping-in-azure-data-factory/ from here but it does not work it still gives me same error.
Regarding your question about watermak:
A watermark is a column that has the last updated time stamp or an incrementing key
So GUID column would not be a good fit.
Try to find a date column, or an integer identity which is ever incrementing, to use as watermark.
Since your source is SQL server, you can also use change data capture.
Links:
Incremental loading in ADF
Change data capture
Regards,
Chen
The watermark logic takes advantange of the fact that all the new records which are inserted after the last watermark saved should only be considered for copying from source A to B , basically we are using ">=" operator to our advantage here .
In case of guid you cannot use that logic as guid cann surely be unique but not ">=" or "=<" will not work.
I have created a data flow in Data Factory.
Step 1. Read the parquet file.
Step 2. Aggregate the file to get the Max(DateField)
Step 3. Use a derived column to write in a Value.
Step 4. Alter row task with Value and the DateField.
Step 5. Sink select the Watermark table to update.
The flow updates the value, but it isn't putting in the max value. The date value is incorrect. Any ideas?
Flow_image
max() aggregate function doesn't work on date/string format type. You must pass any column which contains numerical values. Date is not a valid input on which you can apply max function. There is no maximum date term.
Instead you can filter the timestamp and get the latest or oldest date using ADF.
Refer this answer by #Leon to know how to implement the same.
How does spark structured streaming let the sink know that a new row is an update of an existing row when run in an update mode? Does it look at all the values of all columns of the new row and an existing row for an equality match or does it compute some sort of hash?
Reading the documentation, we see some interesting information about update mode (bold formatting added by me):
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
So, to use update mode there needs to be some kind of aggregation otherwise all data will simply be added to the end of the result table. In turn, to use aggregation the data need to use one or more coulmns as a key. Since a key is needed it is easy to know if a row has been updated or not - simply compare the values with the previous iteration of the table (the key tells you which row to compare with). In aggregations that contains a groupby, the columns being grouped on are the keys.
Simple aggregations that return a single value will not require a key. However, since only a single value is returned it will update if that value is changed. An example here could be taking the sum of a column (without groupby).
The documentation contains a picture that gives a good understanding of this, see the "Model of the Quick Example" from the link above.
I am wondering how's column slicing in CQL WHERE clause affects read performance. Does Cassandra have some optimization, which is able to only fetch specific columns with the value or have to retrieve all the columns of a row and check one after another? e.g.: I have a primary key as (key1, key2), key2 is the clustering key. I only want to find columns that match a certain key2, say value2?
Cassandra saves the data as cells - each value for a key+column is a cell. If you save several values for the key in once they will be placed together in same file. Also, since cassandra writes to sstables, you can have several values saved for same key-column/cell in different files, and cassandra will read all of them and return you the last written one, until comperssion or repair is occured, and irrelevant values are deleted.
Good article about deletes/reads/tombstones:
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
I am new to Cassandra.
I have a column Family where the columns are sorted by "LexicalUUIDType".
How can I access timestamp of each column in such a ColumnFamily?
I need to the timestamp because I have to read the oldest entry.
I can not use "TimeUUIDType" for sorting columns.
Thanks,
It depends on the library you are using. But if you are using the raw thrift api its something like (unreleased 0.7/trunk):
column.column.clock.timestamp
(To get all data you will have to use get_range_slices, start with "", and after each call use the last key as the start key in the next call)
You would have to get back all of the columns using get_slice http://wiki.apache.org/cassandra/API06#get_slice and then look at the timestamp field in each one.
Or you can make another column family sorted by timeuuid which has the corresponding column in the first cf as the value. Query cf #2 with the time you want, and use the result to get from cf #1.