Keep only latest records by unique column in table - apache-spark

I have a question about delta tables \ delta change data feed \ upsert
I have a table that stores all history of states (main), I use ZORDER by uniq column on it, and I retrieve results for 2s with WHERE clause BY (uniq, date) from the table.
During the day I receive many messages from the stream and add them (via append) into the delta table (main), each night I recalculate ZORDER for the main table
—
I would like to create another table that keeps only the latest records by a unique column in the table. I need this table for decreasing the latency of retrieving data.
Possible solutions:
Solution with DENSE_RANK by date (or QUALIFY), we grab from (main) only fresh uniq и date.
Each time we recreate the table, in that case, I lose a lot of CPU time and lose ZORDER on the table, in a result this new small table works worse than the bigger (main) with ZORDER.
Upsert (merge)
I tried but looks like it doesn't work as I want it to.
Hardly to say, it appends or deletes the data from the batch, not from the (main) table, and that records that haven't been matched are not deleted.
Delta Change Data Feed
For this approach, I need to store the full state in a batch, but currently batch is just a new portion of the data.
I start thinking about that for the latest data I need to store it somewhere else (in mem KV storage, Redis for example)
Partition by uniq works worse than ZORDER by uniq column. So maybe I need to drop uniq partition and add a new uniq from the batch?
Structure of the (main):
date | uniq | key | value
2022-12-13 | 1 | "key1" | 1
2022-12-13 | 1 | "key55" | 5
2022-12-13 | 2 | "key105" | 3
Mini-batches like this one:
date | uniq | key | value
2022-12-14 | 1 | "key3" | 0
2022-12-14 | 1 | "key4" | 1
2022-12-14 | 1 | "key5" | 0
2022-12-14 | 1 | "key6" | 1
Right now after insertion (batch) to (main), it will be:
date | uniq | key | value
2022-12-13 | 1 | "key1" | 1
2022-12-13 | 1 | "key55" | 5
2022-12-13 | 2 | "key105" | 3
2022-12-14 | 1 | "key3" | 0
2022-12-14 | 1 | "key4" | 1
2022-12-14 | 1 | "key5" | 0
2022-12-14 | 1 | "key6" | 1
I want to have a such table (latest):
date | uniq | key | value
2022-12-13 | 2 | "key105" | 3
2022-12-14 | 1 | "key3" | 0
2022-12-14 | 1 | "key4" | 1
2022-12-14 | 1 | "key5" | 0
2022-12-14 | 1 | "key6" | 1
ETL:
raw -> AutoLoader -> silver (with append every batch, night OPTIMIZE ZORDER)
Want to create another table (gold) that will keep only the latest records by uniq column.
If we add a new batch (new state), I expect to see only these new values per uniq in the gold table. So with every new batch, it should be triggered and recalculated somehow.
Or I should forget about the idea, and instead of that, save all the data in in-mem KV? (Redis)

Related

Spark replicating rows with values of a column from different dataset

I am trying to replicate rows inside a dataset multiple times with different values for a column in Apache Spark. Lets say I have a dataset as follows
Dataset A
| num | group |
| 1 | 2 |
| 3 | 5 |
Another dataset have different columns
Dataset B
| id |
| 1 |
| 4 |
I would like to replicate the rows from Dataset A with column values of Dataset B. You can say a join without any conditional criteria that needs to be done. So resulting dataset should look like.
| id | num | group |
| 1 | 1 | 2 |
| 1 | 3 | 5 |
| 4 | 1 | 2 |
| 4 | 3 | 5 |
Can anyone suggest how the above can be achieved? As per my understanding, join requires a condition and columns to be matched between 2 datasets.
What you want to do is called CartesianProduct and df1.crossJoin(df2) will achieve it. But be careful with it because it is a very heavy operation.

How to add a new column with some constant value while appending two datasets using Groovy?

I have multiple monthly datasets with 50 variables each. I need to append these datasets to create one single dataset. However, I also want to add the month's name to the corresponding records while appending such that I can see a new column in the final dataset which can be used to identify records belonging to a month.
Example:
Data 1: Monthly_file_201807
ID | customerCategory | Amount |
1 | home | 654.00 |
2 | corporate | 9684.65 |
Data 2: Monthly_file_201808
ID | customerCategory | Amount |
84 | SME | 985.29 |
25 | Govt | 844.88 |
On Appending, I want something like this:
ID | customerCategory | Amount | Month |
1 | home | 654.00 | 201807 |
2 | corporate | 9684.65 | 201807 |
84 | SME | 985.29 | 201808 |
25 | Govt | 844.88 | 201808 |
currently, I'm appending using following code:
List dsList = [
Data1Path,
Data2Path
].collect() {app.data.open(source:it)}
//concatenate all records into a single larger dataset
Dataset ds=app.data.create()
dsList.each(){
ds.prepareToAdd(it)
ds.addAll(it)
}
ds.save()
app.data.copy(in: ds, out: FinalAppendedDataPath)
I have used the standard append code, but unable to add that additional column with a fixed value of month in there. I don't want to loop through the data to create an additional column of "month", as my data is very large and I have multiple files.

Node.js : Check entries regularly based on their timestamps

I have a Node.js backend and MSSQL as database. I do some data processing and store logs in a database table which shows which entity is a what stage in the progress, e.g. (there are three stages each entity has to pass)
+----+-----------+-------+------------------------+
| id | entity_id | stage | timestamp |
+----+-----------+-------+------------------------+
| 1 | 1 | 1 | 2019-01-01 12:12:01 |
| 2 | 1 | 2 | 2019-01-01 12:12:10 |
| 3 | 1 | 3 | 2019-01-01 12:12:15 |
| 4 | 2 | 1 | 2019-01-01 12:14:01 |
| 5 | 2 | 2 | 2019-01-01 12:14:10 <--|
| 6 | 3 | 1 | 2019-01-01 12:24:01 |
+----+-----------+-------+------------------------+
As you can see in line with the arrow, the entity no. 2 did not go to stage 3. After a certain amount of time (maybe 120 seconds), these entity should be considered as faulty and this will be reported somehow.
What would be a good approach in Node.js to check the table for those outtimed entities? Some kind of cron job which checks all lines every x seconds? That sounds rather clumsy to me.
I am looking forward to our ideas!

Lookup between two tables on a unique ID AND where the date is between two other dates

I have two tables in Excel. One has Key and Date - this can be table A. The other has Key, Begin Date, End Date, and Value - let's call this table B.
I am trying to pull into Table A the Value from Table B for the Key , where the Date from Table A is between the Begin Date and End Date from Table B. The value should be 0.4 using the example tables below.
NOTE: There will never be overlapping dates and shouldn't have multiple rows for the same date range.
Table A -
| Key | Date |
|-----|------------|
| 2 | 10/29/2018 |
Table B -
| Key | Begin Date | End Date | Value |
|-----|------------|------------|-------|
| 1 | 07/01/2018 | 12/31/2999 | 0.1 |
| 1 | 01/01/1995 | 06/30/2018 | 1 |
| 1 | 01/01/1900 | 12/31/1994 | 0.5 |
| 2 | 10/31/2018 | 12/31/2999 | 3.6 |
| 2 | 01/01/1995 | 10/30/2018 | 0.4 |
| 2 | 01/01/1900 | 12/31/1994 | 10 |
| 3 | 01/01/1900 | 12/31/2999 | 100 |
Thanks!
Assuming there'll only be one match, use SUMIFS.
=SUMIFS($I$1:$I$8,$F$1:$F$8,A2,$G$1:$G$8,"<="&B2,$H$1:$H$8,">="&B2)
Note - changed two instances of 12/31/1995 in Table B to 12/31/1994, assuming that it's a typo and date ranges shouldn't overlap between rows.
EDIT:
You can use INDEX and AGGREGATE if you need to return text.
=INDEX(I2:I8,AGGREGATE(15,6,ROW($A$1:$A$7)/(($F$2:$F$8=A2)*($G$2:$G$8<=B2)*($H$2:$H$8>=B2)),1))

Spotfire - Identify each time a value changes in a particular pattern for a particular type

Apologies for the bad title, I'm struggling to describe exactly what I;m trying to do (which has also made it difficult to search for an existing answer.
I have a series of date with the columns "Asset", "Time", and a "State" that is a calculated column that changes based on several other values. The data is source from a constant, and non-regular stream of data (though in the table below I have created a sample of timestamps that are regular).
The below table shows the source data ("Asset", "Time" and "State"), as well as an intended calculated column "Event" that tracks each time an event starts. I do not simply want to count the number of times an "Asset" has a Bad "State", but identify each time an "Asset" changes from a "State" of Good to a state of Bad (note in the real data there is a large number of different states, but the pattern I'm trying to identify is consistent).
+-------+----------+-------+-------+
| Asset | Time | State | Event |
+-------+----------+-------+-------+
| 1 | 12:00:00 | Good | 0 |
| 2 | 12:00:00 | Good | 0 |
| 1 | 12:00:01 | Good | 0 |
| 2 | 12:00:01 | Good | 0 |
| 1 | 12:00:02 | Bad | 1 |
| 2 | 12:00:02 | Good | 0 |
| 2 | 12:00:03 | Good | 0 |
| 2 | 12:00:03 | Good | 0 |
| 1 | 12:00:04 | Bad | 0 |
| 1 | 12:00:04 | Good | 0 |
| 2 | 12:00:05 | Good | 0 |
| 2 | 12:00:05 | Bad | 1 |
| 2 | 12:00:06 | Bad | 0 |
| 1 | 12:00:06 | Good | 0 |
| 2 | 12:00:07 | Bad | 0 |
| 2 | 12:00:07 | Good | 0 |
| 2 | 12:00:08 | Good | 0 |
| 1 | 12:00:08 | Bad | 1 |
| 2 | 12:00:09 | Good | 0 |
| 1 | 12:00:09 | Bad | 0 |
| 2 | 12:00:10 | Good | 0 |
| 1 | 12:00:10 | Good | 0 |
+-------+----------+-------+-------+
I intend to create a chart for this data to show how many times an event occurs for a particular asset per day. I figured the easiest way to do that is to have a column that counts a 1 when the event starts, and then sum this column over the dates to create the visualisation.
My idea was for a calculated column that picks up when the "State" is bad, and then checks to see if the previous state for that "Asset" was good, and set as a 1 if so. I have so far been unable to find a way to identify what the value was for a previous row entry for a particular "Asset".
Note that there are roughly 100 (or more) individual "Asset" entries, so creating an individual calculated column to track each would not be feasible. I am also using Spotfire 7.1.
Thanks in advance, and sorry again if the way I've written this is confusing.

Resources