LAST (not NULL) gives incorrect answers on a handful for records (Apache Spark SparkSQL) - apache-spark

I'm having a pretty troubling problem with the LAST aggregate in SparkSQL in Spark 2.3.1. It seems to give me around 4 bad results -- that is, values that are not LAST by the specified partitioning and order -- in 500,000 (logical SQL, not Spark) partitions, something like 50MM records. Smaller batches are worse -- the number of errors per batch seems pretty consistent, although I don't think I tried anything smaller than 100,000 logical SQL partitions.
I have roughly 66 FIRST or LAST aggregates, a compound (date, integer) logical sql partition key and a compound (string, string) sort key. I tried converting the four-character numeric values into integers, then I combined them into a single integer. Neither of those moves resolved the problem. Even with a single integer sort key, I was getting a few bad values.
Typically, there are fewer than a hundred records in each partition, and a handful of non-NULL values for any field. It never seems to get the second to last value; it's always at least third to last.
I did try to replace the simple aggregate with a windowed aggregate with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. The one run I did of that gave me six bad records -- the compound integer key had given me only two, but I didn't do enough runs to really compare the approaches and of course I need zero.
Why do I not seem to be able to rely on LAST()? Here's a test which just illustrates the unwindowed version of the LAST function, although my partitioning and sorting fields are each two fields.
import org.apache.spark.sql.functions.{expr}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.scalamock.scalatest.MockFactory
import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers}
import collection.JavaConverters._
class LastTest extends FlatSpec with Matchers with MockFactory with BeforeAndAfterAll {
implicit val spark: SparkSession = SparkSession.builder().appName("Last Test").master("local[2]").getOrCreate()
import spark.implicits._
// TRN_DATE, TRN_NUMBER, TRN_TIMESTAMP, DETAILS, DATE_TIME, QUE_LINE_ID, OPR_INITIALS, ENTRY_TYPE, HIST_NO, SUB_HIST_NO, MSG_INFO
"LAST" must "work with GROUP BY" in {
val lastSchema = StructType(Seq(
StructField("Pfield", IntegerType) // partition field
, StructField("Ofield", IntegerType) // order field
, StructField("Vfield", StringType) // value field
))
val last:DataFrame = spark.createDataFrame(List[Row](
Row(0, 1, "Pencil")
, Row(5, 1, "Aardvark")
, Row(10, 1, "Monastery")
, Row(10, 2, "Remediation")
, Row(15, 1, "Parcifal")
, Row(20, 1, "Montenegro")
, Row(20, 2, "Susquehana")
, Row(20, 3, "Perfidy")
, Row(20, 4, "Prosody")
).asJava
, lastSchema
).repartition(expr("MOD(Pfield, 4)"))
last.createOrReplaceTempView("last_group_test")
// apply the unwindowed last
val unwindowed:DataFrame = spark.sql("SELECT Pfield, LAST(Vfield) AS Vlast FROM (SELECT * FROM last_group_test ORDER BY Pfield, Ofield) GROUP BY Pfield ORDER BY Pfield")
unwindowed.show(5)
// apply a windowed last
val windowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
windowed.show(5)
// include the partitioning function in the window
val excessivelyWindowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY MOD(Pfield, 4), Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
excessivelyWindowed.show(5)
assert(unwindowed.collect() === windowed.collect() && windowed.collect() === excessivelyWindowed.collect())
assert(windowed.count() == 5)
assert(windowed.filter("Pfield=20").select($"Vlast").collect()(0)(0)==="Prosody")
}
}
So, all three datasets are the same, which is nice. But, if I apply this logic to my actual needs -- which has sixty-odd columns, almost all of which are LAST values -- I'll get an error, it looks like about 4 times in a batch of 500,000 groups. If I run the dataset 30 times, I'll get 30 different sets of bad records.
Am I doing something wrong, or is this a defect? Is it a known defect? Is it fixed in 2.4? I didn't see if, but "aggregates simply don't work sometimes" can't be something they released with, right?

I was able to resolve the issue by applying with windowed aggregate to a dataset with the same sorting, sorted in a subquery.
SELECT LAST(VAL) FROM (SELEcT * FROM TBL ORDER BY SRT) SRC GROUP BY PRT
was not sufficient, nor was
SELECT LAST(VAL) OVER (PARTITION BY PRT ORDER BY SRT ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM TBL
I had to do both
SELECT DISTINCT LAST(VAL) OVER (PARTITION BY PRT ORDER BY SRT ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM (SELEcT * FROM TBL ORDER BY SRT) SRC
These datasets had been extracted from an Oracle 12.2 instance over JDBC. I also added SRT to the order by clause there, which had just had ORDER BY PRT.
Further -- and I think this may have been most significant -- I used the cacheTable API on the spark catalog object after extracting the data. I had been doing
.repartition
.cache
.count
in order to load all the records with a relatively small number of data connections, but I suspect it was not enough to get all the data sparkside before the aggregations took place.

Related

spark: How does salting work in dealing with skewed data

I have a skewed data in a table which is then compared with other table that is small.
I understood that salting works in case of joins- that is a random number is appended to keys in big table with skew data from a range of random data and the rows in small table with no skew data are duplicated with the same range of random numbers. Hence the matching happens because there will be a hit in one among the duplicate values for particular salted key of skewed able.
I also read that salting is helpful while performing groupby. My question is when random numbers are appended to the key doesn't it break the group? If it does then the meaning of group by operation has changed.
My question is when random numbers are appended to the key doesn't it break the group?
Well, it does, to mitigate this you could run group by operation twice.
Firstly with salted key, then remove salting and group again.
The second grouping will take partially aggregated data, thus significantly reduce skew impact.
E.g.
import org.apache.spark.sql.functions._
df.withColumn("salt", (rand * n).cast(IntegerType))
.groupBy("salt", groupByFields)
.agg(aggFields)
.groupBy(groupByFields)
.agg(aggFields)
"My question is when random numbers are appended to the key doesn't it break the group? If if does then the meaning of group by operation has changed."
Yes, adding a salt to an existing key will break the group. However, as #Gelerion has mentioned in his answer, you can group by the salted and original key and afterwards group by the original key. This works well for aggregations such as
count
min
max
sum
where it is possible to combine results from sub-groups. The illustration below shows an example of calculating the maximum value of a skewed Dataframe.
var df1 = Seq((1,"a"),(2,"b"),(1,"c"),(1,"x"),(1,"y"),(1,"g"),(1,"k"),(1,"u"),(1,"n")).toDF("ID","NAME")
df1.createOrReplaceTempView("fact")
var df2 = Seq((1,10),(2,30),(3,40)).toDF("ID","SALARY")
df2.createOrReplaceTempView("dim")
val salted_df1 = spark.sql("""select concat(ID, '_', FLOOR(RAND(123456)*19)) as salted_key, NAME from fact """)
salted_df1.createOrReplaceTempView("salted_fact")
val exploded_dim_df = spark.sql(""" select ID, SALARY, explode(array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key from dim""")
//val exploded_dim_df = spark.sql(""" select ID, SALARY, explode(array(0 to 19)) as salted_key from dim""")
exploded_dim_df.createOrReplaceTempView("salted_dim")
val result_df = spark.sql("""select split(fact.salted_key, '_')[0] as ID, dim.SALARY
from salted_fact fact
LEFT JOIN salted_dim dim
ON fact.salted_key = concat(dim.ID, '_', dim.salted_key) """)
display(result_df)

How to implement Slowly Changing Dimensions (SCD2) Type 2 in Spark

We want to implement SCD2 in Spark using SQL Join. i got reference from Github
https://gist.github.com/rampage644/cc4659edd11d9a288c1b
but it's not very clear.
Can anybody provide any example or reference to implement SCD2 in spark
Regards,
Manish
A little outdated in terms of newer Spark SQL, but here is an example
I trialed a la Ralph Kimball using Spark SQL, that worked and is thus
reliable. You can run it and it works - but file logic and such needs
to be added - this is the body of the ETL SCD2 logic based on 1.6
syntax but run in 2.x - it is not that hard but you will need to trace
through and generate test data and trace through each step:
Some pre-processing required before script initiates, save a copy of existing and copy existing to the DIM_CUSTOMER_EXISTING.
Write new output to DIM_CUSTOMER_NEW and then copy this to target, DIM_CUSTOMER_1 or DIM_CUSTOMER_2.
The feed can also be re-created or LOAD OVERWRITE.
^^^ NEED SOME BETTER SCRIPTING AROUND THIS. ^^^ The Type 2 dimension is simply only Type 2 values, not a mixed Type 1 & Type 2.
DUMPs that are accumulative can be in fact pre-processed to get the delta.
Use case assumes we can have N input for a person with a date validity / extract supplied.
SPARK 1.6 SQL based originally, not updated yet to SPARK 2.x SQL with nested correlated subquery support.
CUST_CODE cannot changes unless a stable Primary Key.
This approach handles no input, delta input, same input, all input, and can catch up and need not be run-date based.
^^^ Works best with deltas, as if pass all data and there is no change then still have make a dummy entry with all the same values else it will have gaps in key range
which means will not be able to link facts to dimensions in all cases. I.e. the discard logic works only in terms of a pure delta feed. All data can be passed but only
the current delta. Problem becomes difficult to solve in that we must then look for changes over different rows and expand date range, a little too complicated imho.
The dummy entries in the dimensions are not a huge issue. The problem is a little more difficult in such a less mutable environment, in KUDU it easier to solve.
Ideally there would be some sort of preprocessor that checks which fields have changed and only then passed on, but that may be a bridge too far.
HENCE IT IS A COMPROMISE ALGORITHM necessarily. ^^^
No Deletions processed.
Multi-step processing for SQL required in some cases. Gaps in key ranges difficult to avoid with set processing.
No out of order processing, that would mean re-processing all. Works on a whole date/day basis, if run more than once per day in batch then would need timestamp instead.
0.1 Any difference analysis on existimg dumps only possible if the dumps are accumulative. If they are transactional deltas only, then this is not required.
Care to be taken here.
0.2 If we want only the last update for a given date, then do that here by method of Partitioning and Ranking and filtering out.
These are all pre-processing steps as are the getting of the dimension data from which table.
0.3 Issue is that of small files, but that is not an issue here at xxx. RAW usage only as written to KUDU in a final step.
Actual coding:
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession
.builder
.master("local") // Not a good idea
.appName("Type 2 dimension update")
.config("spark.sql.crossJoin.enabled", "true") // Needed to add this
.getOrCreate()
spark.sql("drop table if exists DIM_CUSTOMER_EXISTING")
spark.sql("drop table if exists DIM_CUSTOMER_NEW")
spark.sql("drop table if exists FEED_CUSTOMER")
spark.sql("drop table if exists DIM_CUSTOMER_TEMP")
spark.sql("drop table if exists DIM_CUSTOMER_WORK")
spark.sql("drop table if exists DIM_CUSTOMER_WORK_2")
spark.sql("drop table if exists DIM_CUSTOMER_WORK_3")
spark.sql("drop table if exists DIM_CUSTOMER_WORK_4")
spark.sql("create table DIM_CUSTOMER_EXISTING (DWH_KEY int, CUST_CODE String, CUST_NAME String, ADDRESS_CITY String, SALARY int, VALID_FROM_DT String, VALID_TO_DT String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/FileStore/tables/alhwkf661500326287094' ")
spark.sql("create table DIM_CUSTOMER_NEW (DWH_KEY int, CUST_CODE String, CUST_NAME String, ADDRESS_CITY String, SALARY int, VALID_FROM_DT String, VALID_TO_DT String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/FileStore/tables/DIM_CUSTOMER_NEW_3' ")
spark.sql("CREATE TABLE FEED_CUSTOMER (CUST_CODE String, CUST_NAME String, ADDRESS_CITY String, SALARY int, VALID_DT String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/FileStore/tables/mhiscfsv1500226290781' ")
// 1. Get maximum value in dimension, this differs to other RDD approach, issues in parallel? May be other way to be done! Check, get a DF here and this is the interchangability
val max_val = spark.sql("select max(dwh_key) from DIM_CUSTOMER_EXISTING")
//max_val.show()
val null_count = max_val.filter("max(DWH_KEY) is null").count()
var max_Dim_Key = 0;
if ( null_count == 1 ) {
max_Dim_Key = 0
} else {
max_Dim_Key = max_val.head().getInt(0)
}
//2. Cannot do simple difference processing. The values of certain fields could be flip-flopping over time. A too simple MINUS will not work well. Need to process relative to
// youngest existing record etc. and roll the transactions forward. Hence we will not do any sort of difference analysis between new dimension data and existing dimension
// data in any way.
// DO NOTHING.
//3. Capture new stuff to be inserted.
// Some records for a given business key can be linea recta inserted as there have been no mutations to consider at all as there is nothing in current Staging. Does not mean
// delete.
// Also, the older mutations need not be re-processed, only the youngest! The younger one may need closing off or not, need to decide if it is now
// copied across or subject to updating in this cycle, depends on the requirements.
// Older mutations copied across immediately.
// DELTA not always strictly speaking needed, but common definitions. Some ranking required.
spark.sql("""insert into DIM_CUSTOMER_NEW select *
from DIM_CUSTOMER_EXISTING
where CUST_CODE not in (select distinct CUST_CODE FROM FEED_CUSTOMER) """) // This does not need RANKing, DWH Key retained.
spark.sql("""create table DIM_CUSTOMER_TEMP as select *, dense_rank() over (partition by CUST_CODE order by VALID_FROM_DT desc) as RANK
from DIM_CUSTOMER_EXISTING """)
spark.sql("""insert into DIM_CUSTOMER_NEW select DWH_KEY, CUST_CODE, CUST_NAME, ADDRESS_CITY, SALARY, VALID_FROM_DT, VALID_TO_DT
from DIM_CUSTOMER_TEMP
where CUST_CODE in (select distinct CUST_CODE from FEED_CUSTOMER)
and RANK <> 1 """)
// For updating of youngest record in terms of SLCD, we use use AND RANK <> 1 to filter these out here as we want to close off the period in this record, but other younger
// records can be passed through immediately with their retained DWH Key.
//4. Combine Staging and those existing facts required. The result of this eventually will be stored in DIM_CUSTOMER_NEW which can be used for updating a final target.
// Issue here is that DWH Key not yet set and different columns. DWH key can be set last.
//4.1 Get records to process, the will have the status NEW.
spark.sql("""create table DIM_CUSTOMER_WORK (DWH_KEY int, CUST_CODE String, CUST_NAME String, ADDRESS_CITY String, SALARY int, VALID_FROM_DT String, VALID_TO_DT String, RECSTAT String) """)
spark.sql("""insert into DIM_CUSTOMER_WORK select 0, CUST_CODE, CUST_NAME, ADDRESS_CITY, SALARY, VALID_DT, '2099-12-31', "NEW"
from FEED_CUSTOMER """)
//4.2 Get youngest already existing dimension record to process in conjunction with newer values.
spark.sql("""insert into DIM_CUSTOMER_WORK select DWH_KEY, CUST_CODE, CUST_NAME, ADDRESS_CITY, SALARY, VALID_FROM_DT, VALID_TO_DT, "OLD"
from DIM_CUSTOMER_TEMP
where CUST_CODE in (select distinct CUST_CODE from FEED_CUSTOMER)
and RANK = 1 """)
// 5. ISSUE with first record in a set. It is not a delta or is used for making a delta, need to know what to do or bypass, depends on case.
// Here we are doing deltas, so first rec is a complete delta
// RECSTAT to be filtered out at end
// NEW, 1 = INSERT --> checked, is correct way, can do in others. No delta computation required
// OLD, 1 = DO NOTHING
// else do delta and INSERT
//5.1 RANK and JOIN to get before and after images in CDC format so that we can decide what needs to be closed off.
// Get the new DWH key values + offset, there may exist gaps eventually.
spark.sql(""" create table DIM_CUSTOMER_WORK_2 as select *, rank() over (partition by CUST_CODE order by VALID_FROM_DT asc) as rank FROM DIM_CUSTOMER_WORK """)
//DWH_KEY, CUST_CODE, CUST_NAME, BIRTH_CITY, SALARY,VALID_FROM_DT, VALID_TO_DT, "OLD"
spark.sql(""" create table DIM_CUSTOMER_WORK_3 as
select T1.DWH_KEY as T1_DWH_KEY, T1.CUST_CODE as T1_CUST_CODE, T1.rank as CURR_RANK, T2.rank as NEXT_RANK,
T1.VALID_FROM_DT as CURR_VALID_FROM_DT, T2.VALID_FROM_DT as NEXT_VALID_FROM_DT,
T1.VALID_TO_DT as CURR_VALID_TO_DT, T2.VALID_TO_DT as NEXT_VALID_TO_DT,
T1.CUST_NAME as CURR_CUST_NAME, T2.CUST_NAME as NEXT_CUST_NAME,
T1.SALARY as CURR_SALARY, T2.SALARY as NEXT_SALARY,
T1.ADDRESS_CITY as CURR_ADDRESS_CITY, T2.ADDRESS_CITY as NEXT_ADDRESS_CITY,
T1.RECSTAT as CURR_RECSTAT, T2.RECSTAT as NEXT_RECSTAT
from DIM_CUSTOMER_WORK_2 T1 LEFT OUTER JOIN DIM_CUSTOMER_WORK_2 T2
on T1.CUST_CODE = T2.CUST_CODE AND T2.rank = T1.rank + 1 """)
//5.2 Get the data for computing new Dimension Surrogate DWH Keys, must execute new query or could use DF's and RDS, RDDs, but chosen for SPARK SQL as aeasier to follow
spark.sql(s""" create table DIM_CUSTOMER_WORK_4 as
select *, row_number() OVER( ORDER BY T1_CUST_CODE) as ROW_NUMBER, '$max_Dim_Key' as DIM_OFFSET
from DIM_CUSTOMER_WORK_3 """)
//spark.sql("""SELECT * FROM DIM_CUSTOMER_WORK_4 """).show()
//Execute the above to see results, could not format here.
//5.3 Process accordingly and check if no change at all, if no change can get holes in the sequence numbers, that is not an issue. NB: NOT DOING THIS DUE TO COMPLICATIONS !!!
// See sample data above for decision-making on what to do. NOTE THE FACT THAT WE WOULD NEED A PRE_PROCCESOR TO CHECK IF FIELD OF INTEREST ACTUALLY CHANGED
// to get the best result.
// We could elaborate and record via an extra step if there were only two records per business key and if all the current and only next record fields were all the same,
// we could disregard the first and the second record. Will attempt that later as an extra optimization. As soon as there are more than two here, then this scheme packs up
// Some effort still needed.
//5.3.1 Records that just need to be closed off. The previous version gets an appropriate DATE - 1. Dates must not overlap.
// No check on whether data changed or not due to issues above.
spark.sql("""insert into DIM_CUSTOMER_NEW select T1_DWH_KEY, T1_CUST_CODE, CURR_CUST_NAME, CURR_ADDRESS_CITY, CURR_SALARY,
CURR_VALID_FROM_DT, cast(date_sub(cast(NEXT_VALID_FROM_DT as DATE), 1) as STRING)
from DIM_CUSTOMER_WORK_4
where CURR_RECSTAT = 'OLD' """)
//5.3.2 Records that are the last in the sequence must have high end 2099-12-31 set, which has already been done.
// No check on whether data changed or not due to issues above.
spark.sql("""insert into DIM_CUSTOMER_NEW select ROW_NUMBER + DIM_OFFSET, T1_CUST_CODE, CURR_CUST_NAME, CURR_ADDRESS_CITY, CURR_SALARY,
CURR_VALID_FROM_DT, CURR_VALID_TO_DT
from DIM_CUSTOMER_WORK_4
where NEXT_RANK is null """)
//5.3.3
spark.sql("""insert into DIM_CUSTOMER_NEW select ROW_NUMBER + DIM_OFFSET, T1_CUST_CODE, CURR_CUST_NAME, CURR_ADDRESS_CITY, CURR_SALARY,
CURR_VALID_FROM_DT, cast(date_sub(cast(NEXT_VALID_FROM_DT as DATE), 1) as STRING)
from DIM_CUSTOMER_WORK_4
where CURR_RECSTAT = 'NEW'
and NEXT_RANK is not null""")
spark.sql("""SELECT * FROM DIM_CUSTOMER_NEW """).show()
// So, the question is if we could have done without JOINing and just sorted due to gap processing. This was derived off the delta processing but it turned out a little
// different.
// Well we did need the JOIN for next date at least, so if we add some optimization it still holds.
// My logic applied here per different steps, may well be less steps, left as is.
//6. The copy / insert to get a new big target table version and re-compile views. Outside of this actual processing. Logic performed elsewhere.
// NOTE now that 2.x supports nested correlated sub-queries are supported, so would need to re-visit this at a later point, but can leave as is.
// KUDU means no more restating.
Sample data so you know what to generate for the examples:
+-------+---------+----------------+------------+------+-------------+-----------+
|DWH_KEY|CUST_CODE| CUST_NAME|ADDRESS_CITY|SALARY|VALID_FROM_DT|VALID_TO_DT|
+-------+---------+----------------+------------+------+-------------+-----------+
| 230| E222222| Pete Saunders| Leeds| 75000| 2013-03-09| 2099-12-31|
| 400| A048901| John Alexander| Calgary| 22000| 2015-03-24| 2017-10-22|
| 402| A048901| John Alexander| Wellington| 47000| 2017-10-23| 2099-12-31|
| 403| B787555| Mark de Wit|Johannesburg| 49500| 2017-10-02| 2099-12-31|
| 406| C999666| Daya Dumar| Mumbai| 50000| 2016-12-16| 2099-12-31|
| 404| C999666| Daya Dumar| Mumbai| 49000| 2016-11-11| 2016-12-14|
| 405| C999666| Daya Dumar| Mumbai| 50000| 2016-12-15| 2016-12-15|
| 300| A048901| John Alexander| Calgary| 15000| 2014-03-24| 2015-03-23|
+-------+---------+----------------+------------+------+-------------+-----------+
Here's the detailed implementation of slowly changing dimension type 2 in Spark (Data frame and SQL) using exclusive join approach.
Assuming that the source is sending a complete data file i.e. old, updated and new records.
Steps:
Load the recent file data to STG table
Select all the expired records from HIST table
1. select * from HIST_TAB where exp_dt != '2099-12-31'
Select all the records which are not changed from STG and HIST using inner join and filter on HIST.column = STG.column as below
2. select hist.* from HIST_TAB hist inner join STG_TAB stg on hist.key = stg.key where hist.column = stg.column
Select all the new and updated records which are changed from STG_TAB using exclusive left join with HIST_TAB and set expiry and effective date as below
3. select stg.*, eff_dt (yyyy-MM-dd), exp_dt (2099-12-31) from STG_TAB stg left join (select * from HIST_TAB where exp_dt = '2099-12-31') hist
on hist.key = stg.key where hist.key is null or hist.column != stg.column
Select all updated old records from the HIST table using exclusive left join with STG table and set their expiry date as shown below:
4. select hist.*, exp_dt(yyyy-MM-dd) from (select * from HIST_TAB where exp_dt = '2099-12-31') hist left join STG_TAB stg
on hist.key= stg.key where hist.key is null or hist.column!= stg.column
unionall queries from 1-4 and insert overwrite result to HIST table
More detailed implementation of SCD type 2 in Scala and Pyspark can be found here-
https://github.com/sahilbhange/spark-slowly-changing-dimension
Hope this helps!
scala spark: https://georgheiler.com/2020/11/19/sparkling-scd2/
NOTICE: this is not a full SCD2 - it assumes one table of events and it determines/ deduplicates valid_from/valid_to from them i.e. no merge/upsert is implemented
val df = Seq(("k1","foo", "2020-01-01"), ("k1","foo", "2020-02-01"), ("k1","baz", "2020-02-01"),
("k2","foo", "2019-01-01"), ("k2","foo", "2019-02-01"), ("k2","baz", "2019-02-01")).toDF("key", "value_1", "date").withColumn("date", to_date(col("date")))
df.show
+---+-------+----------+
|key|value_1| date|
+---+-------+----------+
| k1| foo|2020-01-01|
| k1| foo|2020-02-01|
| k1| baz|2020-02-01|
| k2| foo|2019-01-01|
| k2| foo|2019-02-01|
| k2| baz|2019-02-01|
+---+-------+----------+
df.printSchema
root
|-- key: string (nullable = true)
|-- value_1: string (nullable = true)
|-- date: date (nullable = true)
df.transform(deduplicateScd2(Seq("key"), Seq("date"), "date", Seq())).show
+---+-------+----------+----------+
|key|value_1|valid_from| valid_to|
+---+-------+----------+----------+
| k1| foo|2020-01-01|2020-02-01|
| k1| baz|2020-02-01|2020-11-18|
| k2| foo|2019-01-01|2019-02-01|
| k2| baz|2019-02-01|2020-11-18|
+---+-------+----------+----------+
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.lag
import org.apache.spark.sql.functions.lead
import org.apache.spark.sql.functions.when
import org.apache.spark.sql.functions.current_date
def deduplicateScd2(
key: Seq[String],
sortChangingIgnored: Seq[String],
timeColumn: String,
columnsToIgnore: Seq[String]
)(df: DataFrame): DataFrame = {
val windowPrimaryKey = Window
.partitionBy(key.map(col): _*)
.orderBy(sortChangingIgnored.map(col): _*)
val columnsToCompare =
df.drop(key ++ sortChangingIgnored: _*).drop(columnsToIgnore: _*).columns
val nextDataChange = lead(timeColumn, 1).over(windowPrimaryKey)
val deduplicated = df
.withColumn(
"data_changes_start",
columnsToCompare
.map(e => {
val previous = lag(col(e), 1).over(windowPrimaryKey)
val self = col(e)
// 3 cases: 1.: start (previous is NULL), 2: in between, try to collapse 3: end (= next is null)
// first, filter to only start & end events (= updates/invalidations of records)
//self =!= previous or self =!= next or previous.isNull or next.isNull
self =!= previous or previous.isNull
})
.reduce(_ or _)
)
.withColumn(
"data_changes_end",
columnsToCompare
.map(e => {
val next = lead(col(e), 1).over(windowPrimaryKey)
val self = col(e)
// 3 cases: 1.: start (previous is NULL), 2: in between, try to collapse 3: end (= next is null)
// first, filter to only start & end events (= updates/invalidations of records)
self =!= next or next.isNull
})
.reduce(_ or _)
)
.filter(col("data_changes_start") or col("data_changes_end"))
.drop("data_changes")
deduplicated //.withColumn("valid_to", nextDataChange)
.withColumn(
"valid_to",
when(col("data_changes_end") === true, col(timeColumn))
.otherwise(nextDataChange)
)
.filter(col("data_changes_start") === true)
.withColumn(
"valid_to",
when(nextDataChange.isNull, current_date()).otherwise(col("valid_to"))
)
.withColumnRenamed(timeColumn, "valid_from")
.drop("data_changes_end", "data_changes_start")
}
}
Here an updated answer with MERGE.
Note it will not work with Spark Structured Streaming, but can be used with Spark Kafka Batch Integration.
// 0. Standard, start of program.
// Handles multiple business keys in a single run. DELTA tables.
// Schema evolution also handled.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val sparkSession = SparkSession.builder
.master("local") // Not realistic
.appName("REF Zone History stuff and processing")
.enableHiveSupport() // Standard in Databricks.
.getOrCreate()
// 1. Read newer data to process in some way. Create tempView.
// In general we should have few rows to process, i.e. not at scale.
val dfA = spark.read.option("multiLine",false).json("/FileStore/tables/new_customers_json_multiple_alt3.txt") // New feed.
dfA.createOrReplaceTempView("newFeed")
// 2. First create the target for data at rest if it does not exist. Add an ASC col_key. Should only occur once.
val save_path = "/some_loc_fix/ref/atRest/data" // Make dynamic.
val table_name = "CUSTOMERS_AT_REST"
spark.sql("CREATE TABLE IF NOT EXISTS " + table_name + " LOCATION '" + save_path + "'" + " AS SELECT * from newFeed WHERE 1 = 0 " ) // Can also use limit 0 instead of WHERE 1 = 0.
// Add an ASC col_key column if it does not exist.
// I have in input valid_from_dt, but it could be different so we would need to add in reality as well. Mark to decide.
try {
spark.sql("ALTER TABLE " + table_name + " ADD COLUMNS (col_key BIGINT FIRST, valid_to_dt STRING) ")
} catch {
case unknown: Exception => {
None
}
}
// 3. Get maximum value for target. This is a necessity.
val max_val = spark.sql("select max(col_key) from " + table_name)
//max_val.show()
val null_count = max_val.filter("max(col_key) is null").count()
var max_Col_Key: BigInt = 0;
if ( null_count == 1 ) {
max_Col_Key = 0
} else {
max_Col_Key = max_val.head().getLong(0) // Long and BIGINT interoperable.
}
// 4.1 Create a temporary table for getting the youngest records from the existing data. table_name as variable, newFeed tempView as string. Then apply processing.
val dfB = spark.sql(" select O.* from (select A.cust_code, max(A.col_key) as max_col_key from " + table_name + " A where A.cust_code in (select B.cust_code from newFeed B ) group by A.cust_code ) Z, " + table_name + " O where O.col_key = Z.max_col_key ") // Most recent records.
// No tempView required.
// 4.2 Get the set of data to actually process. New feed + youngest records in feed.
val dfC =dfA.unionByName(dfB, true)
dfC.createOrReplaceTempView("cusToProcess")
// 4.3 RANK
val df1 = spark.sql("""select *, dense_rank() over (partition by CUST_CODE order by VALID_FROM_DT desc) as RANK from CusToProcess """)
df1.createOrReplaceTempView("CusToProcess2")
// 4.4 JOIN adjacent records & process closing off dates etc.
val df2 = spark.sql("""select A.*, B.rank as B_rank, cast(date_sub(cast(B.valid_from_dt as DATE), 1) as STRING) as untilMinus1
from CusToProcess2 A LEFT OUTER JOIN CusToProcess2 B
on A.cust_code = B.cust_code and A.RANK = B.RANK + 1 """)
val df3 = df2.drop("valid_to_dt").withColumn("valid_to_dt", $"untilMinus1").drop("untilMinus1").drop("B_rank")
val df4 = df3.withColumn("valid_to_dt", when($"valid_to_dt".isNull, lit("2099-12-31")).otherwise($"valid_to_dt")).drop("RANK")
df4.createOrReplaceTempView("CusToProcess3")
val df5 = spark.sql(s""" select *, row_number() OVER( ORDER BY cust_code ASC, valid_from_dt ASC) as ROW_NUMBER, '$max_Col_Key' as col_OFFSET
from CusToProcess3 """)
// Add new ASC col_key, gaps can result, not an issue must always be ascending.
val df6 = df5.withColumn("col_key", when($"col_key".isNull, ($"ROW_NUMBER" + $"col_OFFSET")).otherwise($"col_key"))
val df7 = df6.withColumn("col_key", col("col_key").cast(LongType)).drop("ROW_NUMBER").drop("col_OFFSET")
// 5. ACTUAL MERGE, is very simple.
// More than one Merge key possible? Need then to have a col_key if only one such possible.
df7.createOrReplaceTempView("CUST_DELTA")
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")
spark.sql(""" MERGE INTO CUSTOMERS_AT_REST
USING CUST_DELTA
ON CUSTOMERS_AT_REST.col_key = CUST_DELTA.col_key
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
""")

Is .show() a Spark action? [duplicate]

I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs
//some operations in here to create df as df_in with two more columns "terms1" and "terms2"
val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences
val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
.withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) )
.where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
.cache()
) // add the intersection and difference columns and filter the resulting DF
df1.show()
df1.count()
The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.
My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?
Thank you very much in advance,
Roxana
show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.
show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.

Is dataframe.show() an action in spark?

I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs
//some operations in here to create df as df_in with two more columns "terms1" and "terms2"
val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences
val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
.withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) )
.where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
.cache()
) // add the intersection and difference columns and filter the resulting DF
df1.show()
df1.count()
The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.
My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?
Thank you very much in advance,
Roxana
show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.
show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.

mimic ' group by' and window function logic in spark

I have a large csv file with the columns id,time,location. I made it an RDD, and want to compute some aggregated metrics of the trips, when a trip is defined as a time-contiguous set of records of the same id, separated by at least 1 hour on either side. I am new to spark. (related)
To do that, I think to create an RDD with elements of the form (trip_id,(time, location)) and use reduceByKey to calculate all the needed metrics.
To calculate the trip_id, i try to implement the SQL-approach of the linked question, to make an indicator field of whether the record is a start of a trip, and make a cumulative sum of this indicator field. This does not sound like a distributed approach: is there a better one?
Furthermore, how can I add this indicator field? it should be 1 if the time-difference to the previous record of the same id is above an hour, and 0 otherwise. I thought of at first doing groupBy id and then sort in each of the values, but they will be inside an Array and thus not amenable to sortByKey, and there is no lead function as in SQL to get the previous value.
Example of the suggested aforementioned approach: for the RDD
(1,9:30,50)
(1,9:37,70)
(1,9:50,80)
(2,19:30,10)
(1,20:50,20)
We want to turn it first into the RDD with the time differences,
(1,9:30,50,inf)
(1,9:37,70,00:07:00)
(1,9:50,80,00:13:00)
(2,19:30,10,inf)
(2,20:50,20,01:20:00)
(The value of the earliest record is, say, scala's PositiveInfinity constant)
and turn this last field into an indicator field of whether it is above 1, which indicates whether we start a trip,
(1,9:30,50,1)
(1,9:37,70,0)
(1,9:50,80,0)
(2,19:30,10,1)
(2,20:50,20,1)
and then turn it into a trip_id
(1,9:30,50,1)
(1,9:37,70,1)
(1,9:50,80,1)
(2,19:30,10,2)
(2,20:50,20,3)
and then use this trip_id as the key to aggregations.
The preprocessing was simply to load the file and delete the header,
val rawdata=sc.textFile("some_path")
def isHeader(line:String)=line.contains("id")
val data=rawdata.filter(!isHeader(_))
Edit
While trying to implement with spark SQL, I ran into an error regarding the time difference:
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
");
since spark doesn't know how to take the difference between two timestamps. I'm trying to check whether this difference is above 1 hour.
It Also doesn't recognize the function getTime, that I found online as an answer, the following returns an error too (Couldn't find window function time.getTime):
val lags=sqlContext.sql("
select time.getTime() - (lag(time)).getTime() over (partition by id order by time)
from data
");
Even though making a similar lag difference for a numeric attribute works:
val lag_numeric=sqlContext.sql("
select longitude - lag(longitude) over (partition by id order by time)
from data"); //works
Spark didn't recognize the function Hours.hoursBetween either. I'm using spark 1.4.0.
I also tried to define an appropriate user-defined-function, but UDFS are oddly not recognized inside queries:
val timestamp_diff: ((Timestamp,Timestamp) => Double) =
(d1: Timestamp,d2: Timestamp) => d1.getTime()-d2.getTime()
val lags=sqlContext.sql("select timestamp_diff(time,lag(time))
over (partition by id order by time) from data");
So, how can spark test whether the difference between timestamps is above an hour?
Full code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import sqlContext._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.hive.HiveContext//For window functions
import java.util.Date
import java.sql.Timestamp
case class Record(id: Int, time:Timestamp, longitude: Double, latitude: Double)
val raw_data=sc.textFile("file:///home/sygale/merged_table.csv")
val data_records=
raw_data.map(line=>
Record( line.split(',')(0).toInt,
Timestamp.valueOf(line.split(',')(1)),
line.split(',')(2).toDouble,
line.split(',')(3).toDouble
))
val data=data_records.toDF()
data.registerTempTable("data")
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
");

Resources