How to check Spark DataFrame difference? - apache-spark

I need to check my solution for idempotency and check how much it's different with past solution.
I tried next:
spark.sql('''
select * from t1
except
select * from t2
''').count()
It's gives me information how much this tables different (t1 - my solution, t2 - primal data). If here is many different data, I want to check, where it different.
So, I tried that:
diff = {}
columns = t1.columns
for col in columns:
cntr = spark.sql('''
select {col} from t1
except
select {col} from t2
''').count()
diff[col] = cntr
print(diff)
It's not good for me, because it's works about 1-2 hours (both tables have 30 columns and 30 million lines of data).
Do you guys have an idea how to calculate this quickly?

Except is a kind of a join on all columns at the same time. Does your data have a primary key? It could even be complex, comprising of multiple columns, but it's still much better then taking all 30 columns into account.
Once you figure out the primary key you can do the FULL OUTER JOIN and:
check NULLs on the left
check NULLs on the right
check other columns of matching rows (it's much cheaper to compare the values after the join)

Given that your resource remains unchanged, I think there are three ways that you can optimize:
Join two dataframe once but not looping the except: I assume your dataset should have a key / index, otherwise there is no ordering in your both dataframe and you can't perform except to check the difference. Unless you have limited resource, just do join once to concat two dataframe first instead of multiple except.
Check your data partitioning: Even you use point 1 / the method that you're using, make sure that data partition is in even distribution with optimal number of partition. Most of the time, data skew is one of the critical parts to lower your performance. If your key is a string, use repartition. If you're using a sequence number, use repartitionByRange.
Use the when-otherwise pair to check the difference: once you join two dataframe, you can use a when-otherwise condition to compare the difference, for example: df.select(func.sum(func.when(func.col('df1.col_a')!=func.col('df2.col_a'), func.lit(1))).otherwise(func.lit(0)).alias('diff_in_col_a_count')). Therefore, you can calculate all the difference within one action but not multiple action.

Related

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Does it help to filter down a dataframe before a left outer join?

I've only seen sources say that this helps for RDDs, so I was wondering if it helped for DataFrames since the Spark core and spark SQL engines optimize differently.
Let's say table 1 has 6mil records and we're joining to table 2 which has 600mil records. We are joining these two tables on table 2's primary key, 'key2'.
If we plan to do:
table 3 = table1.join(table2, 'key2', 'left_outer'),
Is it worth it to filter down table2's 600mil records with a WHERE table2.key2 IN table1.key2 before the join? And if so, what's the best way to do it? I know the DataFrame LEFT SEMI JOIN method is similar to a WHERE IN filter, but I'd like to know if there are better ways to filter it down.
TL;DR It is not possible to answer without data, but probably not.
Pre-filtering may provide performance boost if you significantly reduce number of records to be shuffled. To do that:
It has to be highly selective.
Size of the key column is << size of all columns.
The first one is obvious. If there is reduction you search for nothing.
The second is subtle - WHERE ... IN (SELECT ... FROM ...) requires a shuffle, same a join. So the keys are actually shuffled twice.
Using bloom filters might scale more gracefully (no need to shuffle).
If you have 100 fold difference in the number of records, it might be better to consider broadcast join.

Spark SQL window function causes skew in data distribution

The performance of this Spark SQL query is bad due to skew data distribution:
select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 1 PRECEDING), 0L) as totalRevenue
from records c
I see in SparkUI that single task stack and the cluster fail if I increase the scanned range.
I am using Yarn at AWS EMR, with Spark 2.2.0
How can I overcome this issue?
Thanks
I can only recommend several approaches to alleviate your condition for investigation. I would actually try two approaches that don’t treat the skew first:
Try increasing the executor memory per the message. On YARN you may additionally need to increase the maximum container memory as well. The default on Spark IIRC is 2gb and its not uncommon to need to increase it.
Try switching to memory_and_disk or disk_only persistence levels. I believe this should work for your query although it can be hard to eyeball the full query plan
The reason for this is that at least to my eye your data is fundamentally skewed. You’re setting yourself up for maintenance difficulties if you start reshaping the data to address the skew in specific ways to the current shape of the data because the shape of the data may change over time. In my opinion at least you want to preserve the most straightforward implementation of your query for as long as you can, and only optimize skew issues programmatically if you hit problems with SLA violations, etc.
If those don’t work then you can try to address the skew directly. A simple approach for this is to create a third column that is populated by a random number for the column values that are known to be problematic. Do one pass of your summing operation with this in place, using it as a key, then a second pass with the extra random column removed. Alternatively you can do two queries and concatenate them: one with the random number for skewed data (which must still be handled in two passes) and another unaltered query for the non problematic data.
Edit - compute partial sums through two frames
The fundamentally useful observation here is that addition is commutative and associative. My original proposal based on random numbers won't work but this will. Basically, you want to compute the partial sum of the frame you want in several parts. The easiest way to do this is probably as a set of ranges (two used here for simplicity):
create temporary table partial_revenue_1 as select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 336 PRECEDING and 118 PRECEDING), 0L) as partialTotalRevenue
from records c
create temporary table partial_revenue_2 as select c.*, coalesce(
sum(revenue)
OVER (PARTITION BY cid, pid, code
ORDER BY (cTime div (1000*3600))
RANGE BETWEEN 117 PRECEDING and 1 PRECEDING), 0L) as partialTotalRevenue
from records c
create temporary table combined_partials as select * from
partial_reveneue_1 union all select * from partial_revenue_2
select sum(partialTotalRevenue), first(c.some_col) ... from
combined_partials c group by cid, pid, code
Notice you need to use the first aggregate function to cull the duplicate fields that you will have from the earlier select * operations on the records table. Don't worry, this will be fine since both values came from the same table.

Spark window function without orderBy

I have a DataFrame with columns a, b for which I want to partition the data by a using a window function, and then give unique indices for b
val window_filter = Window.partitionBy($"a").orderBy($"b".desc)
withColumn("uid", row_number().over(window_filter))
But for this use-case, ordering by b is unneeded and may be time consuming. How can I achieve this without ordering?
row_number() without order by or with order by constant has non-deterministic behavior and may produce different results for the same rows from run to run due to parallel processing. The same may happen if the order by column does not change, the order of rows may be different from run to run and you will get different results.

Resources