drop_duplicates after unionByName - apache-spark

I am trying to stack two dataframes (with unionByName()) and, then, dropping duplicate entries (with drop_duplicates()).
Can I trust that unionByName() will preserve the order of the rows, i.e., that df1.unionByName(df2) will always produce a dataframe whose first N rows are df1's? Because, if so, when applying drop_duplicates(), df1's row would always be preserved, which is the behaviour I want.

UnionByName will not guarantee that you will have your records ranked first from df1 and then from df2. These are distributed and parallel tasks so you definitely can't build on that.
The solution might be to add a technical priority column to each DataFrame, then unionByName() and use the row_number() analytical function to sort by priority within that ID and then select the one with the higher priority (in below case 1 means higher than 2).
Take a look at the Scala code below:
val df1WithPriority = df1.withColumn("priority", lit(1))
val df2WithPriority = df2.withColumn("priority", lit(2))
df1WithPriority
.unionByName(df2WithPriority)
.withColumn(
"row_num",
row_number()
.over(Window.partitionBy("ID").orderBy(col("priority").asc)
)
.where(col("row_num") === lit(1))

Related

Pyspark filter or split based on row value

I'm dealing with some data in pyspark. We have the issue, that metadata and actual data are mixed. This means we have Strings which are interrupted by a "STOP" string. The number of Strings between the "STOP" is variable and we would like to filter out short occurrences.
An example dataframe where we have ints instead of Strings and 0 is the stop signal is below:
df = spark.createDataFrame([*[(1,),(0,),(3,),(0,),(4,),(4,),(5,),(0,)]])
My goal would now be to have a filter function, where I can say how many elements between two stop signals need to be, in order for the data to be kept. E.g. if min_length was two, we would end up with the dataframe:
df = spark.createDataFrame([(4,),(4,),(5,),(0,)]])
My idea was to create a seperate column and create a group in there:
df.select("_1", F.when(df["_1"]==2, 0).otherwise(get_counter())).show()
The get_counter function should count how many times we've already seen "Stop" (or 0 in the example df). Due to the distributed nature of Spark that does not work though.
Is it somehow possible to easily achive this by filtering? Or is it maybe possible to split the dataframes, everytime "STOP" occurs? I could then delete to short dataframes and merge them again.
Preferably this would be solved in pyspark or sql-spark. But if someone knows how to do this with the spark-shell, I'd also be curious :)
Spark sql implementation:
with t2 as (
select
monotonically_increasing_id() as id
, _1
, case when _1 = 0 then 1 else 0 end as stop
from
t1
)
, t3 as (
select
*
, row_number() over (partition by stop order by id) as stop_seq
from
t2
)
select * from t3 where stop_seq > 2

Spark How to Join Only Within Partitions

I Have 2 large data frames. Each row has lat/lon data. My goal is to do a join between 2 dataframes and find all the points which are within a distance, e.g. 100m.
df1: (id, lat, lon, geohash7)
df2: (id, lat, lon, geohash7)
I want to partition df1 and df2 on geohash7, and then only join within the partitions. I want to avoid joining between partitions to reduce computation.
df1 = df1.repartition(200, "geohash7")
df2 = df2.repartition(200, "geohash7")
df_merged = df1.join(df2, (df1("geohash7")===df2("geohash7")) & (dist(df1("lat"),df1("lon"),df2("lat"),df2("lon"))<100) )
So basically join on geohash7 and then make sure distance between points is less than 100.
The problem is that, Spark actually will cross join all the data. How can I make it only do inter-partition join not intra-partition join?
After much playing with data, it seems that spark is smart enough to first make sure a join happens on the equality condition ("geohash7"). So if there's no match there, it won't calculate the "dist" function.
It also appears that with equality condition, it doesn't do cross-join anymore. So I didn't have to do anything else. The join above works fine.

Perform multiple aggregations on a spark dataframe in one pass instead of multiple slow joins

At the moment I have 9 functions which do specific calculations to a data frame - average balance per month included, rolling P&L, period start balances, ratio calculation.
Each of those functions produce the following:
the first columns are the group by columns which the function accepts and the final column is the statistic calculation.
I.e.
Each of those functions produce a spark data frame that has the same group by variables(same first columns - 1 column if the group by variables is only 1, 2 columns if the group by variables are 2, etc.) and 1 column where the values are the specific calculation - examples of which I listed at the beginning.
Because each of those functions do different calculations, I need to produce a data frame for each one and then join them to produce a report
I join them on the group by variables because they are common in all of them(each individual statistic report).
But doing 7-8 and even more joins is very slow.
Is there a way to add those columns together without using join?
Thank you.
I can think of multiple approaches. But this looks like a good use case for a new pandas udf spark api.
You can define one group_by udf. The udf will receive the aggregated group as a pandas dataframe. You apply 9 aggregate functions on the group and return a pandas dataframe with additional 9 aggregated columns. Spark will combine each new returned pandas dataframe into a large spark dataframe.
e.g
# given you want to aggregate average and ratio
#pandas_udf("month long, avg double, ratio dbl", PandasUDFType.GROUPED_MAP)
def compute(pdf):
# pdf is a pandas.DataFrame
pdf['avg'] = compute_avg(pdf)
pdf['ratio'] = compute_ratio(pdf)
return pdf
df.groupby("month").apply(compute).show()
See Pandas-UDF#Grouped-Map
If you cluster is on a lower version you have 2 options:
Stick to dataframe api and write custom aggregate functions. See answer. They have a horrible api but usage would look like this.
df.groupBy(df.month).agg(
my_avg_func(df.amount).alias('avg'),
my_ratio_func(df.amount).alias('ratio'),
Fall back to good ol' rdd map reduce api
#pseudocode
def _create_tuple_key(record):
return (record.month, record)
def _compute_stats(acc, record):
acc['count'] += 1
acc['avg'] = _accumulate_avg(acc['count'], record)
acc['ratio'] = _accumulate_ratio(acc['count'], record)
return acc
df.toRdd.map(__create_tuple_key).reduceByKey(_compute_stats)

Spark condition on partition column from another table (Performance)

I have a huge parquet table partitioned on registration_ts column - named stored.
I'd like to filter this table based on data obtained from small table - stream
In sql world the query would look like:
spark.sql("select * from stored where exists (select 1 from stream where stream.registration_ts = stored.registration_ts)")
In Dataframe world:
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi")
This all works, but the performance is suffering, because the partition pruning is not applied. Spark full-scans stored table, which is too expensive.
For example this runs 2 minutes:
stream.count
res45: Long = 3
//takes 2 minutes
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
[Stage 181:> (0 + 1) / 373]
This runs in 3 seconds:
val stream = stream.where("registration_ts in (20190516204l, 20190515143l,20190510125l, 20190503151l)")
stream.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
The reason is that in the 2-nd example the partition filter is propagated to joined stream table.
I'd like to achieve partition filtering on dynamic set of partitions.
The only solution I was able to come up with:
val partitions = stream.select('registration_ts).distinct.collect.map(_.getLong(0))
stored.where('registration_ts.isin(partitions:_*))
Which collects the partitions to driver and makes a 2-nd query. This works fine only for small number of partitions. When I've tried this solution with 500k distinct partitions, the delay was significant.
But there must be a better way ...
Here's one way that you can do it in PySpark and I've verified in Zeppelin that it is using the set of values to prune the partitions
# the collect_set function returns a distinct list of values and collect function returns a list of rows. Getting the [0] element in the list of rows gets you the first row and the [0] element in the row gets you the value from the first column which is the list of distinct values
from pyspark.sql.functions import collect_set
filter_list = spark.read.orc(HDFS_PATH)
.agg(collect_set(COLUMN_WITH_FILTER_VALUES))
.collect()[0][0]
# you can use the filter_list with the isin function to prune the partitions
df = spark.read.orc(HDFS_PATH)
.filter(col(PARTITION_COLUMN)
.isin(filter_list))
.show(5)
# you may want to do some checks on your filter_list value to ensure that your first spark.read actually returned you a valid list of values before trying to do the next spark.read and prune your partitions

How to filter out duplicate rows based on some columns in spark dataframe?

Suppose, I have a Dataframe like below:
Here, you can see that transaction number 1,2 and 3 have same value for columns A,B,C but different value for column D and E. Column E has date entries.
For same A,B and C combination (A=1,B=1,C=1), we have 3 rows. I want to take only one row based on the recent transaction date of column E means the rows which have the most recent date. But for the most recent date, there are 2 transactions. But i want to take just one of them if two or more rows found for the same combination of A,B,C and most recent date in column E.
So my expected output for this combination will be row number 3 or 4(any one will do).
For same A,B and C combination (A=2,B=2,C=2), we have 2 rows. But based on column E, the most recent date is the date of row number 5. So we will just take this row for this combination of A,B and C.
So my expected output for this combination will be row number 5
So the final output will be (3 and 5) or (4 and 5).
Now how should i approach:
I read this:
Both reduceByKey and groupByKey can be used for the same purposes but
reduceByKey works much better on a large dataset. That’s because Spark
knows it can combine output with a common key on each partition before
shuffling the data.
I tried with groupBy on Column A,B,C and max on column E. But it can't give me the head of the rows if multiple rows present for the same date.
What is the most optimized approach to solve this? Thanks in advance.
EDIT: I need get back my filtered transactions. How to do it also?
I have used spark window functions to get my solution:
val window = Window
.partitionBy(dataframe("A"), dataframe("B"),dataframe("C"))
.orderBy(dataframe("E") desc)
val dfWithRowNumber = dataframe.withColumn("row_number", row_number() over window)
val filteredDf = dfWithRowNumber.filter(dfWithRowNumber("row_number") === 1)
Link possible by several steps. Agregated Dataframe:
val agregatedDF=initialDF.select("A","B","C","E").groupBy("A","B","C").agg(max("E").as("E_max"))
Link intial-agregated:
initialDF.join(agregatedDF, List("A","B","C"))
If initial DataFrame comes from Hive, all can be simplified.
val initialDF = Seq((1,1,1,1,"2/28/2017 0:00"),(1,1,1,2,"3/1/2017 0:00"),
(1,1,1,3,"3/1/2017 0:00"),(2,2,2,1,"2/28/2017 0:00"),(2,2,2,2,"2/25/20170:00"))
This will miss out on corresponding col(D)
initialDF
.toDS.groupBy("_1","_2","_3")
.agg(max(col("_5"))).show
In case you want the corresponding colD for the max col:
initialDF.toDS.map(x=>x._1,x._2,x._3,x._5,x._4))).groupBy("_1","_2","_3")
.agg(max(col("_4")).as("_4")).select(col("_1"),col("_2"),col("_3"),col("_4._2"),col("_4._1")).show
For ReduceByKey you can convert the dataset to pairRDD and then work off it.Should be faster in case the Catalyst is not able to optimize the groupByKey in the first one. Refer Rolling your own reduceByKey in Spark Dataset

Resources