Getting unexpected results from Spark sql Windows Functions - apache-spark

It seems like Spark sql Window function does not working properly .
I am running a spark job in Hadoop Cluster where a HDFS block size is 128 MB and
Spark Version 1.5 CDH 5.5
My requirement:
If there are multiple records with same data_rfe_id then take the single record as per maximum seq_id and maxiumum service_id
I see that in raw data there are some records with same data_rfe_id and same seq_id so hence, I applied row_number using Window function so that I can filter the records with row_num === 1
But it seems its not working when have huge datasets. I see that same rowNumber is applied .
Why is it happening like this?
Do I need to reshuffle before I apply window function on dataframe?
I am expecting a unique rank number to each data_rfe_id
I want to use Window Function only to achieve this .
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
.....
scala> df.printSchema
root
|-- transitional_key: string (nullable = true)
|-- seq_id: string (nullable = true)
|-- data_rfe_id: string (nullable = true)
|-- service_id: string (nullable = true)
|-- event_start_date_time: string (nullable = true)
|-- event_id: string (nullable = true)
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id").desc,df("service_id").desc)
val rankDF =df.withColumn("row_num",rowNumber.over(windowFunction))
rankDF.select("data_rfe_id","seq_id","service_id","row_num").show(200,false)
Expected result :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |2 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |2 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |4 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |5 |
Actual Result I got as per above code :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |1 |
Could someone explain me why I am getting these unexpected results? and How do I resolve that ?

Basically you want to rank , having seq_id and service_id in desc order. Add rangeBetween with range you need. Rank may work for you. following is snippet of code :
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id"),df("service_id")).desc().rangeBetween(-MAXNUMBER,MAXNUMBER))
val rankDF =df.withColumn( "rank", rank().over(windowFunction) )
As you are using older version of spark don't know it will work or not. There is issue with windowSpec here is reference

Related

Filter rows with minimum and maximum count

This is what the dataframe looks like:
+---+-----------------------------------------+-----+
|eco|eco_name |count|
+---+-----------------------------------------+-----+
|B63|Sicilian, Richter-Rauzer Attack |5 |
|D86|Grunfeld, Exchange |3 |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd |5 |
|A44|Old Benoni Defense |3 |
|C46|Three Knights |1 |
|C08|French, Tarrasch, Open, 4.ed ed |13 |
|E59|Nimzo-Indian, 4.e3, Main line |2 |
|A20|English |2 |
|B20|Sicilian |4 |
|B37|Sicilian, Accelerated Fianchetto |2 |
|A33|English, Symmetrical |8 |
|C77|Ruy Lopez |8 |
|B43|Sicilian, Kan, 5.Nc3 |10 |
|A04|Reti Opening |6 |
|A59|Benko Gambit |1 |
|A54|Old Indian, Ukrainian Variation, 4.Nf3 |3 |
|D30|Queen's Gambit Declined |19 |
|C01|French, Exchange |3 |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1 |
|E74|King's Indian, Averbakh, 6...c5 |2 |
+---+-----------------------------------------+-----+
Schema:
root
|-- eco: string (nullable = true)
|-- eco_name: string (nullable = true)
|-- count: long (nullable = false)
I want to filter it so that only two rows with minimum and maximum counts remain.
The output dataframe should look something like:
+---+-----------------------------------------+--------------------+
|eco|eco_name |number_of_occurences|
+---+-----------------------------------------+--------------------+
|D30|Queen's Gambit Declined |19 |
|C46|Three Knights |1 |
+---+-----------------------------------------+--------------------+
I'm a beginner, I'm really sorry if this is a stupid question.
No need to apologize since this is the place to learn! One of the solutions is to use the Window and rank to find the min/max row:
df = spark.createDataFrame(
[('a', 1), ('b', 1), ('c', 2), ('d', 3)],
schema=['col1', 'col2']
)
df.show(10, False)
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |1 |
|c |2 |
|d |3 |
+----+----+
Just use filtering to find the min/max count row after the ranking:
df\
.withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
.withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
.filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
.show(100, False)
+----+----+-------+-------+
|col1|col2|min_row|max_row|
+----+----+-------+-------+
|d |3 |4 |1 |
|a |1 |1 |3 |
|b |1 |1 |3 |
+----+----+-------+-------+
Please note that if the min/max row count are the same, they will be both filtered out.
You can use row_number function twice to order records by count, ascending and descending.
SELECT eco, eco_name, count
FROM (SELECT *,
row_number() over (order by count asc) as rna,
row_number() over (order by count desc) as rnd
FROM df)
WHERE rna = 1 or rnd = 1;
Note there's a tie for count = 1. If you care about it add a secondary sort to control which record is selected or maybe use rank instead to select all.

Skip records in dataframe's map transformation

I have a Spark dataframe on which I am doing certain operations as follows. I wanted to know how do I skip processing certain records going through all the operations
finalDf = df.map(mapFunc).reduceGroups(reduceFunc).map(mapFunc2).write().format().option().mode().save();
In the mapFunc, I want to write a logic that, if a certain condition is true, dont return anything and essential quit considering that record for further operations .reduceGroups(reduceFunc).map(mapFunc2).write().format().option().mode().save()
I tried returning Optional.empty() from the map, but the code fails in reduceGroups with below error.
Exception in thread "main" java.util.NoSuchElementException: head of empty list
at scala.collection.immutable.Nil$.head(List.scala:420)
at scala.collection.immutable.Nil$.head(List.scala:417)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$5.apply(ExpressionEncoder.scala:121)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$5.apply(ExpressionEncoder.scala:120)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.tuple(ExpressionEncoder.scala:120)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.tuple(ExpressionEncoder.scala:187)
at org.apache.spark.sql.expressions.ReduceAggregator.bufferEncoder(ReduceAggregator.scala:38)
at org.apache.spark.sql.expressions.Aggregator.toColumn(Aggregator.scala:100)
at org.apache.spark.sql.KeyValueGroupedDataset.reduceGroups(KeyValueGroupedDataset.scala:436)
at org.apache.spark.sql.KeyValueGroupedDataset.reduceGroups(KeyValueGroupedDataset.scala:448)
Schema for dataframe :
root
|-- id: string (nullable = true)
|-- mid: integer (nullable = true)
|-- responses: string (nullable = true)
|-- version: string (nullable = true)
In the mapFunc, i am processing responses string by doing substring and fetching value from it. In some cases, responses string can be empty so in such cases I dont want that entire record to be in finalDf;
Input
id mid responses version
A 1 "hello123" 1
B 1 "hello456" 2
A 2 "hello789" 5
A 1 "hello143" 4
B 3 "hello153" 6
C 3 "" 1
Output (Grouping by id, mid column)
id mid responses version
A 1 "143" 4
B 1 "456" 2
A 2 "789" 5
B 3 "153" 6
If responses for (id,mid) combination of (C,3), was not empty, then it would be in output. So i want to remove the the C,3 in the mapFunc.
Here if you have a map function as below, then you can just return the same row and filter later with filter
val myFunc = (row: Row) => {
val number = row.getString(2).replaceAll("\\D+","")
(row.getString(0), row.getInt(1), number, row.getInt(3))
}
df.map(myFunc)
.toDF(df.columns: _*)
.filter(trim($"responses") =!= "" )
.show(false)
Output:
+---+---+---------+-------+
|id |mid|responses|version|
+---+---+---------+-------+
|A |1 |123 |1 |
|B |1 |456 |2 |
|A |2 |789 |5 |
|A |1 |143 |4 |
|B |3 |153 |6 |
+---+---+---------+-------+

How to find out the number of unique elements for a column in a group in PySpark?

I have a PySpark dataframe-
df1 = spark.createDataFrame([
("u1", 1),
("u1", 2),
("u2", 1),
("u2", 1),
("u2", 1),
("u3", 3),
],
['user_id', 'var1'])
print(df1.printSchema())
df1.show(truncate=False)
Output-
root
|-- user_id: string (nullable = true)
|-- var1: long (nullable = true)
None
+-------+----+
|user_id|var1|
+-------+----+
|u1 |1 |
|u1 |2 |
|u2 |1 |
|u2 |1 |
|u2 |1 |
|u3 |3 |
+-------+----+
Now I want to group all the unique users and show the number of unique var for them in a new column. The desired output would look like-
+-------+---------------+
|user_id|num_unique_var1|
+-------+---------------+
|u1 |2 |
|u2 |1 |
|u3 |1 |
+-------+---------------+
I can use collect_set and make a udf to find the set's length. But I think there must be a better way to do it.
How do I achieve this in one line of code?
df1.groupBy('user_id').agg(F.countDistinct('var1').alias('num')).show()
countDistinct is exactly what I needed.
Output-
+-------+---+
|user_id|num|
+-------+---+
| u3| 1|
| u2| 1|
| u1| 2|
+-------+---+
countDistinct is surely the best way to do it, but for the sake of completeness, what you said in your question is also possible without using an UDF. You can use size to get the length of the collect_set:
df1.groupBy('user_id').agg(F.size(F.collect_set('var1')).alias('num'))
this is helpful if you want to use it in a window function, because countDistinct is not supported in a window function.
e.g.
df1.withColumn('num', F.countDistinct('var1').over(Window.partitionBy('user_id')))
would fail, but
df1.withColumn('num', F.size(F.collect_set('var1')).over(Window.partitionBy('user_id')))
would work.

Is there a "key-wise map with state" in Spark?

I have an RDD (or DataFrame) of measuring data which is ordered by the timestamp, and I need to do a pairwise operation on two subsequent records for the same key (e.g., doing a trapezium integration of accelerometer data to get velocities).
Is there a function in Spark that "remembers" the last record for each key and has it available when the next record for the same key arrives?
I currently thought of this approach:
Get all the keys of the RDD
Use a custom Partitioner to partition the RDD by the found keys so I know there is one partition for each key
Use mapPartitions to do the calculation
However this has one flaw:
First, getting the keys can be a very lengthy task because the input data can be several GiB or even TiB large. I could write a custom InputFormat to just extract the keys which would be significantly faster (as I use Hadoop's API and sc.newAPIHadoopFile to get the data in the first place) but that would be additional things to consider and an additional source of bugs.
So my question is: Is there anything like reduceByKey that doesn't aggregate the data but just gives me the current record and the last one for that key and lets me output one or more records based on that information?
Here is what you can do with dataframe
import java.sql.Timestamp
import org.apache.spark.sql.types.{TimestampType, IntegerType}
import org.apache.spark.sql.functions._
**Create a window for lag function**
val w = org.apache.spark.sql.expressions.Window.partitionBy("key").orderBy("timestamp")
val df = spark.sparkContext.parallelize(List((1, 23, Timestamp.valueOf("2017-12-02 03:04:00")),
(1, 24, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 26, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 27, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 30, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 33, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 39, Timestamp.valueOf("2017-12-02 01:45:20")))).toDF("key","value","timestamp")
scala> df.printSchema
root
|-- key: integer (nullable = false)
|-- value: integer (nullable = false)
|-- timestamp: timestamp (nullable = true)
scala> val lagDF = df.withColumn("lag_value",lag("value", 1, 0).over(w))
lagDF: org.apache.spark.sql.DataFrame = [key: int, value: int ... 2 more fields]
**Previous record and current record is in same row now**
scala> lagDF.show(10, false)
+---+-----+-------------------+---------+
|key|value|timestamp |lag_value|
+---+-----+-------------------+---------+
|1 |24 |2017-12-02 01:45:20|0 |
|1 |26 |2017-12-02 01:45:20|24 |
|1 |27 |2017-12-02 01:45:20|26 |
|1 |23 |2017-12-02 03:04:00|27 |
|2 |30 |2017-12-02 01:45:20|0 |
|2 |33 |2017-12-02 01:45:20|30 |
|2 |39 |2017-12-02 01:45:20|33 |
+---+-----+-------------------+---------+
**Put your distance calculation logic here. I'm putting dummy function for demo**
scala> val result = lagDF.withColumn("dummy_operation_for_dist_calc", lagDF("value") - lagDF("lag_value"))
result: org.apache.spark.sql.DataFrame = [key: int, value: int ... 3 more fields]
scala> result.show(10, false)
+---+-----+-------------------+---------+-----------------------------+
|key|value|timestamp |lag_value|dummy_operation_for_dist_calc|
+---+-----+-------------------+---------+-----------------------------+
|1 |24 |2017-12-02 01:45:20|0 |24 |
|1 |26 |2017-12-02 01:45:20|24 |2 |
|1 |27 |2017-12-02 01:45:20|26 |1 |
|1 |23 |2017-12-02 03:04:00|27 |-4 |
|2 |30 |2017-12-02 01:45:20|0 |30 |
|2 |33 |2017-12-02 01:45:20|30 |3 |
|2 |39 |2017-12-02 01:45:20|33 |6 |
+---+-----+-------------------+---------+-----------------------------+

Spark merging two single value datasets

I have a Dataset with the following schema
|-- Name: string (nullable = true)
|-- Values: long (nullable = true)
|-- Count: integer (nullable = true)
Input Dataset
+------------+-----------------------+--------------+
|Name |Values |Count |
+------------+-----------------------+--------------+
|A |1000 |1 |
|B |1150 |0 |
|C |500 |3 |
+------------+-----------------------+--------------+
My result dataset needs to be of the format
+------------+-----------------------+--------------+
|Sum(count>0)| sum(all) | Percentage |
+------------+-----------------------+--------------+
| 1500 | 2650 | 56.60 |
+------------+-----------------------+--------------+
I am currently able to get the sum(count>0) and sum(all) in individual datasets by running
val non_zero = df.filter(col(COUNT).>(0)).select(sum(VALUES).as(NON_ZERO_SUM))
val total = df.select(sum(col(VALUES)).as(TOTAL_SUM))
I'm at a loss on what to do to merge the two independent datasets into a single dataset, with which I would calculate the percentage.
Also could this same problem be solved in a better way?
Thanks,
I'd use single aggregation:
import org.apache.spark.sql.functions._
df.select(
sum(when($"count" > 0, $"values')).alias("NON_ZERO_SUM"),
sum($"values").alias("TOTAL_SUM")
)

Resources