I have an RDD (or DataFrame) of measuring data which is ordered by the timestamp, and I need to do a pairwise operation on two subsequent records for the same key (e.g., doing a trapezium integration of accelerometer data to get velocities).
Is there a function in Spark that "remembers" the last record for each key and has it available when the next record for the same key arrives?
I currently thought of this approach:
Get all the keys of the RDD
Use a custom Partitioner to partition the RDD by the found keys so I know there is one partition for each key
Use mapPartitions to do the calculation
However this has one flaw:
First, getting the keys can be a very lengthy task because the input data can be several GiB or even TiB large. I could write a custom InputFormat to just extract the keys which would be significantly faster (as I use Hadoop's API and sc.newAPIHadoopFile to get the data in the first place) but that would be additional things to consider and an additional source of bugs.
So my question is: Is there anything like reduceByKey that doesn't aggregate the data but just gives me the current record and the last one for that key and lets me output one or more records based on that information?
Here is what you can do with dataframe
import java.sql.Timestamp
import org.apache.spark.sql.types.{TimestampType, IntegerType}
import org.apache.spark.sql.functions._
**Create a window for lag function**
val w = org.apache.spark.sql.expressions.Window.partitionBy("key").orderBy("timestamp")
val df = spark.sparkContext.parallelize(List((1, 23, Timestamp.valueOf("2017-12-02 03:04:00")),
(1, 24, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 26, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 27, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 30, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 33, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 39, Timestamp.valueOf("2017-12-02 01:45:20")))).toDF("key","value","timestamp")
scala> df.printSchema
root
|-- key: integer (nullable = false)
|-- value: integer (nullable = false)
|-- timestamp: timestamp (nullable = true)
scala> val lagDF = df.withColumn("lag_value",lag("value", 1, 0).over(w))
lagDF: org.apache.spark.sql.DataFrame = [key: int, value: int ... 2 more fields]
**Previous record and current record is in same row now**
scala> lagDF.show(10, false)
+---+-----+-------------------+---------+
|key|value|timestamp |lag_value|
+---+-----+-------------------+---------+
|1 |24 |2017-12-02 01:45:20|0 |
|1 |26 |2017-12-02 01:45:20|24 |
|1 |27 |2017-12-02 01:45:20|26 |
|1 |23 |2017-12-02 03:04:00|27 |
|2 |30 |2017-12-02 01:45:20|0 |
|2 |33 |2017-12-02 01:45:20|30 |
|2 |39 |2017-12-02 01:45:20|33 |
+---+-----+-------------------+---------+
**Put your distance calculation logic here. I'm putting dummy function for demo**
scala> val result = lagDF.withColumn("dummy_operation_for_dist_calc", lagDF("value") - lagDF("lag_value"))
result: org.apache.spark.sql.DataFrame = [key: int, value: int ... 3 more fields]
scala> result.show(10, false)
+---+-----+-------------------+---------+-----------------------------+
|key|value|timestamp |lag_value|dummy_operation_for_dist_calc|
+---+-----+-------------------+---------+-----------------------------+
|1 |24 |2017-12-02 01:45:20|0 |24 |
|1 |26 |2017-12-02 01:45:20|24 |2 |
|1 |27 |2017-12-02 01:45:20|26 |1 |
|1 |23 |2017-12-02 03:04:00|27 |-4 |
|2 |30 |2017-12-02 01:45:20|0 |30 |
|2 |33 |2017-12-02 01:45:20|30 |3 |
|2 |39 |2017-12-02 01:45:20|33 |6 |
+---+-----+-------------------+---------+-----------------------------+
Related
This is what the dataframe looks like:
+---+-----------------------------------------+-----+
|eco|eco_name |count|
+---+-----------------------------------------+-----+
|B63|Sicilian, Richter-Rauzer Attack |5 |
|D86|Grunfeld, Exchange |3 |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd |5 |
|A44|Old Benoni Defense |3 |
|C46|Three Knights |1 |
|C08|French, Tarrasch, Open, 4.ed ed |13 |
|E59|Nimzo-Indian, 4.e3, Main line |2 |
|A20|English |2 |
|B20|Sicilian |4 |
|B37|Sicilian, Accelerated Fianchetto |2 |
|A33|English, Symmetrical |8 |
|C77|Ruy Lopez |8 |
|B43|Sicilian, Kan, 5.Nc3 |10 |
|A04|Reti Opening |6 |
|A59|Benko Gambit |1 |
|A54|Old Indian, Ukrainian Variation, 4.Nf3 |3 |
|D30|Queen's Gambit Declined |19 |
|C01|French, Exchange |3 |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1 |
|E74|King's Indian, Averbakh, 6...c5 |2 |
+---+-----------------------------------------+-----+
Schema:
root
|-- eco: string (nullable = true)
|-- eco_name: string (nullable = true)
|-- count: long (nullable = false)
I want to filter it so that only two rows with minimum and maximum counts remain.
The output dataframe should look something like:
+---+-----------------------------------------+--------------------+
|eco|eco_name |number_of_occurences|
+---+-----------------------------------------+--------------------+
|D30|Queen's Gambit Declined |19 |
|C46|Three Knights |1 |
+---+-----------------------------------------+--------------------+
I'm a beginner, I'm really sorry if this is a stupid question.
No need to apologize since this is the place to learn! One of the solutions is to use the Window and rank to find the min/max row:
df = spark.createDataFrame(
[('a', 1), ('b', 1), ('c', 2), ('d', 3)],
schema=['col1', 'col2']
)
df.show(10, False)
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |1 |
|c |2 |
|d |3 |
+----+----+
Just use filtering to find the min/max count row after the ranking:
df\
.withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
.withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
.filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
.show(100, False)
+----+----+-------+-------+
|col1|col2|min_row|max_row|
+----+----+-------+-------+
|d |3 |4 |1 |
|a |1 |1 |3 |
|b |1 |1 |3 |
+----+----+-------+-------+
Please note that if the min/max row count are the same, they will be both filtered out.
You can use row_number function twice to order records by count, ascending and descending.
SELECT eco, eco_name, count
FROM (SELECT *,
row_number() over (order by count asc) as rna,
row_number() over (order by count desc) as rnd
FROM df)
WHERE rna = 1 or rnd = 1;
Note there's a tie for count = 1. If you care about it add a secondary sort to control which record is selected or maybe use rank instead to select all.
I have a dataframe consists of person, transaction_id & is_successful. The dataframe consists of duplicate values for person with different transaction_ids and is_successful will be True/False for each transaction.
I would like to derive a new dataframe which will have one record for each person which consists latest transaction_id of that person and populate True only if any of his transactions are successful.
val input_df = sc.parallelize(Seq((1,1, "True"), (1,2, "False"), (2,1, "False"), (2,2, "False"), (2,3, "True"), (3,1, "False"), (3,2, "False"), (3,3, "False"))).toDF("person","transaction_id", "is_successful")
df: org.apache.spark.sql.DataFrame = [person: int, transaction_id: int ... 1 more field]
input_df.show(false)
+------+--------------+-------------+
|person|transaction_id|is_successful|
+------+--------------+-------------+
|1 |1 |True |
|1 |2 |False |
|2 |1 |False |
|2 |2 |False |
|2 |3 |True |
|3 |1 |False |
|3 |2 |False |
|3 |3 |False |
+------+--------------+-------------+
Expected Df:
+------+--------------+-------------+
|person|transaction_id|is_successful|
+------+--------------+-------------+
|1 |2 |True |
|2 |3 |True |
|3 |3 |False |
+------+--------------+-------------+
How can we derive the dataframe like above?
What you can do is below in spark sql
select person,max(transaction_id) as transaction_id,max(is_successful) as is_successful from <table_name> group by person
Leave the complex work to max operator.As per the max operation True will come over False.So if one of your person has three False and one True, max of that would be True.
You may achieve this by grouping your dataframe on person and finding the max transaction_id and max is_successful
I've included an example below of how this may be achieved using spark sql.
First, I created a temporary view of your dataframe in order to access using spark sql, then run the following sql statement in spark sql.
input_df.createOrReplaceTempView("input_df");
val result_df = sparkSession.sql("<insert sql below here>");
The sql statement groups the data for each person before using max to determine the last transaction id and a combination of max (sum could be used with the same logic also) and case expressions to derive the is_successful value. The case expression is nested as I've converted True to a numeric value of 1 and False to 0 to leverage a numeric comparison. This is within an outer case expression which checks if the max value is > 0 (i.e. any value was successful) before printing True/False.
SELECT
person,
MAX(transaction_id) as transaction_id,
CASE
WHEN MAX(
CASE
WHEN is_successful = 'True' THEN 1
ELSE 0
END
) > 0 THEN 'True'
ELSE 'False'
END as is_successful
FROM
input_df
GROUP BY
person
Here is the #ggordon's sql version of answer in dataframe version.
input_df.groupBy("person")
.agg(max("transaction_id").as("transaction_id"),
when(max(when('is_successful === "True", 1)
.otherwise(0)) > 0, "True")
.otherwise("False").as("is_successful"))
I have a PySpark dataframe-
df1 = spark.createDataFrame([
("u1", 1),
("u1", 2),
("u2", 1),
("u2", 1),
("u2", 1),
("u3", 3),
],
['user_id', 'var1'])
print(df1.printSchema())
df1.show(truncate=False)
Output-
root
|-- user_id: string (nullable = true)
|-- var1: long (nullable = true)
None
+-------+----+
|user_id|var1|
+-------+----+
|u1 |1 |
|u1 |2 |
|u2 |1 |
|u2 |1 |
|u2 |1 |
|u3 |3 |
+-------+----+
Now I want to group all the unique users and show the number of unique var for them in a new column. The desired output would look like-
+-------+---------------+
|user_id|num_unique_var1|
+-------+---------------+
|u1 |2 |
|u2 |1 |
|u3 |1 |
+-------+---------------+
I can use collect_set and make a udf to find the set's length. But I think there must be a better way to do it.
How do I achieve this in one line of code?
df1.groupBy('user_id').agg(F.countDistinct('var1').alias('num')).show()
countDistinct is exactly what I needed.
Output-
+-------+---+
|user_id|num|
+-------+---+
| u3| 1|
| u2| 1|
| u1| 2|
+-------+---+
countDistinct is surely the best way to do it, but for the sake of completeness, what you said in your question is also possible without using an UDF. You can use size to get the length of the collect_set:
df1.groupBy('user_id').agg(F.size(F.collect_set('var1')).alias('num'))
this is helpful if you want to use it in a window function, because countDistinct is not supported in a window function.
e.g.
df1.withColumn('num', F.countDistinct('var1').over(Window.partitionBy('user_id')))
would fail, but
df1.withColumn('num', F.size(F.collect_set('var1')).over(Window.partitionBy('user_id')))
would work.
It seems like Spark sql Window function does not working properly .
I am running a spark job in Hadoop Cluster where a HDFS block size is 128 MB and
Spark Version 1.5 CDH 5.5
My requirement:
If there are multiple records with same data_rfe_id then take the single record as per maximum seq_id and maxiumum service_id
I see that in raw data there are some records with same data_rfe_id and same seq_id so hence, I applied row_number using Window function so that I can filter the records with row_num === 1
But it seems its not working when have huge datasets. I see that same rowNumber is applied .
Why is it happening like this?
Do I need to reshuffle before I apply window function on dataframe?
I am expecting a unique rank number to each data_rfe_id
I want to use Window Function only to achieve this .
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
.....
scala> df.printSchema
root
|-- transitional_key: string (nullable = true)
|-- seq_id: string (nullable = true)
|-- data_rfe_id: string (nullable = true)
|-- service_id: string (nullable = true)
|-- event_start_date_time: string (nullable = true)
|-- event_id: string (nullable = true)
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id").desc,df("service_id").desc)
val rankDF =df.withColumn("row_num",rowNumber.over(windowFunction))
rankDF.select("data_rfe_id","seq_id","service_id","row_num").show(200,false)
Expected result :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |2 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |2 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |4 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |5 |
Actual Result I got as per above code :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |1 |
Could someone explain me why I am getting these unexpected results? and How do I resolve that ?
Basically you want to rank , having seq_id and service_id in desc order. Add rangeBetween with range you need. Rank may work for you. following is snippet of code :
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id"),df("service_id")).desc().rangeBetween(-MAXNUMBER,MAXNUMBER))
val rankDF =df.withColumn( "rank", rank().over(windowFunction) )
As you are using older version of spark don't know it will work or not. There is issue with windowSpec here is reference
I have an input dataframe of the format
+---------------------------------+
|name| values |score |row_number|
+---------------------------------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |4 |
|E |850 |3 |5 |
|F |800 |1 |6 |
+---------------------------------+
I need to get sum(values) when score > 0 and row_number < K (i,e) SUM of all values when score > 0 for the top k values in the dataframe.
I am able to achieve this by running the following query for top 100 values
val top_100_data = df.select(
count(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("count_100"),
sum(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("sum_filtered_100"),
sum(when(col("row_number") <=100, col(values))).alias("total_sum_100")
)
However, I need to fetch data for top 100,200,300......2500. meaning I would need to run this query 25 times and finally union 25 dataframes.
I'm new to spark and still figuring lots of things out. What would be the best approach to solve this problem?
Thanks!!
You can create an Array of limits as
val topFilters = Array(100, 200, 300) // you can add more
Then you can loop through the topFilters array and create the dataframe you require. I suggest you to use join rather than union as join will give you separate columns and unions will give you separate rows. You can do the following
Given your dataframe as
+----+------+-----+----------+
|name|values|score|row_number|
+----+------+-----+----------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |200 |
|E |850 |3 |150 |
|F |800 |1 |250 |
+----+------+-----+----------+
You can do by using the topFilters array defined above as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
var finalDF : DataFrame = Seq("1").toDF("rowNum")
for(k <- topFilters) {
val top_100_data = df.select(lit("1").as("rowNum"), sum(when(col("score") > 0 && col("row_number") < k, col("values"))).alias(s"total_sum_$k"))
finalDF = finalDF.join(top_100_data, Seq("rowNum"))
}
finalDF.show(false)
Which should give you final dataframe as
+------+-------------+-------------+-------------+
|rowNum|total_sum_100|total_sum_200|total_sum_300|
+------+-------------+-------------+-------------+
|1 |923 |1773 |3473 |
+------+-------------+-------------+-------------+
You can do the same for your 25 limits that you have.
If you intend to use union, then the idea is similar to above.
I hope the answer is helpful
Updated
If you require union then you can apply following logic with the same limit array defined above
var finalDF : DataFrame = Seq((0, 0, 0, 0)).toDF("limit", "count", "sum_filtered", "total_sum")
for(k <- topFilters) {
val top_100_data = df.select(lit(k).as("limit"), count(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("count"),
sum(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("sum_filtered"),
sum(when(col("row_number") <=k, col("values"))).alias("total_sum"))
finalDF = finalDF.union(top_100_data)
}
finalDF.filter(col("limit") =!= 0).show(false)
which should give you
+-----+-----+------------+---------+
|limit|count|sum_filtered|total_sum|
+-----+-----+------------+---------+
|100 |1 |923 |2870 |
|200 |3 |2673 |4620 |
|300 |4 |3473 |5420 |
+-----+-----+------------+---------+