Spark Window Function: Referencing different columns for range

Spark Window Function: Referencing different columns for range - apache-spark

I have a DataFrame with columns of start_time and end_time. I want to set windows, with each observation's window being the two rows before it by end time, restricted to data with an end_time before that observation's start_time.
Example data:
data = [('a', 10, 12, 5),('b', 20, 25, 10),('c', 30, 60, 15),('d', 40, 45, 20),('e', 50, 70, 25)]
df = sqlContext.createDataFrame(data, ['name', 'start_time', 'end_time', 'resource'])
+----+----------+--------+--------+
|name|start_time|end_time|resource|
+----+----------+--------+--------+
| a| 10| 12| 5|
| b| 20| 25| 10|
| c| 30| 60| 15|
| d| 40| 45| 20|
| e| 50| 70| 25|
+----+----------+--------+--------+
So the window for 'e' should include 'b' and 'd', but not 'c'
Without the restriction of end time < start time, I was able to use
from pyspark.sql import Window
from pyspark.sql import functions as func
window = Window.orderBy("name").rowsBetween(-2, -1)
df.select('*', func.avg("resource").over(window).alias("avg")).show()
I looked into rangeBetween() but I can't figure out a way to reference the start_time of the current row, or that I want to restrict it by the end_time of the other rows. There's Window.currentRow, but in this example it would only reference the value for resource
Is this possible to do using Window? Should I be trying something else entirely?
Edit: Using Spark 2.1.1 and Python 2.7+ if it matters.

you can actually use groupBy function for aggregation for different partitions and then use the inner join between the output dataframes over the same common key. Partition by or window function takes much time in spark so better to use groupby instead if you can.

I don't think this is possible purely using windows. From a given row, you need to be able to work in reverse sort order back through prior rows until you have two hits which satisfy your condition.
You could use a window function to create a list of all previous values encountered for each row, and then a UDF with some pure scala/python to determine the sum, accounting for your exclusions.
In scala:
val window = Window.partitionBy(???).orderBy("end_time").rowsBetween(Long.MinValue, -1)
val udfWithSelectionLogic = udf { values: Seq[Row] => INSERT_LOGIC_HERE_TO_CALCULATE_AGGREGATE }
val dataPlus = data.withColumn("combined", struct($"start_time", $"end_time", $"resource"))
.withColumn("collected", collect_list($"combined") over window)
.withColumn("result", udfWithSelectionLogic($"collected"))
This isn't ideal, but might be helpful.

Related

Pyspark: grouping contiguous rows by boolean column

I have a Spark dataframe in Python and it is in a specific order where the rows can be sectioned into the right groups, according to a column "start_of_section" which has values 1 or 0. For each collection of rows that need to be grouped together, every column other than "value" and "start_of_section" is equal. I want to group each such collection into one row that has the same values for every other column and a column "list_values" which has an array of all the values in each row.
So some rows might look like:
Row(category=fruit, object=apple, value=60, start_of_section=1)
Row(category=fruit, object=apple, value=160, start_of_section=0)
Row(category=fruit, object=apple, value=30, start_of_section=0)
and in the new dataframe this would be
Row(category=fruit, object=apple, list_values=[60, 160, 30])
(Edit: note that the column "start_of_section" should not have been included in the final dataframe.)
The issue I've had in trying to research the answer is that I've only found ways of grouping by column value without regard for ordering, so that this would wrongly produce two rows, one grouping all rows with "start_of_section"=1 and one grouping all rows with "start_of_section"=0..
What code can achieve this?

Assuming your order column is order_col
df.show()
+--------+------+---------+----------------+-----+
|category|object|order_col|start_of_section|value|
+--------+------+---------+----------------+-----+
| fruit| apple| 1| 1| 60|
| fruit| apple| 2| 0| 160|
| fruit| apple| 3| 0| 30|
| fruit| apple| 4| 1| 50|
+--------+------+---------+----------------+-----+
you need to generate an id to group the lines in the same section together, then group by this id and the dimension you want. Here is how you do it.
from pyspark.sql import functions as F, Window as W
df.withColumn(
"id",
F.sum("start_of_section").over(
W.partitionBy("category", "object").orderBy("order_col")
),
).groupBy("category", "object", "id").agg(F.collect_list("value").alias("values")).drop(
"id"
).show()
+--------+------+-------------+
|category|object| values|
+--------+------+-------------+
| fruit| apple|[60, 160, 30]|
| fruit| apple| [50]|
+--------+------+-------------+
EDIT: If you do not have any order_col, it is an impossible task to do. See your lines in a dataframe as marble in a bag. They do not have any order. You can order them as you pull them out of the bag according to some criteria, but otherwise, you cannot assume any order. show is just you pulling 10 marbles (lines) out of the bag. The order may be the same each time you do it, but suddently change, and you have no controle on it

Well, now I got it. You can do a group by with the column that summing the start_of_section.
In order to make sure about the result, you should include the ordering column.
from pyspark.sql.types import Row
from pyspark.sql.functions import *
from pyspark.sql import Window
data = [Row(category='fruit', object='apple', value=60, start_of_section=1),
Row(category='fruit', object='apple', value=160, start_of_section=0),
Row(category='fruit', object='apple', value=30, start_of_section=0),
Row(category='fruit', object='apple', value=50, start_of_section=1),
Row(category='fruit', object='apple', value=30, start_of_section=0),
Row(category='fruit', object='apple', value=60, start_of_section=1),
Row(category='fruit', object='apple', value=110, start_of_section=0)]
df = spark.createDataFrame(data)
w = Window.partitionBy('category', 'object').rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn('group', sum('start_of_section').over(w)) \
.groupBy('category', 'object', 'group').agg(collect_list('value').alias('list_value')) \
.drop('group').show()
+--------+------+-------------+
|category|object| list_value|
+--------+------+-------------+
| fruit| apple|[60, 160, 30]|
| fruit| apple| [50, 30]|
| fruit| apple| [60, 110]|
+--------+------+-------------+
FAILS: monotonically_increasing_id fails when you have many partitions.
df.repartition(7) \
.withColumn('id', monotonically_increasing_id()) \
.withColumn('group', sum('start_of_section').over(w)) \
.groupBy('category', 'object', 'group').agg(collect_list('value').alias('list_value')) \
.drop('group').show()
+--------+------+--------------------+
|category|object| list_value|
+--------+------+--------------------+
| fruit| apple| [60]|
| fruit| apple|[60, 160, 30, 30,...|
| fruit| apple| [50]|
+--------+------+--------------------+
This is totally not wanted.

How to do sum of columns and make it available as column in spark sql

i have following scenario on my data set. I need to sum of some column values without making any interference to other columns. For example,
Here is my data set
data_set,vol,channel
Dak,10,ABC
Fak,20,CNN
Mok,10,BBC
my expected output is
data_set,vol,channel,sum(vol)
Dak,10,ABC,40
Fak,20,CNN,40
Mok,10,BBC,40
Is there any way we achieve this without join.. i need an optimised result

You can do this in the following way:
import org.apache.spark.sql.functions.lit
import spark.implicits._
val df = Seq(("Dak",10," ABC"),
("Fak",20,"CNN"),
("Mok",10,"BBC")).toDF("data_set","vol","channel")
val sum_df = df.withColumn("vol_sum", lit(df.groupBy().sum("vol").collect()(0).getLong(0)))
sum_df.show()
+--------+---+-------+-------+
|data_set|vol|channel|vol_sum|
+--------+---+-------+-------+
| Dak| 10| ABC| 40|
| Fak| 20| CNN| 40|
| Mok| 10| BBC| 40|
+--------+---+-------+-------+
Hopefully it'll help you.

How to use lag and rangeBetween functions on timestamp values?

I have data that looks like this:
userid,eventtime,location_point
4e191908,2017-06-04 03:00:00,18685891
4e191908,2017-06-04 03:04:00,18685891
3136afcb,2017-06-04 03:03:00,18382821
661212dd,2017-06-04 03:06:00,80831484
40e8a7c3,2017-06-04 03:12:00,18825769
I would like to add a new boolean column that marks true if there are 2 or moreuserid within a 5 minutes window in the same location_point. I had an idea of using lag function to lookup over a window partitioned by the userid and with the range between the current timestamp and the next 5 minutes:
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col
days = lambda i: i * 60*5
windowSpec = W.partitionBy(col("userid")).orderBy(col("eventtime").cast("timestamp").cast("long")).rangeBetween(0, days(5))
lastURN = F.lag(col("location_point"), 1).over(windowSpec)
visitCheck = (last_location_point == output.location_pont)
output.withColumn("visit_check", visitCheck).select("userid","eventtime", "location_pont", "visit_check")
This code is giving me an analysis exception when I use the RangeBetween function:
AnalysisException: u'Window Frame RANGE BETWEEN CURRENT ROW AND 1500
FOLLOWING must match the required frame ROWS BETWEEN 1 PRECEDING AND 1
PRECEDING;
Do you know any way to tackle this problem?

Given your data:
Let's add a column with a timestamp in seconds:
df = df.withColumn('timestamp',df_taf.eventtime.astype('Timestamp').cast("long"))
df.show()
+--------+-------------------+--------------+----------+
| userid| eventtime|location_point| timestamp|
+--------+-------------------+--------------+----------+
|4e191908|2017-06-04 03:00:00| 18685891|1496545200|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560|
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890|
+--------+-------------------+--------------+----------+
Now, let's define a window function, with a partition by location_point, an order by timestamp and a range between -300s and current time. We can count the number of elements in this window and put these data in a column named 'occurences in_5_min':
w = Window.partitionBy('location_point').orderBy('timestamp').rangeBetween(-60*5,0)
df = df.withColumn('occurrences_in_5_min',F.count('timestamp').over(w))
df.show()
+--------+-------------------+--------------+----------+--------------------+
| userid| eventtime|location_point| timestamp|occurrences_in_5_min|
+--------+-------------------+--------------+----------+--------------------+
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920| 1|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380| 1|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560| 1|
|4e191908|2017-06-04 03:00:00| 18685891|1496545200| 1|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440| 2|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890| 1|
+--------+-------------------+--------------+----------+--------------------+
Now you can add the desired column with True if the number of occurences is strictly more than 1 in the last 5 minutes on a particular location:
add_bool = udf(lambda col : True if col>1 else False, BooleanType())
df = df.withColumn('already_occured',add_bool('occurrences_in_5_min'))
df.show()
+--------+-------------------+--------------+----------+--------------------+---------------+
| userid| eventtime|location_point| timestamp|occurrences_in_5_min|already_occured|
+--------+-------------------+--------------+----------+--------------------+---------------+
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920| 1| false|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380| 1| false|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560| 1| false|
|4e191908|2017-06-04 03:00:00| 18685891|1496545200| 1| false|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440| 2| true|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890| 1| false|
+--------+-------------------+--------------+----------+--------------------+---------------+

rangeBetween just doesn't make sense for non-aggregate function like lag. lag takes always a specific row, denoted by offset argument, so specifying frame is pointless.
To get a window over time series you can use window grouping with standard aggregates:
from pyspark.sql.functions import window, countDistinct
(df
.groupBy("location_point", window("eventtime", "5 minutes"))
.agg( countDistinct("userid")))
You can add more arguments to modify slide duration.
You can try something similar with window functions if you partition by location:
windowSpec = (W.partitionBy(col("location"))
.orderBy(col("eventtime").cast("timestamp").cast("long"))
.rangeBetween(0, days(5)))
df.withColumn("id_count", countDistinct("userid").over(windowSpec))

How to find first non-null values in groups? (secondary sorting using dataset api)

I am working on a dataset which represents a stream of events (like fired as tracking events from a website). All the events have a timestamp. One use case we often have is trying to find the 1st non null value for a given field. So for example something like gets us most the way there:
val eventsDf = spark.read.json(jsonEventsPath)
case class ProjectedFields(visitId: String, userId: Int, timestamp: Long ... )
val projectedEventsDs = eventsDf.select(
eventsDf("message.visit.id").alias("visitId"),
eventsDf("message.property.user_id").alias("userId"),
eventsDf("message.property.timestamp"),
...
).as[ProjectedFields]
projectedEventsDs.groupBy($"visitId").agg(first($"userId", true))
The problem with the above code is that the order of the data being fed into that first aggregation function is not guaranteed. I would like it to be sorted by timestamp to ensure that it is the 1st non null userId by timestamp rather than any random non null userId.
Is there a way to define the sorting within a grouping?
Using Spark 2.10
BTW, the way suggested for Spark 2.10 in SPARK DataFrame: select the first row of each group is to do ordering before the grouping -- that doesn't work. For example the following code:
case class OrderedKeyValue(key: String, value: String, ordering: Int)
val ds = Seq(
OrderedKeyValue("a", null, 1),
OrderedKeyValue("a", null, 2),
OrderedKeyValue("a", "x", 3),
OrderedKeyValue("a", "y", 4),
OrderedKeyValue("a", null, 5)
).toDS()
ds.orderBy("ordering").groupBy("key").agg(first("value", true)).collect()
Will sometimes return Array([a,y]) and sometimes Array([a,x])

Use my beloved windows (...and experience how much simpler your life becomes !)
import org.apache.spark.sql.expressions.Window
val byKeyOrderByOrdering = Window
.partitionBy("key")
.orderBy("ordering")
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
import org.apache.spark.sql.functions.first
val firsts = ds.withColumn("first",
first("value", ignoreNulls = true) over byKeyOrderByOrdering)
scala> firsts.show
+---+-----+--------+-----+
|key|value|ordering|first|
+---+-----+--------+-----+
| a| null| 1| x|
| a| null| 2| x|
| a| x| 3| x|
| a| y| 4| x|
| a| null| 5| x|
+---+-----+--------+-----+
NOTE: Somehow, Spark 2.2.0-SNAPSHOT (built today) could not give me the correct answer with no rangeBetween which I thought should've been the default unbounded range.

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form:
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',2),
('Baz',22,'US',6),
('Baz',36,'US',6)])
What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only.
Removing entirely duplicate rows is straightforward:
data = data.distinct()
and either row 5 or row 6 will be removed
But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these:
('Baz',22,'US',6)
('Baz',36,'US',6)
In Python, this could be done by specifying columns with .drop_duplicates(). How can I achieve the same in Spark/Pyspark?

Pyspark does include a dropDuplicates() method, which was introduced in 1.4. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html
>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
| 10| 80|Alice|
+---+------+-----+
>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
+---+------+-----+

From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.
Here is some code to get you started:
def get_key(x):
return "{0}{1}{2}".format(x[0],x[2],x[3])
m = data.map(lambda x: (get_key(x),x))
Now, you have a key-value RDD that is keyed by columns 1,3 and 4.
The next step would be either a reduceByKey or groupByKey and filter.
This would eliminate duplicates.
r = m.reduceByKey(lambda x,y: (x))

I know you already accepted the other answer, but if you want to do this as a
DataFrame, just use groupBy and agg. Assuming you had a DF already created (with columns named "col1", "col2", etc) you could do:
myDF.groupBy($"col1", $"col3", $"col4").agg($"col1", max($"col2"), $"col3", $"col4")
Note that in this case, I chose the Max of col2, but you could do avg, min, etc.

Agree with David. To add on, it may not be the case that we want to groupBy all columns other than the column(s) in aggregate function i.e, if we want to remove duplicates purely based on a subset of columns and retain all columns in the original dataframe. So the better way to do this could be using dropDuplicates Dataframe api available in Spark 1.4.0
For reference, see: https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrame

I used inbuilt function dropDuplicates(). Scala code given below
val data = sc.parallelize(List(("Foo",41,"US",3),
("Foo",39,"UK",1),
("Bar",57,"CA",2),
("Bar",72,"CA",2),
("Baz",22,"US",6),
("Baz",36,"US",6))).toDF("x","y","z","count")
data.dropDuplicates(Array("x","count")).show()
Output :
+---+---+---+-----+
| x| y| z|count|
+---+---+---+-----+
|Baz| 22| US| 6|
|Foo| 39| UK| 1|
|Foo| 41| US| 3|
|Bar| 57| CA| 2|
+---+---+---+-----+

The below programme will help you drop duplicates on whole , or if you want to drop duplicates based on certain columns , you can even do that:
import org.apache.spark.sql.SparkSession
object DropDuplicates {
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-DropDuplicates")
.master("local[4]")
.getOrCreate()
import spark.implicits._
// create an RDD of tuples with some data
val custs = Seq(
(1, "Widget Co", 120000.00, 0.00, "AZ"),
(2, "Acme Widgets", 410500.00, 500.00, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(4, "Widgets R Us", 410500.00, 0.0, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(5, "Ye Olde Widgete", 500.00, 0.0, "MA"),
(6, "Widget Co", 12000.00, 10.00, "AZ")
)
val customerRows = spark.sparkContext.parallelize(custs, 4)
// convert RDD of tuples to DataFrame by supplying column names
val customerDF = customerRows.toDF("id", "name", "sales", "discount", "state")
println("*** Here's the whole DataFrame with duplicates")
customerDF.printSchema()
customerDF.show()
// drop fully identical rows
val withoutDuplicates = customerDF.dropDuplicates()
println("*** Now without duplicates")
withoutDuplicates.show()
val withoutPartials = customerDF.dropDuplicates(Seq("name", "state"))
println("*** Now without partial duplicates too")
withoutPartials.show()
}
}

This is my Df contain 4 is repeated twice so here will remove repeated values.
scala> df.show
+-----+
|value|
+-----+
| 1|
| 4|
| 3|
| 5|
| 4|
| 18|
+-----+
scala> val newdf=df.dropDuplicates
scala> newdf.show
+-----+
|value|
+-----+
| 1|
| 3|
| 5|
| 4|
| 18|
+-----+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Window Function: Referencing different columns for range - apache-spark

you can actually use groupBy function for aggregation for different partitions and then use the inner join between the output dataframes over the same common key. Partition by or window function takes much time in spark so better to use groupby instead if you can.

Related

Pyspark: grouping contiguous rows by boolean column

How to do sum of columns and make it available as column in spark sql

How to use lag and rangeBetween functions on timestamp values?

How to find first non-null values in groups? (secondary sorting using dataset api)

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Categories

Resources