Check value from Spark DataFrame column and do transformations - apache-spark

I have a dataframe consists of person, transaction_id & is_successful. The dataframe consists of duplicate values for person with different transaction_ids and is_successful will be True/False for each transaction.
I would like to derive a new dataframe which will have one record for each person which consists latest transaction_id of that person and populate True only if any of his transactions are successful.
val input_df = sc.parallelize(Seq((1,1, "True"), (1,2, "False"), (2,1, "False"), (2,2, "False"), (2,3, "True"), (3,1, "False"), (3,2, "False"), (3,3, "False"))).toDF("person","transaction_id", "is_successful")
df: org.apache.spark.sql.DataFrame = [person: int, transaction_id: int ... 1 more field]
input_df.show(false)
+------+--------------+-------------+
|person|transaction_id|is_successful|
+------+--------------+-------------+
|1 |1 |True |
|1 |2 |False |
|2 |1 |False |
|2 |2 |False |
|2 |3 |True |
|3 |1 |False |
|3 |2 |False |
|3 |3 |False |
+------+--------------+-------------+
Expected Df:
+------+--------------+-------------+
|person|transaction_id|is_successful|
+------+--------------+-------------+
|1 |2 |True |
|2 |3 |True |
|3 |3 |False |
+------+--------------+-------------+
How can we derive the dataframe like above?

What you can do is below in spark sql
select person,max(transaction_id) as transaction_id,max(is_successful) as is_successful from <table_name> group by person
Leave the complex work to max operator.As per the max operation True will come over False.So if one of your person has three False and one True, max of that would be True.

You may achieve this by grouping your dataframe on person and finding the max transaction_id and max is_successful
I've included an example below of how this may be achieved using spark sql.
First, I created a temporary view of your dataframe in order to access using spark sql, then run the following sql statement in spark sql.
input_df.createOrReplaceTempView("input_df");
val result_df = sparkSession.sql("<insert sql below here>");
The sql statement groups the data for each person before using max to determine the last transaction id and a combination of max (sum could be used with the same logic also) and case expressions to derive the is_successful value. The case expression is nested as I've converted True to a numeric value of 1 and False to 0 to leverage a numeric comparison. This is within an outer case expression which checks if the max value is > 0 (i.e. any value was successful) before printing True/False.
SELECT
person,
MAX(transaction_id) as transaction_id,
CASE
WHEN MAX(
CASE
WHEN is_successful = 'True' THEN 1
ELSE 0
END
) > 0 THEN 'True'
ELSE 'False'
END as is_successful
FROM
input_df
GROUP BY
person

Here is the #ggordon's sql version of answer in dataframe version.
input_df.groupBy("person")
.agg(max("transaction_id").as("transaction_id"),
when(max(when('is_successful === "True", 1)
.otherwise(0)) > 0, "True")
.otherwise("False").as("is_successful"))

Related

Spark, return multiple rows on group?

So, I have a Kafka topic containing the following data, and I'm working on a proof-of-concept whether we can achieve what we're trying to do. I was previous trying to solve it within Kafka, but it seems that Kafka wasn't the right tool, so looking at Spark now :)
The data in its basic form looks like this:
+--+------------+-------+---------+
|id|serialNumber|source |company |
+--+------------+-------+---------+
|1 |123ABC |system1|Acme |
|2 |3285624 |system1|Ajax |
|3 |CDE567 |system1|Emca |
|4 |XX |system2|Ajax |
|5 |3285624 |system2|Ajax&Sons|
|6 |0147852 |system2|Ajax |
|7 |123ABC |system2|Acme |
|8 |CDE567 |system2|Xaja |
+--+------------+-------+---------+
The main grouping column is serialNumber and the result should be that id 1 and 7 should match as it's a full match on the company. Id 2 and 5 should match because the company in id 2 is a full partial match of the company in id 5. Id 3 and 8 should not match as the companies doesn't match.
I expect the end result to be something like this. Note that sources are not fixed to just one or two and in the future it will contain more sources.
+------+-----+------------+-----------------+---------------+
|uuid |id |serialNumber|source |company |
+------+-----+------------+-----------------+---------------+
|<uuid>|[1,7]|123ABC |[system1,system2]|[Acme] |
|<uuid>|[2,5]|3285624 |[system1,system2]|[Ajax,Ajax&Sons|
|<uuid>|[3] |CDE567 |[system1] |[Emca] |
|<uuid>|[4] |XX |[system2] |[Ajax] |
|<uuid>|[6] |0147852 |[system2] |[Ajax] |
|<uuid>|[8] |CDE567 |[system2] |[Xaja] |
+------+-----+------------+-----------------+---------------+
I was looking at groupByKey().mapGroups() but having problems finding examples. Can mapGroups() return more than one row?
You can simply groupBy based on serialNumber column and collect_list of all other columns.
code:
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._
val ds = Seq((1,"123ABC", "system1", "Acme"),
(7,"123ABC", "system2", "Acme"))
.toDF("id", "serialNumber", "source", "company")
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_list("company").alias("company")
)
.show(false)
Output:
+------------+------+------------------+------------+
|serialNumber|id |source |company |
+------------+------+------------------+------------+
|123ABC |[1, 7]|[system1, system2]|[Acme, Acme]|
+------------+------+------------------+------------+
If you dont want duplicate values, use collect_set
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_set("company").alias("company")
)
.show(false)
Output with collect_set on company column:
+------------+------+------------------+-------+
|serialNumber|id |source |company|
+------------+------+------------------+-------+
|123ABC |[1, 7]|[system1, system2]|[Acme] |
+------------+------+------------------+-------+

finding non-overlapping windows in a pyspark dataframe

Suppose I have a pyspark dataframe with an id column and a time column (t) in seconds. For each id I'd like to group the rows so that each group has all entries that are within 5 seconds after the start time for that group. So for instance, if the table is:
+---+--+
|id |t |
+---+--+
|1 |0 |
|1 |1 |
|1 |3 |
|1 |8 |
|1 |14|
|1 |18|
|2 |0 |
|2 |20|
|2 |21|
|2 |50|
+---+--+
Then the result should be:
+---+--+---------+-------------+-------+
|id |t |subgroup |window_start |offset |
+---+--+---------+-------------+-------+
|1 |0 |1 |0 |0 |
|1 |1 |1 |0 |1 |
|1 |3 |1 |0 |3 |
|1 |8 |2 |8 |0 |
|1 |14|3 |14 |0 |
|1 |18|3 |14 |4 |
|2 |0 |1 |0 |0 |
|2 |20|2 |20 |0 |
|2 |21|2 |20 |1 |
|2 |50|3 |50 |0 |
+---+--+---------+-------------+-------+
I don't need the subgroup numbers to be consecutive. I'm ok with solutions using custom UDAF in Scala as long as it is efficient.
Computing (cumsum(t)-(cumsum(t)%5))/5 within each group can be used to identify the first window, but not the ones beyond that. Essentially the problem is that after the first window is found, the cumulative sum needs to reset to 0. I could operate recursively using this cumulative sum approach, but that is too inefficient on a large dataset.
The following works and is more efficient than recursively calling cumsum, but it is still so slow as to be useless on large dataframes.
d = [[int(x[0]),float(x[1])] for x in [[1,0],[1,1],[1,4],[1,7],[1,14],[1,18],[2,5],[2,20],[2,21],[3,0],[3,1],[3,1.5],[3,2],[3,3.5],[3,4],[3,6],[3,6.5],[3,7],[3,11],[3,14],[3,18],[3,20],[3,24],[4,0],[4,1],[4,2],[4,6],[4,7]]]
schema = pyspark.sql.types.StructType(
[
pyspark.sql.types.StructField('id',pyspark.sql.types.LongType(),False),
pyspark.sql.types.StructField('t',pyspark.sql.types.DoubleType(),False)
]
)
df = spark.createDataFrame(
[pyspark.sql.Row(*x) for x in d],
schema
)
def getSubgroup(ts):
result = []
total = 0
ts = sorted(ts)
tdiffs = numpy.array(ts)
tdiffs = tdiffs[1:]-tdiffs[:-1]
tdiffs = numpy.concatenate([[0],tdiffs])
subgroup = 0
for k in range(len(tdiffs)):
t = ts[k]
tdiff = tdiffs[k]
total = total+tdiff
if total >= 5:
total = 0
subgroup += 1
result.append([t,float(subgroup)])
return result
getSubgroupUDF = pyspark.sql.functions.udf(getSubgroup,pyspark.sql.types.ArrayType(pyspark.sql.types.ArrayType(pyspark.sql.types.DoubleType())))
subgroups = df.select('id','t').distinct().groupBy(
'id'
).agg(
pyspark.sql.functions.collect_list('t').alias('ts')
).withColumn(
't_and_subgroup',
pyspark.sql.functions.explode(getSubgroupUDF('ts'))
).withColumn(
't',
pyspark.sql.functions.col('t_and_subgroup').getItem(0)
).withColumn(
'subgroup',
pyspark.sql.functions.col('t_and_subgroup').getItem(1).cast(pyspark.sql.types.IntegerType())
).drop(
't_and_subgroup','ts'
)
df = df.join(
subgroups,
on=['id','t'],
how='inner'
)
df.orderBy(
pyspark.sql.functions.asc('id'),pyspark.sql.functions.asc('t')
).show()
The subgroup column is equivalent to partitioning by id, window_start so maybe you don't need to create it.
To create window_start , I think this does the job :
.withColumn("window_start", min("t").over(Window.partitionBy("id").orderBy(asc("t")).rangeBetween(0, 5)))
I'm not 100% sure about the behavior of rangeBetween.
To create offset it's just .withColumn("offset", col("t") - col("window_start"))
Let me know how it goes

Is there a "key-wise map with state" in Spark?

I have an RDD (or DataFrame) of measuring data which is ordered by the timestamp, and I need to do a pairwise operation on two subsequent records for the same key (e.g., doing a trapezium integration of accelerometer data to get velocities).
Is there a function in Spark that "remembers" the last record for each key and has it available when the next record for the same key arrives?
I currently thought of this approach:
Get all the keys of the RDD
Use a custom Partitioner to partition the RDD by the found keys so I know there is one partition for each key
Use mapPartitions to do the calculation
However this has one flaw:
First, getting the keys can be a very lengthy task because the input data can be several GiB or even TiB large. I could write a custom InputFormat to just extract the keys which would be significantly faster (as I use Hadoop's API and sc.newAPIHadoopFile to get the data in the first place) but that would be additional things to consider and an additional source of bugs.
So my question is: Is there anything like reduceByKey that doesn't aggregate the data but just gives me the current record and the last one for that key and lets me output one or more records based on that information?
Here is what you can do with dataframe
import java.sql.Timestamp
import org.apache.spark.sql.types.{TimestampType, IntegerType}
import org.apache.spark.sql.functions._
**Create a window for lag function**
val w = org.apache.spark.sql.expressions.Window.partitionBy("key").orderBy("timestamp")
val df = spark.sparkContext.parallelize(List((1, 23, Timestamp.valueOf("2017-12-02 03:04:00")),
(1, 24, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 26, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 27, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 30, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 33, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 39, Timestamp.valueOf("2017-12-02 01:45:20")))).toDF("key","value","timestamp")
scala> df.printSchema
root
|-- key: integer (nullable = false)
|-- value: integer (nullable = false)
|-- timestamp: timestamp (nullable = true)
scala> val lagDF = df.withColumn("lag_value",lag("value", 1, 0).over(w))
lagDF: org.apache.spark.sql.DataFrame = [key: int, value: int ... 2 more fields]
**Previous record and current record is in same row now**
scala> lagDF.show(10, false)
+---+-----+-------------------+---------+
|key|value|timestamp |lag_value|
+---+-----+-------------------+---------+
|1 |24 |2017-12-02 01:45:20|0 |
|1 |26 |2017-12-02 01:45:20|24 |
|1 |27 |2017-12-02 01:45:20|26 |
|1 |23 |2017-12-02 03:04:00|27 |
|2 |30 |2017-12-02 01:45:20|0 |
|2 |33 |2017-12-02 01:45:20|30 |
|2 |39 |2017-12-02 01:45:20|33 |
+---+-----+-------------------+---------+
**Put your distance calculation logic here. I'm putting dummy function for demo**
scala> val result = lagDF.withColumn("dummy_operation_for_dist_calc", lagDF("value") - lagDF("lag_value"))
result: org.apache.spark.sql.DataFrame = [key: int, value: int ... 3 more fields]
scala> result.show(10, false)
+---+-----+-------------------+---------+-----------------------------+
|key|value|timestamp |lag_value|dummy_operation_for_dist_calc|
+---+-----+-------------------+---------+-----------------------------+
|1 |24 |2017-12-02 01:45:20|0 |24 |
|1 |26 |2017-12-02 01:45:20|24 |2 |
|1 |27 |2017-12-02 01:45:20|26 |1 |
|1 |23 |2017-12-02 03:04:00|27 |-4 |
|2 |30 |2017-12-02 01:45:20|0 |30 |
|2 |33 |2017-12-02 01:45:20|30 |3 |
|2 |39 |2017-12-02 01:45:20|33 |6 |
+---+-----+-------------------+---------+-----------------------------+

How to add column and records on a dataset given a condition

I'm working on a program that brands data as OutOfRange based on the values present on certain columns.
I have three columns: Age, Height, and Weight. I want to create a fourth column called OutOfRange and assign it a value of 0(false) or 1(true) if the values in those three columns exceed a specific threshold.
If age is lower than 18 or higher than 60, that row will be assigned a value of 1 (0 otherwise). If height is lower than 5, that row will be assigned a value of 1 (0 otherwise), and so on.
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
Given a dataframe as
+---+------+------+
|Age|Height|Weight|
+---+------+------+
|20 |3 |70 |
|17 |6 |80 |
|30 |5 |60 |
|61 |7 |90 |
+---+------+------+
You can apply when function to apply the logics explained in the question as
import org.apache.spark.sql.functions._
df.withColumn("OutOfRange", when(col("Age") <18 || col("Age") > 60 || col("Height") < 5, 1).otherwise(0))
which would result following dataframe
+---+------+------+----------+
|Age|Height|Weight|OutOfRange|
+---+------+------+----------+
|20 |3 |70 |1 |
|17 |6 |80 |1 |
|30 |5 |60 |0 |
|61 |7 |90 |1 |
+---+------+------+----------+
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
This is not possible without recreating the Dataset all together since Datasets are inherently immutable.
However you can save the Dataset as a Hive table, which will allow you to do what you want to do. Saving the Dataset as a Hive table will write the contents of your Dataset to disk under the default spark-warehouse directory.
df.write.mode("overwrite").saveAsTable("my_table")
// Add a row
spark.sql("insert into my_table (Age, Height, Weight, OutofRange) values (20, 30, 70, 1)
// Update a row
spark.sql("update my_table set OutOfRange = 1 where Age > 30")
....
Hive support must be enabled for spark at time of instantiation in order to do this.

Calculating sum,count of multiple top K values spark

I have an input dataframe of the format
+---------------------------------+
|name| values |score |row_number|
+---------------------------------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |4 |
|E |850 |3 |5 |
|F |800 |1 |6 |
+---------------------------------+
I need to get sum(values) when score > 0 and row_number < K (i,e) SUM of all values when score > 0 for the top k values in the dataframe.
I am able to achieve this by running the following query for top 100 values
val top_100_data = df.select(
count(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("count_100"),
sum(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("sum_filtered_100"),
sum(when(col("row_number") <=100, col(values))).alias("total_sum_100")
)
However, I need to fetch data for top 100,200,300......2500. meaning I would need to run this query 25 times and finally union 25 dataframes.
I'm new to spark and still figuring lots of things out. What would be the best approach to solve this problem?
Thanks!!
You can create an Array of limits as
val topFilters = Array(100, 200, 300) // you can add more
Then you can loop through the topFilters array and create the dataframe you require. I suggest you to use join rather than union as join will give you separate columns and unions will give you separate rows. You can do the following
Given your dataframe as
+----+------+-----+----------+
|name|values|score|row_number|
+----+------+-----+----------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |200 |
|E |850 |3 |150 |
|F |800 |1 |250 |
+----+------+-----+----------+
You can do by using the topFilters array defined above as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
var finalDF : DataFrame = Seq("1").toDF("rowNum")
for(k <- topFilters) {
val top_100_data = df.select(lit("1").as("rowNum"), sum(when(col("score") > 0 && col("row_number") < k, col("values"))).alias(s"total_sum_$k"))
finalDF = finalDF.join(top_100_data, Seq("rowNum"))
}
finalDF.show(false)
Which should give you final dataframe as
+------+-------------+-------------+-------------+
|rowNum|total_sum_100|total_sum_200|total_sum_300|
+------+-------------+-------------+-------------+
|1 |923 |1773 |3473 |
+------+-------------+-------------+-------------+
You can do the same for your 25 limits that you have.
If you intend to use union, then the idea is similar to above.
I hope the answer is helpful
Updated
If you require union then you can apply following logic with the same limit array defined above
var finalDF : DataFrame = Seq((0, 0, 0, 0)).toDF("limit", "count", "sum_filtered", "total_sum")
for(k <- topFilters) {
val top_100_data = df.select(lit(k).as("limit"), count(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("count"),
sum(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("sum_filtered"),
sum(when(col("row_number") <=k, col("values"))).alias("total_sum"))
finalDF = finalDF.union(top_100_data)
}
finalDF.filter(col("limit") =!= 0).show(false)
which should give you
+-----+-----+------------+---------+
|limit|count|sum_filtered|total_sum|
+-----+-----+------------+---------+
|100 |1 |923 |2870 |
|200 |3 |2673 |4620 |
|300 |4 |3473 |5420 |
+-----+-----+------------+---------+

Resources