Add rows to a PySpark df based on a condition

Add rows to a PySpark df based on a condition - apache-spark

I want to "duplicate" the rows the same number of times that the difference between two dates in the df. I have this dataframe:
So I need to explode the number of rows of the df to get this:

Get all dates between D1 and D2 using sequence and then explode the dates:
df = ...
df.withColumn("D1", F.explode(F.expr("sequence(D1,D2)"))) \
.drop("D2").show(truncate=False)
Output:
+---+---+---+----------+
|A |B |C |D1 |
+---+---+---+----------+
|1 |2 |3 |2019-01-01|
|1 |2 |3 |2019-01-02|
|1 |2 |3 |2019-01-03|
+---+---+---+----------+

Related

pyspark get value counts within a groupby

I have a dataframe df with a few columns. I want to groupby using one (or more) column and for every group, I want the count of values of another column(s).
Here's the df:
col1 col2 col3 col4
1 1 a 2
1 1 b 1
1 2 c 1
2 1 a 3
2 1 b 4
I want to groupby 'col1' and 'col2' and then for every group, the count of unique values in a column and then sum/mean/min/max of other column. I also want to maintain the grouped columns. the result should be:
col1 col2 count_a count_b count_c col4_sum
1 1 1 1 0 3
1 2 0 0 1 1
2 1 1 1 0 7
how do I achieve this?

You have two solutions
First, you can use pivot on col3 to get your count of unique values, and then join your pivoted dataframe with an aggregated dataframe that compute the sum/mean/min/max of other column.
Your code would be as follows:
from pyspark.sql import functions as F
result = df \
.groupBy('col1', 'col2') \
.pivot('col3') \
.agg(F.count('col3')) \
.fillna(0) \
.join(
df.groupby('col1', 'col2').agg(F.sum('col4').alias('col4_sum')),
['col1', 'col2']
)
And with your input dataframe, you will get:
+----+----+---+---+---+--------+
|col1|col2|a |b |c |col4_sum|
+----+----+---+---+---+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+---+---+---+--------+
However, you can't choose the name of columns extracted from pivot, it will be the name of the value.
If you really want to choose the name of the columns, you can retrieve all distinct values first and then build your aggregation column from each of them, as follows:
from pyspark.sql import functions as F
values = map(lambda x: x.col3, df.select("col3").distinct().collect())
count_of_distinct_values = [F.sum((F.col('col3') == i).cast('integer')).alias('count_' + i) for i in values]
other_column_aggregations = [F.sum('col4').alias('col4_sum')]
aggregated = count_of_distinct_values + other_column_aggregations
result = df.groupBy('col1', 'col2').agg(*aggregated)
You then get the following dataframe:
+----+----+-------+-------+-------+--------+
|col1|col2|count_a|count_b|count_c|col4_sum|
+----+----+-------+-------+-------+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+-------+-------+-------+--------+

When dynamically generating join condition as list in PySpark, How to apply "OR" in between the elements instead of "AND"?

I am joining two dataframes site_bs and site_wrk_int1 and creating site_wrk using a dynamic join condition.
My code is like below:
join_cond=[ col(v_col) == col('wrk_'+v_col) for v_col in primaryKeyCols] #result would be
site_wrk=site_bs.join(site_wrk_int1,join_cond,'inner').select(*site_bs.columns)
join_cond will be dynamic and the value will be something like [ col(id) == col(wrk_id), col(id) == col(wrk_parentId)]
In the above join condition, join will happen satisfying both the conditions above. i.e., the join condition will be
id = wrk_id and id = wrk_parentId
But I want or condition to be applied like below
id = wrk_id or id = wrk_parentId
How to achieve this in Pyspark?

Since logical operations on pyspark columns return column objects, you can chain these conditions in the join statement such as:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
(1, "A", "A"),
(2, "C", "C"),
(3, "E", "D"),
], ['id', 'col1', 'col2']
)
df.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| A| A|
| 2| C| C|
| 3| E| D|
+---+----+----+
df.alias("t1").join(
df.alias("t2"),
(f.col("t1.col1") == f.col("t2.col2")) | (f.col("t1.col1") == f.lit("E")),
"left_outer"
).show(truncate=False)
+---+----+----+---+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+---+----+----+
|1 |A |A |1 |A |A |
|2 |C |C |2 |C |C |
|3 |E |D |1 |A |A |
|3 |E |D |2 |C |C |
|3 |E |D |3 |E |D |
+---+----+----+---+----+----+
As you can see, I get the True value for left rows with IDs 1 and 2 as col1 == col2 OR col1 == E which is True for three rows of my DataFrame. In terms of syntax, it's important that the Python operators (| & ...) are separated by closed brackets as in example above, otherwise you might get confusing py4j errors.
Alternatively, if you wish to keep to similar notation as you stated in your questions, why not use functools.reduce and operator.or_ to apply this logic to your list, such as:
In this example, I have an AND condition between my column conditions and get NULL only, as expected:
df.alias("t1").join(
df.alias("t2"),
[f.col("t1.col1") == f.col("t2.col2"), f.col("t1.col1") == f.lit("E")],
"left_outer"
).show(truncate=False)
+---+----+----+----+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+----+----+----+
|3 |E |D |null|null|null|
|1 |A |A |null|null|null|
|2 |C |C |null|null|null|
+---+----+----+----+----+----+
In this example, I leverage functools and operator to get same result as above:
df.alias("t1").join(
df.alias("t2"),
functools.reduce(
operator.or_,
[f.col("t1.col1") == f.col("t2.col2"), f.col("t1.col1") == f.lit("E")]),
"left_outer"
).show(truncate=False)
+---+----+----+---+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+---+----+----+
|1 |A |A |1 |A |A |
|2 |C |C |2 |C |C |
|3 |E |D |1 |A |A |
|3 |E |D |2 |C |C |
|3 |E |D |3 |E |D |
+---+----+----+---+----+----+

I am quite new in spark SQL.
Please notify me if this can be a solution.
site_wrk = site_bs.join(site_work_int1, [(site_bs.id == site_work_int1.wrk_id) | (site_bs.id == site_work_int1.wrk_parentId)], how = "inner")

finding non-overlapping windows in a pyspark dataframe

Suppose I have a pyspark dataframe with an id column and a time column (t) in seconds. For each id I'd like to group the rows so that each group has all entries that are within 5 seconds after the start time for that group. So for instance, if the table is:
+---+--+
|id |t |
+---+--+
|1 |0 |
|1 |1 |
|1 |3 |
|1 |8 |
|1 |14|
|1 |18|
|2 |0 |
|2 |20|
|2 |21|
|2 |50|
+---+--+
Then the result should be:
+---+--+---------+-------------+-------+
|id |t |subgroup |window_start |offset |
+---+--+---------+-------------+-------+
|1 |0 |1 |0 |0 |
|1 |1 |1 |0 |1 |
|1 |3 |1 |0 |3 |
|1 |8 |2 |8 |0 |
|1 |14|3 |14 |0 |
|1 |18|3 |14 |4 |
|2 |0 |1 |0 |0 |
|2 |20|2 |20 |0 |
|2 |21|2 |20 |1 |
|2 |50|3 |50 |0 |
+---+--+---------+-------------+-------+
I don't need the subgroup numbers to be consecutive. I'm ok with solutions using custom UDAF in Scala as long as it is efficient.
Computing (cumsum(t)-(cumsum(t)%5))/5 within each group can be used to identify the first window, but not the ones beyond that. Essentially the problem is that after the first window is found, the cumulative sum needs to reset to 0. I could operate recursively using this cumulative sum approach, but that is too inefficient on a large dataset.
The following works and is more efficient than recursively calling cumsum, but it is still so slow as to be useless on large dataframes.
d = [[int(x[0]),float(x[1])] for x in [[1,0],[1,1],[1,4],[1,7],[1,14],[1,18],[2,5],[2,20],[2,21],[3,0],[3,1],[3,1.5],[3,2],[3,3.5],[3,4],[3,6],[3,6.5],[3,7],[3,11],[3,14],[3,18],[3,20],[3,24],[4,0],[4,1],[4,2],[4,6],[4,7]]]
schema = pyspark.sql.types.StructType(
[
pyspark.sql.types.StructField('id',pyspark.sql.types.LongType(),False),
pyspark.sql.types.StructField('t',pyspark.sql.types.DoubleType(),False)
]
)
df = spark.createDataFrame(
[pyspark.sql.Row(*x) for x in d],
schema
)
def getSubgroup(ts):
result = []
total = 0
ts = sorted(ts)
tdiffs = numpy.array(ts)
tdiffs = tdiffs[1:]-tdiffs[:-1]
tdiffs = numpy.concatenate([[0],tdiffs])
subgroup = 0
for k in range(len(tdiffs)):
t = ts[k]
tdiff = tdiffs[k]
total = total+tdiff
if total >= 5:
total = 0
subgroup += 1
result.append([t,float(subgroup)])
return result
getSubgroupUDF = pyspark.sql.functions.udf(getSubgroup,pyspark.sql.types.ArrayType(pyspark.sql.types.ArrayType(pyspark.sql.types.DoubleType())))
subgroups = df.select('id','t').distinct().groupBy(
'id'
).agg(
pyspark.sql.functions.collect_list('t').alias('ts')
).withColumn(
't_and_subgroup',
pyspark.sql.functions.explode(getSubgroupUDF('ts'))
).withColumn(
't',
pyspark.sql.functions.col('t_and_subgroup').getItem(0)
).withColumn(
'subgroup',
pyspark.sql.functions.col('t_and_subgroup').getItem(1).cast(pyspark.sql.types.IntegerType())
).drop(
't_and_subgroup','ts'
)
df = df.join(
subgroups,
on=['id','t'],
how='inner'
)
df.orderBy(
pyspark.sql.functions.asc('id'),pyspark.sql.functions.asc('t')
).show()

The subgroup column is equivalent to partitioning by id, window_start so maybe you don't need to create it.
To create window_start , I think this does the job :
.withColumn("window_start", min("t").over(Window.partitionBy("id").orderBy(asc("t")).rangeBetween(0, 5)))
I'm not 100% sure about the behavior of rangeBetween.
To create offset it's just .withColumn("offset", col("t") - col("window_start"))
Let me know how it goes

Sorting DataFrame within rows and getting the ranking

I have the following PySpark DataFrame :
+----+----------+----------+----------+
| id| a| b| c|
+----+----------+----------+----------+
|2346|2017-05-26| null|2016-12-18|
|5678|2013-05-07|2018-05-12| null|
+----+----------+----------+----------+
My ideal output is :
+----+---+---+---+
|id |a |b |c |
+----+---+---+---+
|2346|2 |0 |1 |
|5678|1 |2 |0 |
+----+---+---+---+
Ie the more recent the date within the row, the higher the score
I have looked at similar posts suggesting to use window function. The problem is that I need to order my values within the row, not the column.

You can put the values in each row into an array and use pyspark.sql.functions.sort_array() to sort it.
import pyspark.sql.functions as f
cols = ["a", "b", "c"]
df = df.select("*", f.sort_array(f.array([f.col(c) for c in cols])).alias("sorted"))
df.show(truncate=False)
#+----+----------+----------+----------+------------------------------+
#|id |a |b |c |sorted |
#+----+----------+----------+----------+------------------------------+
#|2346|2017-05-26|null |2016-12-18|[null, 2016-12-18, 2017-05-26]|
#|5678|2013-05-07|2018-05-12|null |[null, 2013-05-07, 2018-05-12]|
#+----+----------+----------+----------+------------------------------+
Now you can use a combination of pyspark.sql.functions.coalesce() and pyspark.sql.functions.when() to loop over each of the columns in cols and find the corresponding index in the sorted array.
df = df.select(
"id",
*[
f.coalesce(
*[
f.when(
f.col("sorted").getItem(i) == f.col(c),
f.lit(i)
)
for i in range(len(cols))
]
).alias(c)
for c in cols
]
)
df.show(truncate=False)
#+----+---+----+----+
#|id |a |b |c |
#+----+---+----+----+
#|2346|2 |null|1 |
#|5678|1 |2 |null|
#+----+---+----+----+
Finally fill the null values with 0:
df = df.na.fill(0)
df.show(truncate=False)
#+----+---+---+---+
#|id |a |b |c |
#+----+---+---+---+
#|2346|2 |0 |1 |
#|5678|1 |2 |0 |
#+----+---+---+---+

Calculating sum,count of multiple top K values spark

I have an input dataframe of the format
+---------------------------------+
|name| values |score |row_number|
+---------------------------------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |4 |
|E |850 |3 |5 |
|F |800 |1 |6 |
+---------------------------------+
I need to get sum(values) when score > 0 and row_number < K (i,e) SUM of all values when score > 0 for the top k values in the dataframe.
I am able to achieve this by running the following query for top 100 values
val top_100_data = df.select(
count(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("count_100"),
sum(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("sum_filtered_100"),
sum(when(col("row_number") <=100, col(values))).alias("total_sum_100")
)
However, I need to fetch data for top 100,200,300......2500. meaning I would need to run this query 25 times and finally union 25 dataframes.
I'm new to spark and still figuring lots of things out. What would be the best approach to solve this problem?
Thanks!!

You can create an Array of limits as
val topFilters = Array(100, 200, 300) // you can add more
Then you can loop through the topFilters array and create the dataframe you require. I suggest you to use join rather than union as join will give you separate columns and unions will give you separate rows. You can do the following
Given your dataframe as
+----+------+-----+----------+
|name|values|score|row_number|
+----+------+-----+----------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |200 |
|E |850 |3 |150 |
|F |800 |1 |250 |
+----+------+-----+----------+
You can do by using the topFilters array defined above as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
var finalDF : DataFrame = Seq("1").toDF("rowNum")
for(k <- topFilters) {
val top_100_data = df.select(lit("1").as("rowNum"), sum(when(col("score") > 0 && col("row_number") < k, col("values"))).alias(s"total_sum_$k"))
finalDF = finalDF.join(top_100_data, Seq("rowNum"))
}
finalDF.show(false)
Which should give you final dataframe as
+------+-------------+-------------+-------------+
|rowNum|total_sum_100|total_sum_200|total_sum_300|
+------+-------------+-------------+-------------+
|1 |923 |1773 |3473 |
+------+-------------+-------------+-------------+
You can do the same for your 25 limits that you have.
If you intend to use union, then the idea is similar to above.
I hope the answer is helpful
Updated
If you require union then you can apply following logic with the same limit array defined above
var finalDF : DataFrame = Seq((0, 0, 0, 0)).toDF("limit", "count", "sum_filtered", "total_sum")
for(k <- topFilters) {
val top_100_data = df.select(lit(k).as("limit"), count(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("count"),
sum(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("sum_filtered"),
sum(when(col("row_number") <=k, col("values"))).alias("total_sum"))
finalDF = finalDF.union(top_100_data)
}
finalDF.filter(col("limit") =!= 0).show(false)
which should give you
+-----+-----+------------+---------+
|limit|count|sum_filtered|total_sum|
+-----+-----+------------+---------+
|100 |1 |923 |2870 |
|200 |3 |2673 |4620 |
|300 |4 |3473 |5420 |
+-----+-----+------------+---------+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Add rows to a PySpark df based on a condition - apache-spark

I want to "duplicate" the rows the same number of times that the difference between two dates in the df. I have this dataframe: So I need to explode the number of rows of the df to get this:

Related

pyspark get value counts within a groupby

When dynamically generating join condition as list in PySpark, How to apply "OR" in between the elements instead of "AND"?

finding non-overlapping windows in a pyspark dataframe

Sorting DataFrame within rows and getting the ranking

Calculating sum,count of multiple top K values spark

Categories

Resources