pyspark get value counts within a groupby - apache-spark

I have a dataframe df with a few columns. I want to groupby using one (or more) column and for every group, I want the count of values of another column(s).
Here's the df:
col1 col2 col3 col4
1 1 a 2
1 1 b 1
1 2 c 1
2 1 a 3
2 1 b 4
I want to groupby 'col1' and 'col2' and then for every group, the count of unique values in a column and then sum/mean/min/max of other column. I also want to maintain the grouped columns. the result should be:
col1 col2 count_a count_b count_c col4_sum
1 1 1 1 0 3
1 2 0 0 1 1
2 1 1 1 0 7
how do I achieve this?

You have two solutions
First, you can use pivot on col3 to get your count of unique values, and then join your pivoted dataframe with an aggregated dataframe that compute the sum/mean/min/max of other column.
Your code would be as follows:
from pyspark.sql import functions as F
result = df \
.groupBy('col1', 'col2') \
.pivot('col3') \
.agg(F.count('col3')) \
.fillna(0) \
.join(
df.groupby('col1', 'col2').agg(F.sum('col4').alias('col4_sum')),
['col1', 'col2']
)
And with your input dataframe, you will get:
+----+----+---+---+---+--------+
|col1|col2|a |b |c |col4_sum|
+----+----+---+---+---+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+---+---+---+--------+
However, you can't choose the name of columns extracted from pivot, it will be the name of the value.
If you really want to choose the name of the columns, you can retrieve all distinct values first and then build your aggregation column from each of them, as follows:
from pyspark.sql import functions as F
values = map(lambda x: x.col3, df.select("col3").distinct().collect())
count_of_distinct_values = [F.sum((F.col('col3') == i).cast('integer')).alias('count_' + i) for i in values]
other_column_aggregations = [F.sum('col4').alias('col4_sum')]
aggregated = count_of_distinct_values + other_column_aggregations
result = df.groupBy('col1', 'col2').agg(*aggregated)
You then get the following dataframe:
+----+----+-------+-------+-------+--------+
|col1|col2|count_a|count_b|count_c|col4_sum|
+----+----+-------+-------+-------+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+-------+-------+-------+--------+

Related

Get Column which contains a specific value

Using pyspark. I would have a data frame like this
col1
col2
col3
1
[3,7]
5
hello
4
666
4
world
4
Now I want to get the column name where the number 666 is included.
So the result should be "col3".
Thanks
Edits
Added other values than int. The great answers are focussed on only int values. Sry.
deleted: while we are at it, I guess the index can also be retrieved easily.
(df.withColumn('result', F.array(*[F.array(F.lit(x).alias('y'), col(x).alias('y')) for x in df.columns]))#Create an array of cols and values
.withColumn('result', expr("transform(filter(result, (c,i)->(c[1]==666)),(c,i)->c[0])"))#Filter array with 666 and extract col
.show(truncate=False))
|col1|col2|col3|result|
+----+----+----+------+
|1 |3 |5 |[] |
|2 |4 |666 |[col3]|
|4 |6 |4 |[] |
here's an approach that creates an array using the columns and then filters it.
data_sdf. \
withColumn('allcols',
func.array(*[func.struct(func.lit(c).alias('name'), func.col(c).cast('string').alias('value'))
for c in data_sdf.columns]
)
). \
withColumn('cols_w_666_arr',
func.expr('transform(filter(allcols, x -> x.value = "666"), c -> c.name)')
). \
drop('allcols'). \
show(truncate=False)
# +---+---+---+--------------+
# |c1 |c2 |c3 |cols_w_666_arr|
# +---+---+---+--------------+
# |1 |3 |5 |[] |
# |2 |4 |666|[c3] |
# |4 |666|4 |[c2] |
# |666|666|4 |[c1, c2] |
# +---+---+---+--------------+

Add rows to a PySpark df based on a condition

I want to "duplicate" the rows the same number of times that the difference between two dates in the df. I have this dataframe:
So I need to explode the number of rows of the df to get this:
Get all dates between D1 and D2 using sequence and then explode the dates:
df = ...
df.withColumn("D1", F.explode(F.expr("sequence(D1,D2)"))) \
.drop("D2").show(truncate=False)
Output:
+---+---+---+----------+
|A |B |C |D1 |
+---+---+---+----------+
|1 |2 |3 |2019-01-01|
|1 |2 |3 |2019-01-02|
|1 |2 |3 |2019-01-03|
+---+---+---+----------+

spark dataframe filter operation

I have a spark dataframe and then filter string to apply, filter only selects the some rows but I would like to know the reason for the rows not selected.
Example:
DataFrame columns: customer_id|col_a|col_b|col_c|col_d
Filter string: col_a > 0 & col_b > 4 & col_c < 0 & col_d=0
etc...
reason_for_exclusion can be any string or letter as long as it says why particular row excluded.
I could split the filter string and apply each filter but I have huge filter string and it would be inefficient so just checking is there any better way to this operation?
Thanks
You'll have to check for each condition within the filter expression which can be expensive regarding the simple operation of filtering.
I would suggest displaying the same reason for all filtred rows since it satisfies at least one condition in that expression. It's not pretty but I'd prefer this as it's efficient especially when you have to handle very large DataFrames.
data = [(1, 1, 5, -3, 0),(2, 0, 10, -1, 0), (3, 0, 10, -4, 1),]
df = spark.createDataFrame(data, ["customer_id", "col_a", "col_b", "col_c", "col_d"])
filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"
filtered_df = df.withColumn("reason_for_exclusion",
when(~expr(filter_expr),lit(filter_expr)
).otherwise(lit(None))
)
filtered_df.show(truncate=False)
Output:
+-----------+-----+-----+-----+-----+-------------------------------------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion |
+-----------+-----+-----+-----+-----+-------------------------------------------------+
|1 |1 |5 |-3 |0 |null |
|2 |0 |10 |-1 |0 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
|3 |0 |10 |-4 |1 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
+-----------+-----+-----+-----+-----+-------------------------------------------------+
EDIT:
Now, if you really want to display only the conditions which failed you can turn each condition to separated columns and use DataFrame select to do the calculation. Then you'll have to check columns evaluated to False to know which condition has failed.
You could name these columns by <PREFIX>_<condition> so that you could identify them easily later. Here is a complete example:
filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"
COLUMN_FILTER_PREFIX = "filter_validation_"
original_columns = [col(c) for c in df.columns]
# create column for each condition in filter expression
condition_columns = [expr(f).alias(COLUMN_FILTER_PREFIX + f) for f in filter_expr.split("AND")]
# evaluate condition to True/False and persist the DF with calculated columns
filtered_df = df.select(original_columns + condition_columns)
filtered_df = filtered_df.persist(StorageLevel.MEMORY_AND_DISK)
# get back columns we calculated for filter
filter_col_names = [c for c in filtered_df.columns if COLUMN_FILTER_PREFIX in c]
filter_columns = list()
for c in filter_col_names:
filter_columns.append(
when(~col(f"`{c}`"),
lit(f"{c.replace(COLUMN_FILTER_PREFIX, '')}")
)
)
array_reason_filter = array_except(array(*filter_columns), array(lit(None)))
df_with_filter_reason = filtered_df.withColumn("reason_for_exclusion", array_reason_filter)
df_with_filter_reason.select(*original_columns, col("reason_for_exclusion")).show(truncate=False)
# output
+-----------+-----+-----+-----+-----+----------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion |
+-----------+-----+-----+-----+-----+----------------------+
|1 |1 |5 |-3 |0 |[] |
|2 |0 |10 |-1 |0 |[col_a > 0 ] |
|3 |0 |10 |-4 |1 |[col_a > 0 , col_d=0]|
+-----------+-----+-----+-----+-----+----------------------+

finding non-overlapping windows in a pyspark dataframe

Suppose I have a pyspark dataframe with an id column and a time column (t) in seconds. For each id I'd like to group the rows so that each group has all entries that are within 5 seconds after the start time for that group. So for instance, if the table is:
+---+--+
|id |t |
+---+--+
|1 |0 |
|1 |1 |
|1 |3 |
|1 |8 |
|1 |14|
|1 |18|
|2 |0 |
|2 |20|
|2 |21|
|2 |50|
+---+--+
Then the result should be:
+---+--+---------+-------------+-------+
|id |t |subgroup |window_start |offset |
+---+--+---------+-------------+-------+
|1 |0 |1 |0 |0 |
|1 |1 |1 |0 |1 |
|1 |3 |1 |0 |3 |
|1 |8 |2 |8 |0 |
|1 |14|3 |14 |0 |
|1 |18|3 |14 |4 |
|2 |0 |1 |0 |0 |
|2 |20|2 |20 |0 |
|2 |21|2 |20 |1 |
|2 |50|3 |50 |0 |
+---+--+---------+-------------+-------+
I don't need the subgroup numbers to be consecutive. I'm ok with solutions using custom UDAF in Scala as long as it is efficient.
Computing (cumsum(t)-(cumsum(t)%5))/5 within each group can be used to identify the first window, but not the ones beyond that. Essentially the problem is that after the first window is found, the cumulative sum needs to reset to 0. I could operate recursively using this cumulative sum approach, but that is too inefficient on a large dataset.
The following works and is more efficient than recursively calling cumsum, but it is still so slow as to be useless on large dataframes.
d = [[int(x[0]),float(x[1])] for x in [[1,0],[1,1],[1,4],[1,7],[1,14],[1,18],[2,5],[2,20],[2,21],[3,0],[3,1],[3,1.5],[3,2],[3,3.5],[3,4],[3,6],[3,6.5],[3,7],[3,11],[3,14],[3,18],[3,20],[3,24],[4,0],[4,1],[4,2],[4,6],[4,7]]]
schema = pyspark.sql.types.StructType(
[
pyspark.sql.types.StructField('id',pyspark.sql.types.LongType(),False),
pyspark.sql.types.StructField('t',pyspark.sql.types.DoubleType(),False)
]
)
df = spark.createDataFrame(
[pyspark.sql.Row(*x) for x in d],
schema
)
def getSubgroup(ts):
result = []
total = 0
ts = sorted(ts)
tdiffs = numpy.array(ts)
tdiffs = tdiffs[1:]-tdiffs[:-1]
tdiffs = numpy.concatenate([[0],tdiffs])
subgroup = 0
for k in range(len(tdiffs)):
t = ts[k]
tdiff = tdiffs[k]
total = total+tdiff
if total >= 5:
total = 0
subgroup += 1
result.append([t,float(subgroup)])
return result
getSubgroupUDF = pyspark.sql.functions.udf(getSubgroup,pyspark.sql.types.ArrayType(pyspark.sql.types.ArrayType(pyspark.sql.types.DoubleType())))
subgroups = df.select('id','t').distinct().groupBy(
'id'
).agg(
pyspark.sql.functions.collect_list('t').alias('ts')
).withColumn(
't_and_subgroup',
pyspark.sql.functions.explode(getSubgroupUDF('ts'))
).withColumn(
't',
pyspark.sql.functions.col('t_and_subgroup').getItem(0)
).withColumn(
'subgroup',
pyspark.sql.functions.col('t_and_subgroup').getItem(1).cast(pyspark.sql.types.IntegerType())
).drop(
't_and_subgroup','ts'
)
df = df.join(
subgroups,
on=['id','t'],
how='inner'
)
df.orderBy(
pyspark.sql.functions.asc('id'),pyspark.sql.functions.asc('t')
).show()
The subgroup column is equivalent to partitioning by id, window_start so maybe you don't need to create it.
To create window_start , I think this does the job :
.withColumn("window_start", min("t").over(Window.partitionBy("id").orderBy(asc("t")).rangeBetween(0, 5)))
I'm not 100% sure about the behavior of rangeBetween.
To create offset it's just .withColumn("offset", col("t") - col("window_start"))
Let me know how it goes

Calculating sum,count of multiple top K values spark

I have an input dataframe of the format
+---------------------------------+
|name| values |score |row_number|
+---------------------------------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |4 |
|E |850 |3 |5 |
|F |800 |1 |6 |
+---------------------------------+
I need to get sum(values) when score > 0 and row_number < K (i,e) SUM of all values when score > 0 for the top k values in the dataframe.
I am able to achieve this by running the following query for top 100 values
val top_100_data = df.select(
count(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("count_100"),
sum(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("sum_filtered_100"),
sum(when(col("row_number") <=100, col(values))).alias("total_sum_100")
)
However, I need to fetch data for top 100,200,300......2500. meaning I would need to run this query 25 times and finally union 25 dataframes.
I'm new to spark and still figuring lots of things out. What would be the best approach to solve this problem?
Thanks!!
You can create an Array of limits as
val topFilters = Array(100, 200, 300) // you can add more
Then you can loop through the topFilters array and create the dataframe you require. I suggest you to use join rather than union as join will give you separate columns and unions will give you separate rows. You can do the following
Given your dataframe as
+----+------+-----+----------+
|name|values|score|row_number|
+----+------+-----+----------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |200 |
|E |850 |3 |150 |
|F |800 |1 |250 |
+----+------+-----+----------+
You can do by using the topFilters array defined above as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
var finalDF : DataFrame = Seq("1").toDF("rowNum")
for(k <- topFilters) {
val top_100_data = df.select(lit("1").as("rowNum"), sum(when(col("score") > 0 && col("row_number") < k, col("values"))).alias(s"total_sum_$k"))
finalDF = finalDF.join(top_100_data, Seq("rowNum"))
}
finalDF.show(false)
Which should give you final dataframe as
+------+-------------+-------------+-------------+
|rowNum|total_sum_100|total_sum_200|total_sum_300|
+------+-------------+-------------+-------------+
|1 |923 |1773 |3473 |
+------+-------------+-------------+-------------+
You can do the same for your 25 limits that you have.
If you intend to use union, then the idea is similar to above.
I hope the answer is helpful
Updated
If you require union then you can apply following logic with the same limit array defined above
var finalDF : DataFrame = Seq((0, 0, 0, 0)).toDF("limit", "count", "sum_filtered", "total_sum")
for(k <- topFilters) {
val top_100_data = df.select(lit(k).as("limit"), count(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("count"),
sum(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("sum_filtered"),
sum(when(col("row_number") <=k, col("values"))).alias("total_sum"))
finalDF = finalDF.union(top_100_data)
}
finalDF.filter(col("limit") =!= 0).show(false)
which should give you
+-----+-----+------------+---------+
|limit|count|sum_filtered|total_sum|
+-----+-----+------------+---------+
|100 |1 |923 |2870 |
|200 |3 |2673 |4620 |
|300 |4 |3473 |5420 |
+-----+-----+------------+---------+

Resources