How to aggregate based on condition of a column in pandas? - python-3.x

I have a dataframe which looks like this:
df:
id|flag|fee
1 |0 |5
1 |0 |5
1 |1 |5
1 |1 |5
DESRIED df_aggregated:
id|flag|fee
1 |2 |10
The aggregate should count the number of flags per id and the fee should sum per id when the flag is set to 1:
df1=df.groupby(['id'])["flag"].apply(lambda x : x.astype(int).count()).reset_index()
df2=df.groupby(['id'])["fee"].apply(lambda x : x.astype(int).sum()).reset_index()
df_aggregated=pd.merge(df1, df2, on='id', how='inner')
ACTUAL df_aggregated:
id|flag|fee
1 |2 |20
My fee aggregation is NOT correct/complete because it is not accounting for the condition of only summing the fee IF THE FLAG=1. Instead if sums up all fees regarding of the flag. How do I change my code to account for this condition? It should look like the DESIRED df_aggregated table.
Thanks!

You need to check for the condition flag==1. In doing so, you can multiply fee with df.flag.eq(1):
(df.assign(fee=df.fee*df.flag.eq(1))
.groupby('id', as_index=False)
.agg({'flag':'nunique', 'fee':'sum'})
)
Output:
id flag fee
0 1 2 10
If you want both to count/sum only where flag==1, you can do a query first:
(df.query('flag==1')
.groupby('id', as_index=False)
.agg({'flag':'count', 'fee':'sum'})
)
which incidentally gives the same output as above.

Related

pyspark get value counts within a groupby

I have a dataframe df with a few columns. I want to groupby using one (or more) column and for every group, I want the count of values of another column(s).
Here's the df:
col1 col2 col3 col4
1 1 a 2
1 1 b 1
1 2 c 1
2 1 a 3
2 1 b 4
I want to groupby 'col1' and 'col2' and then for every group, the count of unique values in a column and then sum/mean/min/max of other column. I also want to maintain the grouped columns. the result should be:
col1 col2 count_a count_b count_c col4_sum
1 1 1 1 0 3
1 2 0 0 1 1
2 1 1 1 0 7
how do I achieve this?
You have two solutions
First, you can use pivot on col3 to get your count of unique values, and then join your pivoted dataframe with an aggregated dataframe that compute the sum/mean/min/max of other column.
Your code would be as follows:
from pyspark.sql import functions as F
result = df \
.groupBy('col1', 'col2') \
.pivot('col3') \
.agg(F.count('col3')) \
.fillna(0) \
.join(
df.groupby('col1', 'col2').agg(F.sum('col4').alias('col4_sum')),
['col1', 'col2']
)
And with your input dataframe, you will get:
+----+----+---+---+---+--------+
|col1|col2|a |b |c |col4_sum|
+----+----+---+---+---+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+---+---+---+--------+
However, you can't choose the name of columns extracted from pivot, it will be the name of the value.
If you really want to choose the name of the columns, you can retrieve all distinct values first and then build your aggregation column from each of them, as follows:
from pyspark.sql import functions as F
values = map(lambda x: x.col3, df.select("col3").distinct().collect())
count_of_distinct_values = [F.sum((F.col('col3') == i).cast('integer')).alias('count_' + i) for i in values]
other_column_aggregations = [F.sum('col4').alias('col4_sum')]
aggregated = count_of_distinct_values + other_column_aggregations
result = df.groupBy('col1', 'col2').agg(*aggregated)
You then get the following dataframe:
+----+----+-------+-------+-------+--------+
|col1|col2|count_a|count_b|count_c|col4_sum|
+----+----+-------+-------+-------+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+-------+-------+-------+--------+

Spark, return multiple rows on group?

So, I have a Kafka topic containing the following data, and I'm working on a proof-of-concept whether we can achieve what we're trying to do. I was previous trying to solve it within Kafka, but it seems that Kafka wasn't the right tool, so looking at Spark now :)
The data in its basic form looks like this:
+--+------------+-------+---------+
|id|serialNumber|source |company |
+--+------------+-------+---------+
|1 |123ABC |system1|Acme |
|2 |3285624 |system1|Ajax |
|3 |CDE567 |system1|Emca |
|4 |XX |system2|Ajax |
|5 |3285624 |system2|Ajax&Sons|
|6 |0147852 |system2|Ajax |
|7 |123ABC |system2|Acme |
|8 |CDE567 |system2|Xaja |
+--+------------+-------+---------+
The main grouping column is serialNumber and the result should be that id 1 and 7 should match as it's a full match on the company. Id 2 and 5 should match because the company in id 2 is a full partial match of the company in id 5. Id 3 and 8 should not match as the companies doesn't match.
I expect the end result to be something like this. Note that sources are not fixed to just one or two and in the future it will contain more sources.
+------+-----+------------+-----------------+---------------+
|uuid |id |serialNumber|source |company |
+------+-----+------------+-----------------+---------------+
|<uuid>|[1,7]|123ABC |[system1,system2]|[Acme] |
|<uuid>|[2,5]|3285624 |[system1,system2]|[Ajax,Ajax&Sons|
|<uuid>|[3] |CDE567 |[system1] |[Emca] |
|<uuid>|[4] |XX |[system2] |[Ajax] |
|<uuid>|[6] |0147852 |[system2] |[Ajax] |
|<uuid>|[8] |CDE567 |[system2] |[Xaja] |
+------+-----+------------+-----------------+---------------+
I was looking at groupByKey().mapGroups() but having problems finding examples. Can mapGroups() return more than one row?
You can simply groupBy based on serialNumber column and collect_list of all other columns.
code:
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._
val ds = Seq((1,"123ABC", "system1", "Acme"),
(7,"123ABC", "system2", "Acme"))
.toDF("id", "serialNumber", "source", "company")
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_list("company").alias("company")
)
.show(false)
Output:
+------------+------+------------------+------------+
|serialNumber|id |source |company |
+------------+------+------------------+------------+
|123ABC |[1, 7]|[system1, system2]|[Acme, Acme]|
+------------+------+------------------+------------+
If you dont want duplicate values, use collect_set
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_set("company").alias("company")
)
.show(false)
Output with collect_set on company column:
+------------+------+------------------+-------+
|serialNumber|id |source |company|
+------------+------+------------------+-------+
|123ABC |[1, 7]|[system1, system2]|[Acme] |
+------------+------+------------------+-------+

Check value from Spark DataFrame column and do transformations

I have a dataframe consists of person, transaction_id & is_successful. The dataframe consists of duplicate values for person with different transaction_ids and is_successful will be True/False for each transaction.
I would like to derive a new dataframe which will have one record for each person which consists latest transaction_id of that person and populate True only if any of his transactions are successful.
val input_df = sc.parallelize(Seq((1,1, "True"), (1,2, "False"), (2,1, "False"), (2,2, "False"), (2,3, "True"), (3,1, "False"), (3,2, "False"), (3,3, "False"))).toDF("person","transaction_id", "is_successful")
df: org.apache.spark.sql.DataFrame = [person: int, transaction_id: int ... 1 more field]
input_df.show(false)
+------+--------------+-------------+
|person|transaction_id|is_successful|
+------+--------------+-------------+
|1 |1 |True |
|1 |2 |False |
|2 |1 |False |
|2 |2 |False |
|2 |3 |True |
|3 |1 |False |
|3 |2 |False |
|3 |3 |False |
+------+--------------+-------------+
Expected Df:
+------+--------------+-------------+
|person|transaction_id|is_successful|
+------+--------------+-------------+
|1 |2 |True |
|2 |3 |True |
|3 |3 |False |
+------+--------------+-------------+
How can we derive the dataframe like above?
What you can do is below in spark sql
select person,max(transaction_id) as transaction_id,max(is_successful) as is_successful from <table_name> group by person
Leave the complex work to max operator.As per the max operation True will come over False.So if one of your person has three False and one True, max of that would be True.
You may achieve this by grouping your dataframe on person and finding the max transaction_id and max is_successful
I've included an example below of how this may be achieved using spark sql.
First, I created a temporary view of your dataframe in order to access using spark sql, then run the following sql statement in spark sql.
input_df.createOrReplaceTempView("input_df");
val result_df = sparkSession.sql("<insert sql below here>");
The sql statement groups the data for each person before using max to determine the last transaction id and a combination of max (sum could be used with the same logic also) and case expressions to derive the is_successful value. The case expression is nested as I've converted True to a numeric value of 1 and False to 0 to leverage a numeric comparison. This is within an outer case expression which checks if the max value is > 0 (i.e. any value was successful) before printing True/False.
SELECT
person,
MAX(transaction_id) as transaction_id,
CASE
WHEN MAX(
CASE
WHEN is_successful = 'True' THEN 1
ELSE 0
END
) > 0 THEN 'True'
ELSE 'False'
END as is_successful
FROM
input_df
GROUP BY
person
Here is the #ggordon's sql version of answer in dataframe version.
input_df.groupBy("person")
.agg(max("transaction_id").as("transaction_id"),
when(max(when('is_successful === "True", 1)
.otherwise(0)) > 0, "True")
.otherwise("False").as("is_successful"))

spark dataframe filter operation

I have a spark dataframe and then filter string to apply, filter only selects the some rows but I would like to know the reason for the rows not selected.
Example:
DataFrame columns: customer_id|col_a|col_b|col_c|col_d
Filter string: col_a > 0 & col_b > 4 & col_c < 0 & col_d=0
etc...
reason_for_exclusion can be any string or letter as long as it says why particular row excluded.
I could split the filter string and apply each filter but I have huge filter string and it would be inefficient so just checking is there any better way to this operation?
Thanks
You'll have to check for each condition within the filter expression which can be expensive regarding the simple operation of filtering.
I would suggest displaying the same reason for all filtred rows since it satisfies at least one condition in that expression. It's not pretty but I'd prefer this as it's efficient especially when you have to handle very large DataFrames.
data = [(1, 1, 5, -3, 0),(2, 0, 10, -1, 0), (3, 0, 10, -4, 1),]
df = spark.createDataFrame(data, ["customer_id", "col_a", "col_b", "col_c", "col_d"])
filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"
filtered_df = df.withColumn("reason_for_exclusion",
when(~expr(filter_expr),lit(filter_expr)
).otherwise(lit(None))
)
filtered_df.show(truncate=False)
Output:
+-----------+-----+-----+-----+-----+-------------------------------------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion |
+-----------+-----+-----+-----+-----+-------------------------------------------------+
|1 |1 |5 |-3 |0 |null |
|2 |0 |10 |-1 |0 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
|3 |0 |10 |-4 |1 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
+-----------+-----+-----+-----+-----+-------------------------------------------------+
EDIT:
Now, if you really want to display only the conditions which failed you can turn each condition to separated columns and use DataFrame select to do the calculation. Then you'll have to check columns evaluated to False to know which condition has failed.
You could name these columns by <PREFIX>_<condition> so that you could identify them easily later. Here is a complete example:
filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"
COLUMN_FILTER_PREFIX = "filter_validation_"
original_columns = [col(c) for c in df.columns]
# create column for each condition in filter expression
condition_columns = [expr(f).alias(COLUMN_FILTER_PREFIX + f) for f in filter_expr.split("AND")]
# evaluate condition to True/False and persist the DF with calculated columns
filtered_df = df.select(original_columns + condition_columns)
filtered_df = filtered_df.persist(StorageLevel.MEMORY_AND_DISK)
# get back columns we calculated for filter
filter_col_names = [c for c in filtered_df.columns if COLUMN_FILTER_PREFIX in c]
filter_columns = list()
for c in filter_col_names:
filter_columns.append(
when(~col(f"`{c}`"),
lit(f"{c.replace(COLUMN_FILTER_PREFIX, '')}")
)
)
array_reason_filter = array_except(array(*filter_columns), array(lit(None)))
df_with_filter_reason = filtered_df.withColumn("reason_for_exclusion", array_reason_filter)
df_with_filter_reason.select(*original_columns, col("reason_for_exclusion")).show(truncate=False)
# output
+-----------+-----+-----+-----+-----+----------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion |
+-----------+-----+-----+-----+-----+----------------------+
|1 |1 |5 |-3 |0 |[] |
|2 |0 |10 |-1 |0 |[col_a > 0 ] |
|3 |0 |10 |-4 |1 |[col_a > 0 , col_d=0]|
+-----------+-----+-----+-----+-----+----------------------+

How to add column and records on a dataset given a condition

I'm working on a program that brands data as OutOfRange based on the values present on certain columns.
I have three columns: Age, Height, and Weight. I want to create a fourth column called OutOfRange and assign it a value of 0(false) or 1(true) if the values in those three columns exceed a specific threshold.
If age is lower than 18 or higher than 60, that row will be assigned a value of 1 (0 otherwise). If height is lower than 5, that row will be assigned a value of 1 (0 otherwise), and so on.
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
Given a dataframe as
+---+------+------+
|Age|Height|Weight|
+---+------+------+
|20 |3 |70 |
|17 |6 |80 |
|30 |5 |60 |
|61 |7 |90 |
+---+------+------+
You can apply when function to apply the logics explained in the question as
import org.apache.spark.sql.functions._
df.withColumn("OutOfRange", when(col("Age") <18 || col("Age") > 60 || col("Height") < 5, 1).otherwise(0))
which would result following dataframe
+---+------+------+----------+
|Age|Height|Weight|OutOfRange|
+---+------+------+----------+
|20 |3 |70 |1 |
|17 |6 |80 |1 |
|30 |5 |60 |0 |
|61 |7 |90 |1 |
+---+------+------+----------+
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
This is not possible without recreating the Dataset all together since Datasets are inherently immutable.
However you can save the Dataset as a Hive table, which will allow you to do what you want to do. Saving the Dataset as a Hive table will write the contents of your Dataset to disk under the default spark-warehouse directory.
df.write.mode("overwrite").saveAsTable("my_table")
// Add a row
spark.sql("insert into my_table (Age, Height, Weight, OutofRange) values (20, 30, 70, 1)
// Update a row
spark.sql("update my_table set OutOfRange = 1 where Age > 30")
....
Hive support must be enabled for spark at time of instantiation in order to do this.

Resources