spark dataframe filter operation - apache-spark

I have a spark dataframe and then filter string to apply, filter only selects the some rows but I would like to know the reason for the rows not selected.
Example:
DataFrame columns: customer_id|col_a|col_b|col_c|col_d
Filter string: col_a > 0 & col_b > 4 & col_c < 0 & col_d=0
etc...
reason_for_exclusion can be any string or letter as long as it says why particular row excluded.
I could split the filter string and apply each filter but I have huge filter string and it would be inefficient so just checking is there any better way to this operation?
Thanks

You'll have to check for each condition within the filter expression which can be expensive regarding the simple operation of filtering.
I would suggest displaying the same reason for all filtred rows since it satisfies at least one condition in that expression. It's not pretty but I'd prefer this as it's efficient especially when you have to handle very large DataFrames.
data = [(1, 1, 5, -3, 0),(2, 0, 10, -1, 0), (3, 0, 10, -4, 1),]
df = spark.createDataFrame(data, ["customer_id", "col_a", "col_b", "col_c", "col_d"])
filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"
filtered_df = df.withColumn("reason_for_exclusion",
when(~expr(filter_expr),lit(filter_expr)
).otherwise(lit(None))
)
filtered_df.show(truncate=False)
Output:
+-----------+-----+-----+-----+-----+-------------------------------------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion |
+-----------+-----+-----+-----+-----+-------------------------------------------------+
|1 |1 |5 |-3 |0 |null |
|2 |0 |10 |-1 |0 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
|3 |0 |10 |-4 |1 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
+-----------+-----+-----+-----+-----+-------------------------------------------------+
EDIT:
Now, if you really want to display only the conditions which failed you can turn each condition to separated columns and use DataFrame select to do the calculation. Then you'll have to check columns evaluated to False to know which condition has failed.
You could name these columns by <PREFIX>_<condition> so that you could identify them easily later. Here is a complete example:
filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"
COLUMN_FILTER_PREFIX = "filter_validation_"
original_columns = [col(c) for c in df.columns]
# create column for each condition in filter expression
condition_columns = [expr(f).alias(COLUMN_FILTER_PREFIX + f) for f in filter_expr.split("AND")]
# evaluate condition to True/False and persist the DF with calculated columns
filtered_df = df.select(original_columns + condition_columns)
filtered_df = filtered_df.persist(StorageLevel.MEMORY_AND_DISK)
# get back columns we calculated for filter
filter_col_names = [c for c in filtered_df.columns if COLUMN_FILTER_PREFIX in c]
filter_columns = list()
for c in filter_col_names:
filter_columns.append(
when(~col(f"`{c}`"),
lit(f"{c.replace(COLUMN_FILTER_PREFIX, '')}")
)
)
array_reason_filter = array_except(array(*filter_columns), array(lit(None)))
df_with_filter_reason = filtered_df.withColumn("reason_for_exclusion", array_reason_filter)
df_with_filter_reason.select(*original_columns, col("reason_for_exclusion")).show(truncate=False)
# output
+-----------+-----+-----+-----+-----+----------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion |
+-----------+-----+-----+-----+-----+----------------------+
|1 |1 |5 |-3 |0 |[] |
|2 |0 |10 |-1 |0 |[col_a > 0 ] |
|3 |0 |10 |-4 |1 |[col_a > 0 , col_d=0]|
+-----------+-----+-----+-----+-----+----------------------+

Related

pyspark get value counts within a groupby

I have a dataframe df with a few columns. I want to groupby using one (or more) column and for every group, I want the count of values of another column(s).
Here's the df:
col1 col2 col3 col4
1 1 a 2
1 1 b 1
1 2 c 1
2 1 a 3
2 1 b 4
I want to groupby 'col1' and 'col2' and then for every group, the count of unique values in a column and then sum/mean/min/max of other column. I also want to maintain the grouped columns. the result should be:
col1 col2 count_a count_b count_c col4_sum
1 1 1 1 0 3
1 2 0 0 1 1
2 1 1 1 0 7
how do I achieve this?
You have two solutions
First, you can use pivot on col3 to get your count of unique values, and then join your pivoted dataframe with an aggregated dataframe that compute the sum/mean/min/max of other column.
Your code would be as follows:
from pyspark.sql import functions as F
result = df \
.groupBy('col1', 'col2') \
.pivot('col3') \
.agg(F.count('col3')) \
.fillna(0) \
.join(
df.groupby('col1', 'col2').agg(F.sum('col4').alias('col4_sum')),
['col1', 'col2']
)
And with your input dataframe, you will get:
+----+----+---+---+---+--------+
|col1|col2|a |b |c |col4_sum|
+----+----+---+---+---+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+---+---+---+--------+
However, you can't choose the name of columns extracted from pivot, it will be the name of the value.
If you really want to choose the name of the columns, you can retrieve all distinct values first and then build your aggregation column from each of them, as follows:
from pyspark.sql import functions as F
values = map(lambda x: x.col3, df.select("col3").distinct().collect())
count_of_distinct_values = [F.sum((F.col('col3') == i).cast('integer')).alias('count_' + i) for i in values]
other_column_aggregations = [F.sum('col4').alias('col4_sum')]
aggregated = count_of_distinct_values + other_column_aggregations
result = df.groupBy('col1', 'col2').agg(*aggregated)
You then get the following dataframe:
+----+----+-------+-------+-------+--------+
|col1|col2|count_a|count_b|count_c|col4_sum|
+----+----+-------+-------+-------+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+-------+-------+-------+--------+

Pyspark Pivot based on column values combinations

I have a dataframe with 3 columns as shown below
I would like to Pivot and fill the columns on the id so that each row contains a
column for each id + column combination where the value is for that id, as shown below
Note: Zero or Null is shown if the ID does not match. For instance ID2_colA and Id2_ColB get 0 in the first two rows, and ID1_calA abd ID1_ColB get 0 in row 3
There are more distinct values in ID column. Shortened it for the ease of illustration
How can I achieve this in pyspark?
Here is the code for the first dataframe:
data = [(("ID1", 3, 5)), (("ID1", 4, 12)), (("ID2", 8, 3))]
df = spark.createDataFrame(data, ["ID", "colA", "colB"])
You can create a map column where the values are columns colA and colB and the keys concatenation of literal colA and colB names with IDcolumn. Then, explode the map and pivot the resulting value column like this:
from itertools import chain
import pyspark.sql.functions as F
df.select(
"ID", "colA", "colB",
F.explode(
F.create_map(
*list(chain(*[[F.concat_ws("_", F.lit(c), F.col("ID")), F.col(c)] for c in ["colA", "colB"]]))
)
)
).groupBy("ID", "colA", "colB") \
.pivot("key").agg(F.first("value")) \
.fillna(0) \
.show()
#+---+----+----+--------+--------+--------+--------+
#|ID |colA|colB|colA_ID1|colA_ID2|colB_ID1|colB_ID2|
#+---+----+----+--------+--------+--------+--------+
#|ID2|8 |3 |0 |8 |0 |3 |
#|ID1|3 |5 |3 |0 |5 |0 |
#|ID1|4 |12 |4 |0 |12 |0 |
#+---+----+----+--------+--------+--------+--------+

How to aggregate based on condition of a column in pandas?

I have a dataframe which looks like this:
df:
id|flag|fee
1 |0 |5
1 |0 |5
1 |1 |5
1 |1 |5
DESRIED df_aggregated:
id|flag|fee
1 |2 |10
The aggregate should count the number of flags per id and the fee should sum per id when the flag is set to 1:
df1=df.groupby(['id'])["flag"].apply(lambda x : x.astype(int).count()).reset_index()
df2=df.groupby(['id'])["fee"].apply(lambda x : x.astype(int).sum()).reset_index()
df_aggregated=pd.merge(df1, df2, on='id', how='inner')
ACTUAL df_aggregated:
id|flag|fee
1 |2 |20
My fee aggregation is NOT correct/complete because it is not accounting for the condition of only summing the fee IF THE FLAG=1. Instead if sums up all fees regarding of the flag. How do I change my code to account for this condition? It should look like the DESIRED df_aggregated table.
Thanks!
You need to check for the condition flag==1. In doing so, you can multiply fee with df.flag.eq(1):
(df.assign(fee=df.fee*df.flag.eq(1))
.groupby('id', as_index=False)
.agg({'flag':'nunique', 'fee':'sum'})
)
Output:
id flag fee
0 1 2 10
If you want both to count/sum only where flag==1, you can do a query first:
(df.query('flag==1')
.groupby('id', as_index=False)
.agg({'flag':'count', 'fee':'sum'})
)
which incidentally gives the same output as above.

how to match 2 column with each other in Apache Spark - Pyspark

I have a dataframe so assume my data is in Tabular format.
|ID | Serial | Updated
-------------------------------------------------------
|10 |pers1 | |
|20 | | |
|30 |entity_1, entity_2, entity_3|entity_1, entity_3|
Now using withColumn("Serial", explode(split(",")"Serial"))). I have achieved breaking columns into multiple rows as below. this was the 1st part of the requirement.
|ID | Serial | Updated
-------------------------------------------------------
|10 |pers1 | |
|20 | | |
|30 |entity_1 |entity_1, entity_3|
|30 |entity_2 |entity_1, entity_3|
|30 |entity_3 |entity_1, entity_3|
Now for the columns where there are no values it should be 0,
For values which is present in 'Serial' Column should be searched in 'Updated' column. If the value is present in 'Updated' column then it should display '1' else '2'
So for here in this case for entity_1 && entity_3 --> 1 must be displayed & for entity_2 --> 2 should be displayed
How to achieve this ..?
AFAIK, there is no way to check if one column is contained within or is a substring of another column directly without using a udf.
However, if you wanted to avoid using a udf, one way is to explode the "Updated" column. Then you can check for equality between the "Serial" column and the exploded "Updated" column and apply your conditions (1 if match, 2 otherwise)- call this "contains".
Finally, you can then groupBy("ID", "Serial", "Updated") and select the minimum of the "contains" column.
For example, after the two calls to explode() and checking your condition, you will have a DataFrame like this:
df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
.withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
.withColumn(
"contains",
f.when(
f.isnull("Serial") |
f.isnull("Updated") |
(f.col("Serial") == "") |
(f.col("Updated") == ""),
0
).when(
f.col("Serial") == f.col("updatedExploded"),
1
).otherwise(2)
)\
.show(truncate=False)
#+---+--------+-----------------+---------------+--------+
#|ID |Serial |Updated |updatedExploded|contains|
#+---+--------+-----------------+---------------+--------+
#|10 |pers1 | | |0 |
#|20 | | | |0 |
#|30 |entity_1|entity_1,entity_3|entity_1 |1 |
#|30 |entity_1|entity_1,entity_3|entity_3 |2 |
#|30 |entity_2|entity_1,entity_3|entity_1 |2 |
#|30 |entity_2|entity_1,entity_3|entity_3 |2 |
#|30 |entity_3|entity_1,entity_3|entity_1 |2 |
#|30 |entity_3|entity_1,entity_3|entity_3 |1 |
#+---+--------+-----------------+---------------+--------+
The "trick" of grouping by ("ID", "Serial", "Updated") and taking the minimum of "contains" works because:
If either "Serial" or "Updated" is null (or equal to empty string in this case), the value will be 0.
If at least one of the values in "Updated" matches with "Serial", one of the columns will have a 1.
If there are no matches, you will have only 2's
The final output:
df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
.withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
.withColumn(
"contains",
f.when(
f.isnull("Serial") |
f.isnull("Updated") |
(f.col("Serial") == "") |
(f.col("Updated") == ""),
0
).when(
f.col("Serial") == f.col("updatedExploded"),
1
).otherwise(2)
)\
.groupBy("ID", "Serial", "Updated")\
.agg(f.min("contains").alias("contains"))\
.sort("ID")\
.show(truncate=False)
#+---+--------+-----------------+--------+
#|ID |Serial |Updated |contains|
#+---+--------+-----------------+--------+
#|10 |pers1 | |0 |
#|20 | | |0 |
#|30 |entity_3|entity_1,entity_3|1 |
#|30 |entity_2|entity_1,entity_3|2 |
#|30 |entity_1|entity_1,entity_3|1 |
#+---+--------+-----------------+--------+
I'm chaining calls to pyspark.sql.functions.when() to check the conditions. The first part checks to see if either column is null or equal to the empty string. I believe that you probably only need to check for null in your actual data, but I put in the check for empty string based on how you displayed your example DataFrame.

Calculating sum,count of multiple top K values spark

I have an input dataframe of the format
+---------------------------------+
|name| values |score |row_number|
+---------------------------------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |4 |
|E |850 |3 |5 |
|F |800 |1 |6 |
+---------------------------------+
I need to get sum(values) when score > 0 and row_number < K (i,e) SUM of all values when score > 0 for the top k values in the dataframe.
I am able to achieve this by running the following query for top 100 values
val top_100_data = df.select(
count(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("count_100"),
sum(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("sum_filtered_100"),
sum(when(col("row_number") <=100, col(values))).alias("total_sum_100")
)
However, I need to fetch data for top 100,200,300......2500. meaning I would need to run this query 25 times and finally union 25 dataframes.
I'm new to spark and still figuring lots of things out. What would be the best approach to solve this problem?
Thanks!!
You can create an Array of limits as
val topFilters = Array(100, 200, 300) // you can add more
Then you can loop through the topFilters array and create the dataframe you require. I suggest you to use join rather than union as join will give you separate columns and unions will give you separate rows. You can do the following
Given your dataframe as
+----+------+-----+----------+
|name|values|score|row_number|
+----+------+-----+----------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |200 |
|E |850 |3 |150 |
|F |800 |1 |250 |
+----+------+-----+----------+
You can do by using the topFilters array defined above as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
var finalDF : DataFrame = Seq("1").toDF("rowNum")
for(k <- topFilters) {
val top_100_data = df.select(lit("1").as("rowNum"), sum(when(col("score") > 0 && col("row_number") < k, col("values"))).alias(s"total_sum_$k"))
finalDF = finalDF.join(top_100_data, Seq("rowNum"))
}
finalDF.show(false)
Which should give you final dataframe as
+------+-------------+-------------+-------------+
|rowNum|total_sum_100|total_sum_200|total_sum_300|
+------+-------------+-------------+-------------+
|1 |923 |1773 |3473 |
+------+-------------+-------------+-------------+
You can do the same for your 25 limits that you have.
If you intend to use union, then the idea is similar to above.
I hope the answer is helpful
Updated
If you require union then you can apply following logic with the same limit array defined above
var finalDF : DataFrame = Seq((0, 0, 0, 0)).toDF("limit", "count", "sum_filtered", "total_sum")
for(k <- topFilters) {
val top_100_data = df.select(lit(k).as("limit"), count(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("count"),
sum(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("sum_filtered"),
sum(when(col("row_number") <=k, col("values"))).alias("total_sum"))
finalDF = finalDF.union(top_100_data)
}
finalDF.filter(col("limit") =!= 0).show(false)
which should give you
+-----+-----+------------+---------+
|limit|count|sum_filtered|total_sum|
+-----+-----+------------+---------+
|100 |1 |923 |2870 |
|200 |3 |2673 |4620 |
|300 |4 |3473 |5420 |
+-----+-----+------------+---------+

Resources