I have a dataframe with 3 columns as shown below
I would like to Pivot and fill the columns on the id so that each row contains a
column for each id + column combination where the value is for that id, as shown below
Note: Zero or Null is shown if the ID does not match. For instance ID2_colA and Id2_ColB get 0 in the first two rows, and ID1_calA abd ID1_ColB get 0 in row 3
There are more distinct values in ID column. Shortened it for the ease of illustration
How can I achieve this in pyspark?
Here is the code for the first dataframe:
data = [(("ID1", 3, 5)), (("ID1", 4, 12)), (("ID2", 8, 3))]
df = spark.createDataFrame(data, ["ID", "colA", "colB"])
You can create a map column where the values are columns colA and colB and the keys concatenation of literal colA and colB names with IDcolumn. Then, explode the map and pivot the resulting value column like this:
from itertools import chain
import pyspark.sql.functions as F
df.select(
"ID", "colA", "colB",
F.explode(
F.create_map(
*list(chain(*[[F.concat_ws("_", F.lit(c), F.col("ID")), F.col(c)] for c in ["colA", "colB"]]))
)
)
).groupBy("ID", "colA", "colB") \
.pivot("key").agg(F.first("value")) \
.fillna(0) \
.show()
#+---+----+----+--------+--------+--------+--------+
#|ID |colA|colB|colA_ID1|colA_ID2|colB_ID1|colB_ID2|
#+---+----+----+--------+--------+--------+--------+
#|ID2|8 |3 |0 |8 |0 |3 |
#|ID1|3 |5 |3 |0 |5 |0 |
#|ID1|4 |12 |4 |0 |12 |0 |
#+---+----+----+--------+--------+--------+--------+
Related
I have a dataframe df with a few columns. I want to groupby using one (or more) column and for every group, I want the count of values of another column(s).
Here's the df:
col1 col2 col3 col4
1 1 a 2
1 1 b 1
1 2 c 1
2 1 a 3
2 1 b 4
I want to groupby 'col1' and 'col2' and then for every group, the count of unique values in a column and then sum/mean/min/max of other column. I also want to maintain the grouped columns. the result should be:
col1 col2 count_a count_b count_c col4_sum
1 1 1 1 0 3
1 2 0 0 1 1
2 1 1 1 0 7
how do I achieve this?
You have two solutions
First, you can use pivot on col3 to get your count of unique values, and then join your pivoted dataframe with an aggregated dataframe that compute the sum/mean/min/max of other column.
Your code would be as follows:
from pyspark.sql import functions as F
result = df \
.groupBy('col1', 'col2') \
.pivot('col3') \
.agg(F.count('col3')) \
.fillna(0) \
.join(
df.groupby('col1', 'col2').agg(F.sum('col4').alias('col4_sum')),
['col1', 'col2']
)
And with your input dataframe, you will get:
+----+----+---+---+---+--------+
|col1|col2|a |b |c |col4_sum|
+----+----+---+---+---+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+---+---+---+--------+
However, you can't choose the name of columns extracted from pivot, it will be the name of the value.
If you really want to choose the name of the columns, you can retrieve all distinct values first and then build your aggregation column from each of them, as follows:
from pyspark.sql import functions as F
values = map(lambda x: x.col3, df.select("col3").distinct().collect())
count_of_distinct_values = [F.sum((F.col('col3') == i).cast('integer')).alias('count_' + i) for i in values]
other_column_aggregations = [F.sum('col4').alias('col4_sum')]
aggregated = count_of_distinct_values + other_column_aggregations
result = df.groupBy('col1', 'col2').agg(*aggregated)
You then get the following dataframe:
+----+----+-------+-------+-------+--------+
|col1|col2|count_a|count_b|count_c|col4_sum|
+----+----+-------+-------+-------+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+-------+-------+-------+--------+
Consider that I have two Dataframes DF1 and DF2 with the same schema.
what I want to do is that :
For each row in DF1,
if DF1.uniqueId exists in DF2 and type is new, then add to DF2 with a repeat count.
if DF1.uniqueId exists in DF2 and type is old, change DF2 type to that of DF1 type (old).
if DF1.uniqueId does not exists in DF2 and type is new, add a new row to DF2.
if DF1.uniqueId does not exist in DF2 and type is old, move that row to a new table -DF3
ie. if the tables are as shown below, the resultant or the updated DF2 should be like resultDF2 table below
DF1
+----------+--------------------------+
|UniqueID |type_ |
+----------+--------------------------+
|1 |new |
|1 |new |
|1 |new |
|2 |old |
|1 |new |
+----------+--------------------------+
DF2
+----------+--------------------------+
|UniqueID |type_ |
+----------+--------------------------+
| | |
+----------+--------------------------+
resultDF2
+----------+--------------------------++----------+--------------------------+
|UniqueID |type_ | repeatCount |
+----------+--------------------------++----------+--------------------------+
| 1 | new | 3 |
+----------+--------------------------++----------+--------------------------+
resultDF3
+----------+--------------------------++----------+--------------------------+
|UniqueID |type_ | repeatCount |
+----------+--------------------------++----------+--------------------------+
| 1 | old | 0 |
+----------+--------------------------++----------+--------------------------+
** if there is only one entry repeatCount is zero.
I am trying to achieve this using pyspark.
Can anyone please suggest me with any pointers on how to achieve this considering that I have both the tables in-memory.
The desired output can be obtained by:
Group df1 on UniqueId and get repeatCount, during this operation remove UniqueId that have old and new type_.
Apply a Full Join between dataframe from step 1 and df2.
From the joined result, remove rows where df.UniqueId is absent from df2 and df1.type_ is old.
Finally, select the UniqueID, type_ and repeatCount.
from pyspark.sql import functions as F
data = [(1, "new",), # Not exists and new
(1, "new",),
(1, "new",),
(2, "old",), # Not exists and old
(1, "new",),
(3, "old",), # cancel out
(3, "new",), # cancel out
(4, "new",), # one entry count zero example
(5, "new",), # Exists and new
(6, "old",), ] # Exists and old
df1 = spark.createDataFrame(data, ("UniqueID", "type_", ))
df2 = spark.createDataFrame([(5, "new", ), (6, "new", ), ], ("UniqueID", "type_", ))
df1_grouped = (df1.groupBy("UniqueID").agg(F.collect_set("type_").alias("types_"),
(F.count("type_") - F.lit(1)).alias("repeatCount"))
.filter(F.size(F.col("types_")) == 1) # when more than one type of `type_` is present they cancel out
.withColumn("type_", F.col("types_")[0])
.drop("types_")
)
id_not_exists_old = (df2["UniqueID"].isNull() & (df1_grouped["type_"] == F.lit("old")))
(df1_grouped.join(df2, df1_grouped["UniqueID"] == df2["UniqueID"], "full")
.filter(~(id_not_exists_old))
.select(df1_grouped["UniqueID"], df1_grouped["type_"], "repeatCount")
).show()
"""
+--------+-----+-----------+
|UniqueID|type_|repeatCount|
+--------+-----+-----------+
| 1| new| 3|
| 4| new| 0|
| 5| new| 0|
| 6| old| 0|
+--------+-----+-----------+
"""
Let's say I have the following Spark frame:
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-13|1 |1 |0 |
+--------+----------+-----------+-------------------+-------------------+
Now I want to not only impute the missing dates in date column with the right dates so that dataframe keeps its continuous time-series nature and equally sequenced frame but also impute other columns with Null or 0 (while groupBy preferably).
My code is below:
import time
import datetime as dt
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, TimestampType, DateType
dict2 = [("2021-08-11 04:05:06", "A"),
("2021-08-11 04:15:06", "B"),
("2021-08-11 09:15:26", "A"),
("2021-08-11 11:04:06", "B"),
("2021-08-11 14:55:16", "A"),
("2021-08-13 04:12:11", "B"),
]
schema = StructType([
StructField("timestamp", StringType(), True), \
StructField("UserName", StringType(), True), \
])
#create a Spark dataframe
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(data=dict2,schema=schema)
#sdf.printSchema()
#sdf.show(truncate=False)
#+-------------------+--------+
#|timestamp |UserName|
#+-------------------+--------+
#|2021-08-11 04:05:06|A |
#|2021-08-11 04:15:06|B |
#|2021-08-11 09:15:26|A |
#|2021-08-11 11:04:06|B |
#|2021-08-11 14:55:16|A |
#|2021-08-13 04:12:11|B |
#+-------------------+--------+
#Generate date and timestamp
sdf1 = sdf.withColumn('timestamp', F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
.withColumn('date', F.to_date("timestamp", "yyyy-MM-dd").cast(DateType())) \
.select('timestamp', 'date', 'UserName')
#sdf1.show(truncate = False)
#+-------------------+----------+--------+
#|timestamp |date |UserName|
#+-------------------+----------+--------+
#|2021-08-11 04:05:06|2021-08-11|A |
#|2021-08-11 04:15:06|2021-08-11|B |
#|2021-08-11 09:15:26|2021-08-11|A |
#|2021-08-11 11:04:06|2021-08-11|B |
#|2021-08-11 14:55:16|2021-08-11|A |
#|2021-08-13 04:12:11|2021-08-13|B |
#+-------------------+----------+--------+
#Aggeragate records numbers for specific features (Username) for certain time-resolution PerDay(24hrs), HalfDay(2x12hrs)
df = sdf1.groupBy("UserName", "date").agg(
F.sum(F.hour("timestamp").between(0, 24).cast("int")).alias("NoLogPerDay"),
F.sum(F.hour("timestamp").between(0, 11).cast("int")).alias("NoLogPer-1st-12-hrs"),
F.sum(F.hour("timestamp").between(12, 23).cast("int")).alias("NoLogPer-2nd-12-hrs"),
).sort('date')
df.show(truncate = False)
The problem is when I groupBy on date and UserName, I missed some dates which user B had activities but user A not or vice versa. So I'm interested in reflecting these no activities in the Spark dataframe by refilling those dates (no need to timestamp) and allocating 0 to those columns. I'm not sure if I can do this while grouping or before or after!
I already checked some related post as well as PySpark offers window functions and inspired this answer so until now I've tried this:
# compute the list of all dates from available dates
max_date = sdf1.select(F.max('date')).first()['max(date)']
min_date = sdf1.select(F.min('date')).first()['min(date)']
print(min_date) #2021-08-11
print(max_date) #2021-08-13
#compute list of available dates based on min_date & max_date from available data
dates_list = [max_date - dt.timedelta(days=x) for x in range((max_date - min_date).days +1)]
print(dates_list)
#create a temporaray Spark dataframe for date column includng missing dates with interval 1 day
sqlCtx = SQLContext(sc)
df2 = sqlCtx.createDataFrame(data=dates_list)
#Apply leftouter join on date column
dff = df2.join(sdf1, ["date"], "leftouter")
#dff.sort('date').show(truncate = False)
#possible to use .withColumn().otherwise()
#.withColumn('date',when(col('date').isNull(),to_date(lit('01.01.1900'),'dd.MM.yyyy')).otherwise(col('date')))
#Replace 0 for null for all integer columns
dfff = dff.na.fill(value=0).sort('date')
dfff.select('date','Username', 'NoLogPerDay','NoLogPer-1st-12-hrs','NoLogPer-2nd-12-hrs').sort('date').show(truncate = False)
Please note that I'm not interested in using UDF or hacking it via toPandas()
so expected results should be like below after groupBy:
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-12|0 |0 |0 | <--
|A |2021-08-12|0 |0 |0 | <--
|B |2021-08-13|1 |1 |0 |
|A |2021-08-13|0 |0 |0 | <--
+--------+----------+-----------+-------------------+-------------------+
Here's is one way of doing:
First, generate new dataframe all_dates_df that contains the sequence of the dates from min to max date in your grouped df. For this you can use sequence function:
import pyspark.sql.functions as F
all_dates_df = df.selectExpr(
"sequence(min(date), max(date), interval 1 day) as date"
).select(F.explode("date").alias("date"))
all_dates_df.show()
#+----------+
#| date|
#+----------+
#|2021-08-11|
#|2021-08-12|
#|2021-08-13|
#+----------+
Now, you need to duplicate each date for all the users using a cross join with distinct UserName dataframe and finally join with the grouped df to get the desired output:
result_df = all_dates_df.crossJoin(
df.select("UserName").distinct()
).join(
df,
["UserName", "date"],
"left"
).fillna(0)
result_df.show()
#+--------+----------+-----------+-------------------+-------------------+
#|UserName| date|NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
#+--------+----------+-----------+-------------------+-------------------+
#| A|2021-08-11| 3| 2| 1|
#| B|2021-08-11| 2| 2| 0|
#| A|2021-08-12| 0| 0| 0|
#| B|2021-08-12| 0| 0| 0|
#| B|2021-08-13| 1| 1| 0|
#| A|2021-08-13| 0| 0| 0|
#+--------+----------+-----------+-------------------+-------------------+
Essentially, you may generate all the possible options and left join on this to achieve your missing date.
The sequence sql function may be helpful here to generate all your possible dates. You may pass it your min and max date along with your interval you would like it to increment by. The following examples continue with the code on your google collab.
Using the functions min,max,collect_set and table generating functions explode you may achieve the following:
possible_user_dates=(
# Step 1 - Get all possible UserNames and desired dates
df.select(
F.collect_set("UserName").alias("UserName"),
F.expr("sequence(min(date),max(date), interval 1 day)").alias("date")
)
# Step 2 - Use explode to split the collected arrays into rows (ouput immediately below)
.withColumn("UserName",F.explode("UserName"))
.withColumn("date",F.explode("date"))
.distinct()
)
possible_user_dates.show(truncate=False)
+--------+----------+
|UserName|date |
+--------+----------+
|B |2021-08-11|
|A |2021-08-11|
|B |2021-08-12|
|A |2021-08-12|
|B |2021-08-13|
|A |2021-08-13|
+--------+----------+
Performing your left join
final_df = (
possible_user_dates.join(
df,
["UserName","date"],
"left"
)
# Since the left join will place NULLs where values are missing.
# Eg. where a User was not active on a particular date
# We use `fill` to replace the null values with `0`
.na.fill(0)
)
final_df.show(truncate=False)
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-12|0 |0 |0 |
|A |2021-08-12|0 |0 |0 |
|B |2021-08-13|1 |1 |0 |
|A |2021-08-13|0 |0 |0 |
+--------+----------+-----------+-------------------+-------------------+
For debugging purposes, I've included the output of a few intermediary steps
Step 1 Output:
df.select(
F.collect_set("UserName").alias("UserName"),
F.expr("sequence(min(date),max(date), interval 1 day)").alias("date")
).show(truncate=False)
+--------+------------------------------------+
|UserName|date |
+--------+------------------------------------+
|[B, A] |[2021-08-11, 2021-08-12, 2021-08-13]|
+--------+------------------------------------+
I want to "duplicate" the rows the same number of times that the difference between two dates in the df. I have this dataframe:
So I need to explode the number of rows of the df to get this:
Get all dates between D1 and D2 using sequence and then explode the dates:
df = ...
df.withColumn("D1", F.explode(F.expr("sequence(D1,D2)"))) \
.drop("D2").show(truncate=False)
Output:
+---+---+---+----------+
|A |B |C |D1 |
+---+---+---+----------+
|1 |2 |3 |2019-01-01|
|1 |2 |3 |2019-01-02|
|1 |2 |3 |2019-01-03|
+---+---+---+----------+
I have a spark dataframe and then filter string to apply, filter only selects the some rows but I would like to know the reason for the rows not selected.
Example:
DataFrame columns: customer_id|col_a|col_b|col_c|col_d
Filter string: col_a > 0 & col_b > 4 & col_c < 0 & col_d=0
etc...
reason_for_exclusion can be any string or letter as long as it says why particular row excluded.
I could split the filter string and apply each filter but I have huge filter string and it would be inefficient so just checking is there any better way to this operation?
Thanks
You'll have to check for each condition within the filter expression which can be expensive regarding the simple operation of filtering.
I would suggest displaying the same reason for all filtred rows since it satisfies at least one condition in that expression. It's not pretty but I'd prefer this as it's efficient especially when you have to handle very large DataFrames.
data = [(1, 1, 5, -3, 0),(2, 0, 10, -1, 0), (3, 0, 10, -4, 1),]
df = spark.createDataFrame(data, ["customer_id", "col_a", "col_b", "col_c", "col_d"])
filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"
filtered_df = df.withColumn("reason_for_exclusion",
when(~expr(filter_expr),lit(filter_expr)
).otherwise(lit(None))
)
filtered_df.show(truncate=False)
Output:
+-----------+-----+-----+-----+-----+-------------------------------------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion |
+-----------+-----+-----+-----+-----+-------------------------------------------------+
|1 |1 |5 |-3 |0 |null |
|2 |0 |10 |-1 |0 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
|3 |0 |10 |-4 |1 |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
+-----------+-----+-----+-----+-----+-------------------------------------------------+
EDIT:
Now, if you really want to display only the conditions which failed you can turn each condition to separated columns and use DataFrame select to do the calculation. Then you'll have to check columns evaluated to False to know which condition has failed.
You could name these columns by <PREFIX>_<condition> so that you could identify them easily later. Here is a complete example:
filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"
COLUMN_FILTER_PREFIX = "filter_validation_"
original_columns = [col(c) for c in df.columns]
# create column for each condition in filter expression
condition_columns = [expr(f).alias(COLUMN_FILTER_PREFIX + f) for f in filter_expr.split("AND")]
# evaluate condition to True/False and persist the DF with calculated columns
filtered_df = df.select(original_columns + condition_columns)
filtered_df = filtered_df.persist(StorageLevel.MEMORY_AND_DISK)
# get back columns we calculated for filter
filter_col_names = [c for c in filtered_df.columns if COLUMN_FILTER_PREFIX in c]
filter_columns = list()
for c in filter_col_names:
filter_columns.append(
when(~col(f"`{c}`"),
lit(f"{c.replace(COLUMN_FILTER_PREFIX, '')}")
)
)
array_reason_filter = array_except(array(*filter_columns), array(lit(None)))
df_with_filter_reason = filtered_df.withColumn("reason_for_exclusion", array_reason_filter)
df_with_filter_reason.select(*original_columns, col("reason_for_exclusion")).show(truncate=False)
# output
+-----------+-----+-----+-----+-----+----------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion |
+-----------+-----+-----+-----+-----+----------------------+
|1 |1 |5 |-3 |0 |[] |
|2 |0 |10 |-1 |0 |[col_a > 0 ] |
|3 |0 |10 |-4 |1 |[col_a > 0 , col_d=0]|
+-----------+-----+-----+-----+-----+----------------------+