I have the following PySpark DataFrame :
+----+----------+----------+----------+
| id| a| b| c|
+----+----------+----------+----------+
|2346|2017-05-26| null|2016-12-18|
|5678|2013-05-07|2018-05-12| null|
+----+----------+----------+----------+
My ideal output is :
+----+---+---+---+
|id |a |b |c |
+----+---+---+---+
|2346|2 |0 |1 |
|5678|1 |2 |0 |
+----+---+---+---+
Ie the more recent the date within the row, the higher the score
I have looked at similar posts suggesting to use window function. The problem is that I need to order my values within the row, not the column.
You can put the values in each row into an array and use pyspark.sql.functions.sort_array() to sort it.
import pyspark.sql.functions as f
cols = ["a", "b", "c"]
df = df.select("*", f.sort_array(f.array([f.col(c) for c in cols])).alias("sorted"))
df.show(truncate=False)
#+----+----------+----------+----------+------------------------------+
#|id |a |b |c |sorted |
#+----+----------+----------+----------+------------------------------+
#|2346|2017-05-26|null |2016-12-18|[null, 2016-12-18, 2017-05-26]|
#|5678|2013-05-07|2018-05-12|null |[null, 2013-05-07, 2018-05-12]|
#+----+----------+----------+----------+------------------------------+
Now you can use a combination of pyspark.sql.functions.coalesce() and pyspark.sql.functions.when() to loop over each of the columns in cols and find the corresponding index in the sorted array.
df = df.select(
"id",
*[
f.coalesce(
*[
f.when(
f.col("sorted").getItem(i) == f.col(c),
f.lit(i)
)
for i in range(len(cols))
]
).alias(c)
for c in cols
]
)
df.show(truncate=False)
#+----+---+----+----+
#|id |a |b |c |
#+----+---+----+----+
#|2346|2 |null|1 |
#|5678|1 |2 |null|
#+----+---+----+----+
Finally fill the null values with 0:
df = df.na.fill(0)
df.show(truncate=False)
#+----+---+---+---+
#|id |a |b |c |
#+----+---+---+---+
#|2346|2 |0 |1 |
#|5678|1 |2 |0 |
#+----+---+---+---+
Related
I have a dataset like:
Data
a
a
a
a
a
b
b
b
a
a
b
I would like to include a column that like the one below. The data will be in the form of a1,1 in the column, where the first element represents the event frequency (a1), or how often "a" appears in the field, and the second element (,1) is the frequency for each event, or how often "a" repeats before any other element (b) in the field. Can we carry this out with PySpark?
Data Frequency
a a1,1
a a1,2
a a1,3
a a1,4
a a1,5
b b1,1
b b1,2
b b1,3
a a2,1
a a2,2
b b2,1
You can achieve your desired result by doing this,
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'string').toDF("Data")
print("Original Data:")
df.show()
print("Result:")
df.withColumn("ID", F.monotonically_increasing_id()) \
.withColumn("group",
F.row_number().over(Window.orderBy("ID"))
- F.row_number().over(Window.partitionBy("Data").orderBy("ID"))
) \
.withColumn("element_freq", F.when(F.col('Data') != 'abcd', F.row_number().over(Window.partitionBy("group").orderBy("ID"))).otherwise(F.lit(0)))\
.withColumn("event_freq", F.when(F.col('Data') != 'abcd', F.dense_rank().over(Window.partitionBy("Data").orderBy("group"))).otherwise(F.lit(0)))\
.withColumn("Frequency", F.concat_ws(',', F.concat(F.col("Data"), F.col("event_freq")), F.col("element_freq"))) \
.orderBy("ID")\
.drop("ID", "group", "event_freq", "element_freq")\
.show()
Original Data:
+----+
|Data|
+----+
| a|
| a|
| a|
| a|
| a|
| b|
| b|
| b|
| a|
| a|
| b|
+----+
Result:
+----+---------+
|Data|Frequency|
+----+---------+
| a| a1,1|
| a| a1,2|
| a| a1,3|
| a| a1,4|
| a| a1,5|
| b| b1,1|
| b| b1,2|
| b| b1,3|
| a| a2,1|
| a| a2,2|
| b| b2,1|
+----+---------+
Use Window functions. I give you to options just in case
Option 1, separating groups and Frequency
#Variable to use in the groupby
k=Window.partitionBy().orderBy('index')
(
#Create an index of df to order by
df1.withColumn('index', monotonically_increasing_id())
#Create a column that puts a consecutive and previous Data in a row
.withColumn('group', lag('Data').over(k))
# #Where consecutive and previous dont match, conditionally assign a 1 else o
.withColumn('group', when(col('data')!=col('group'),1).otherwise(0))
# Concat Data and sum of outcome from above per group and ordered by index
.withColumn('group', concat('Data',sum('group').over(Window.partitionBy('Data').orderBy('index'))+1))
#rank outcome above in the order in which they appeared in initial df
.withColumn('Frequency', rank().over(Window.partitionBy('group').orderBy('index')))
).sort('index').drop('index').show(truncate=False)
+----+-----+---------+
|Data|group|Frequency|
+----+-----+---------+
|a |a1 |1 |
|a |a1 |2 |
|a |a1 |3 |
|a |a1 |4 |
|a |a1 |5 |
|b |b2 |1 |
|b |b2 |2 |
|b |b2 |3 |
|a |a2 |1 |
|a |a2 |2 |
|b |b3 |1 |
+----+-----+---------+
Option 2 gives an output you wanted
#Variable to use in the groupby
k=Window.partitionBy().orderBy('index')
(
#Create an index of df to order by
df1.withColumn('index', monotonically_increasing_id())
#Create a column that puts a consecutive and previous Data in a row
.withColumn('Frequency', lag('Data').over(k))
# #Where consecutive and previous dont match, conditionally assign a 1 else o
.withColumn('Frequency', when(col('data')!=col('Frequency'),1).otherwise(0))
# Concat Data and sum of outcome from above per group and ordered by index
.withColumn('Frequency', concat('Data',sum('Frequency').over(Window.partitionBy('Data').orderBy('index'))+1))
#rank outcome above in the order in which they appeared in initial df
.withColumn('Frequency', array_join(array('Frequency',rank().over(Window.partitionBy('Frequency').orderBy('index'))),','))
).sort('index').drop('index').show(truncate=False)
+----+---------+
|Data|Frequency|
+----+---------+
|a |a1,1 |
|a |a1,2 |
|a |a1,3 |
|a |a1,4 |
|a |a1,5 |
|b |b2,1 |
|b |b2,2 |
|b |b2,3 |
|a |a2,1 |
|a |a2,2 |
|b |b3,1 |
+----+---------+
Let's say I have the following Spark frame:
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-13|1 |1 |0 |
+--------+----------+-----------+-------------------+-------------------+
Now I want to not only impute the missing dates in date column with the right dates so that dataframe keeps its continuous time-series nature and equally sequenced frame but also impute other columns with Null or 0 (while groupBy preferably).
My code is below:
import time
import datetime as dt
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, TimestampType, DateType
dict2 = [("2021-08-11 04:05:06", "A"),
("2021-08-11 04:15:06", "B"),
("2021-08-11 09:15:26", "A"),
("2021-08-11 11:04:06", "B"),
("2021-08-11 14:55:16", "A"),
("2021-08-13 04:12:11", "B"),
]
schema = StructType([
StructField("timestamp", StringType(), True), \
StructField("UserName", StringType(), True), \
])
#create a Spark dataframe
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(data=dict2,schema=schema)
#sdf.printSchema()
#sdf.show(truncate=False)
#+-------------------+--------+
#|timestamp |UserName|
#+-------------------+--------+
#|2021-08-11 04:05:06|A |
#|2021-08-11 04:15:06|B |
#|2021-08-11 09:15:26|A |
#|2021-08-11 11:04:06|B |
#|2021-08-11 14:55:16|A |
#|2021-08-13 04:12:11|B |
#+-------------------+--------+
#Generate date and timestamp
sdf1 = sdf.withColumn('timestamp', F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
.withColumn('date', F.to_date("timestamp", "yyyy-MM-dd").cast(DateType())) \
.select('timestamp', 'date', 'UserName')
#sdf1.show(truncate = False)
#+-------------------+----------+--------+
#|timestamp |date |UserName|
#+-------------------+----------+--------+
#|2021-08-11 04:05:06|2021-08-11|A |
#|2021-08-11 04:15:06|2021-08-11|B |
#|2021-08-11 09:15:26|2021-08-11|A |
#|2021-08-11 11:04:06|2021-08-11|B |
#|2021-08-11 14:55:16|2021-08-11|A |
#|2021-08-13 04:12:11|2021-08-13|B |
#+-------------------+----------+--------+
#Aggeragate records numbers for specific features (Username) for certain time-resolution PerDay(24hrs), HalfDay(2x12hrs)
df = sdf1.groupBy("UserName", "date").agg(
F.sum(F.hour("timestamp").between(0, 24).cast("int")).alias("NoLogPerDay"),
F.sum(F.hour("timestamp").between(0, 11).cast("int")).alias("NoLogPer-1st-12-hrs"),
F.sum(F.hour("timestamp").between(12, 23).cast("int")).alias("NoLogPer-2nd-12-hrs"),
).sort('date')
df.show(truncate = False)
The problem is when I groupBy on date and UserName, I missed some dates which user B had activities but user A not or vice versa. So I'm interested in reflecting these no activities in the Spark dataframe by refilling those dates (no need to timestamp) and allocating 0 to those columns. I'm not sure if I can do this while grouping or before or after!
I already checked some related post as well as PySpark offers window functions and inspired this answer so until now I've tried this:
# compute the list of all dates from available dates
max_date = sdf1.select(F.max('date')).first()['max(date)']
min_date = sdf1.select(F.min('date')).first()['min(date)']
print(min_date) #2021-08-11
print(max_date) #2021-08-13
#compute list of available dates based on min_date & max_date from available data
dates_list = [max_date - dt.timedelta(days=x) for x in range((max_date - min_date).days +1)]
print(dates_list)
#create a temporaray Spark dataframe for date column includng missing dates with interval 1 day
sqlCtx = SQLContext(sc)
df2 = sqlCtx.createDataFrame(data=dates_list)
#Apply leftouter join on date column
dff = df2.join(sdf1, ["date"], "leftouter")
#dff.sort('date').show(truncate = False)
#possible to use .withColumn().otherwise()
#.withColumn('date',when(col('date').isNull(),to_date(lit('01.01.1900'),'dd.MM.yyyy')).otherwise(col('date')))
#Replace 0 for null for all integer columns
dfff = dff.na.fill(value=0).sort('date')
dfff.select('date','Username', 'NoLogPerDay','NoLogPer-1st-12-hrs','NoLogPer-2nd-12-hrs').sort('date').show(truncate = False)
Please note that I'm not interested in using UDF or hacking it via toPandas()
so expected results should be like below after groupBy:
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-12|0 |0 |0 | <--
|A |2021-08-12|0 |0 |0 | <--
|B |2021-08-13|1 |1 |0 |
|A |2021-08-13|0 |0 |0 | <--
+--------+----------+-----------+-------------------+-------------------+
Here's is one way of doing:
First, generate new dataframe all_dates_df that contains the sequence of the dates from min to max date in your grouped df. For this you can use sequence function:
import pyspark.sql.functions as F
all_dates_df = df.selectExpr(
"sequence(min(date), max(date), interval 1 day) as date"
).select(F.explode("date").alias("date"))
all_dates_df.show()
#+----------+
#| date|
#+----------+
#|2021-08-11|
#|2021-08-12|
#|2021-08-13|
#+----------+
Now, you need to duplicate each date for all the users using a cross join with distinct UserName dataframe and finally join with the grouped df to get the desired output:
result_df = all_dates_df.crossJoin(
df.select("UserName").distinct()
).join(
df,
["UserName", "date"],
"left"
).fillna(0)
result_df.show()
#+--------+----------+-----------+-------------------+-------------------+
#|UserName| date|NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
#+--------+----------+-----------+-------------------+-------------------+
#| A|2021-08-11| 3| 2| 1|
#| B|2021-08-11| 2| 2| 0|
#| A|2021-08-12| 0| 0| 0|
#| B|2021-08-12| 0| 0| 0|
#| B|2021-08-13| 1| 1| 0|
#| A|2021-08-13| 0| 0| 0|
#+--------+----------+-----------+-------------------+-------------------+
Essentially, you may generate all the possible options and left join on this to achieve your missing date.
The sequence sql function may be helpful here to generate all your possible dates. You may pass it your min and max date along with your interval you would like it to increment by. The following examples continue with the code on your google collab.
Using the functions min,max,collect_set and table generating functions explode you may achieve the following:
possible_user_dates=(
# Step 1 - Get all possible UserNames and desired dates
df.select(
F.collect_set("UserName").alias("UserName"),
F.expr("sequence(min(date),max(date), interval 1 day)").alias("date")
)
# Step 2 - Use explode to split the collected arrays into rows (ouput immediately below)
.withColumn("UserName",F.explode("UserName"))
.withColumn("date",F.explode("date"))
.distinct()
)
possible_user_dates.show(truncate=False)
+--------+----------+
|UserName|date |
+--------+----------+
|B |2021-08-11|
|A |2021-08-11|
|B |2021-08-12|
|A |2021-08-12|
|B |2021-08-13|
|A |2021-08-13|
+--------+----------+
Performing your left join
final_df = (
possible_user_dates.join(
df,
["UserName","date"],
"left"
)
# Since the left join will place NULLs where values are missing.
# Eg. where a User was not active on a particular date
# We use `fill` to replace the null values with `0`
.na.fill(0)
)
final_df.show(truncate=False)
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-12|0 |0 |0 |
|A |2021-08-12|0 |0 |0 |
|B |2021-08-13|1 |1 |0 |
|A |2021-08-13|0 |0 |0 |
+--------+----------+-----------+-------------------+-------------------+
For debugging purposes, I've included the output of a few intermediary steps
Step 1 Output:
df.select(
F.collect_set("UserName").alias("UserName"),
F.expr("sequence(min(date),max(date), interval 1 day)").alias("date")
).show(truncate=False)
+--------+------------------------------------+
|UserName|date |
+--------+------------------------------------+
|[B, A] |[2021-08-11, 2021-08-12, 2021-08-13]|
+--------+------------------------------------+
I am joining two dataframes site_bs and site_wrk_int1 and creating site_wrk using a dynamic join condition.
My code is like below:
join_cond=[ col(v_col) == col('wrk_'+v_col) for v_col in primaryKeyCols] #result would be
site_wrk=site_bs.join(site_wrk_int1,join_cond,'inner').select(*site_bs.columns)
join_cond will be dynamic and the value will be something like [ col(id) == col(wrk_id), col(id) == col(wrk_parentId)]
In the above join condition, join will happen satisfying both the conditions above. i.e., the join condition will be
id = wrk_id and id = wrk_parentId
But I want or condition to be applied like below
id = wrk_id or id = wrk_parentId
How to achieve this in Pyspark?
Since logical operations on pyspark columns return column objects, you can chain these conditions in the join statement such as:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
(1, "A", "A"),
(2, "C", "C"),
(3, "E", "D"),
], ['id', 'col1', 'col2']
)
df.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| A| A|
| 2| C| C|
| 3| E| D|
+---+----+----+
df.alias("t1").join(
df.alias("t2"),
(f.col("t1.col1") == f.col("t2.col2")) | (f.col("t1.col1") == f.lit("E")),
"left_outer"
).show(truncate=False)
+---+----+----+---+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+---+----+----+
|1 |A |A |1 |A |A |
|2 |C |C |2 |C |C |
|3 |E |D |1 |A |A |
|3 |E |D |2 |C |C |
|3 |E |D |3 |E |D |
+---+----+----+---+----+----+
As you can see, I get the True value for left rows with IDs 1 and 2 as col1 == col2 OR col1 == E which is True for three rows of my DataFrame. In terms of syntax, it's important that the Python operators (| & ...) are separated by closed brackets as in example above, otherwise you might get confusing py4j errors.
Alternatively, if you wish to keep to similar notation as you stated in your questions, why not use functools.reduce and operator.or_ to apply this logic to your list, such as:
In this example, I have an AND condition between my column conditions and get NULL only, as expected:
df.alias("t1").join(
df.alias("t2"),
[f.col("t1.col1") == f.col("t2.col2"), f.col("t1.col1") == f.lit("E")],
"left_outer"
).show(truncate=False)
+---+----+----+----+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+----+----+----+
|3 |E |D |null|null|null|
|1 |A |A |null|null|null|
|2 |C |C |null|null|null|
+---+----+----+----+----+----+
In this example, I leverage functools and operator to get same result as above:
df.alias("t1").join(
df.alias("t2"),
functools.reduce(
operator.or_,
[f.col("t1.col1") == f.col("t2.col2"), f.col("t1.col1") == f.lit("E")]),
"left_outer"
).show(truncate=False)
+---+----+----+---+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+---+----+----+
|1 |A |A |1 |A |A |
|2 |C |C |2 |C |C |
|3 |E |D |1 |A |A |
|3 |E |D |2 |C |C |
|3 |E |D |3 |E |D |
+---+----+----+---+----+----+
I am quite new in spark SQL.
Please notify me if this can be a solution.
site_wrk = site_bs.join(site_work_int1, [(site_bs.id == site_work_int1.wrk_id) | (site_bs.id == site_work_int1.wrk_parentId)], how = "inner")
I have data as below and I need to separate that based on ","
I/p file : 1,2,4,371003\,5371022\,87200000\,U
The desired result should be :
a b c d e f
1 2 3 4 371003,5371022,87000000 U
val df = spark.read.option("inferSchma","true").option("escape","\\").option("delimiter",",").csv("/user/txt.csv")
try this:
val df = spark.read.csv("/user/txt.csv")
df.show()
+---+---+---+-------+--------+---------+---+
|_c0|_c1|_c2| _c3| _c4| _c5|_c6|
+---+---+---+-------+--------+---------+---+
| 1| 2| 4|371003\|5371022\|87200000\| U|
+---+---+---+-------+--------+---------+---+
df.select(
'_c0, '_c1, '_c2,
regexp_replace(concat_ws(",", '_c3, '_c4, '_c5), "\\\\", ""),
'_c6
).toDF("a","b","c","e","f").show(false)
+---+---+---+-----------------------+---+
|a |b |c |e |f |
+---+---+---+-----------------------+---+
|1 |2 |4 |371003,5371022,87200000|U |
+---+---+---+-----------------------+---+
I have the following dataframe in Spark (it has only one row):
df.show
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1|4.4| 2| 3| 7|2.6|
+---+---+---+---+---+---+
I want to get the columns that their values are greater than 2.8 (just as an example). The outcomes should be:
List(B, D , E)
Here is my own solution:
val cols = df.columns
val threshold = 2.8
val values = df.rdd.collect.toList
val res = values
.flatMap(x => x.toSeq)
.map(x => x.toString.toDouble)
.zip(cols)
.filter(x => x._1 > threshold)
.map(x => x._2)
A simple udf function should give you correct result as
val columns = df.columns
def getColumns = udf((cols: Seq[Double]) => cols.zip(columns).filter(_._1 > 2.8).map(_._2))
df.withColumn("columns > 2.8", getColumns(array(columns.map(col(_)): _*))).show(false)
So that even if you have multiple rows as below
+---+---+---+---+---+---+
|A |B |C |D |E |F |
+---+---+---+---+---+---+
|1 |4.4|2 |3 |7 |2.6|
|4 |2.7|2 |3 |1 |2.9|
+---+---+---+---+---+---+
You will get result for each rows as
+---+---+---+---+---+---+-------------+
|A |B |C |D |E |F |columns > 2.8|
+---+---+---+---+---+---+-------------+
|1 |4.4|2 |3 |7 |2.6|[B, D, E] |
|4 |2.7|2 |3 |1 |2.9|[A, D, F] |
+---+---+---+---+---+---+-------------+
I hope the answer is helpful
you could use explode and array functions:
df.select(
explode(
array(
df.columns.map(c => struct(lit(c).alias("key"), col(c).alias("val"))): _*
)
).as("kv")
)
.where($"kv.val" > 2.8)
.select($"kv.key")
.show()
+---+
|key|
+---+
| B|
| D|
| E|
+---+
you could then collect this result. But I don't see any issue with collecting the dataframe first as t has only 1 row:
df.columns.zip(df.first().toSeq.map(_.asInstanceOf[Double]))
.collect{case (c,v) if v>2.8 => c} // Array(B,D,E)
val c = df.columns.foldLeft(df){(a,b) => a.withColumn(b, when(col(b) > 2.8, b))}
c.collect
You can remove the nulls from the array