Update a column in PySpark while doing multiple inner joins? - apache-spark

I have a SQL query which I am trying to convert into PySpark. In SQL query, we are joining three tables and updating a column where there's a match. The SQL query looks like this:
UPDATE [DEPARTMENT_DATA]
INNER JOIN ([COLLEGE_DATA]
INNER JOIN [STUDENT_TABLE]
ON COLLEGE_DATA.UNIQUEID = STUDENT_TABLE.PROFESSIONALID)
ON DEPARTMENT_DATA.PUBLICID = COLLEGE_DATA.COLLEGEID
SET STUDENT_TABLE.PRIVACY = "PRIVATE"
The logic I have tried:
df_STUDENT_TABLE = (
df_STUDENT_TABLE.alias('a')
.join(
df_COLLEGE_DATA('b'),
on=F.col('a.PROFESSIONALID') == F.col('b.UNIQUEID'),
how='left',
)
.join(
df_DEPARTMENT_DATA.alias('c'),
on=F.col('b.COLLEGEID') == F.col('c.PUBLICID'),
how='left',
)
.select(
*[F.col(f'a.{c}') for c in df_STUDENT_TABLE.columns],
F.when(
F.col('b.UNIQUEID').isNotNull() & F.col('c.PUBLICID').isNotNull()
F.lit('PRIVATE')
).alias('PRIVACY')
)
)
This code is adding a new column "PRIVACY", but giving null values after running.

I have taken some sample data and when I apply the join using conditions, the following is the result I get (requirement is that the following record's privacy needs to be set to PRIVATE)
%sql
select student.*,college.*,department.* from department INNER JOIN college INNER JOIN student
ON college.unique_id = student.professional_id and department.public_id = college.college_id
When I used your code (same logic), I got the same output i.e., an additional column being added to the dataframe with required values and the actual privacy column has nulls.
from pyspark.sql.functions import col,when,lit
df_s = df_s.alias('a').join(df_c.alias('b'), col('a.professional_id') == col('b.unique_id'),'left').join(df_d.alias('c'), col('b.college_id') == col('c.public_id'),'left').select(*[col(f'a.{c}') for c in df_s.columns],when(col('b.unique_id').isNotNull() & col('c.public_id').isNotNull(), 'PRIVATE').otherwise(col('a.privacy')).alias('req_value'))
df_s.show()
Since, from the above, req_value is the column with required values and these values need to be reflected in privacy, you can use the following code directly.
final = df_s.withColumn('privacy',col('req_value')).select([column for column in df_s.columns if column!='req_value'])
final.show()
UPDATE:
You can also use the following code where I have updated the column using withColumn instead of select.
df_s = df_s.alias('a').join(df_c.alias('b'), col('a.professional_id') == col('b.unique_id'),'left').join(df_d.alias('c'), col('b.college_id') == col('c.public_id'),'left').withColumn('privacy',when(col('b.unique_id').isNotNull() & col('c.public_id').isNotNull(), 'PRIVATE').otherwise(col('privacy'))).select(*df_s.columns)
#or you can use this as well, without using alias.
#df_s = df_s.join(df_c, df_s['professional_id'] == df_c['unique_id'],'left').join(df_d, df_c['college_id'] == df_d['public_id'],'left').withColumn('privacy',when(df_c['unique_id'].isNotNull() & df_d['public_id'].isNotNull(), 'PRIVATE').otherwise(df_s['privacy'])).select(*df_s.columns)
df_s.show()

After the joins, you can use nvl2. It can check if the join with the last dataframe (df_dept) was successful, if yes, then you can return "PRIVATE", otherwise the value from df_stud.PRIVACY.
Inputs:
from pyspark.sql import functions as F
df_stud = spark.createDataFrame([(1, 'x'), (2, 'STAY')], ['PROFESSIONALID', 'PRIVACY'])
df_college = spark.createDataFrame([(1, 1)], ['COLLEGEID', 'UNIQUEID'])
df_dept = spark.createDataFrame([(1,)], ['PUBLICID'])
df_stud.show()
# +--------------+-------+
# |PROFESSIONALID|PRIVACY|
# +--------------+-------+
# | 1| x|
# | 2| STAY|
# +--------------+-------+
Script:
df = (df_stud.alias('s')
.join(df_college.alias('c'), F.col('s.PROFESSIONALID') == F.col('c.UNIQUEID'), 'left')
.join(df_dept.alias('d'), F.col('c.COLLEGEID') == F.col('d.PUBLICID'), 'left')
.select(
*[f's.`{c}`' for c in df_stud.columns if c != 'PRIVACY'],
F.expr("nvl2(d.PUBLICID, 'PRIVATE', s.PRIVACY) PRIVACY")
)
)
df.show()
# +--------------+-------+
# |PROFESSIONALID|PRIVACY|
# +--------------+-------+
# | 1|PRIVATE|
# | 2| STAY|
# +--------------+-------+

Related

How to add hours as variable to timestamp in Pyspark

Dataframe schema is like this:
["id", "t_create", "hours"]
string, timestamp, int
Sample data is like:
["abc", "2022-07-01 12:23:21.343998", 5]
I want to add hours to the t_create and get a new column t_update: "2022-07-01 17:23:21.343998"
Here is my code:
df_cols = ["id", "t_create", "hour"]
df = spark.read.format("delta").load("blablah path")
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL 5 HOURS"))
It works no problem. However the hours column should be a variable. I did not figure out how to put the variable to the expr, f string and the INTERVAL function, something like:
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL {df.hours} HOURS"))
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL {col(df.hours)} HOURS"))
etc... They don't work. Need help here.
Another way is to write a udf and wrap the whole expr string to the udf return value:
#udf
def udf_interval(hours):
return "INTERVAL " + str(hours) + " HOURS"
Then:
df = df.withColumn("t_update", df.t_create + expr(udf_interval(df.hours)))
Now I get TypeError: Column is not iterable.
Stuck. Need help in either the udf or non-udf way. Thanks!
You can do this without using the fiddly unix_timestamp and utilise make_interval within SparkSQL
SparkSQL - TO_TIMESTAMP & MAKE_INTERVAL
sql.sql("""
WITH INP AS (
SELECT
"abc" as id,
TO_TIMESTAMP("2022-07-01 12:23:21.343998","yyyy-MM-dd HH:mm:ss.SSSSSS") as t_create,
5 as t_hour
)
SELECT
id,
t_create,
t_hour,
t_create + MAKE_INTERVAL(0,0,0,0,t_hour,0,0) HOURS as t_update
FROM INP
""").show(truncate=False)
+---+--------------------------+------+--------------------------+
|id |t_create |t_hour|t_update |
+---+--------------------------+------+--------------------------+
|abc|2022-07-01 12:23:21.343998|5 |2022-07-01 17:23:21.343998|
+---+--------------------------+------+--------------------------+
Pyspark API
s = StringIO("""
id,t_create,t_hour
abc,2022-07-01 12:23:21.343998,5
"""
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('t_create'
,F.to_timestamp(F.col('t_create')
,'yyyy-MM-dd HH:mm:ss.SSSSSS'
)
).withColumn('t_update'
,F.expr('t_create + MAKE_INTERVAL(0,0,0,0,t_hour,0,0) HOURS')
).show(truncate=False)
+---+--------------------------+------+--------------------------+
|id |t_create |t_hour|t_update |
+---+--------------------------+------+--------------------------+
|abc|2022-07-01 12:23:21.343998|5 |2022-07-01 17:23:21.343998|
+---+--------------------------+------+--------------------------+
A simple way would be to cast the timestamp to bigint (or decimal if dealing with fraction of second) and add the number of seconds to it. Here's an example where I've created columns for every calculation for detailed understanding - you can merge all the calculations into a single column.
spark.sparkContext.parallelize([("2022-07-01 12:23:21.343998",)]).toDF(['ts_str']). \
withColumn('ts', func.col('ts_str').cast('timestamp')). \
withColumn('hours_to_add', func.lit(5)). \
withColumn('ts_as_decimal', func.col('ts').cast('decimal(20, 10)')). \
withColumn('seconds_to_add_as_decimal',
func.col('hours_to_add').cast('decimal(20, 10)') * 3600
). \
withColumn('new_ts_as_decimal',
func.col('ts_as_decimal') + func.col('seconds_to_add_as_decimal')
). \
withColumn('new_ts', func.col('new_ts_as_decimal').cast('timestamp')). \
show(truncate=False)
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+
# |ts_str |ts |hours_to_add|ts_as_decimal |seconds_to_add_as_decimal|new_ts_as_decimal |new_ts |
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+
# |2022-07-01 12:23:21.343998|2022-07-01 12:23:21.343998|5 |1656678201.3439980000|18000.0000000000 |1656696201.3439980000|2022-07-01 17:23:21.343998|
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+

I need to create a pyspark UDF that outputs a table from a query with a comparison

I am working with the IBM attrition data set on Kaggle. What I am trying to do is count occurrences of categorical variables to Attrition == 'Yes', and Attrition == 'No', and take the simple ratio to see which level of the categorical variable is more likely to attrite. Now I can do this in Pandas, like this:
def cal_ratio(x):
n_1 = sum(x['Attrition'].values == 'Yes')
n_0 = sum(x['Attrition'].values == 'No')
return n_1/n_0
Or I could easily enough write a spark.sql query that does it, and re-write it for each categorical variable I want to compare. A function like this one for Pandas would make my life easier, but I can't find any real guidance on how to create this sort of UDF nor how to register it.
EDIT: may be helpful if I ask also how would this work in pyspark with the UDF?
b = data.groupby('BusinessTravel').apply(cal_ratio)
Not sure it is the best solution but you can try this :
# My sample dataframe
df.show()
+---------+
|Attrition|
+---------+
| Yes|
| Yes|
| Yes|
| Yes|
| Yes|
| No|
| No|
+---------+
from pyspark.sql import functions as F
result = (
df.agg(
F.sum(F.when(F.col("Attrition") == "Yes", 1)).alias("Yes"),
F.sum(F.when(F.col("Attrition") == "No", 1)).alias("No"),
)
.select((F.col("Yes") / F.col("No")).alias("ratio"))
.first()
)
print(result.ratio)
> 2.5
You can, of course, transform the result thing to a function by replacing the hard-coded values with variables.
def cal_ratio(df):
result = (
df.agg(
F.sum(F.when(F.col("Attrition") == "Yes", 1)).alias("Yes"),
F.sum(F.when(F.col("Attrition") == "No", 1)).alias("No"),
)
.select((F.col("Yes") / F.col("No")).alias("ratio"))
.first()
)
return result.ratio
EDIT : If you need to group by a column, then you need to replace the first with a collect:
def cal_ratio(df):
result = (
df.groupBy("BusinessTravel")
.agg(
F.sum(F.when(F.col("Attrition") == "Yes", 1)).alias("Yes"),
F.sum(F.when(F.col("Attrition") == "No", 1)).alias("No"),
)
.select((F.col("Yes") / F.col("No")).alias("ratio"))
.collect()
)
return result

Pyspark DataFrame: find difference between two DataFrames (values and column names)

I am having total 100+ columns in dataframe.
I am trying to compare two data frame and find unmatched record with column name.
I got a output bellow code but When I run the code for 100+ columns job got aborted.
I am doing this for SCD Type 2 delta process error finding.
from pyspark.sql.types import *
from pyspark.sql.functions import *
d2 = sc.parallelize([("A1", 500,1005) ,("A2", 700,10007)])
dataFrame1 = sqlContext.createDataFrame(d2, ["ID", "VALUE1", "VALUE2"])
d2 = sc.parallelize([("A1", 600,1005),("A2", 700,10007)])
dataFrame2 = sqlContext.createDataFrame(d2, ["ID", "VALUE1", "VALUE2"])
key_id_col_name="ID"
key_id_value="A1"
dataFrame1.select("ID","VALUE1").subtract(dataFrame2.select("ID",col("VALUE1").alias("value"))).show()
def unequalColumnValuesTwoDF(dataFrame1,dataFrame2,key_id_col_name,key_id_value):
chk_fst=True
dataFrame1 = dataFrame1.where(dataFrame1[key_id_col_name] == key_id_value)
dataFrame2 = dataFrame2.where(dataFrame2[key_id_col_name] == key_id_value)
col_names = list(set(dataFrame1.columns).intersection(dataFrame2.columns))
col_names.remove(key_id_col_name)
for col_name in col_names:
if chk_fst == True:
df_tmp = dataFrame1.select(col(key_id_col_name).alias("KEY_ID"),col(col_name).alias("VALUE")).subtract(dataFrame2.select(col(key_id_col_name).alias("KEY_ID"),col(col_name).alias("VALUE"))).withColumn("COL_NAME",lit(col_name))
chk_fst = False
else:
df_tmp = df_tmp.unionAll(dataFrame1.select(col(key_id_col_name).alias("KEY_ID"),col(col_name).alias("VALUE")).subtract(dataFrame2.select(col(key_id_col_name).alias("KEY_ID"),col(col_name).alias("VALUE"))).withColumn("COL_NAME",lit(col_name)))
return df_tmp
res_df = unequalColumnValuesTwoDF(dataFrame1,dataFrame2,key_id_col_name,key_id_value)
res_df.show()
>>> dataFrame1.show()
+---+------+------+
| ID|VALUE1|VALUE2|
+---+------+------+
| A1| 500| 1005|
| A2| 700| 10007|
+---+------+------+
>>> dataFrame2.show()
+---+------+------+
| ID|VALUE1|VALUE2|
+---+------+------+
| A1| 600| 1005|
| A2| 700| 10007|
+---+------+------+
>>> res_df.show()
+------+-----+--------+
|KEY_ID|VALUE|COL_NAME|
+------+-----+--------+
| A1| 500| VALUE1|
+------+-----+--------+
Please suggest any other way.
Here is another approach:
Join the two DataFrames using the ID column.
Then for each row, create a new column which contains the columns for which there is a difference.
Create this new column as a key-value pair map using pyspark.sql.functions.create_map().1
The key for the map will be the column name.
Using pyspark.sql.functions.when(), set the value to the corresponding value in in dataFrame1 (as it seems like that is what you want from your example) if there is a difference between the two DataFrames. Otherwise, we set the value to None.
Use pyspark.sql.functions.explode() on the map column, and filter out any rows where the difference is not null using pyspark.sql.functions.isnull().
Select the columns you want and rename using alias().
Example:
import pyspark.sql.functions as f
columns = [c for c in dataFrame1.columns if c != 'ID']
dataFrame1.alias('r').join(dataFrame2.alias('l'), on='ID')\
.withColumn(
'diffs',
f.create_map(
*reduce(
list.__add__,
[
[
f.lit(c),
f.when(
f.col('r.'+c) != f.col('l.'+c),
f.col('r.'+c)
).otherwise(None)
]
for c in columns
]
)
)
)\
.select([f.col('ID'), f.explode('diffs')])\
.where(~f.isnull(f.col('value')))\
.select(
f.col('ID').alias('KEY_ID'),
f.col('value').alias('VALUE'),
f.col('key').alias('COL_NAME')
)\
.show(truncate=False)
#+------+-----+--------+
#|KEY_ID|VALUE|COL_NAME|
#+------+-----+--------+
#|A1 |500 |VALUE1 |
#+------+-----+--------+
Notes
1 The syntax *reduce(list.__add__, [[f.lit(c), ...] for c in columns]) as the argument to create_map() is some python-fu that helps create the map dynamically.
create_map() expects an even number of arguments- it assumes that the first argument in every pair is the key and the second is the value. In order to put the arguments in that order, the list comprehension yields a list for each iteration. We reduce this list of lists into a flat list using list.__add__.
Finally the * operator is used to unpack the list.
Here is the intermediate output, which may make the logic clearer:
dataFrame1.alias('r').join(dataFrame2.alias('l'), on='ID')\
.withColumn(
'diffs',
f.create_map(
*reduce(
list.__add__,
[
[
f.lit(c),
f.when(
f.col('r.'+c) != f.col('l.'+c),
f.col('r.'+c)
).otherwise(None)
]
for c in columns
]
)
)
)\
.select('ID', 'diffs').show(truncate=False)
#+---+-----------------------------------+
#|ID |diffs |
#+---+-----------------------------------+
#|A2 |Map(VALUE1 -> null, VALUE2 -> null)|
#|A1 |Map(VALUE1 -> 500, VALUE2 -> null) |
#+---+-----------------------------------+

Pyspark dataframe join elements as variables

I am facing an issue while I am trying to pass the join elements as variables in pyspark dataframe join function. I am getting primary key fields from a file while I am trying pass it as variable in a join statement, it throws an error as "cannot resolve the column name" since it is passed as a string. Please assist me on this.
for i in range(len(pr_list)):
if i != len(pr_list)-1:
pr_str += " (df_a." + pr_list[i] + " == df_b." +pr_list[i] +") & "
else:
pr_str += "(df_a." + pr_list[i] + " == df_b." +pr_list[i] +")"
print (pr_str)
df1_with_db2 = df_a.join(df_b, pr_str ,'inner').select('df_a.*')
The reason for showing this error is because in the join condition you are passing the join condition as string and in the join condition it accepts either a single column name or list of column names or condition with expressions, you just want to minor change in the code
df1_with_db2 = df_a.alias("df_a").join(df_b, eval(pr_str) ,'inner').select('df_a.*')
By looking at your error it looks your pr_list can have columns which are neither present on any of 2 df or you didn't alias you dataframes befor joining like
df1_with_db2 = df_a.alias("df_a").join(df_b.alias("df_b"), pr_str ,'inner').select('df_a.*')
Below is my way to do this problem:-
In your code, I found both dataframe have the same name of columns and that are in list pr_list
So you can just pass this list as join condition like below (by default join is inner):
df1_with_db2 = df_a.join(
df_b,
pr_list
)
you will get common column only one time so no need to write select function
Here is a example:-
df1 = sqlContext.createDataFrame([
[1,2],
[3,4],
[9,8]
], ['a', 'b'])
df2 = sqlContext.createDataFrame([
[1,2],
[3,4],
[18,19]
], ['a', 'b'])
jlist = ['a','b']
df1.join(df2, jlist).show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+

Find all nulls with SQL query over pyspark dataframe

I have a dataframe of StructField with a mixed schema (DoubleType, StringType, LongType, etc.).
I want to 'iterate' over all columns to return summary statistics. For instance:
set_min = df.select([
fn.min(self.df[c]).alias(c) for c in self.df.columns
]).collect()
Is what I'm using to find the minimum value in each column. That works fine. But when I try something designed similar to find Nulls:
set_null = df.filter(
(lambda x: self.df[x]).isNull().count()
).collect()
I get the TypeError: condition should be string or Column which makes sense, I'm passing a function.
or with list comprehension:
set_null = self.df[c].alias(c).isNull() for c in self.df.columns
Then I try pass it a SQL query as a string:
set_null = df.filter('SELECT fields FROM table WHERE column = NUL').collect()
I get:
ParseException: "\nmismatched input 'FROM' expecting <EOF>(line 1, pos 14)\n\n== SQL ==\nSELECT fields FROM table WHERE column = NULL\n--------------^^^\n"
How can i pass my function as a 'string or column' so I can use filter or where alternatively, why wont the pure SQL statement work?
There are things wrong in several parts of your attempts:
You are missing square brackets in your list comprehension example
You missed an L in NUL
Your pure SQL doesn't work, because filter/where expects a where clause, not a full SQL statement; they are just aliases and I prefer to use where so it is clearer you just need to give such a clause
In the end you don't need to use where, like karlson also shows. But subtracting from the total count means you have to evaluate the dataframe twice (which can be alleviated by caching, but still not ideal). There is a more direct way:
>>> df.select([fn.sum(fn.isnull(c).cast('int')).alias(c) for c in df.columns]).show()
+---+---+
| A| B|
+---+---+
| 2| 3|
+---+---+
This works because casting a boolean value to integer give 1 for True and 0 for False. If you prefer SQL, the equivalent is:
df.select([fn.expr('SUM(CAST(({c} IS NULL) AS INT)) AS {c}'.format(c=c)) for c in df.columns]).show()
or nicer, without a cast:
df.select([fn.expr('SUM(IF({c} IS NULL, 1, 0)) AS {c}'.format(c=c)) for c in df.columns]).show()
If you want a count of NULL values per column you could count the non-null values and subtract from the total.
For example:
from pyspark.sql import SparkSession
from pyspark.sql import functions as fn
spark = SparkSession.builder.master("local").getOrCreate()
df = spark.createDataFrame(
data=[
(1, None),
(1, 1),
(None, None),
(1, 1),
(None, 1),
(1, None),
],
schema=("A", "B")
)
total = df.count()
missing_counts = df.select(
*[(total - fn.count(col)).alias("missing(%s)" % col) for col in df.columns]
)
missing_counts.show()
>>> +----------+----------+
... |missing(A)|missing(B)|
... +----------+----------+
... | 2| 3|
... +----------+----------+

Resources