How to compare 2 dataframes in pyspark based on dynamic columns - apache-spark

I have 2 dataframes that I am processing in pyspark from different sources. These dataframes have few columns in common. Here is what I need to do,
compare the 2 dataframes based on columns that are generated dynamically
for matched rows, further process based on column values from both left and right
also need unmatched records from both dataframes
Here is what I mean,
df1 = [Row(name=Bob, sub_id=1, id=1, age=5, status=active, version=0),
Row(name=Rob, sub_id=1, id=1, age=5, status=active, version=1),
Row(name=Tom, sub_id=2, id=3, age=50, status=active, version=0)]
df2 = [Row(name=Bobbie, sub_id=1, age=5),
Row(name=Tom, sub_id=2, age=51),
Row(name=Carter, sub_id=3, age=70)]
"""
my keys are based on sub_id and say they are per below,
sub_id = 1, keys = [sub_id]
sub_id = 2, keys = [sub_id, age]
sub_id = 3, keys = [sub_id]
"""
#matched records expected results
#note that only sub_id=1 has a match based on keys. After match and further processing, version 0 record in df1 was copied as version 2 and updated per df2
df_matched = [
Row(name=Bobbie, sub_id=1, id=1, age=5, status=active, version=0), #updated per df2
Row(name=Rob, sub_id=1, id=1, age=5, status=active, version=1),
Row(name=Bob, sub_id=1, id=1, age=5, status=active, version=2) #new insert
]
#unmatched from df1
df_unmatched_left = [Row(name=Rob, sub_id=1, id=1, age=5, status=active, version=1)]
#unmatched from df2
df_unmatched_right = [Row(name=Tom, sub_id=2, age=51),
Row(name=Carter, sub_id=3, age=70)]
#Here is what I have tried so far
#created temp views of df1 and df2, say df1_table, df2_table
#but how do i make the joins dynamic and also further process for which i need both df1 and df2
df_matched = spark.sql("sql to join df1_table and df2_table").select("select from both dfs") #use some function with map for further processing ?
#to make the join condition, would using map and adding a temp column with join condition to use in query a good approach ?
I am using pyspark 2.3.x and python 3.5.x

Related

Extracting Data From Pandas DataFrame

I have two pandas dataframe named df1 and df2. I want to extract same named files from both of the dataframe and put extracted in two columns in a data frame. I want the take, files name from df1 and match with df2 (df2 has more files than df1). There is only one column in both dataframe (df1 and df2). The "BOLD" one started with letter s**** is the common matching alpha-numeric characters. We have to match both dataframe on that.
df1["Text_File_Location"] =
0 /home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt
1 /home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt
df2["Image_File_Location"]=
0 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg'
1 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg
In Python 3.4+, you can use pathlib to handily work with filepaths. You can extract the filename without extension ("stem") from df1 and then you can extract the parent folder name from df2. Then, you can do an inner merge on those names.
import pandas as pd
from pathlib import Path
df1 = pd.DataFrame(
{
"Text_File_Location": [
"/home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt",
"/home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt",
]
}
)
df2 = pd.DataFrame(
{
"Image_File_Location": [
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg",
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg",
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/foo/bar.jpg",
]
}
)
df1["name"] = df1["Text_File_Location"].apply(lambda x: Path(str(x)).stem)
df2["name"] = df2["Image_File_Location"].apply(lambda x: Path(str(x)).parent.name)
df3 = pd.merge(df1, df2, on="name", how="inner")

Upsert/Merge two dataframe in pyspark

i need one help for the below requirement. this is just for sample data. i have more than 200 columns in each data frame in real time use case. i need to compare two data frames and flag the differences.
df1
id,name,city
1,abc,pune
2,xyz,noida
df2
id,name,city
1,abc,pune
2,xyz,bangalore
3,kk,mumbai
expected dataframe
id,name,city,flag
1,abc,pune,same
2,xyz,bangalore,update
3,kk,mumbai,new
can someone please help me to build the logic in pyspark?
Thanks in advance.
Pyspark's hash function can help with identifying the records that are different.
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.hash.html
from pyspark.sql.functions import col, hash
df1 = df1.withColumn('hash_value', hash('id', 'name', 'city')
df2 = df2.withColumn('hash_value', hash('id', 'name', 'city')
df_updates = df1 .alias('a').join(df2.alias('b'), (\
(col('a.id') == col('b.id')) &\
(col('a.hash_value') != col('b.hash_value')) \
) , how ='inner'
)
df_updates = df_updates.select(b.*)
Once you have identified the records that are different.
Then you would be able to setup a function that can loop through each column in the df to compare that columns value.
Something like this should work
def add_change_flags(df1, df2):
df_joined = df1.join(df2, 'id', how='inner')
for column in df1.columns:
df_joined = df_joined.withColumn(column + "_change_flag", \
when(col(f"df1.{column}") === col(f"df2.{column}"),True)\
.otherwise(False))
return df_joined

Iteratively append new data into pandas dataframe column and join with another dataframe

I have been doing data extract from many API. I would like to add a common column among all APIs.
And I have tried below
df = pd.DataFrame()
for i in range(1,200):
url = '{id}/values'.format(id=i)
res = request.get(url,headers=headers)
if res.status_code==200:
data =json.loads(res.content.decode('utf-8'))
if data['success']:
df['id'] = i
test = pd.json_normalize(data[parent][child])
df = df.append(test,index=False)
But data-frame id column I'm getting only the last iterated id only. And in case of APIs has many rows I'm getting invalid data.
From performance reasons it would be better first storing data in a dictionary and then create from this dictionary dataframe:
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
for i in range(1,200):
# simulate dataframe retrieved from pd.json_normalize() call
row = pd.DataFrame({'id': [i], 'field1': [f'f1-{i}'], 'field2': [f'f2-{i}'], 'field3': [f'f3-{i}']})
for k, v in row.to_dict().items():
d[k].append(v[0])
df = pd.DataFrame(d)

How to check NULL values while comparing 2 text files using spark data frames

The below code failing to capture the 'null' value records. From below df1, the column NO . 5 has a null value (name field).
As per my below requirement OutputDF, the No. 5 record should come as mentioned. But after below code execution this record is not coming into the final output. The records with 'null' values are not coming into the output. Except this, remaining everything fine.
df1
NO DEPT NAME SAL
1 IT RAM 1000
2 IT SRI 600
3 HR GOPI 1500
5 HW 700
df2
NO DEPT NAME SAL
1 IT RAM 1000
2 IT SRI 900
4 MT SUMP 1200
5 HW MAHI 700
OutputDF
NO DEPT NAME SAL FLAG
1 IT RAM 1000 SAME
2 IT SRI 900 UPDATE
4 MT SUMP 1200 INSERT
3 HR GOPI 1500 DELETE
5 HW MAHI 700 UPDATE
from pyspark.shell import spark
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
sc = spark.sparkContext
filedf1 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\files\\file1.csv")
filedf2 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\files\\file2.csv")
filedf1.createOrReplaceTempView("table1")
filedf2.createOrReplaceTempView("table2")
df1 = spark.sql( "select * from table1" )
df2 = spark.sql( "select * from table2" )
#DELETE
df_d = df1.join(df2, df1.NO == df2.NO, 'left').filter(F.isnull(df2.NO)).select(df1.NO,df1.DEPT,df1.NAME,df1.SAL, F.lit('DELETE').alias('FLAG'))
print("df_d left:",df_d.show())
#INSERT
df_i = df1.join(df2, df1.NO == df2.NO, 'right').filter(F.isnull(df1.NO)).select(df2.NO,df2.DEPT,df2.NAME,df2.SAL, F.lit('INSERT').alias('FLAG'))
print("df_i right:",df_i.show())
#SAME
df_s = df1.join(df2, df1.NO == df2.NO, 'inner').filter(F.concat(df2.NO,df2.DEPT,df2.NAME,df2.SAL) == F.concat(df1.NO,df1.DEPT,df1.NAME,df1.SAL)).select(df1.NO,df1.DEPT,df1.NAME,df1.SAL, F.lit('SAME').alias('FLAG'))
print("df_s inner:",df_s.show())
#UPDATE
df_u = df1.join(df2, df1.NO == df2.NO, 'inner').filter(F.concat(df2.NO,df2.DEPT,df2.NAME,df2.SAL) != F.concat(df1.NO,df1.DEPT,df1.NAME,df1.SAL)).select(df2.NO,df2.DEPT,df2.NAME,df2.SAL, F.lit('UPDATE').alias('FLAG'))
print("df_u inner:",df_u.show())
df = df_d.union(df_i).union(df_s).union(df_u)
df.show()
Here i'm comparing both df1 and df2, if found new records in df2 taking flag as INSERT, if record is same in both dfs then taking as SAME, if the record is in df1 and not in df2 taking as DELETE and if the record exist in both dfs but with different values then taking df2 values as UPDATE.
There's two issues with the code:
The result of F.concat of a null returns null, so this part in code filters out row row NO 5:
.filter(F.concat(df2.NO, df2.NAME, df2.SAL) != F.concat(df1.NO, df1.NAME, df1.SAL))
You are only selecting df2. It's fine in the example case above, but if your df2 has a null then the resultant dataframe will have null.
You can try concatenating it with a udf below:
def concat_cols(row):
concat_row = ''.join([str(col) for col in row if col is not None])
return concat_row
udf_concat_cols = udf(concat_cols, StringType())
The function concat_row can be broken down into two parts:
"".join([mylist]) is a string function. It joins everything in the
list with the defined delimeter, in this case it's an empty string.
[str(col) for col in row if col is not None] is a list comprehension, it does as it reads: for each column in the row, if
the column is not None, then append the str(col) into the list.
List comprehension is just a more pythonic way of doing this:
mylist = []
for col in row:
if col is not None:
mylist.append(col))
You can replace your update code as:
df_u = (df1
.join(df2, df1.NO == df2.NO, 'inner')
.filter(udf_concat_cols(struct(df1.NO, df1.NAME, df1.SAL)) != udf_concat_cols(struct(df2.NO, df2.NAME, df2.SAL)))
.select(coalesce(df1.NO, df2.NO),
coalesce(df1.NAME, df2.NAME),
coalesce(df1.SAL, df2.SAL),
F.lit('UPDATE').alias('FLAG')))
You should do something similar for your #SAME flag and break the line for readability.
Update:
If df2 always have the correct (updated) result, there is no need to coalesce.
The code for this instance would be:
df_u = (df1
.join(df2, df1.NO == df2.NO, 'inner')
.filter(udf_concat_cols(struct(df1.NO, df1.NAME, df1.SAL)) != udf_concat_cols(struct(df2.NO, df2.NAME, df2.SAL)))
.select(df2.NO,
df2.NAME,
df2.SAL,
F.lit('UPDATE').alias('FLAG')))

Calling a data frame via a string

I have a list of countries such as:
country = ["Brazil", "Chile", "Colombia", "Mexico", "Panama", "Peru", "Venezuela"]
I created data frames using the names from the country list:
for c in country:
c = pd.read_excel(str(c + ".xls"), skiprows = 1)
c = pd.to_datetime(c.Date, infer_datetime_format=True)
c = c[["Date", "spreads"]]
Now I want to be able to merge all the countries data frames using the columns date as the key. The idea is to create a loop like the following:
df = Brazil #this is the first dataframe, which also corresponds to the first element of the list country.
for i in range(len(country)-1):
df = df.merge(country[i+1], on = "Date", how = "inner")
df.set_index("Date", inplace=True)
I got the error ValueError: can not merge DataFrame with instance of type <class 'str'>. It seems python is not calling the data frame which the name is in the country list. How can I call those data frames starting from the country list?
Thanks masters!
Your loop doesn't modify the contents of the country list, so country is still a list of strings.
Consider building a new list of dataframes and looping over that:
country_dfs = []
for c in country:
df = pd.read_excel(c + ".xls", skiprows=1)
df = pd.to_datetime(df.Date, infer_datetime_format=True)
df = df[["Date", "spreads"]]
# add new dataframe to our list of dataframes
country_dfs.append(df)
then to merge,
merged_df = country_dfs[0]
for df in country_dfs[1:]:
merged_df = merged_df.merge(df, on='Date', how='inner')
merged_df.set_index('Date', inplace=True)

Resources