i need one help for the below requirement. this is just for sample data. i have more than 200 columns in each data frame in real time use case. i need to compare two data frames and flag the differences.
expected dataframe
can someone please help me to build the logic in pyspark?
Thanks in advance.

Pyspark's hash function can help with identifying the records that are different.
from pyspark.sql.functions import col, hash
df1 = df1.withColumn('hash_value', hash('id', 'name', 'city')
df2 = df2.withColumn('hash_value', hash('id', 'name', 'city')
df_updates = df1 .alias('a').join(df2.alias('b'), (\
(col('') == col('')) &\
(col('a.hash_value') != col('b.hash_value')) \
) , how ='inner'
df_updates =*)
Once you have identified the records that are different.
Then you would be able to setup a function that can loop through each column in the df to compare that columns value.
Something like this should work
def add_change_flags(df1, df2):
df_joined = df1.join(df2, 'id', how='inner')
for column in df1.columns:
df_joined = df_joined.withColumn(column + "_change_flag", \
when(col(f"df1.{column}") === col(f"df2.{column}"),True)\
return df_joined


Pyspark Dataframe: Transform many columns

I have a pyspark dataframe with 10 columns as read from a parquet file
df =
I want to apply several pre-processing steps to a subset of this dataframe's columns: col_list.
The following works fine, but apart from a bit ugly, I also have the feeling it is not optimal.
import pyspark.sql.functions as F
for col in col_list:
df = df.withColumn(col, F.regexp_replace(col, ".", " ")
df = df.withColumn(col, F.regexp_replace(col, "_[A-Z]_", "")
and the list goes on with other similar text processing steps.
So the question is whether the above is as optimal and elegant as it gets and also if/how I can use transform to achieve a sequential execution of the above steps.
Thanks a lot.
You can select all the required columns in one go:
import pyspark.sql.functions as F
df2 =
*[c for c in df.columns if c not in col_list],
*[F.regexp_replace(F.regexp_replace(c, ".", " "), "_[A-Z]_", "").alias(c) for c in df.columns if c in col_list]

Different outcome from seemingly equivalent implementation of PySpark transformations

I have a set of spark dataframe transforms which gives an out of memory error and has a messed up sql query plan while a different implemetation runs successfully.
import pandas as pd
diction = {
'key': [1,2,3,4,5,6],
'f1' : [1,0,1,0,1,0],
'f2' : [0,1,0,1,0,1],
'f3' : [1,0,1,0,1,0],
'f4' : [0,1,0,1,0,1]}
bil = pd.DataFrame(diction)
# successfull logic
df = spark.createDataFrame(bil)
df = df.cache()
zdf = df
for i in [1,2,3]:
tempdf =['key'])
df = df.join(tempdf,on=['key'],how='left')
# failed logic
df = spark.createDataFrame(bil)
df = df.cache()
for i in [1,2,3]:
tempdf =['key'])
df = df.join(tempdf,on=['key'],how='left')
Logically thinking there must not be such a computational difference (more than double the time and memory used).
Can anyone help me understand this ?
DAG of successful logic:
DAG of failure logic:
I'm not sure what your use case is for this code, however the two pieces of code are not logically the same. In the second version you are joining the result of the previous iteration to itself three times. In the first version you are joining a 'copy' of the original df three times. If your key column is not unique, the second piece of code will 'explode' your dataframe more than the first.
To make this more clear we can make a simple example below where we have a non-unique key value. Taking your second example:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
for i in [1,2,3]:
tempdf =['key'])
df = df.join(tempdf,on=['key'],how='left')
>>> 257
And your first piece of code:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
zdf = df
for i in [1,2,3]:
tempdf =['key'])
df = df.join(tempdf,on=['key'],how='left')
>>> 17

Filter dataframe based on groupby sum()

I want to filter my dataframe based on a groupby sum(). I am looking for lines where the amounts for a spesific date, gets to zero.
I have solve this by creating a for loop. I suspect this will reduce performance if the dataframe is large.
It also seems clunky.
newdf = pd.DataFrame()
newdf['name'] = ('leon','eurika','monica','wian')
newdf['surname'] = ('swart','swart','swart','swart')
newdf['birthdate'] = ('14051981','198001','20081012','20100621')
newdf['tdate'] = ('13/05/2015','14/05/2015','15/05/2015', '13/05/2015')
newdf['tamount'] = (100.10, 111.11, 123.45, -100.10)
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
df2 = df.loc[df["tamount"] == 0, "tdate"]
df3 = pd.DataFrame()
for i in df2:
df3 = df3.append(newdf.loc[newdf["tdate"] == i])
print (df3)
The below code is creating an output of the two lines getting to zero when combined on tamount
name surname birthdate tdate tamount
0 leon swart 1981-05-14 13/05/2015 100.1
3 wian swart 2010-06-21 13/05/2015 -100.1
Just use basic numpy :)
import numpy as np
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
dates = df['tdate'][np.where(df['tamount'] == 0)[0]]
newdf[np.isin(newdf['tdate'], dates) == True]
Hope this helps; let me know if you have any questions.

HiveQL to PySpark - issue with aggregated column in SELECT statement

I have following HQL script which needs to be puti nto pyspark, spark 1.6
insert into table db.temp_avg
avg(b) ,
from db.temp WHERE flag is not null GROUP BY a, c;
I created few versions of spark code, but I'm stuggling how to get this averaged column into select.
Also I found out that groupped data cannot be write this way:
df3 = df2.groupBy...
part of pyspark code:
temp_table = sqlContext.table("db.temp")
df ='a', 'avg(b)', 'c', 'flag').toDF('a', 'avg(b)', 'c', 'flag')
df = df.where(['flag'] != 'null'))
# this ofc does not work along with the avg(b)
df2 = df.groupBy('a', 'c')
Thx for your help.
Correct solution:
import pyspark.sql.functions as F
df = sqlContext.sql("SELECT * FROM db.temp_avg").alias("temp")
df ='a', 'b', 'c')\
.groupby('a', 'c')\
import pyspark.sql.functions as F
df = sqlContext.sql("select * from db.temp_avg")
df ='a',
.groupby('a', 'c')\
Then you can save the table by

Spark: Merge 2 dataframes by adding row index/number on both dataframes

Q: Is there is any way to merge two dataframes or copy a column of a dataframe to another in PySpark?
For example, I have two Dataframes:
C1 C2
23397414 20875.7353
5213970 20497.5582
41323308 20935.7956
123276113 18884.0477
76456078 18389.9269
the seconde dataframe
C3 C4
2008-02-04 262.00
2008-02-05 257.25
2008-02-06 262.75
2008-02-07 237.00
2008-02-08 231.00
Then i want to add C3 of DF2 to DF1 like this:
New DF
C1 C2 C3
23397414 20875.7353 2008-02-04
5213970 20497.5582 2008-02-05
41323308 20935.7956 2008-02-06
123276113 18884.0477 2008-02-07
76456078 18389.9269 2008-02-08
I hope this example was clear.
rownum + window function i.e solution 1 or i.e solution 2 should help in this case.
Solution 1 : You can use window functions to get this kind of
Then I would suggest you to add rownumber as additional column name to Dataframe say df1.
C1 C2 columnindex
23397414 20875.7353 1
5213970 20497.5582 2
41323308 20935.7956 3
123276113 18884.0477 4
76456078 18389.9269 5
the second dataframe
C3 C4 columnindex
2008-02-04 262.00 1
2008-02-05 257.25 2
2008-02-06 262.75 3
2008-02-07 237.00 4
2008-02-08 231.00 5
Now .. do inner join of df1 and df2 that's all...
you will get below ouput
something like this
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df1 = .... // as showed above df1
df2 = .... // as shown above df2
df11 = df1.withColumn("columnindex", rowNumber().over(w))
df22 = df2.withColumn("columnindex", rowNumber().over(w))
newDF = df11.join(df22, df11.columnindex == df22.columnindex, 'inner').drop(df22.columnindex)
New DF
C1 C2 C3
23397414 20875.7353 2008-02-04
5213970 20497.5582 2008-02-05
41323308 20935.7956 2008-02-06
123276113 18884.0477 2008-02-07
76456078 18389.9269 2008-02-08
Solution 2 : Another good way(probably this is best :)) in scala, which you can translate to pyspark :
* Add Column Index to dataframe
def addColumnIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add Column index{case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex)},
// Create schema
StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
// Add index now...
val df1WithIndex = addColumnIndex(df1)
val df2WithIndex = addColumnIndex(df2)
// Now time to join ...
val newone = df1WithIndex
.join(df2WithIndex , Seq("columnindex"))
I thought I would share the python (pyspark) translation for answer #2 above from #Ram Ghadiyaram:
from pyspark.sql.functions import col
def addColumnIndex(df):
# Create new column names
oldColumns = df.schema.names
newColumns = oldColumns + ["columnindex"]
# Add Column index
df_indexed = df.rdd.zipWithIndex().map(lambda (row, columnindex): \
row + (columnindex,)).toDF()
#Rename all the columns
new_df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx],
newColumns[idx]), xrange(len(oldColumns)), df_indexed)
return new_df
# Add index now...
df1WithIndex = addColumnIndex(df1)
df2WithIndex = addColumnIndex(df2)
#Now time to join ...
newone = df1WithIndex.join(df2WithIndex, col("columnindex"),
for python3 version,
from pyspark.sql.types import StructType, StructField, LongType
def with_column_index(sdf):
new_schema = StructType(sdf.schema.fields + [StructField("ColumnIndex", LongType(), False),])
return sdf.rdd.zipWithIndex().map(lambda row: row[0] + (row[1],)).toDF(schema=new_schema)
df1_ci = with_column_index(df1)
df2_ci = with_column_index(df2)
join_on_index = df1_ci.join(df2_ci, df1_ci.ColumnIndex == df2_ci.ColumnIndex, 'inner').drop("ColumnIndex")
I referred to his(#Jed) answer
from pyspark.sql.functions import col
def addColumnIndex(df):
# Get old columns names and add a column "columnindex"
oldColumns = df.columns
newColumns = oldColumns + ["columnindex"]
# Add Column index
df_indexed = df.rdd.zipWithIndex().map(lambda (row, columnindex): \
row + (columnindex,)).toDF()
#Rename all the columns
oldColumns = df_indexed.columns
new_df = reduce(lambda data, idx:data.withColumnRenamed(oldColumns[idx],
newColumns[idx]), xrange(len(oldColumns)), df_indexed)
return new_df
# Add index now...
df1WithIndex = addColumnIndex(df1)
df2WithIndex = addColumnIndex(df2)
#Now time to join ...
newone = df1WithIndex.join(df2WithIndex, col("columnindex"),
This answer solved it for me:
import pyspark.sql.functions as sparkf
# This will return a new DF with all the columns + id
res = df.withColumn('id', sparkf.monotonically_increasing_id())
Credit to Arkadi T
Here is an simple example that can help you even if you have already solve the issue.
//create First Dataframe
val df1 = spark.sparkContext.parallelize(Seq(1,2,1)).toDF("lavel1")
//create second Dataframe
val df2 = spark.sparkContext.parallelize(Seq((1.0, 12.1), (12.1, 1.3), (1.1, 0.3))). toDF("f1", "f2")
//Combine both dataframe
val combinedRow = map({
//convert both dataframe to Seq and join them and return as a row
case (df1Data, df2Data) => Row.fromSeq(df1Data.toSeq ++ df2Data.toSeq)
// create new Schema from both the dataframe's schema
val combinedschema = StructType(df1.schema.fields ++ df2.schema.fields)
// Create a new dataframe from new row and new schema
val finalDF = spark.sqlContext.createDataFrame(combinedRow, combinedschema)
Expanding on Jed's answer, in response to Ajinkya's comment:
To get the same old column names, you need to replace "old_cols" with a column list of the newly named indexed columns. See my modified version of the function below
def add_column_index(df):
new_cols = df.schema.names + ['ix']
ix_df = df.rdd.zipWithIndex().map(lambda (row, ix): row + (ix,)).toDF()
tmp_cols = ix_df.schema.names
return reduce(lambda data, idx: data.withColumnRenamed(tmp_cols[idx], new_cols[idx]), xrange(len(tmp_cols)), ix_df)
Not the better way performance wise.
To merge columns from two different dataframe you have first to create a column index and then join the two dataframes. Indeed, two dataframes are similar to two SQL tables. To make a connection you have to join them.
If you don't care about the final order of the rows you can generate the index column with monotonically_increasing_id().
Using the following code you can check that monotonically_increasing_id generates the same index column in both dataframes (at least up to a billion of rows), so you won't have any error in the merged dataframe.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
sample_size = 1E9
sdf1 = spark.range(1, sample_size).select(F.col("id").alias("id1"))
sdf2 = spark.range(1, sample_size).select(F.col("id").alias("id2"))
sdf1 = sdf1.withColumn("idx", sf.monotonically_increasing_id())
sdf2 = sdf2.withColumn("idx", sf.monotonically_increasing_id())
sdf3 = sdf1.join(sdf2, 'idx', 'inner')
sdf3 = sdf3.withColumn("diff", F.col("id1")-F.col("id2")).select("diff")
sdf3.filter(F.col("diff") != 0 ).show()
You can use a combination of monotonically_increasing_id (guaranteed to always be increasing) and row_number (guaranteed to always give the same sequence). You cannot use row_number alone because it needs to be ordered by something. So here we order by monotonically_increasing_id. I am using Spark 2.3.1 and Python 2.7.13.
from pandas import DataFrame
from pyspark.sql.functions import (
from pyspark.sql import Window
DF1 = spark.createDataFrame(DataFrame({
'C1': [23397414, 5213970, 41323308, 123276113, 76456078],
'C2': [20875.7353, 20497.5582, 20935.7956, 18884.0477, 18389.9269]}))
DF2 = spark.createDataFrame(DataFrame({
'C3':['2008-02-04', '2008-02-05', '2008-02-06', '2008-02-07', '2008-02-08']}))
DF1_idx = (
.withColumn('id', monotonically_increasing_id())
.withColumn('columnindex', row_number().over(Window.orderBy('id')))
.select('columnindex', 'C1', 'C2'))
DF2_idx = (
.withColumn('id', monotonically_increasing_id())
.withColumn('columnindex', row_number().over(Window.orderBy('id')))
.select('columnindex', 'C3'))
DF_complete = (
.select('C1', 'C2', 'C3'))
| C1| C2| C3|
| 23397414|20875.7353|2008-02-04|
| 5213970|20497.5582|2008-02-05|
| 41323308|20935.7956|2008-02-06|
| 76456078|18389.9269|2008-02-08|
