Pyspark - Union two data frames with same column based n same id - apache-spark

DF1
Id
Name
Desc
etc
A
Name1
desc1
etc1
B
name2
desc2
etc2
DF2
Id
Name
Desc
etc
A
Name2
desc2
etc2
C
name2
desc2
etc2
I want to union records from DF2 into DF1 where the ID is equal and include all records from DF1.
Result DF
Id
Name
Desc
etc
A
Name1
desc1
etc1
B
name2
desc2
etc2
A
name2
desc2
etc2
What is the best way to do it? Any help Is appreciated.

You can do so by doing a semi join keeping only the ids in df2 which exists in df1, followed by a union with df1.
d1 = [
('A', 'Name1', 'desc1', 'etc1'),
('B', 'name2', 'desc2', 'etc2'),
]
d2 = [
('A', 'Name2', 'desc2', 'etc2'),
('C', 'name2', 'desc2', 'etc2'),
]
df1 = spark.createDataFrame(d1, ['Id', 'Name', 'Desc', 'etc'])
df2 = spark.createDataFrame(d2, ['Id', 'Name', 'Desc', 'etc'])
df2.join(df1, on='Id', how='semi').union(df1).show()
+---+-----+-----+----+
| Id| Name| Desc| etc|
+---+-----+-----+----+
| A|Name2|desc2|etc2|
| A|Name1|desc1|etc1|
| B|name2|desc2|etc2|
+---+-----+-----+----+

Related

Join Two Dataframes while keeping the same format

Hello I have two Dataframes:
df1 with columns: first_name, last_name, id, location, phone_number.
df2 with columns: last_name, id, location, employer.
I am trying to create a new dataset that displays only the columns in df1 that returns only the rows where the last_name, and id is present in df2. So I decided that a inner join on the two tables. The issue is that join appends the columns from df2 to the end of df1 so my resulting df is much larger than I need. I only care about the columns in df1.
My join was: df1.join(df2, df1.col("last_name").equalTo(df2.col("last_name").and(df1.col("id").equalTo(df2.col("id")), "inner");
The problem with this is I got a new table of: first_name, last_name, id, location, phone_number, employer. Where id and last_name was ambiguous.
Is there any way to keep the same table format of df1 after the join? (Without dropping individual columns, because I am using with a large table with about 30 columns).
You can use the the join method of a Dataframe with the following function signature (from the API docs):
def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
This will only keep 1 column of the joining columns, removing your ambiguity problem.
After that, you can just select the columns of df1 dynamically, by using df1.columns. In total, it would look something like this:
import spark.implicits._
val df1 = Seq(
("joe", "shmoe", 1, "x", 123),
("jack", "johnson", 2, "y", 456)
).toDF("first_name", "last_name", "id", "location", "phone_number")
df1.show
+----------+---------+---+--------+------------+
|first_name|last_name| id|location|phone_number|
+----------+---------+---+--------+------------+
| joe| shmoe| 1| x| 123|
| jack| johnson| 2| y| 456|
+----------+---------+---+--------+------------+
val df2 = Seq(
("shmoe", 1, "x", "someCoolGuy"),
("otherName", 3, "z", "employer2")
).toDF("last_name", "id", "location", "employer")
df2.show
+---------+---+--------+-----------+
|last_name| id|location| employer|
+---------+---+--------+-----------+
| shmoe| 1| x|someCoolGuy|
|otherName| 3| z| employer2|
+---------+---+--------+-----------+
val output = df1
.join(df2.select("last_name", "id"), Seq("last_name", "id")) // only selecting interesting columns of df2 for the join
.select(df1.columns.head, df1.columns.tail: _*)
output.show
+----------+---------+---+--------+------------+
|first_name|last_name| id|location|phone_number|
+----------+---------+---+--------+------------+
| joe| shmoe| 1| x| 123|
+----------+---------+---+--------+------------+
Hope this helps!

Keep rows with one particular value in one column using some condition in another column in same dataframe

I have the below dataframe:
col1
col2
col3
Device1
A
true
Device1
A
false
Device1
C
false
Device1
B
false
I want to keep first two rows (where col2 value = A) where A is identified because col3 has a 'true' in row 1. In other words, for device 1, I want to keep all those rows where col2 has at least 1 value in col3 as 'true'.
I expect the below result after the filter:
col1
col2
col3
Device1
A
true
Device1
A
false
You can do it with window functions: partitioning by col1 and col2, ordering by col3 descending. Use the first function over the window.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('Device1', 'A', True),
('Device1', 'A', False),
('Device1', 'C', None),
('Device1', 'B', False)],
['col1', 'col2', 'col3'])
w = W.partitionBy('col1', 'col2').orderBy(F.desc('col3'))
df = df.withColumn('keep', F.first('col3').over(w))
df = df.filter('keep').drop('keep')
df.show()
# +-------+----+-----+
# | col1|col2| col3|
# +-------+----+-----+
# |Device1| A| true|
# |Device1| A|false|
# +-------+----+-----+

Combine two dataframes with seperate keys for one dataframe so can select two column based on keys

I want new column DATE1 equal to a column START in dataframe1(DF1) on KEY1 and combine with Dataframe2 (DF2) based on KEY2 in DF2 so it shows DATE1 just when the key mayches in the join. I can show column start but it shows all.
I want DATE2 equal column START in dataframe1(DF1) on KEY1 but combine with DF2 based on a diffrent key called KEY3 in DF2 so it shows DATE2 just when the key matches in the join. I can show column start but not sure how to only show colum start when combined on two keys.
Example input for DF1 would be:
+---------+--------+------+------+
|START |KEY1 |Color OTHER |
+---------+--------+------+------+
| 10/05/21| 1 | White| 3000|
| 10/06/21| 2 | Blue| 4100|
| 10/07/21| 3 | Green| 6200|
+---------+--------+------+------+
DF2 input would be:
+---------+--------+----+
|KEY2 |KEY3 |NUMBER|
+---------+--------+----+
| 1 | 2| 3000 |
| 2 | 3| 4100 |
| 3 | 1| 6200 |
+---------+--------+----+
Output would be something like below:
+---------+--------+
|DATE1 | DATE2 |
+---------+--------+
| 10/05/21|10/06/21|
| 10/06/21|10/07/21|
| 10/07/21|10/05/21|
+---------+--------+
Below is code
def transform_df_data(df: DataFrame):
return \
df \
.withColumn("DATE1", col("START")) \
.withColumn("DATE2", col("START")) \
.withColumn("KEY1", col("KEY1")) \
.select("KEY1","DATE1","DATE2")
def build_final_df(df:DataFrame, otherdf:Dataframe)
df_transform = transform_df_data(d_period)
return final_transform.join(df_transform , final_transform.KEY1 == df_transform(KEY2, 'inner').withColumn("DATE1", col("START")).select("DATE1","DATE2")
Note sure I correctly understand the question, but I think you want to join df1 and df2 on KEY1 = KEY2 then join the result again with df1 on KEY1 = KEY3:
import pyspark.sql.functions as F
data1 = [("10/05/21", 1, "White", 3000), ("10/06/21", 2, "Blue", 4100), ("10/07/21", 3, "Green", 6200)]
df1 = spark.createDataFrame(data1, ["START", "KEY1", "Color", "OTHER"])
data2 = [(1, 2, 3000), (2, 3, 4100), (3, 1, 6200)]
df2 = spark.createDataFrame(data2, ["KEY2", "KEY3", "NUMBER"])
df_result = df1.withColumnRenamed("START", "DATE1").join(
df2,
F.col("KEY1") == F.col("KEY2")
).select("DATE1", "KEY3").join(
df1.withColumnRenamed("START", "DATE2"),
F.col("KEY1") == F.col("KEY3")
).select("DATE1", "DATE2")
df_result.show()
#+--------+--------+
#| DATE1| DATE2|
#+--------+--------+
#|10/07/21|10/05/21|
#|10/05/21|10/06/21|
#|10/06/21|10/07/21|
#+--------+--------+

Spark SQL - data comparison

What would be the best approach to compare two csv files (millions of rows) with same schema with a primary key column and print out the differences. For example ,
CSV1
Id name zip
1 name1 07112
2 name2 07234
3 name3 10290
CSV2
Id name zip
1 name1 07112
2 name21 07234
4 name4 10290
Comparing modified file CSV2 with the original data CSV1,
Output should be
Id name zip
2 name21 07234 Modified
3 name3 10290 Deleted
4 name4 10290 Added
New to Spark SQL, I am thinking of importing data into Hive tables and then run Spark SQL to identify the changes.
1) Is there any row modified method available to identify whether a row has modified instead of comparing values in each column?
2) Are there any better approach available to implement either using Spark or other HDFS tools?
Appreciate the feedback
Many approaches exist; this is one that can have things done in parallel:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val origDF = sc.parallelize(Seq(
("1", "a", "b"),
("2", "c", "d"),
("3", "e", "f")
)).toDF("k", "v1", "v2")
val newDF = sc.parallelize(Seq(
("1", "a", "b"),
("2", "c2", "d"),
("4", "g", "h")
)).toDF("k", "v1", "v2")
val df1 = origDF.except(newDF) // if k not exists in df2, then deleted
//df1.show(false)
val df2 = newDF.except(origDF) // if k not exists in df1, then added
//df2.show(false)
// if no occurrence in both dfs, then the same
// if k exists in both, then k in df2 = modified
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
val df3 = spark.sql("""SELECT df1.k, df1.v1, df1.v2, "deleted" as operation
FROM df1
WHERE NOT EXISTS (SELECT df2.k
FROM df2
WHERE df2.k = df1.k)
UNION
SELECT df2.k, df2.v1, df2.v2, "added" as operation
FROM df2
WHERE NOT EXISTS (SELECT df1.k
FROM df1
WHERE df1.k = df2.k)
UNION
SELECT df2.k, df2.v1, df2.v2, "modified" as operation
FROM df2
WHERE EXISTS (SELECT df1.k
FROM df1
WHERE df1.k = df2.k)
""")
df3.show(false)
returns:
+---+---+---+---------+
|k |v1 |v2 |operation|
+---+---+---+---------+
|4 |g |h |added |
|2 |c2 |d |modified |
|3 |e |f |deleted |
+---+---+---+---------+
Not so hard, no standard utility.

Iterate cols PySpark

I have a SQL table containing 40 columns: ID, Product, Product_ID, Date etc. and would like to iterate over all columns to get distinct values.
Customer table (sample):
ID Product
1 gadget
2 VR
2 AR
3 hi-fi
I have tried using dropDuplicates within a function that loops over all columns but the resultant output is only spitting out one distinct value per column instead of all possible distinct values.
Expected Result:
Column Value
ID 1
ID 2
ID 3
Product gadget
Product VR
Product AR
Product hi-fi
Actual Result:
Column Value
ID 1
Product gadget
The idea is to use collect_set() to fetch distinct elements in a column and then exploding the dataframe.
#All columns which need to be aggregated should be added here in col_list.
col_list = ['ID','Product']
exprs = [collect_set(x) for x in col_list]
Let's start aggregating.
from pyspark.sql.functions import lit , collect_set, explode, array, struct, col, substring, length, expr
df = spark.createDataFrame([(1,'gadget'),(2,'VR'),(2,'AR'),(3,'hi-fi')], schema = ['ID','Product'])
df = df.withColumn('Dummy',lit('Dummy'))
#While exploding later, the datatypes must be the same, so we have to cast ID as a String.
df = df.withColumn('ID',col('ID').cast('string'))
#Creating the list of distinct values.
df = df.groupby("Dummy").agg(*exprs)
df.show(truncate=False)
+-----+---------------+-----------------------+
|Dummy|collect_set(ID)|collect_set(Product) |
+-----+---------------+-----------------------+
|Dummy|[3, 1, 2] |[AR, VR, hi-fi, gadget]|
+-----+---------------+-----------------------+
def to_transpose(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df = to_transpose(df, ['Dummy']).drop('Dummy')
df.show()
+--------------------+--------------------+
| key| val|
+--------------------+--------------------+
| collect_set(ID)| [3, 1, 2]|
|collect_set(Product)|[AR, VR, hi-fi, g...|
+--------------------+--------------------+
df = df.withColumn('val', explode(col('val')))
df = df.withColumnRenamed('key', 'Column').withColumnRenamed('val', 'Value')
df = df.withColumn('Column', expr("substring(Column,13,length(Column)-13)"))
df.show()
+-------+------+
| Column| Value|
+-------+------+
| ID| 3|
| ID| 1|
| ID| 2|
|Product| AR|
|Product| VR|
|Product| hi-fi|
|Product|gadget|
+-------+------+
Note: All the columns which are not strings, should be converted into String like df = df.withColumn('ID',col('ID').cast('string')). Otherwise, you will get error.

Resources