Joining two dataframes by first hit in key

Joining two dataframes by first hit in key - apache-spark

I have two dataframes:
df1 = spark.createDataFrame(
data=[["a,b,c", "john"], ["d,e", "mark"], ["f", "aby"], ["g,i,j", "mary"]], schema=["keys", "name"])
+-----+----+
| keys|name|
+-----+----+
|a,b,c|john|
| d,e|mark|
| f| aby|
|g,i,j|mary|
+-----+----+
df2 = spark.createDataFrame(
data=[["b", 18], ["c", 25], ["d", 55], ["i", 90], ["j", 88]], schema=["key", "age"])
+---+---+
|key|age|
+---+---+
| b| 18|
| c| 25|
| d| 55|
| i| 90|
| j| 88|
+---+---+
I would like to join them by df1.keys_array and df2.key values only on first hit, so the df3 would like like this:
+-----+----+----+
| keys|name| age|
+-----+----+----+
|a,b,c|john| 18|
| d,e|mark| 55|
| f| aby|NULL|
|g,i,j|mary| 90|
+-----+----+----+
I tried to create new array column, containing keys and match on that, but it joins all rows, instead of just the one with first key occurence.
df1 = df1.withColumn('keys_array', f.split('keys', ','))
joined_df = df1.join(df2, f.expr("array_contains(keys_array, key)"), 'left_outer').show()
+-----+----+----------+----+----+
| keys|name|keys_array| key| age|
+-----+----+----------+----+----+
|a,b,c|john| [a, b, c]| b| 18|
|a,b,c|john| [a, b, c]| c| 25|
| d,e|mark| [d, e]| d| 55|
| f| aby| [f]|null|null|
|g,i,j|mary| [g, i, j]| i| 90|
|g,i,j|mary| [g, i, j]| j| 88|
+-----+----+----------+----+----+
How do I make sure only first occurrence of key in keys_array are matched?

From where your stopped, I'd simply add a row_number for each keys order by the position of the key in the array:
from pyspark.sql import functions as F, Window as W
last_df = (
joined_df
.withColumn("rnk", F.expr("array_position(keys_array, key)"))
.withColumn("rnk", F.row_number().over(W.partitionBy("keys").orderBy("rnk")))
.where(F.col("rnk") == 1)
)
last_df.show()
+-----+----+----------+----+----+---+
| keys|name|keys_array| key| age|rnk|
+-----+----+----------+----+----+---+
|a,b,c|john| [a, b, c]| b| 18| 1|
| d,e|mark| [d, e]| d| 55| 1|
| f| aby| [f]|null|null| 1|
|g,i,j|mary| [g, i, j]| i| 90| 1|
+-----+----+----------+----+----+---+

Related

PySpark window over : Is there any better way to find row number for two columns partitions

I have a dataset with 3 columns(T,S, and A). I need to filter out records such a way that T and S columns have one to one match.
e.g.If T1 is matched with S1 then T2 row with S1 value should be filtered.
I am able to achieve it using 2-time window over but it will do a lot of shuffling in the cluster during the second window function (First window shuffling I can control with df.sort/repartition).
l = [('T1', 'S1', 10), ('T2', 'S1', 10), ('T1', 'S2', 10), ('T2', 'S2', 10)]
df = spark.createDataFrame(l).toDF('T','S','A')
df.show()
+---+---+---+
| T| S| A|
+---+---+---+
| T1| S1| 10|
| T2| S1| 10|
| T1| S2| 10|
| T2| S2| 10|
+---+---+---+
w1 = w.partitionBy('T').orderBy('A')
w2 = w.partitionBy('S').orderBy('A','T')
df.withColumn('r1', f.row_number().over(w1)).withColumn('r2',f.row_number().over(w2)).show()
It gives below result so I can filter records if r1 == r2 and get expected output.
+---+---+---+---+---+
| T| S| A| r1| r2|
+---+---+---+---+---+
| T1| S2| 10| 2| 1|
| T2| S2| 10| 2| 2|
| T1| S1| 10| 1| 1|
| T2| S1| 10| 1| 2|
+---+---+---+---+---+
expected result
+---+---+---+---+---+
| T| S| A| r1| r2|
+---+---+---+---+---+
| T2| S2| 10| 2| 2|
| T1| S1| 10| 1| 1|
+---+---+---+---+---+

Why sum is not displaying after aggregation & pivot?

Here I have student marks like below and I want to transpose subject name column and want to get the total marks also after the pivot.
Source table like:
+---------+-----------+-----+
|StudentId|SubjectName|Marks|
+---------+-----------+-----+
| 1| A| 10|
| 1| B| 20|
| 1| C| 30|
| 2| A| 20|
| 2| B| 25|
| 2| C| 30|
| 3| A| 10|
| 3| B| 20|
| 3| C| 20|
+---------+-----------+-----+
Destination:
+---------+---+---+---+-----+
|StudentId| A| B| C|Total|
+---------+---+---+---+-----+
| 1| 10| 20| 30| 60|
| 3| 10| 20| 20| 50|
| 2| 20| 25| 30| 75|
+---------+---+---+---+-----+
Please find the below source code:
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val list = List((1, "A", 10), (1, "B", 20), (1, "C", 30), (2, "A", 20), (2, "B", 25), (2, "C", 30), (3, "A", 10),
(3, "B", 20), (3, "C", 20))
val df = list.toDF("StudentId", "SubjectName", "Marks")
df.show() // source table as per above
val df1 = df.groupBy("StudentId").pivot("SubjectName", Seq("A", "B", "C")).agg(sum("Marks"))
df1.show()
val df2 = df1.withColumn("Total", col("A") + col("B") + col("C"))
df2.show // required destitnation
val df3 = df.groupBy("StudentId").agg(sum("Marks").as("Total"))
df3.show()
df1 is not displaying the sum/total column. it's displaying like below.
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 10| 20| 30|
| 3| 10| 20| 20|
| 2| 20| 25| 30|
+---------+---+---+---+
df3 is able to create new Total column but why in df1 it not able to create a new column?
Please, can anybody help me what I missing or anything wrong with my understanding of pivot concept?

This is an expected behaviour from spark pivot function as .agg function is applied on the pivoted columns that's the reason why you are not able to see sum of marks as new column.
Refer to this link for official documentation about pivot.
Example:
scala> df.groupBy("StudentId").pivot("SubjectName").agg(sum("Marks") + 2).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 12| 22| 32|
| 3| 12| 22| 22|
| 2| 22| 27| 32|
+---------+---+---+---+
In the above example we have added 2 to all the pivoted columns.
Example2:
To get count using pivot and agg
scala> df.groupBy("StudentId").pivot("SubjectName").agg(count("*")).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 1| 1| 1|
| 3| 1| 1| 1|
| 2| 1| 1| 1|
+---------+---+---+---+

The .agg followed by pivot is applicable only for the pivoted data. To find the sum you should you should add new column and sum it as below.
val cols = Seq("A", "B", "C")
val result = df.groupBy("StudentId")
.pivot("SubjectName")
.agg(sum("Marks"))
.withColumn("Total", cols.map(col _).reduce(_ + _))
result.show(false)
Output:
+---------+---+---+---+-----+
|StudentId|A |B |C |Total|
+---------+---+---+---+-----+
|1 |10 |20 |30 |60 |
|3 |10 |20 |20 |50 |
|2 |20 |25 |30 |75 |
+---------+---+---+---+-----+

Applying function to dataframe columns spark scala

I have a large dataset with considerabely large number of columns(150), I want to apply a function(UDF) on all the column expect first column, which has the id field. I was able to apply the function dynamically but now I need the final dataset with id filed back to the dataframe. The spark job will be running on cluster mode,heere is what I tried.
val df = sc.parallelize(
Seq(("id1", "B", "c","d"), ("id2", "e", "d","k"),("id3", "e", "m","n"))).toDF("id", "dat1", "dat2","dat3")
df.show
+---+----+----+----+
| id|dat1|dat2|dat3|
+---+----+----+----+
|id1| B| c| d|
|id2| e| d| k|
|id3| e| m| n|
+---+----+----+----+
df.select(df.columns.slice(1,df.columns.size).map(c => upper(col(c)).alias(c)): _*).show
----+----+----+
|dat1|dat2|dat3|
+----+----+----+
| B| C| D|
| E| D| K|
| E| M| N|
+----+----+----+
Expected output
-----+----+----+
id|dat1|dat2|dat3|
-+----+----+----+
|id1| B| C| D|
|id2| E| D| K|
|id3| E| M| N|
-+----+----+----+

Simply prepend the id column to the other (transformed) columns:
df.select(
col("id") +: df.columns.tail.map(c => upper(col(c)).alias(c)): _*
).show
+---+----+----+----+
| id|dat1|dat2|dat3|
+---+----+----+----+
|id1| B| C| D|
|id2| E| D| K|
|id3| E| M| N|
+---+----+----+----+

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!

With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])

I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

Unpivot in Spark SQL / PySpark

I have a problem statement at hand wherein I want to unpivot table in Spark SQL / PySpark. I have gone through the documentation and I could see there is support only for pivot, but no support for un-pivot so far.
Is there a way I can achieve this?
Let my initial table look like this:
When I pivot this in PySpark:
df.groupBy("A").pivot("B").sum("C")
I get this as the output:
Now I want to unpivot the pivoted table. In general, this operation may/may not yield the original table based on how I've pivoted the original table.
Spark SQL as of now doesn't provide out of the box support for unpivot. Is there a way I can achieve this?

You can use the built in stack function, for example in Scala:
scala> val df = Seq(("G",Some(4),2,None),("H",None,4,Some(5))).toDF("A","X","Y", "Z")
df: org.apache.spark.sql.DataFrame = [A: string, X: int ... 2 more fields]
scala> df.show
+---+----+---+----+
| A| X| Y| Z|
+---+----+---+----+
| G| 4| 2|null|
| H|null| 4| 5|
+---+----+---+----+
scala> df.select($"A", expr("stack(3, 'X', X, 'Y', Y, 'Z', Z) as (B, C)")).where("C is not null").show
+---+---+---+
| A| B| C|
+---+---+---+
| G| X| 4|
| G| Y| 2|
| H| Y| 4|
| H| Z| 5|
+---+---+---+
Or in pyspark:
In [1]: df = spark.createDataFrame([("G",4,2,None),("H",None,4,5)],list("AXYZ"))
In [2]: df.show()
+---+----+---+----+
| A| X| Y| Z|
+---+----+---+----+
| G| 4| 2|null|
| H|null| 4| 5|
+---+----+---+----+
In [3]: df.selectExpr("A", "stack(3, 'X', X, 'Y', Y, 'Z', Z) as (B, C)").where("C is not null").show()
+---+---+---+
| A| B| C|
+---+---+---+
| G| X| 4|
| G| Y| 2|
| H| Y| 4|
| H| Z| 5|
+---+---+---+

Spark 3.4+
df = df.melt(['A'], ['X', 'Y', 'Z'], 'B', 'C')
# OR
df = df.unpivot(['A'], ['X', 'Y', 'Z'], 'B', 'C')
+---+---+----+
| A| B| C|
+---+---+----+
| G| Y| 2|
| G| Z|null|
| G| X| 4|
| H| Y| 4|
| H| Z| 5|
| H| X|null|
+---+---+----+
To filter out nulls: df = df.filter("C is not null")
Spark 3.3 and below
to_melt = {'X', 'Y', 'Z'}
new_names = ['B', 'C']
melt_str = ','.join([f"'{c}', `{c}`" for c in to_melt])
df = df.select(
*(set(df.columns) - to_melt),
F.expr(f"stack({len(to_melt)}, {melt_str}) ({','.join(new_names)})")
).filter(f"!{new_names[1]} is null")
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame([("G", 4, 2, None), ("H", None, 4, 5)], list("AXYZ"))
to_melt = {'X', 'Y', 'Z'}
new_names = ['B', 'C']
melt_str = ','.join([f"'{c}', `{c}`" for c in to_melt])
df = df.select(
*(set(df.columns) - to_melt),
F.expr(f"stack({len(to_melt)}, {melt_str}) ({','.join(new_names)})")
).filter(f"!{new_names[1]} is null")
df.show()
# +---+---+---+
# | A| B| C|
# +---+---+---+
# | G| Y| 2|
# | G| X| 4|
# | H| Y| 4|
# | H| Z| 5|
# +---+---+---+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Joining two dataframes by first hit in key - apache-spark

Related

PySpark window over : Is there any better way to find row number for two columns partitions

Why sum is not displaying after aggregation & pivot?

Applying function to dataframe columns spark scala

PySpark : change column names of a df based on relations defined in another df

Unpivot in Spark SQL / PySpark

Categories

Resources