Join two dataframes without repeating values in spark - apache-spark

I am creating a follow up question from this previous question I asked. For context, the solution proposed by blackbishop works, but i am realising now that it has a unexpected side-effect that i don't desire. (see the link to reproduce the problem and the solution).
problem statement: given several dataframes, how can i join them but avoid the cross join repetition of the columns?
solution proposed: join on id and row number.
When trying to reproduce the solution though, i realised that i do not get the same results as the solution proposed.
This is the table that I get
+---+-----------------+---------------------+-------------+
|id |name |dob |country |
+---+-----------------+---------------------+-------------+
|1 |{value -> Robert}|{value -> 21-04-1988}|{value -> IT}|
|1 |{value -> bob} |null |{value -> DE}|
|2 |null |null |{value -> ES}|
|2 |{value -> Mary} |{value -> null} |{value -> FR}|
+---+-----------------+---------------------+-------------+
The problem there is that, for id = 2 for example, I get the null values first for name and dob. What I would like, is a way to systematically get the first row of every id to always display non null values for each column (like shown by blackbishop).
#+---+-----------------+---------------------+-------------+
#|id |name |dob |country |
#+---+-----------------+---------------------+-------------+
#|1 |{value -> bob} |{value -> 21-04-1988}|{value -> IT}|
#|1 |{value -> Robert}|null |{value -> DE}|
#|2 |{value -> Mary} |{value -> null} |{value -> FR}|
#|2 |null |null |{value -> ES}|
#+---+-----------------+---------------------+-------------+
I tried to edit the windowing to the following but it doesn't change the result:
w = Window.partitionBy("id") # change this if you have some column to use for ordering
name = name.withColumn("id2", F.row_number().over(w.orderBy("name.value")))
dob = dob.withColumn("id2", F.row_number().over(w.orderBy("dob.value")))
country = country.withColumn("id2", F.row_number().over(w.orderBy("country.value")))
result = (name.join(dob, ["id", "id2"], "full")
.join(country, ["id", "id2"], "full")
.drop("id2")
)

Silly me! It was much simpler than I thought. I just need to add an order by clause before dropping the id2.
Here's the solution:
name = name.withColumn("id2", F.row_number().over(w.orderBy("name.value")))
dob = dob.withColumn("id2", F.row_number().over(w.orderBy("dob.value")))
country = country.withColumn("id2", F.row_number().over(w.orderBy("country.value")))
result = (name.join(dob, ["id", "id2"], "full")
.join(country, ["id", "id2"], "full")
.orderBy("id", "id2")
).drop("id2")

Related

Spark sql replace collect_list empty lists with null

I have below data in dataframe
+----------+--------------+-------------------+---------------+
|id |mid |ppp |qq |
+----------+--------------+-------------------+---------------+
|A |4 |[{P}] |null |
|B |4 |[{P}] |null |
|A |4 |null |[{P}] |
|A |4 |null |[{Q}] |
|C |4 |null |[{Q}] |
|D |4 |null |[{Q}] |
|A |4 |null |[{R}] |
+----------+--------------+-------------------+---------------+
I have below code
String[] array = {"id", "mid", "ppp", "qq"};
List<String> columns = Arrays.asList(array)
Column[] columns = columns
.stream()
.filter(field -> !field.equals("id") && !field.equals("mid"))
.map(column -> flatten(when(size(collect_list(column)).equalTo(0), null)
.otherwise(collect_list(column)))
.as(column))
.collect(Collectors.toList()).toArray(new Column[0]);
Dataset<Row> output = df
.groupBy(functions.col("id"), functions.col("mid"))
.agg(columns[0], Arrays.copyOfRange(columns, 1, columns.length));
The above code produces groups by id and mid and then collect_list collects elements of ppp and qq into arrays in both columns.
Output :
+----------+--------------+-------------------+----------------+
|id |mid | ppp |qq |
+----------+--------------+-------------------+----------------+
|A |4 |[[P]] |[[R], [P], [Q]] |
|B |4 |null |[[Q]] |
|C |4 |[[P]] |null |
|D |4 |null |[[Q]] |
Code works fine exactly as required where if collect_list creates empty list, I am replacing that by null.
Is there a way to avoid calling collect_list twice in when and otherwise and achieve the same result that if collect_list creates empty list, replace that by null.
of course you can do that, just call size on the array on set it to null if it is 0, something like
df
.groupBy()
.agg(
collect_list($"mycol").as("arr_mycol")
)
// set empty arrays to null
.withColumn("arr_mycol",when(size($"arr_mycol")>0,$"arr_mycol"))

PySpark: Check if value in col is like a key in a dict

I would like to take my dictionary which contains keywords and check a column in a pyspark df to see if that keyword exists and if so then return the value from the dictionary in a new column.
The problem looks like this;
myDict = {
'price': 'Pricing Issue',
'support': 'Support Issue',
'android': 'Left for Competitor'
}
df = sc.parallelize([('1','Needed better Support'),('2','Better value from android'),('3','Price was to expensive')]).toDF(['id','reason'])
+-----+-------------------------+
| id |reason |
+-----+-------------------------+
|1 |Needed better support |
|2 |Better value from android|
|3 | Price was to expensive |
|4 | Support problems |
+-----+-------------------------+
The end result that I am looking for is this:
+-----+-------------------------+---------------------+
| id |reason |new_reason |
+-----+-------------------------+---------------------+
|1 |Needed better support | Support Issue |
|2 |Better value from android| Left for Competitor |
|3 |Price was to expensive | Pricing Issue |
|4 |Support issue | Support Issue |
+-----+-------------------------+---------------------+
What's the best way to build an efficient function to do this in pyspark?
You can use when expressions to check whether the column reason matches the dict keys. You can dynamically generate the when expressions using python functools.reduce function by passing the list myDict.keys():
from functools import reduce
from pyspark.sql import functions as F
df2 = df.withColumn(
"new_reason",
reduce(
lambda c, k: c.when(F.lower(F.col("reason")).rlike(rf"\b{k.lower()}\b"), myDict[k]),
myDict.keys(),
F
)
)
df2.show(truncate=False)
#+---+-------------------------+-------------------+
#|id |reason |new_reason |
#+---+-------------------------+-------------------+
#|1 |Needed better Support |Support Issue |
#|2 |Better value from android|Left for Competitor|
#|3 |Price was to expensive |Pricing Issue |
#|4 |Support problems |Support Issue |
#+---+-------------------------+-------------------+
You can create a keywords dataframe, and join to the original dataframe using an rlike condition. I added \\\\b before and after the keywords so that only words between word boundaries will be matched, and there won't be partial word matches (e.g. "pineapple" matching "apple").
import pyspark.sql.functions as F
keywords = spark.createDataFrame([[k,v] for (k,v) in myDict.items()]).toDF('key', 'new_reason')
result = df.join(
keywords,
F.expr("lower(reason) rlike '\\\\b' || lower(key) || '\\\\b'"),
'left'
).drop('key')
result.show(truncate=False)
+---+-------------------------+-------------------+
|id |reason |new_reason |
+---+-------------------------+-------------------+
|1 |Needed better Support |Support Issue |
|2 |Better value from android|Left for Competitor|
|3 |Price was to expensive |Pricing Issue |
|4 |Support problems |Support Issue |
+---+-------------------------+-------------------+

Spark SQL - 1 task running for long time due to null values is join key

I am performing a left join between two tables with 1.3 billion records each however the join key is null in table1 for approx 600 million records and because of this all null records get allocated to 1 task ,hence data skew happens making this 1 task to run for hours.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("report").enableHiveSupport()
tbl1 = spark.sql("""select a.col1,b.col2,a.Col3
from table1 a
left join table2 b on a.col1 = b.col2""")
tbl1.write.mode("overwrite").saveAsTable("db.tbl3")
There are no other join conditions & this is the only join key to use. Is there any way that i can make spark to distribute these NULL records across different tasks instead of one or any other approach?
There is an excellent answer by #Mikhail Dubkov
that resolves just that.
I just modified it a little bit, to solve the following exception:
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.;
Here is an example
Create tables:
case class Country(country_id: String, country_name: String)
case class Location(location_id: Int, street_address: String, city: String, country_id: String)
val countries: DataFrame = List(
Country("CN", "China"),
Country("UK", "United Kingdom"),
Country("US", "United States of America"),
Country(null, "Unknown 1"),
Country(null, "Unknown 2"),
Country(null, "Unknown 3"),
Country(null, "Unknown 4"),
Country(null, "Unknown 5"),
Country(null, "Unknown 6")
).toDF()
val locations = List(
Location(1400, "2014 Jabberwocky Rd", "Southlake", "US"),
Location(1500, "2011 Interiors Blvd", "San Francisco", "US"),
Location(1700, "2004 Charade Rd", "Seattle", "US"),
Location(2400, "8204 Arthur St", "London", "UK"),
Location(2500, "Magdalen Centre, The Oxford Science Park", "Oxford", "UK"),
Location(0, "Null Street", "Null City", null),
).toDF()
Join:
import SkewedDataFrameExt
val skewedSafeJoin = countries
.nullSkewLeftJoin(locations, "country_id")
+----------+------------------------+------------------------+-----------+----------------------------------------+-------------+----------+
|country_id|country_name |country_id_skewed_column|location_id|street_address |city |country_id|
+----------+------------------------+------------------------+-----------+----------------------------------------+-------------+----------+
|CN |China |CN |null |null |null |null |
|UK |United Kingdom |UK |2500 |Magdalen Centre, The Oxford Science Park|Oxford |UK |
|UK |United Kingdom |UK |2400 |8204 Arthur St |London |UK |
|US |United States of America|US |1700 |2004 Charade Rd |Seattle |US |
|US |United States of America|US |1500 |2011 Interiors Blvd |San Francisco|US |
|US |United States of America|US |1400 |2014 Jabberwocky Rd |Southlake |US |
|null |Unknown 1 |-9702 |null |null |null |null |
|null |Unknown 2 |-9689 |null |null |null |null |
|null |Unknown 3 |-815 |null |null |null |null |
|null |Unknown 4 |-7726 |null |null |null |null |
|null |Unknown 5 |-7826 |null |null |null |null |
|null |Unknown 6 |-8878 |null |null |null |null |
+----------+------------------------+------------------------+-----------+----------------------------------------+-------------+----------+
The other way I see to implement it is applying custom hint and adding a custom rule. Don't know if it worth the effort though.
Tell me if this helps.
Modified nullSkewLeftJoin
def nullSkewLeftJoin(right: DataFrame,
usingColumn: String,
skewedColumnPostFix: String = "skewed_column",
nullNumBuckets: Int = 10000): DataFrame = {
val left = underlying
val leftColumn = left.col(usingColumn)
val rightColumn = right.col(usingColumn)
nullSkewLeftJoin(right, leftColumn, rightColumn, skewedColumnPostFix, nullNumBuckets)
}
def nullSkewLeftJoin(right: DataFrame,
joinLeftCol: Column,
joinRightCol: Column,
skewedColumnPostFix: String ,
nullNumBuckets: Int): DataFrame = {
val skewedTempColumn = s"${joinLeftCol.toString()}_$skewedColumnPostFix"
if (underlying.columns.exists(_ equalsIgnoreCase skewedTempColumn)) {
underlying.join(right.where(joinRightCol.isNotNull), col(skewedTempColumn) === joinRightCol, "left")
} else {
underlying
.withColumn(skewedTempColumn,
when(joinLeftCol.isNotNull, joinLeftCol).otherwise(negativeRandomWithin(nullNumBuckets)))
.join(right.where(joinRightCol.isNotNull), col(skewedTempColumn) === joinRightCol, "left")
}
}
}
And again all thanks to #Mikhail Dubkov

PySpark - Using lists inside LIKE operator

I would like to use list inside the LIKE operator on pyspark in order to create a column.
I have the following input df :
input_df :
+------+--------------------+-------+
| ID| customers|country|
+------+--------------------+-------+
|161 |xyz Limited |U.K. |
|262 |ABC Limited |U.K. |
|165 |Sons & Sons |U.K. |
|361 |TÜV GmbH |Germany|
|462 |Mueller GmbH |Germany|
|369 |Schneider AG |Germany|
|467 |Sahm UG |Austria|
+------+--------------------+-------+
I would like to add a column CAT_ID. CAT_ID takes value 1 if "ID" contains "16" or "26". CAT_ID takes value 2 if "ID" contains "36" or "46".
So, I want my output df to look like this -
The desired output_df :
+------+--------------------+-------+-------+
| ID| customers|country|Cat_ID |
+------+--------------------+-------+-------+
|161 |xyz Limited |U.K. |1 |
|262 |ABC Limited |U.K. |1 |
|165 |Sons & Sons |U.K. |1 |
|361 |TÜV GmbH |Germany|2 |
|462 |Mueller GmbH |Germany|2 |
|369 |Schneider AG |Germany|2 |
|467 |Sahm UG |Austria|2 |
+------+--------------------+-------+-------+
I am interested in learning how this can be done using LIKE statement and lists.
I know how to implement it without list, which works perfectly:
from pyspark.sql import functions as F
def add_CAT_ID(df):
return df.withColumn(
'CAT_ID',
F.when( ( (F.col('ID').like('16%')) | (F.col('ID').like('26%')) ) , "1") \
.when( ( (F.col('ID').like('36%')) | (F.col('ID').like('46%')) ) , "2") \
.otherwise('999')
)
output_df = add_CAT_ID(input_df)
However, I would love to use list and have something like:
list1 =['16', '26']
list2 =['36', '46']
def add_CAT_ID(df):
return df.withColumn(
'CAT_ID',
F.when( ( (F.col('ID').like(list1 %)) ) , "1") \
.when( ( (F.col('ID').like('list2 %')) ) , "2") \
.otherwise('999')
)
output_df = add_CAT_ID(input_df)
Thanks a lot in advance,
SQL wildcards do not support "or" clauses. There are several ways you can handle it though.
1. Regular expressions
You can use rlike with a regular expression:
import pyspark.sql.functions as psf
list1 =['16', '26']
list2 =['36', '46']
df.withColumn(
'CAT_ID',
psf.when(psf.col('ID').rlike('({})\d'.format('|'.join(list1))), '1') \
.when(psf.col('ID').rlike('({})\d'.format('|'.join(list2))), '2') \
.otherwise('999')) \
.show()
+---+------------+-------+------+
| ID| customers|country|CAT_ID|
+---+------------+-------+------+
|161| xyz Limited| U.K.| 1|
|262|ABC Limited| U.K.| 1|
|165| Sons & Sons| U.K.| 1|
|361| TÜV GmbH|Germany| 2|
|462|Mueller GmbH|Germany| 2|
|369|Schneider AG|Germany| 2|
|467| Sahm UG|Austria| 2|
+---+------------+-------+------+
Here, we get for list1 the regular expression (16|26)\d matching 16 or 26 followed by an integer (\d is equivalent to [0-9]).
2. Dynamically build an SQL clause
If you want to keep the sql like, you can use selectExpr and chain the values with ' OR ':
df.selectExpr(
'*',
"CASE WHEN ({}) THEN '1' WHEN ({}) THEN '2' ELSE '999' END AS CAT_ID"
.format(*[' OR '.join(["ID LIKE '{}%'".format(x) for x in l]) for l in [list1, list2]]))
3. Dynamically build a Python expression
You can also use eval if you don't want to write SQL:
df.withColumn(
'CAT_ID',
psf.when(eval(" | ".join(["psf.col('ID').like('{}%')".format(x) for x in list1])), '1')
.when(eval(" | ".join(["psf.col('ID').like('{}%')".format(x) for x in list2])), '2')
.otherwise('999'))
With Spark 2.4 onwards, you can use higher order functions in the spark-sql.
Try the below one, the sql solution is same for both scala/python
val df = Seq(
("161","xyz Limited","U.K."),
("262","ABC Limited","U.K."),
("165","Sons & Sons","U.K."),
("361","TÜV GmbH","Germany"),
("462","Mueller GmbH","Germany"),
("369","Schneider AG","Germany"),
("467","Sahm UG","Germany")
).toDF("ID","customers","country")
df.show(false)
df.createOrReplaceTempView("secil")
spark.sql(
""" with t1 ( select id, customers, country, array('16','26') as a1, array('36','46') as a2 from secil),
t2 (select id, customers, country, filter(a1, x -> id like x||'%') a1f, filter(a2, x -> id like x||'%') a2f from t1),
t3 (select id, customers, country, a1f, a2f,
case when size(a1f) > 0 then 1 else 0 end a1r,
case when size(a2f) > 0 then 2 else 0 end a2r
from t2)
select id, customers, country, a1f, a2f, a1r, a2r, a1r+a2r as Cat_ID from t3
""").show(false)
Results:
+---+------------+-------+
|ID |customers |country|
+---+------------+-------+
|161|xyz Limited |U.K. |
|262|ABC Limited|U.K. |
|165|Sons & Sons |U.K. |
|361|TÜV GmbH |Germany|
|462|Mueller GmbH|Germany|
|369|Schneider AG|Germany|
|467|Sahm UG |Germany|
+---+------------+-------+
+---+------------+-------+----+----+---+---+------+
|id |customers |country|a1f |a2f |a1r|a2r|Cat_ID|
+---+------------+-------+----+----+---+---+------+
|161|xyz Limited |U.K. |[16]|[] |1 |0 |1 |
|262|ABC Limited|U.K. |[26]|[] |1 |0 |1 |
|165|Sons & Sons |U.K. |[16]|[] |1 |0 |1 |
|361|TÜV GmbH |Germany|[] |[36]|0 |2 |2 |
|462|Mueller GmbH|Germany|[] |[46]|0 |2 |2 |
|369|Schneider AG|Germany|[] |[36]|0 |2 |2 |
|467|Sahm UG |Germany|[] |[46]|0 |2 |2 |
+---+------------+-------+----+----+---+---+------+
Create the List, then Dataframe
If the list is structured a little differently, we can do a simple join using the like function and an expression after turning the list into a pyspark Dataframe. This demonstrates that while 'or' is not supported in the like clause, we can get around that by restructuring the inputs and using a join.
list1 =[('16', '1'),
('26', '1'),
('36', '2'),
('46', '2')]
listCols = ['listDat','Cat_ID']
df = spark.createDataFrame(data=list1, schema=listCols)
Create input_df
data = [(161, 'xyz Limited','U.K.'),
(262,'ABC Limited','U.K.'),
(165,'Sons & Sons','U.K.'),
(361,'TÜV GmbH','Germany'),
(462,'Mueller GmbH','Germany'),
(369,'Schneider AG','Germany'),
(467,'Sahm UG','Austria')]
dataCols = ['ID','Customers','Country']
input_df = spark.createDataFrame(data=data, schema = dataCols)
Given the Dataframe, the join can now take place with wildcards. Note that while with SSMS we get to use the plus sign to concatenate, with pyspark we need to use concat().
Before displaying, the extra column is dropped since it isn't needed in the result set.
output_df = input_df.alias('i').join(df.alias('d'),
F.expr("i.ID like concat('%',d.listDat,'%')"),
how = 'left').drop('listDat').show()
Display output_df
+---+------------+-------+------+
| ID| Customers|Country|Cat_ID|
+---+------------+-------+------+
|161| xyz Limited| U.K.| 1|
|262| ABC Limited| U.K.| 1|
|165| Sons & Sons| U.K.| 1|
|361| TÜV GmbH|Germany| 2|
|462|Mueller GmbH|Germany| 2|
|369|Schneider AG|Germany| 2|
|467| Sahm UG|Austria| 2|
+---+------------+-------+------+

Spark Querying on all keys of a map

I have a parquet file with the following schema
|-- Name: string (nullable = true)
|-- Attendance: long (nullable = true)
|-- Efficiency: map (nullable = true)
| |-- key: string
| |-- value: double (valueContainsNull = true)
Where efficiency value ranges from -1 to +1 and the key is various categories such as Sports,Academics etc. I have up to 20 different keys.
I am trying to fetch the top 100 names ordered in descending by Attendance where efficiency[Key] is lesser than 0.
I am able to do this for one key. But i'm not able to figure out how I should be implementing this for all my keys simultaneously.
Code snippet for one key:
spark.sql("select Name,Attendance,Efficiency['Sports'] from data where Efficiency['Sports'] < 0 order by Attendance desc limit 100")
On doing some analysis I found that we would need to explode our map. But whenever I explode the number of rows in my table goes up and I am unable to fetch the top 100 names.
Sample Data for one key. The actual table has a map instead of the one column that is seen here
+--------------------+------------------+-------------+
|Name |Attendance |Efficiency[Sports]|
+--------------------+------------------+-------------+
|A |1000 |0.002 |
|B |365 |0.0 |
|C |1080 |0.193 |
|D |245 |-0.002 |
|E |1080 |-0.515 |
|F |905 |0.0 |
|G |900 |-0.001 |
Expected output : List of 100 names for each key
+-----------------------+--------------+
|Sports |Academics |
+-----------------------+--------------+
|A |A |
|B |C |
|C |D |
|D |E |
Any help on solving this would really helpful
Thanks
I hope this is what you are looking for
import org.apache.spark.sql.functions._
//dummy data
val d = Seq(
("a", 10, Map("Sports" -> -0.2, "Academics" -> 0.1)),
("b", 20, Map("Sports" -> -0.1, "Academics" -> -0.1)),
("c", 5, Map("Sports" -> -0.2, "Academics" -> 0.5)),
("d", 15, Map("Sports" -> -0.2, "Academics" -> 0.0))
).toDF("Name", "Attendence", "Efficiency")
//explode the map and get key value
val result = d.select($"Name", $"Attendence", explode($"Efficiency"))
//select value less than 0 and show 100
result.select("*").where($"value".lt(0))
.sort($"Attendence".desc)
.show(100)
Output:
+----+----------+---------+-----+
|Name|Attendence|key |value|
+----+----------+---------+-----+
|b |20 |Sports |-0.1 |
|b |20 |Academics|-0.1 |
|d |15 |Sports |-0.2 |
|a |10 |Sports |-0.2 |
|c |5 |Sports |-0.2 |
+----+----------+---------+-----+
Hope this helps!
Given the input dataframe as
+----+----------+-----------------------------------------+
|Name|Attendance|Efficiency |
+----+----------+-----------------------------------------+
|A |1000 |Map(Sports -> 0.002, Academics -> 0.002) |
|B |365 |Map(Sports -> 0.0, Academics -> 0.0) |
|C |1080 |Map(Sports -> 0.193, Academics -> 0.193) |
|D |245 |Map(Sports -> -0.002, Academics -> -0.46)|
|E |1080 |Map(Sports -> -0.515, Academics -> -0.5) |
|F |905 |Map(Sports -> 0.0, Academics -> 0.0) |
|G |900 |Map(Sports -> -0.001, Academics -> -0.0) |
+----+----------+-----------------------------------------+
Using udf function to iterate the Map for checking less than zero values. This can be done as below
import org.apache.spark.sql.functions._
val isLessThan0 = udf((maps: Map[String, Double]) => maps.map(x => x._2 < 0).toSeq.contains(true))
df.withColumn("lessThan0", isLessThan0('Efficiency))
.filter($"lessThan0" === true)
.orderBy($"Attendance".desc)
.drop("lessThan0")
.show(100, false)
you will have output as
+----+----------+-----------------------------------------+
|Name|Attendance|Efficiency |
+----+----------+-----------------------------------------+
|E |1080 |Map(Sports -> -0.515, Academics -> -0.5) |
|G |900 |Map(Sports -> -0.001, Academics -> -0.0) |
|D |245 |Map(Sports -> -0.002, Academics -> -0.46)|
+----+----------+-----------------------------------------+

Resources