PySpark equivalent of a simple SQL join - apache-spark

This is probably far to simple of a question.
But I'm not getting very far on my own.
I'm trying to use PySpark in Databricks to do the SQL equivalent of a lookup:
select
a.*
, b.MASTER_ID as PLAYER_ID
from vGame a
join PLAYER_XREF b
on a.PLAYER_NAME = b.PLAYER
Note that the two attributes on both sides of the on are NOT named the same.
Can you show me the pyspark version of the same?
Seems to me the numerous tangential posts here for this are over the top complex compared to than this.
I found this and this is really close but the returned dataframe is all columns of ta & tb.
inner_join = ta.join(tb, ta.name == tb.name)

I can list out all the ta columns individually & alias the one tb column with:
from pyspark.sql.functions import *
inner_join = ta.join(tb, ta.PLAYER_NAME == tb.PLAYER).select('<taCol1>', '<taCol2>', ... col('MASTER_ID').alias('PLAYER_ID'))
display(inner_join)

logic:
1.) We first rename player_name in ta dataframe to player so that we can join
2.) Once the columnNames are same we can use a join using square brackets []
3.) also we dynamically select columns from data frame ta
code:
ta = ta.withColumn("player_name","player")
inner_join = ta.join(tb,["player"]).select(col(x) for x in ta.columns])

Related

Pyspark - Looking to create a normalized version of a Double column

As the title states, I'd like to create a normalized version of an existing Double column.
As I'm quite new to pyspark, this was my attempt at solving this:
df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df2 = df2.withColumn('count_trans_norm', F.col('count_trans) / (F.max(F.col('count_trans'))))
When I do this, I get the following error:
"grouping expressions sequence is empty, and '`movie_id`' is not an aggregate function.
Any help would be much appreciated.
You need to specify an empty window if you want to get the maximum of count_trans in df2:
df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df3 = df2.selectExpr('*', 'count_trans / max(count_trans) over () as count_trans_norm')
Or if you prefer pyspark syntax:
from pyspark.sql import functions as F, Window
df3 = df2.withColumn('count_trans_norm', F.col('count_trans') / F.max(F.col('count_trans')).over(Window.orderBy()))

How do I calculate percentages over groups in spark?

I have data in the form:
FUND|BROKER|QTY
F1|B1|10
F1|B1|50
F1|B2|20
F1|B3|20
When I group it by FUND, and BROKER, I would like to calculate QTY as a percentage of the total at the group level. Like so,
FUND|BROKER|QTY %|QTY EXPLANATION
F1|B1|60%|(10+50)/(10+50+20+20)
F1|B2|20%|(20)/(10+50+20+20)
F1|B2|20%|(20)/(10+50+20+20)
Or when I group by just FUND, like so
FUND|BROKER|QTY %|QTY EXPLANATION
F1|B1|16.66|(10)/(10 + 50)
F1|B1|83.33|(50)/(10 + 50)
F1|B2|100|(20)/(20)
F1|B3|100|(20)/(20)
I would like to achieve this using spark-sql if possible or through dataframe functions.
I think I have to use Windowing functions, so I can get access to the total of the grouped dataset, but I've not had much luck using them the right way.
Dataset<Row> result = sparkSession.sql("SELECT fund_short_name, broker_short_name,first(quantity)/ sum(quantity) as new_col FROM margin_summary group by fund_short_name, broker_short_name" );
PySpark SQL solution.
This can be done using sum as a window function defining 2 windows - one with a grouping on broker, fund and the other only on fund.
from pyspark.sql import Window
from pyspark.sql.functions import sum
w1 = Window.partitionBy(df.fund,df.broker)
w2 = Window.partitionBy(df.fund)
res = df.withColumn('qty_pct',sum(df.qty).over(w1)/sum(df.qty).over(w2))
res.select(res.fund,res.broker,res.qty_pct).distinct().show()
Edit: Result 2 is simpler.
res2 = df.withColumn('qty_pct',df.qty/sum(df.qty).over(w1))
res2.show()
SQL solution would be
select distinct fund,broker,100*sum(qty) over(partition by fund,broker)/sum(qty) over(partition by fund)
from tbl
Yes. You are right when you say that you need to use windowed analytical functions.
Please find below the solutions to your queries.
Hope it helps!
spark.read.option("header","true").option("delimiter","|").csv("****").withColumn("fundTotal",sum("QTY").over(Window.partitionBy("FUND"))).withColumn("QTY%",sum("QTY").over(Window.partitionBy("BROKER"))).select('FUND,'BROKER,(($"QTY%"*100)/'fundTotal).as("QTY%")).distinct.show
And the second!
spark.read.option("header","true").option("delimiter","|").csv("/vihit/data.csv").withColumn("QTY%",sum("QTY").over(Window.partitionBy("BROKER"))).select('FUND,'BROKER,(('QTY*100)/$"QTY%").as("QTY%")).distinct.show

How to use monotonically_increasing_id to join two pyspark dataframes having no common column?

I have two pyspark dataframes with same number of rows but they don't have any common column. So I am adding new column to both of them using monotonically_increasing_id() as
from pyspark.sql.functions import monotonically_increasing_id as mi
id=mi()
df1 = df1.withColumn("match_id", id)
cont_data = cont_data.withColumn("match_id", id)
cont_data = cont_data.join(df1,df1.match_id==cont_data.match_id, 'inner').drop(df1.match_id)
But after join the resulting data frame has less number of rows.
What am I missing here. Thanks
You just don't. This not an applicable use case for monotonically_increasing_id, which is by definition non-deterministic. Instead:
convert to RDD
zipWithIndex
convert back to DataFrame.
join
You can generate the id's with monotonically_increasing_id, save the file to disc, and then read it back in THEN do whatever joining process. Would only suggest this approach if you just need to generate the id's once. At that point they can be used for joining, but for the reasons mentioned above, this is hacky and not a good solution for anything that runs regularly.
If you want to get an incremental number on both dataframes and then join, you can generate a consecutive number with monotonically and windowing with the following code:
df1 = df1.withColumn("monotonically_increasing_id",monotonically_increasing_id())
window = Window.orderBy(scol('monotonically_increasing_id'))
df1 = df1.withColumn("match_id", row_number().over(window))
df1 = df1.drop("monotonically_increasing_id")
cont_data = cont_data.withColumn("monotonically_increasing_id",monotonically_increasing_id())
window = Window.orderBy(scol('monotonically_increasing_id'))
cont_data = cont_data.withColumn("match_id", row_number().over(window))
cont_data = cont_data.drop("monotonically_increasing_id")
cont_data = cont_data.join(df1,df1.match_id==cont_data.match_id, 'inner').drop(df1.match_id)
Warning It may move the data to a single partition! So maybe is better to separate the match_id to a different dataframe with the monotonically_increasing_id, generate the consecutive incremental number and then join with the data.

How to remove rows in DataFrame on column based on another DataFrame?

I'm trying to use SQLContext.subtract() in Spark 1.6.1 to remove rows from a dataframe based on a column from another dataframe. Let's use an example:
from pyspark.sql import Row
df1 = sqlContext.createDataFrame([
Row(name='Alice', age=2),
Row(name='Bob', age=1),
]).alias('df1')
df2 = sqlContext.createDataFrame([
Row(name='Bob'),
])
df1_with_df2 = df1.join(df2, 'name').select('df1.*')
df1_without_df2 = df1.subtract(df1_with_df2)
Since I want all rows from df1 which don't include name='Bob' I expect Row(age=2, name='Alice'). But I also retrieve Bob:
print(df1_without_df2.collect())
# [Row(age='1', name='Bob'), Row(age='2', name='Alice')]
After various experiments to get down to this MCVE, I found out that the issue is with the age key. If I omit it:
df1_noage = sqlContext.createDataFrame([
Row(name='Alice'),
Row(name='Bob'),
]).alias('df1_noage')
df1_noage_with_df2 = df1_noage.join(df2, 'name').select('df1_noage.*')
df1_noage_without_df2 = df1_noage.subtract(df1_noage_with_df2)
print(df1_noage_without_df2.collect())
# [Row(name='Alice')]
Then I only get Alice as expected. The weirdest observation I made is that it's possible to add keys, as long as they're after (in the lexicographical order sense) the key I use in the join:
df1_zage = sqlContext.createDataFrame([
Row(zage=2, name='Alice'),
Row(zage=1, name='Bob'),
]).alias('df1_zage')
df1_zage_with_df2 = df1_zage.join(df2, 'name').select('df1_zage.*')
df1_zage_without_df2 = df1_zage.subtract(df1_zage_with_df2)
print(df1_zage_without_df2.collect())
# [Row(name='Alice', zage=2)]
I correctly get Alice (with her zage)! In my real examples, I'm interested in all columns, not only the ones that are after name.
Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT / EXCEPT is no the right choice for partial matches.
Instead, as of Spark 2.0, you can use anti-join:
df1.join(df1_with_df2, ["name"], "leftanti").show()
In 1.6 you can do pretty much the same thing with standard outer join:
import pyspark.sql.functions as F
ref = df1_with_df2.select("name").alias("ref")
(df1
.join(ref, ref.name == df1.name, "leftouter")
.filter(F.isnull("ref.name"))
.drop(F.col("ref.name")))

Find maximum row per group in Spark DataFrame

I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to produce more readable code.
In a 14-nodes Google Dataproc cluster, I have about 6 millions names that are translated to ids by two different systems: sa and sb. Each Row contains name, id_sa and id_sb. My goal is to produce a mapping from id_sa to id_sb such that for each id_sa, the corresponding id_sb is the most frequent id among all names attached to id_sa.
Let's try to clarify with an example. If I have the following rows:
[Row(name='n1', id_sa='a1', id_sb='b1'),
Row(name='n2', id_sa='a1', id_sb='b2'),
Row(name='n3', id_sa='a1', id_sb='b2'),
Row(name='n4', id_sa='a2', id_sb='b2')]
My goal is to produce a mapping from a1 to b2. Indeed, the names associated to a1 are n1, n2 and n3, which map respectively to b1, b2 and b2, so b2 is the most frequent mapping in the names associated to a1. In the same way, a2 will be mapped to b2. It's OK to assume that there will always be a winner: no need to break ties.
I was hoping that I could use groupBy(df.id_sa) on my dataframe, but I don't know what to do next. I was hoping for an aggregation that could produce, in the end, the following rows:
[Row(id_sa=a1, max_id_sb=b2),
Row(id_sa=a2, max_id_sb=b2)]
But maybe I'm trying to use the wrong tool and I should just go back to using RDDs.
Using join (it will result in more than one row in group in case of ties):
import pyspark.sql.functions as F
from pyspark.sql.functions import count, col
cnts = df.groupBy("id_sa", "id_sb").agg(count("*").alias("cnt")).alias("cnts")
maxs = cnts.groupBy("id_sa").agg(F.max("cnt").alias("mx")).alias("maxs")
cnts.join(maxs,
(col("cnt") == col("mx")) & (col("cnts.id_sa") == col("maxs.id_sa"))
).select(col("cnts.id_sa"), col("cnts.id_sb"))
Using window functions (will drop ties):
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
w = Window().partitionBy("id_sa").orderBy(col("cnt").desc())
(cnts
.withColumn("rn", row_number().over(w))
.where(col("rn") == 1)
.select("id_sa", "id_sb"))
Using struct ordering:
from pyspark.sql.functions import struct
(cnts
.groupBy("id_sa")
.agg(F.max(struct(col("cnt"), col("id_sb"))).alias("max"))
.select(col("id_sa"), col("max.id_sb")))
See also How to select the first row of each group?
I think what you might be looking for are window functions:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
Here is an example in Scala (I don't have a Spark Shell with Hive available right now, so I was not able to test the code, but I think it should work):
case class MyRow(name: String, id_sa: String, id_sb: String)
val myDF = sc.parallelize(Array(
MyRow("n1", "a1", "b1"),
MyRow("n2", "a1", "b2"),
MyRow("n3", "a1", "b2"),
MyRow("n1", "a2", "b2")
)).toDF("name", "id_sa", "id_sb")
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy(myDF("id_sa")).orderBy(myDF("id_sb").desc)
myDF.withColumn("max_id_b", first(myDF("id_sb")).over(windowSpec).as("max_id_sb")).filter("id_sb = max_id_sb")
There are probably more efficient ways to achieve the same results with Window functions, but I hope this points you in the right direction.

Resources