Need to Join multiple tables in pyspark: - python-3.x

query using:
df= (df1.alias('a')
.join(df2, a.id == df2.id, how='inner')
.select('a.*').alias('b')
.join(df3, b.id == df3.id, how='inner'))
error: name 'b' is not defined.

.alias('b') does not create a Python identifier named b. It sets an internal name of the returned dataframe. Your a.id is likely not the thing you expect it to be, too, but is something defined previously.
I can't remember a nice way to access the newly created DF by name right in the expression. I'd go with an intermediate identifier:
df_joined = df1.join(df1.id == df2.id, how='inner')
result_df = dj_joined.join(df_joined.id == df3.id, how='inner')

Related

Update a column in PySpark while doing multiple inner joins?

I have a SQL query which I am trying to convert into PySpark. In SQL query, we are joining three tables and updating a column where there's a match. The SQL query looks like this:
UPDATE [DEPARTMENT_DATA]
INNER JOIN ([COLLEGE_DATA]
INNER JOIN [STUDENT_TABLE]
ON COLLEGE_DATA.UNIQUEID = STUDENT_TABLE.PROFESSIONALID)
ON DEPARTMENT_DATA.PUBLICID = COLLEGE_DATA.COLLEGEID
SET STUDENT_TABLE.PRIVACY = "PRIVATE"
The logic I have tried:
df_STUDENT_TABLE = (
df_STUDENT_TABLE.alias('a')
.join(
df_COLLEGE_DATA('b'),
on=F.col('a.PROFESSIONALID') == F.col('b.UNIQUEID'),
how='left',
)
.join(
df_DEPARTMENT_DATA.alias('c'),
on=F.col('b.COLLEGEID') == F.col('c.PUBLICID'),
how='left',
)
.select(
*[F.col(f'a.{c}') for c in df_STUDENT_TABLE.columns],
F.when(
F.col('b.UNIQUEID').isNotNull() & F.col('c.PUBLICID').isNotNull()
F.lit('PRIVATE')
).alias('PRIVACY')
)
)
This code is adding a new column "PRIVACY", but giving null values after running.
I have taken some sample data and when I apply the join using conditions, the following is the result I get (requirement is that the following record's privacy needs to be set to PRIVATE)
%sql
select student.*,college.*,department.* from department INNER JOIN college INNER JOIN student
ON college.unique_id = student.professional_id and department.public_id = college.college_id
When I used your code (same logic), I got the same output i.e., an additional column being added to the dataframe with required values and the actual privacy column has nulls.
from pyspark.sql.functions import col,when,lit
df_s = df_s.alias('a').join(df_c.alias('b'), col('a.professional_id') == col('b.unique_id'),'left').join(df_d.alias('c'), col('b.college_id') == col('c.public_id'),'left').select(*[col(f'a.{c}') for c in df_s.columns],when(col('b.unique_id').isNotNull() & col('c.public_id').isNotNull(), 'PRIVATE').otherwise(col('a.privacy')).alias('req_value'))
df_s.show()
Since, from the above, req_value is the column with required values and these values need to be reflected in privacy, you can use the following code directly.
final = df_s.withColumn('privacy',col('req_value')).select([column for column in df_s.columns if column!='req_value'])
final.show()
UPDATE:
You can also use the following code where I have updated the column using withColumn instead of select.
df_s = df_s.alias('a').join(df_c.alias('b'), col('a.professional_id') == col('b.unique_id'),'left').join(df_d.alias('c'), col('b.college_id') == col('c.public_id'),'left').withColumn('privacy',when(col('b.unique_id').isNotNull() & col('c.public_id').isNotNull(), 'PRIVATE').otherwise(col('privacy'))).select(*df_s.columns)
#or you can use this as well, without using alias.
#df_s = df_s.join(df_c, df_s['professional_id'] == df_c['unique_id'],'left').join(df_d, df_c['college_id'] == df_d['public_id'],'left').withColumn('privacy',when(df_c['unique_id'].isNotNull() & df_d['public_id'].isNotNull(), 'PRIVATE').otherwise(df_s['privacy'])).select(*df_s.columns)
df_s.show()
After the joins, you can use nvl2. It can check if the join with the last dataframe (df_dept) was successful, if yes, then you can return "PRIVATE", otherwise the value from df_stud.PRIVACY.
Inputs:
from pyspark.sql import functions as F
df_stud = spark.createDataFrame([(1, 'x'), (2, 'STAY')], ['PROFESSIONALID', 'PRIVACY'])
df_college = spark.createDataFrame([(1, 1)], ['COLLEGEID', 'UNIQUEID'])
df_dept = spark.createDataFrame([(1,)], ['PUBLICID'])
df_stud.show()
# +--------------+-------+
# |PROFESSIONALID|PRIVACY|
# +--------------+-------+
# | 1| x|
# | 2| STAY|
# +--------------+-------+
Script:
df = (df_stud.alias('s')
.join(df_college.alias('c'), F.col('s.PROFESSIONALID') == F.col('c.UNIQUEID'), 'left')
.join(df_dept.alias('d'), F.col('c.COLLEGEID') == F.col('d.PUBLICID'), 'left')
.select(
*[f's.`{c}`' for c in df_stud.columns if c != 'PRIVACY'],
F.expr("nvl2(d.PUBLICID, 'PRIVATE', s.PRIVACY) PRIVACY")
)
)
df.show()
# +--------------+-------+
# |PROFESSIONALID|PRIVACY|
# +--------------+-------+
# | 1|PRIVATE|
# | 2| STAY|
# +--------------+-------+

Using AVG in Spark with window function

I have the following SQL Query:
Select st.Value,
st.Id,
ntile(2) OVER (PARTITION BY St.Id, St.VarId ORDER By St.Sls),
AVG(St.Value) OVER (PARTITION BY St.Id, St.VarId ORDER By St.Sls, St.Date)
FROM table tb
INNER JOIN staging st on St.Id = tb.Id
I've tried to adapt this to Spark/PySpark using window function, my code is below:
windowSpec_1 = Window.partitionBy("staging.Id", "staging.VarId").orderBy("staging.Sls")
windowSpec_2 = Window.partitionBy("staging.Id", "staging.VarId").orderBy("staging.Sls", "staging.Date")
df= table.join(
staging,
on=f.col("staging.Id") == f.col("table.Id"),
how='inner'
).select(
f.col("staging.Value"),
f.ntile(2).over(windowSpec_1),
f.avg("staging.Value").over(windowSpec_2)
)
Although I'm getting the following error:
pyspark.sql.utils.AnalysisException: Can't extract value from Value#42928: need struct type but got decimal(16,6)
How Can I solve this problem? Is it necessary to group data?
Maybe you forgot to assign alias to staging?:
df= table.join(
staging.alias("staging"),
on=f.col("staging.Id") == f.col("table.Id"),
how='inner'
).select(
f.col("staging.Value"),
f.ntile(2).over(windowSpec_1),
f.avg("staging.Value").over(windowSpec_2)
)

pyspark.sql.utils.AnalysisException: Column ambiguous but no duplicate column names

I'm getting an ambiguous column exception when joining on the id column of a dataframe, but there are no duplicate columns in the dataframe. What could be causing this error to be thrown?
Join operation, where a and input have been processed by other functions:
b = (
input
.where(F.col('st').like('%VALUE%'))
.select('id', 'sii')
)
a.join(b, b['id'] == a['item'])
Dataframes:
(Pdb) a.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[item#25280L,sii#24665L]
(Pdb) b.explain()
== Physical Plan ==
*(1) Project [id#23711L, sii#24665L]
+- *(1) Filter (isnotnull(st#25022) AND st#25022 LIKE %VALUE%)
+- *(1) Scan ExistingRDD[id#23711L,st#25022,sii#24665L]
Exception:
pyspark.sql.utils.AnalysisException: Column id#23711L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.;
If I recreate the dataframe using the same schema, I do not get any errors:
b_clean = spark_session.createDataFrame([], b.schema)
a.join(b_clean, b_clean['id'] == a['item'])
What can I look at to troubleshoot what happened with the original dataframes that would cause the ambiguous column error?
This error and the fact that your sii column has the same id in both tables (i.e. sii#24665L) tells that both a and b dataframes are made using the same source. So, in essence, this makes your join a self join (exactly what the error message tells). In such cases it's recommended to use alias for dataframes. Try this:
a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
Again, in some systems you may not be able to save your result, as the resulting dataframe will have 2 sii columns. I would recommend to explicitly select only the columns that you need. Renaming columns using alias may also help if you decide that you'll need both the duplicate columns. E.g.:
df = (
a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
.select('item',
'id',
F.col('a.sii').alias('a_sii')
)
)

Create column and use as join key inside of JOIN in PySpark

I'd like to create a column to use as the join key inside of the join like:
df1.join(df2
.withColumn('NewDF2Column', SOME_OPERATION)),
df1['key'] = df2['NewDF2Column'], how = 'left'))
PySpark can never find the NewDF2Column to use as the join key. It works if I create it first in another dataframe, but not dynamically like this. Is it possible? Thank you!
Dataframes are immutable, which means that you need to reassign everytime your variable to get the result from it. In this case, you are creating your NewDF2Column on the first parameter of join operation, but the second parameter where you references NewDF2Column again can't see the changes made before. How to solve it?
First option
# Creating before joining
df2 = df2.withColumn('NewDF2Column', SOME_OPERATION)
output_df = df1.join(df2, df1['key'] = df2['NewDF2Column'], how='left')
Second option
# Creating a column to join with the same name as df1
output_df = df1.join(df2.withColumn('key', SOME_OPERATION), on='key', how='left')

Error: Resolved attributes missing in join

I'm using pyspark to perform a join of two tables with a relatively complex join condition (using greater than/smaller than in the join conditions). This works fine, but breaks down as soon as I add a fillna command before the join.
The code looks something like this:
join_cond = [
df_a.col1 == df_b.colx,
df_a.col2 == df_b.coly,
df_a.col3 >= df_b.colz
]
df = (
df_a
.fillna('NA', subset=['col1'])
.join(df_b, join_cond, 'left')
)
This results in an error like this:
org.apache.spark.sql.AnalysisException: Resolved attribute(s) col1#4765 missing from col1#6488,col2#4766,col3#4768,colx#4823,coly#4830,colz#4764 in operator !Join LeftOuter, (((col1#4765 = colx#4823) && (col2#4766 = coly#4830)) && (col3#4768 >= colz#4764)). Attribute(s) with the same name appear in the operation: col1. Please check if the right attribute(s) are used.
It looks like spark no longer recognizes col1 after performing the fillna. (The error does not come up if I comment that out.) The problem is that I do need that statement. (And in general I've simplified this example a lot.)
I've looked at this question, but these answers do not work for me. Specifically, using .alias('a') after the fillna doesn't work because then spark does not recognize the a in the join condition.
Could someone:
Explain exactly why this is happening and how I can avoid it in the future?
Advise me on a way to solve it?
Thanks in advance for your help.
What is happening?
In order to "replace" empty values, a new dataframe is created that contains new columns. These new columns have the same names like the old ones but are effectively completely new Spark objects. In the Scala code you can see that the "changed" columns are newly created ones while the original columns are dropped.
A way to see this effect is to call explain on the dataframe before and after replacing the empty values:
df_a.explain()
prints
== Physical Plan ==
*(1) Project [_1#0L AS col1#6L, _2#1L AS col2#7L, _3#2L AS col3#8L]
+- *(1) Scan ExistingRDD[_1#0L,_2#1L,_3#2L]
while
df_a.fillna(42, subset=['col1']).explain()
prints
== Physical Plan ==
*(1) Project [coalesce(_1#0L, 42) AS col1#27L, _2#1L AS col2#7L, _3#2L AS col3#8L]
+- *(1) Scan ExistingRDD[_1#0L,_2#1L,_3#2L]
Both plans contain a column called col1, but in the first case the internal representation is called col1#6L while the second one is called col1#27L.
When the join condition df_a.col1 == df_b.colx now is associated with the column col1#6L the join will fail if only the column col1#27L is part of the left table.
How can the problem be solved?
The obvious way would be to move the `fillna` operation before the definition of the join condition:
df_a = df_a.fillna('NA', subset=['col1'])
join_cond = [
df_a.col1 == df_b.colx,
[...]
If this is not possible or wanted you can change the join condition. Instead of using a column from the dataframe (df_a.col1) you can use a column that is not associated with any dataframe by using the col function. This column works only based on its name and therefore ignores when the column is replaced in the dataframe:
from pyspark.sql import functions as F
join_cond = [
F.col("col1") == df_b.colx,
df_a.col2 == df_b.coly,
df_a.col3 >= df_b.colz
]
The downside of this second approach is that the column names in both tables must be unique.

Resources