"resolved attribute(s) missing" when performing join on pySpark - apache-spark

I have the following two pySpark dataframe:
> df_lag_pre.columns
['date','sku','name','country','ccy_code','quantity','usd_price','usd_lag','lag_quantity']
> df_unmatched.columns
['alt_sku', 'alt_lag_quantity', 'country', 'ccy_code', 'name', 'usd_price']
Now I want to join them on common columns, so I try the following:
> df_lag_pre.join(df_unmatched, on=['name','country','ccy_code','usd_price'])
And I get the following error message:
AnalysisException: u'resolved attribute(s) price#3424 missing from country#3443,month#801,price#808,category#803,subcategory#804,page#805,date#280,link#809,name#806,quantity#807,ccy_code#3439,sku#3004,day#802 in operator !EvaluatePython PythonUDF#<lambda>(ccy_code#3439,price#3424), pythonUDF#811: string;'
Some of the columns that show up on this error, such as price, were part of another dataframe from which df_lag was built from. I can't find any info on how to interpret this message, so any help would be greatly appreciated.

You can perform join this way in pyspark, Please see if this is useful for you:
df_lag_pre.alias("df1")
df_unmatched.alias("df2")
join_both = df1.join(df2, (col("df1.name") == col("df2.name")) & (col("df1.country") == col("df2.country")) & (col("df1.ccy_code") == col("df2.ccy_code")) & (col("df1.usd_price") == col("df2.usd_price")), 'inner')
Update: If you are getting col not defined error, please use below import
from pyspark.sql.functions import col

Related

why am I getting column object not callable error in pyspark?

I am doing a simple parquet file reading and running a query to find the un-matched rows from left table. Please see the code snippet below.
argTestData = '<path to parquet file>'
tst_DF = spark.read.option('header', True).parquet(argTestData)
argrefData = '<path to parquet file>'
refDF = spark.read.option('header', True).parquet(argrefData)
cond = ["col1", "col2", "col3"]
fi = tst_DF.join(refDF, cond , "left_anti")
So far things are working. However, as a requirement, I need to get the elements list if the above gives count > 0, i.e. if the value of fi.count() > 0, then I need the elements name. So, I tried below code, but it is throwing error.
if fi.filter(col("col1").count() > 0).collect():
fi.show()
error
TypeError: 'Column' object is not callable
Note:
I have 3 columns as a joining condition which is in a list and assigned to a variable cond, and I need to get the un-matched records for those 3 columns, so the if condition has to accommodate them. OfCourse there are many other columns due to join.
Please suggest where am I making mistakes.
Thank you
If I understand correctly, that's simply :
fi.select(cond).collect()
The left_anti already get the records which do not match (exists in tst_DF but not in refDF).
You can add a distinct before the collect to remove duplicates.
Did you import the column function?
from pyspark.sql import functions as F
...
if fi.filter(F.col("col1").count() > 0).collect():
fi.show()

EOF in multi-line string error in PySpark

I was running the following query in PySpark, the SQL query runs fine on Hive
spark.sql(f"""
create table DEFAULT.TMP_TABLE as
select b.customer_id, prob
from
bi_prod.TMP_score_1st_day a,
(select customer_id, prob from BI_prod.Tbl_brand1_scoring where insert_date = 20230101
union all
select customer_id, prob from BI_prod.Tbl_brand2_scoring where insert_date = 20230101)  b
where a.customer_id = b.customer_id
""")
This produces the following error
ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
Need to fix this error, can't find out why error is occurring.
I recommend rewriting the code in a more Pythonic way.
from pyspark.sql.functions import col
df1 = (
spark.table('bi_prod.TMP_score_1st_day')
.filter(col('insert_date') == '20230101')
.select('customer_id')
)
df2 = (
spark.table('bi_prod.Tbl_brand2_scoring')
.filter(col('insert_date') == '20230101')
.select('customer_id', 'prob')
)
df = df1.join(df2, 'customer_id')
df.show(1, vertical=True)
Let me know how this works for you and if you still get the same error.

Using AVG in Spark with window function

I have the following SQL Query:
Select st.Value,
st.Id,
ntile(2) OVER (PARTITION BY St.Id, St.VarId ORDER By St.Sls),
AVG(St.Value) OVER (PARTITION BY St.Id, St.VarId ORDER By St.Sls, St.Date)
FROM table tb
INNER JOIN staging st on St.Id = tb.Id
I've tried to adapt this to Spark/PySpark using window function, my code is below:
windowSpec_1 = Window.partitionBy("staging.Id", "staging.VarId").orderBy("staging.Sls")
windowSpec_2 = Window.partitionBy("staging.Id", "staging.VarId").orderBy("staging.Sls", "staging.Date")
df= table.join(
staging,
on=f.col("staging.Id") == f.col("table.Id"),
how='inner'
).select(
f.col("staging.Value"),
f.ntile(2).over(windowSpec_1),
f.avg("staging.Value").over(windowSpec_2)
)
Although I'm getting the following error:
pyspark.sql.utils.AnalysisException: Can't extract value from Value#42928: need struct type but got decimal(16,6)
How Can I solve this problem? Is it necessary to group data?
Maybe you forgot to assign alias to staging?:
df= table.join(
staging.alias("staging"),
on=f.col("staging.Id") == f.col("table.Id"),
how='inner'
).select(
f.col("staging.Value"),
f.ntile(2).over(windowSpec_1),
f.avg("staging.Value").over(windowSpec_2)
)

Spark dataframe throws error while doing logic for date_add function

I am getting the below error
I am trying to add days to the current date , basically i want current_date + week_day(current_day)
scala> df.withColumn( "week_day", date_add(current_date(), dayofweek(current_date()).cast(IntegerType) )).show(10,false)
<console>:51: error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
df.withColumn( "week_day", date_add(current_date(), dayofweek(current_date()).cast(IntegerType) )).show(10,false)
I have added the import statement also , Please help why it is throwing the error
df.withColumn("week_day",expr(s"date_add(${current_date()},${dayofweek(current_date()).cast(IntegerType)})"))
should give you the desired output.
You are passing the column as the second argument(dayofweek(current_date()).cast(IntegerType)) to date_add,it should be integer type instead

Pyspark SQL coalesce data type mismatch with date cast

I am joining two dataframes using a left join.
Rows in the left table may not have a match so I am trying to set a default using the coalesce function
import pyspark.sql.functions as F
joined = t1\
.join(t2, on="id", how='left')\
.select(t1["*"],
F.coalesce(t2.date, F.to_date('2019-01-01')))
I am getting the following error
pyspark.sql.utils.AnalysisException: 'cannot resolve \'CAST(t1.`2019-01-01` AS DATE)\' due to data type mismatch: cannot cast decimal(38,3) to date;;\n\...
I have confirmed that t2.date is in fact a date type. Other t1 columns are the decimal data type seen in the error so it seems to me that it is trying to cast every column to this date type :S
Any help would be greatly appreciated
The date string was interpreted as a column name of t1. You should specify it as a literal column instead.
import pyspark.sql.functions as F
joined = t1\
.join(t2, on="id", how='left')\
.select(t1["*"],
F.coalesce(t2.date, F.to_date(F.lit('2019-01-01')))
)

Resources