I am doing a simple parquet file reading and running a query to find the un-matched rows from left table. Please see the code snippet below.
argTestData = '<path to parquet file>'
tst_DF = spark.read.option('header', True).parquet(argTestData)
argrefData = '<path to parquet file>'
refDF = spark.read.option('header', True).parquet(argrefData)
cond = ["col1", "col2", "col3"]
fi = tst_DF.join(refDF, cond , "left_anti")
So far things are working. However, as a requirement, I need to get the elements list if the above gives count > 0, i.e. if the value of fi.count() > 0, then I need the elements name. So, I tried below code, but it is throwing error.
if fi.filter(col("col1").count() > 0).collect():
fi.show()
error
TypeError: 'Column' object is not callable
Note:
I have 3 columns as a joining condition which is in a list and assigned to a variable cond, and I need to get the un-matched records for those 3 columns, so the if condition has to accommodate them. OfCourse there are many other columns due to join.
Please suggest where am I making mistakes.
Thank you
If I understand correctly, that's simply :
fi.select(cond).collect()
The left_anti already get the records which do not match (exists in tst_DF but not in refDF).
You can add a distinct before the collect to remove duplicates.
Did you import the column function?
from pyspark.sql import functions as F
...
if fi.filter(F.col("col1").count() > 0).collect():
fi.show()
Related
I have a requirement where I will have to filter records from a df if that is present in one array. so I have an array that is distinct values from another df's column like below.
dist_eventCodes = Event_code.select('Value').distinct().collect()
now I am passing this dist_eventCodes in a filter like below.
ADT_df_select = ADT_df.filter(ADT_df.eventTypeCode.isin(dist_eventCodes))
when I run this code I get the below error message
"AttributeError: 'DataFrame' object has no attribute '_get_object_id'"
can somebody please help me under what wrong am i doing?
Thanks in advance
If I understood correctly, you want to retain only those rows where eventTypeCode is within eventTypeCode from Event_code dataframe
Let me know if this is not the case
This can be achieved by a simple left-semi join in spark. This way you don't need to collect the dataframe, thus would be the right way in a distributed environment.
ADT_df.alias("df1").join(Event_code.select("value").distinct().alias("df2"), [F.col("df1.eventTypeCode")=F.col("df2.value")], 'leftsemi')
Or if there is a specific need to use isin, this would work (collect_set will take care of distinct):
dist_eventCodes = Event_code.select("value").groupBy(F.lit("dummy")).agg(F.collect_set("value").alias("value")).first().asDict()
ADT_df_select = ADT_df.filter(ADT_df["eventTypeCode"].isin(dist_eventCodes["value"]))
Input (ADT_df):
Event_code Dataframe:
Output:
I have this big nested json object from which I need to make a Dataframe. One of the inner json elements sometimes come as empty and sometimes it comes with some values in it.
I am giving a simple example here:
When it is filled:
{"student_address": {"Door Number":"1234",
"Place":"xxxx",
"Zip code":"12345"}}
When it is empty:
{"student_address":""}
So, in the final DataFrame I have all the three columns Door Number, Place and Zip code. When the address is empty, I should put Null values in the respective columns and should fill them when there is data.
The code I tried:
test = test.withColumn("place",when(col("student_address") == "", lit(None)).otherwise(col("student_address.place")))\
.withColumn("door_num",when(col("student_address") == "",lit(None)).otherwise(col("student_address.door_num")))\
.withColumn("zip_code",when(col("student_address") == "", lit(None)).otherwise(col("student_address.zip_code")))
So, I am trying to check wether the value is empty or not.
This is the error I am getting:
AnalysisException: Can't extract value from student_address#34: need struct type but got string
I am not able to understand why PySpark is checking the statement in otherwise, when condition is met in when itself. (I tried giving simple values in otherwise instead of json path and it worked).
I am struggling to understand what is happening here and would like to know if there is any simple way to do this.
val addressSchema = StructType(StructField("Place", StringType, false) :: Nil) # Add more fields
val schema = StructType(StructField("address", addressSchema, true) :: Nil) # the point is address is nullable
val df = spark.read.schema(schema).json("example.json")
use Get JSON object with the first object with is Student_address and the column should be the name of the column.
df.withColumn("place",when(df.student_address== "", lit(None)).otherwise(get_json_object(col("student_address"),"$.student_address.place")))
I have checked similar questions on SO with the SettingWithCopyWarning error raised using .loc but I still don't understand why I have the error in the following example.
It appears line 3, I succeed to make it disappear with .copy() but I would like to understand why .loc didn't work specifically here.
Does making a conditional slice creates a view even if it's .loc ?
df = pd.DataFrame( data=[0,1,2,3,4,5], columns=['A'])
df.loc[:,'B'] = df.loc[:,'A'].values
dfa = df.loc[df.loc[:,'A'] < 4,:] # here .copy() removes the error
dfa.loc[:,'C'] = [3,2,1,0]
Edit : pandas version is 1.2.4
dfa = df.loc[df.loc[:,'A'] < 4,:]<br>
dfa is a slice of the df dataframe, still referencing the dataframe, a view..copy creates a separate copy, not just a view of the first dataframe.
dfa.loc[:,'C'] = [3,2,1,0]
When it's a view not a copy, you are getting the warning : A value is trying to be set on a copy of a slice from a DataFrame.
.loc is locating the conditions you give it, but it's still a view that you're setting values to if you don't make it a copy of the dataframe.
I have read a csv file and need to find correlation between two columns.
I am using df.stat.corr('Age','Exp') and result is 0.7924058156930612.
But I want to have this result stored in another dataframe with header as "correlation".
correlation
0.7924058156930612
Following up on what #gupta_hemant commented.
You can create a new column as
df.withColumn("correlation", df.stat.corr("Age", "Exp").collect()[0].correlation)
(I am guessing the exact syntax here, but it should be something like this)
After reviewing the code, the syntax should be
import pyspark.sql.functions as F
df.withColumn("correlation", F.lit(df.stat.corr("Age", "Exp")))
Try this and let me know.
corrValue = df.stat.corr("Age", "Exp")
newDF = spark.createDataFrame(
[
(corrValue)
],
["corr"]
)
Say I am having a dataframe named "orderitems" with below schema
DataFrame[order_item_id: int, order_item_order_id: int, order_item_product_id: int, order_item_quantity: int, order_item_subtotal: float, order_item_product_price: float]
So As a part of checking the data quality , I need to ensure all rows satisfies the formula : order_item_subtotal = (order_item_quantity*order_item_product_price).
For this I need to add a seperate column named "valid" which should have 'Y' as value for all those rows which satisfy the above formula and for all other rows it should have 'N' as value.
I have decided to use when() and otherwise() along with withColumn() method as below.
orderitems.withColumn("valid",when(orderitems.order_item_subtotal != (orderitems.order_item_product_price * orderitems.order_item_quantity),'N').otherwise("Y"))
But it returns me below Error:
TypeError: 'Column' object is not callable
I know this happened because I have tried to multiply two column objects. But I am not sure how to resolve this since I am still on a learnig proccess in spark.
I would like to know , how to fix this. I am using Spark 2.3.0 with Python
Try something like this:
from pyspark.sql.functions import col,when
orderitems.withColumn("valid",
when(col("order_item_subtotal") != (col("order_item_product_price") * col("order_item_quantity")),"N")
.otherwise("Y")).show()
This can be implemented through spark UDF functions which are very efficient in performing row operartions.
Before running this code make sure the comparison you are doing should have the same datatype.
def check(subtotal, item_quantity, item_product_price):
if subtotal == (item_quantity * item_product_price):
return "Y"
else:
return "N"
validate = udf(check)
orderitems = orderitems.withColumn("valid", validate("order_item_subtotal", "order_item_quantity", "order_item_product_price"))