Say I am having a dataframe named "orderitems" with below schema
DataFrame[order_item_id: int, order_item_order_id: int, order_item_product_id: int, order_item_quantity: int, order_item_subtotal: float, order_item_product_price: float]
So As a part of checking the data quality , I need to ensure all rows satisfies the formula : order_item_subtotal = (order_item_quantity*order_item_product_price).
For this I need to add a seperate column named "valid" which should have 'Y' as value for all those rows which satisfy the above formula and for all other rows it should have 'N' as value.
I have decided to use when() and otherwise() along with withColumn() method as below.
orderitems.withColumn("valid",when(orderitems.order_item_subtotal != (orderitems.order_item_product_price * orderitems.order_item_quantity),'N').otherwise("Y"))
But it returns me below Error:
TypeError: 'Column' object is not callable
I know this happened because I have tried to multiply two column objects. But I am not sure how to resolve this since I am still on a learnig proccess in spark.
I would like to know , how to fix this. I am using Spark 2.3.0 with Python
Try something like this:
from pyspark.sql.functions import col,when
orderitems.withColumn("valid",
when(col("order_item_subtotal") != (col("order_item_product_price") * col("order_item_quantity")),"N")
.otherwise("Y")).show()
This can be implemented through spark UDF functions which are very efficient in performing row operartions.
Before running this code make sure the comparison you are doing should have the same datatype.
def check(subtotal, item_quantity, item_product_price):
if subtotal == (item_quantity * item_product_price):
return "Y"
else:
return "N"
validate = udf(check)
orderitems = orderitems.withColumn("valid", validate("order_item_subtotal", "order_item_quantity", "order_item_product_price"))
Related
I have a pyspark UDF which returns me a list of weeks. yw_list contains a list of weeks like 202001, 202002, ....202048 etc..
def Week_generator(week, no_of_weeks):
end_index = yw_list.index(week)
start_index = end_index - no_of_weeks + 1
return(yw_list[start_index:end_index+1])
spark.udf.register("Week_generator", Week_generator)
When I'm calling this UDF in my spark sql dataframe, instead of storing the result as a list, it is getting stored as a string. Because of this I'm not able to iterate over the values in the list.
spark.sql(""" select Week_generator('some week column', 4) as col1 from xyz""")
Output Schema: col1:String
Any idea or suggestion on how to resolve this ?
As pointed out by Suresh, I missed out adding the datatype.
spark.udf.register("Week_generator", Week_generator,ArrayType(StringType()))
This solved my issue.
I'm new to Spark, and trying to figure out how I can add a column to a DataFrame where its value is fetched from a HashMap, where the key is another value on the same row which where the value is being set.
For example, I have a map defined as follows:
var myMap: Map<Integer,Integer> = generateMap();
I want to add a new column to my DataFrame where its value is fetched from this map, with the key a current column value. A solution might look like this:
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", lit(myMap.get(col("EXISTING_COLUMN"))))
My issue with this code is that using the col function doesn't return a type of Int, like the keys in my HashMap.
Any suggestions?
I would create a dataframe from the map. Then do a join operation. It should be faster and can be reused.
A UDF (user-defined function) can also be used but they are black boxes to Catalyst, so I would be prudent in using them. Depending on where the content of the map is, it may also be complicated to pass it to a UDF.
As of the next version of Kotlin API for Apache Spark you will be able to simply create a udf which will be usable in almost this way.
val mapUDF by udf { input: Int -> myMap[input] }
dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))
You need to use UDF.
val mapUDF = udf((i:Int)=>myMap.getOrElse(i,0))
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))
I have a PySpark DataFrame consists of three columns, whose structure is as below.
In[1]: df.take(1)
Out[1]:
[Row(angle_est=-0.006815859163590619, rwsep_est=0.00019571401752467945, cost_est=34.33651951754235)]
What I want to do is to retrieve each value of the first column (angle_est), and pass it as parameter xMisallignment to a defined function to set a particular property of a class object. The defined function is:
def setMisAllignment(self, xMisallignment):
if np.abs(xMisallignment) > 0.8:
warnings.warn('You might set misallignment angle too large.')
self.MisAllignment = xMisallignment
I am trying to select the first column and convert it into rdd, and apply the above function to a map() function, but it seems it does not work, the MisAllignment did not change anyway.
df.select(df.angle_est).rdd.map(lambda row: model0.setMisAllignment(row))
In[2]: model0.MisAllignment
Out[2]: 0.00111511718224
Anyone has ideas to help me let that function work? Thanks in advance!
You can register your function as spark UDF something similar to follows:
spark.udf.register("misallign", setMisAllignment)
You can get many examples of creating and registering UDF's in this test suite:
https://github.com/apache/spark/blob/master/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDFSuite.java
Hope it answers your question
I'm trying to calculate the average for each column in a dataframe and subtract from each element in the column. I've created a function that attempts to do that, but when I try to implement it using a UDF, I get an error: 'float' object has no attribute 'map'. Any ideas on how I can create such a function? Thanks!
def normalize(data):
average=data.map(lambda x: x[0]).sum()/data.count()
out=data.map(lambda x: (x-average))
return out
mapSTD=udf(normalize,IntegerType())
dats = data.withColumn('Normalized', mapSTD('Fare'))
In your example there is problem with UDF function which can not be applied to row and whole DataFrame. UDF can be applied only to single row, but Spark also enables implementing UDAF (User Defined Aggregate Functions) working on whole DataFrame.
To solve your problem you can use below function:
from pyspark.sql.functions import mean
def normalize(df, column):
average = df.agg(mean(df[column]).alias("mean")).collect()[0]["mean"]
return df.select(df[column] - average)
Use it like this:
normalize(df, "Fare")
Please note that above only works on single column, but it is possible to implement something more generic:
def normalize(df, columns):
selectExpr = []
for column in columns:
average = df.agg(mean(df[column]).alias("mean")).collect()[0]["mean"]
selectExpr.append(df[column] - average)
return df.select(selectExpr)
use it like:
normalize(df, ["col1", "col2"])
This works, but you need to run aggregation for each column, so with many columns performance could be issue, but it is possible to generate only one aggregate expression:
def normalize(df, columns):
aggExpr = []
for column in columns:
aggExpr.append(mean(df[column]).alias(column))
averages = df.agg(*aggExpr).collect()[0]
selectExpr = []
for column in columns:
selectExpr.append(df[column] - averages[column])
return df.select(selectExpr)
Adding onto Piotr's answer. If you need to keep the existing dataframe and add normalized columns with aliases, the function can be modified as:
def normalize(df, columns):
aggExpr = []
for column in columns:
aggExpr.append(mean(df[column]).alias(column))
averages = df.agg(*aggExpr).collect()[0]
selectExpr = ['*']
for column in columns:
selectExpr.append((df[column] - averages[column]).alias('normalized_'+column))
return df.select(selectExpr)
I need to use an UDF in Spark that takes in a timestamp, an Integer and another dataframe and returns a tuple of 3 values.
I keep hitting error after error and I'm not sure I'm trying to fix it right anymore.
Here is the function:
def determine_price (view_date: org.apache.spark.sql.types.TimestampType , product_id: Int, price_df: org.apache.spark.sql.DataFrame) : (Double, java.sql.Timestamp, Double) = {
var price_df_filtered = price_df.filter($"mkt_product_id" === product_id && $"created"<= view_date)
var price_df_joined = price_df_filtered.groupBy("mkt_product_id").agg("view_price" -> "min", "created" -> "max").withColumn("last_view_price_change", lit(1))
var price_df_final = price_df_joined.join(price_df_filtered, price_df_joined("max(created)") === price_df_filtered("created")).filter($"last_view_price_change" === 1)
var result = (price_df_final.select("view_price").head().getDouble(0), price_df_final.select("created").head().getTimestamp(0), price_df_final.select("min(view_price)").head().getDouble(0))
return result
}
val det_price_udf = udf(determine_price)
the error it gives me is:
error: missing argument list for method determine_price
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `determine_price _` or `determine_price(_,_,_)` instead of `determine_price`.
If I start adding the arguments I keep running in other errors such as Int expected Int.type found or object DataFrame is not a member of package org.apache.spark.sql
To give some context:
The idea is that I have a dataframe of prices, a product id and a date of creation and another dataframe containing product IDs and view dates.
I need to determine the price based on which was the last created price entry that is older than the view date.
Since each product ID has multiple view dates in the second dataframe. I thought an UDF is faster than a cross join. If anyone has a different idea, I'd be grateful.
You cannot pass the Dataframe inside UDF as UDF will be running on the Worker On a particular partition. And as you cannot use RDD on Worker( Is it possible to create nested RDDs in Apache Spark? ), similarly you cannot use the DataFrame on Worker too.!
You need to do a work around for this !