In C# (.Net for Spark), how to use When() method as a condition to add new column to a DataFrame? - .net-spark

I have got some experience in pyspark. When our team is migrating the Spark project from python to C# (.Net for Spark). I'm encountering problems:
Suppose we have got a Spark dataframe df with an existing column as col1.
In pyspark, I could do something like:
df = df.withColumn('new_col_name', when((df.col1 <= 5), lit('Group A')) \
.when((df.col1 > 5) & (df.col1 <= 8), lit('Group B')) \
.when((df.col1 > 8), lit('Group C')))
The question is how to do the equivalent in C#?
I've tried many things but still getting Exceptions when using the When() method.
For example, the following code would generate the exception:
df = df.WithColumn("new_col_name", df.Col("col1").When(df.Col("col1").EqualTo(3), Functions.Lit("Group A")));
Exception:
[MD2V4P4C] [Error] [JvmBridge] java.lang.IllegalArgumentException: when() can only be applied on a Column previously generated by when() function
Searched around and didn't find many examples on .Net for Spark. Any help would be much appreciated.

I think the problem is that you need to call the function When that isn't a member function on the Col object as the first call and then the version of When that is a member function every call after that (in this column) so:
var spark = SparkSession.Builder().GetOrCreate();
var df = spark.Range(100);
df.WithColumn("new_col_name",
When(Col("Id").Lt(5), Lit("Group A"))
.When(Col("Id").Between(5, 8), Lit("Group B"))
.When(Col("Id").Gt(8), Lit("Group C"))
).Show();
You. can also use >, =, <, etc like this but personally I prefer the more explicit one above:
df.WithColumn("new_col_name",
When(Col("Id") < 5, Lit("Group A"))
.When(Col("Id") >= 5 & Col("Id") <=8 , Lit("Group B"))
.When(Col("Id") > 8, Lit("Group C"))
).Show();

Related

Palantir Foundry spark.sql query

When I attempt to query my input table as a view, I get the error com.palantir.foundry.spark.api.errors.DatasetPathNotFoundException. My code is as follows:
def Median_Product_Revenue_Temp2(Merchant_Segments):
Merchant_Segments.createOrReplaceTempView('Merchant_Segments_View')
df = spark.sql('select * from Merchant_Segments_View limit 5')
return df
I need to dynamically query this table, since I am trying to calculate the median using percentile_approx across numerous fields, and I'm not sure how to do this without using spark.sql.
If I try to avoid using spark.sql to calculate median across numerous fields using something like the below code, it results in the error Missing Transform Attribute: A module object does not have an attribute percentile_approx. Please check the spelling and/or the datatype of the object.
import pyspark.sql.functions as F
exprs = {x: percentile_approx("x", 0.5) for x in df.columns if x is not exclustion_list}
df = df.groupBy(['BANK_NAME','BUS_SEGMENT']).agg(exprs)
try createGlobalTempView. It worked for me.
eg:
df.createGlobalTempView("people")
(Don't know the root cause why localTempView dose not work )
I managed to avoid using dynamic sql for calculating median across columns using the following code:
df_result = df.groupBy(group_list).agg(
*[ F.expr('percentile_approx(nullif('+col+',0), 0.5)').alias(col) for col in df.columns if col not in exclusion_list]
)
Embedding percentile_approx in an F.expr bypassed the issue I was encountering in the second half of my post.

I am getting results with SQL query but getting error with spark.Sql

accountBal.createOrReplaceTempView("accntBal")
var finalDf = spark.sql(
" SELECT CTC_ID, ACCNT_BAL, PAID_THRU_DT, DAYS(CURRENT_DATE) - DAYS(PAID_THRU_DT) AS DEL_DAYS FROM accntBal WHERE ACCNT_BAL > 0 AND PAID_THRU_DT <= CURRENT_DATE AND PAID_THRU_DT > '01/01/2000' AND PAID_THRU_DT is not null "
)
org.apache.spark.sql.AnalysisException: Undefined function: 'DAYS'
. This function is neither a registered temporary function nor a permanent function registered in the database
You should be using DATEDIFF to get the difference in days between two date:
SELECT
CTC_ID,
ACCNT_BAL,
PAID_THRU_DT,
DATEDIFF(CURRENT_DATE, PAID_THRU_T) AS DEL_DAYS
FROM accntBal
WHERE
ACCNT_BAL > 0 AND
PAID_THRU_DT > '2000-01-01' AND PAID_THRU_DT <= CURRENT_DATE;
Note: The NULL check on PAID_THRU_DT is probably not necessary, since a NULL value would fail the range check already.
In spark udf has to be registered to be used in your queries.
Register a function as a UDF
example :
val squared = (s: Long) => {
s * s
}
spark.udf.register("square", squared)
since you have not registered days as it was throwing this error.
I assume you have written a custom udf to know number of days between 2 dates.
How to debug ? :
To check your udf is there in the functions registerd with spark or not like this.
You can query for available standard and user-defined functions using the Catalog interface (that is available through SparkSession.catalog attribute).
val spark: SparkSession = ...
scala> spark.catalog.listFunctions.show(false)
it will be display all the functions defined within spark session.
Further reading : UDFs — User-Defined Functions
If not... you can try which is already present in spark functions.scala
static Column datediff(Column end, Column start) Returns the number of
days from start to end.

Calling UDF on Dataframe with Serialization Issue

I was looking at some examples on blogs of UDFs that appear to work, but in fact when I run them they give the infamous task not serializable error.
I find it strange that this is published and no such mention made. Running Spark 2.4.
Code, pretty straight forward something must have changed in Spark?:
def lowerRemoveAllWhitespace(s: String): String = {
s.toLowerCase().replaceAll("\\s", "")
}
val lowerRemoveAllWhitespaceUDF = udf[String, String](lowerRemoveAllWhitespace)
import org.apache.spark.sql.functions.col
val df = sc.parallelize(Seq(
("r1 ", 1, 1, 3, -2),
("r 2", 6, 4, -2, -2),
("r 3", 4, 1, 1, 0),
("r4", 1, 2, 4, 5)
)).toDF("ID", "a", "b", "c", "d")
df.select(lowerRemoveAllWhitespaceUDF(col("ID"))).show(false)
returns:
org.apache.spark.SparkException: Task not serializable
From this blog that I find good: https://medium.com/#mrpowers/spark-user-defined-functions-udfs-6c849e39443b
Something must have changed???
I looked at the top voted item here with an Object and extends Serializable but no joy either. Puzzled.
EDIT
Things seems to have changed, this format needed:
val squared = udf((s: Long) => s * s)
The Object approach still interest me why it failed.
I couldn't reproduce the error (tried on spark 1.6, 2.3, and 2.4), but I do remember facing this kind of error (long time ago). I'll put in my best guess.
The problem happens due to difference between Method and Function in scala. As described in detail here.
Short version of that is when you write def it is equivalent to methods in java, i.e part of a a class and can be invoked using the instance of the class.
When you write udf((s: Long) => s * s) it creates an instance of trait Function1. For this to happen an anonymous class implementing Function1 is generated whose apply method is is something like def apply(s: Long):Long= {s * s}, and the instance of this class is passed as parameter to udf.
However when you write udf[String, String](lowerRemoveAllWhitespace) the method lowerRemoveAllWhitespace needs to be converted to Function1 instance and passed to udf. This is where the serialization fails, since the apply method on this instance will try to invoke lowerRemoveAllWhitespace on instance of another object (which could not be serialized and sent to the worker jvm process) causing the exception.
The example that was posted was from a reputable source, but I cannot get to run without a Serialization error in Spark 2.4, trying of Objects etc. did not help either.
I solved the issue as follows using the udf(( .. approach which looks like a single statement only possible and indeed I could that and voila no serialization. A slightly different example though using primitives.
val sumContributionsPlus = udf((n1: Int, n2: Int, n3: Int, n4: Int) => Seq(n1,n2,n3,n4).foldLeft(0)( (acc, a) => if (a > 0) acc + a else acc))
On a final note, the whole discussion on UDF, Spark native, columns UDFs is confusing when things no longer appear to work.

Check Spark Dataframe row has ANY column meeting a condition and stop when first such column found

The following code can be used to filter rows that contain a value of 1. Image there are a lot of columns.
import org.apache.spark.sql.types.StructType
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
val ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)
df.withColumn("ones", ones).where($"ones" === 0).show
The downside here is that it should ideally stop when the first such condition is met. I.e. the first column found. OK, we all know that.
But I cannot find an elegant method to achieve this without presumably using a UDF or very specific logic. The map will process all cols.
Can therefore a fold(Left) be used that can terminate when first occurrence found possibly? Or some other approach? May be an oversight.
My first idea was to use logical expressions and hope for short-circuiting, but it seems spark is not doing this :
df
.withColumn("ones", df.columns.tail.map(x => when(col(x) === 1, true)
.otherwise(false)).reduceLeft(_ or _))
.where(!$"ones")
.show()
But I'm no sure whether spark does support short-circuiting, I think not (https://issues.apache.org/jira/browse/SPARK-18712)
So alternatively you can apply a custom function on your rows using lazy exist on scala's Seq:
df
.map{r => (r.getString(0),r.toSeq.tail.exists(c => c.asInstanceOf[Int]==1))}
.toDF("ID","ones")
.show()
This approach is similar to an UDF, so not sure if thats what you accept.

pyspark udf regex_extract - "'NoneType' object has no attribute '_jvm'"

I am experiencing a very odd problem in pyspark.. I am using regex to extract a unixtimestamp from an lengthy datetime string (stored string is unsuitable for direct conversion). This works fine when writing it into a withColumn function:
r= "([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2})\.([0-9]*)"
latencyhops.select('time') \
.withColumn('TimeSec',f.unix_timestamp(f.regexp_extract('time', r, 1))) \
.show(5,False)
+---------------------------+----------+
|time |TimeSec |
+---------------------------+----------+
|2018-01-22 14:39:00.0743640|1516631940|
|2018-01-23 05:47:34.2797780|1516686454|
|2018-01-23 05:47:34.2797780|1516686454|
|2018-01-23 05:47:34.2797780|1516686454|
|2018-01-24 08:06:29.2989410|1516781189|
However when running via a UDF it fails:
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def timeConversion (time):
return f.unix_timestamp(f.regexp_extract(time, "([0-9]{4}-[0-9]{2}-
[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2})\.([0-9]*)", 1))
to_nano =udf(timeConversion, IntegerType())
latencyhops.select('time') \
.withColumn('TimeSec',to_nano('time', r, 1)) \
.show(5,False)
With:
..../pyspark/sql/functions.py", line 1521, in regexp_extract
jc = sc._jvm.functions.regexp_extract(_to_java_column(str), pattern, idx)
AttributeError: 'NoneType' object has no attribute '_jvm'
As far as I can tell these should be doing exactly the same thing. I have tried multiple variants of defining the UDF (lambda expressions etc) but always hit the same error. Does anyone have advice?
Thanks
As #mkaran pointed out, many pyspark functions only work on the column level so are unsuitable for UDF use. Went with a .withColumn solution which worked.

Resources