PySpark UDF with multiple arguments returns null - apache-spark

I have a PySpark Dataframe with two columns (A, B, whose type is double) whose values are either 0.0 or 1.0.
I am trying to add a new column, which is the sum of those two.
I followed examples in Pyspark: Pass multiple columns in UDF
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, StringType
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
This shows a Series of NULLs instead of the results I expect.
I tried any of the following to see if there's an issue with data types
sum_cols = F.udf(lambda x: x[0], IntegerType())
sum_cols = F.udf(lambda x: int(x[0]), IntegerType())
still getting Nulls.
I tried removing the array:
sum_cols = F.udf(lambda x: x, IntegerType())
df_with_sum = df.withColumn('SUM_COL',sum_cols(df.A))
This works fine and shows 0/1
I tried removing the UDF, but leaving the array:
df_with_sum = df.withColumn('SUM_COL', F.array('A','B'))
This works fine and shows a series of arrays of [0.0/1.0, 0.0/1.0]
So, array works fine, UDF works fine, it is just when I try to pass an array to UDF that things break down. What am I doing wrong?

The problem is that you are trying to return a double in a function that is supposed to output an integer, which does not fit, and pyspark by default silently resorts to NULL when a casting fails:
df_with_doubles = spark.createDataFrame([(1.0,1.0), (2.0,2.0)], ['A', 'B'])
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df_with_double.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
You get:
SUM_COL
0 None
1 None
However, if you do:
df_with_integers = spark.createDataFrame([(1,1), (2,2)], ['A', 'B'])
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df_with_integers.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
You get:
SUM_COL
0 2
1 4
So, either cast your columns to IntegerType beforehand (or cast them in the UDF), or change the return type of the UDF to DoubleType.

Related

Different outcome from seemingly equivalent implementation of PySpark transformations

I have a set of spark dataframe transforms which gives an out of memory error and has a messed up sql query plan while a different implemetation runs successfully.
%python
import pandas as pd
diction = {
'key': [1,2,3,4,5,6],
'f1' : [1,0,1,0,1,0],
'f2' : [0,1,0,1,0,1],
'f3' : [1,0,1,0,1,0],
'f4' : [0,1,0,1,0,1]}
bil = pd.DataFrame(diction)
# successfull logic
df = spark.createDataFrame(bil)
df = df.cache()
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
# failed logic
df = spark.createDataFrame(bil)
df = df.cache()
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
Logically thinking there must not be such a computational difference (more than double the time and memory used).
Can anyone help me understand this ?
DAG of successful logic:
DAG of failure logic:
I'm not sure what your use case is for this code, however the two pieces of code are not logically the same. In the second version you are joining the result of the previous iteration to itself three times. In the first version you are joining a 'copy' of the original df three times. If your key column is not unique, the second piece of code will 'explode' your dataframe more than the first.
To make this more clear we can make a simple example below where we have a non-unique key value. Taking your second example:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 257
And your first piece of code:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 17

Data type conversion in spark

I have an column id which had type int but later changed to bigint.
It has both types of values.
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = spark.read.parquet('hdfs path')
df = df.select("id", "code")
df=df.withColumn("id1", df["id"].cast(LongType()))
res1=df.select("id1", "code")
res1.show(1, False)
It shows me the data frame but when i try to perform some operations on them
example:
res1.groupBy('code').agg(countDistinct("id1")).show(1, False)
I get Column: [id], Expected: int, Found: INT64
I tried mergeSchema did not work either.
from pyspark.sql.functions import *
from pyspark.sql.types import *
df1 = spark.read.parquet('hdfs path')
df2 = df1.select("id", "code")
df3 = df2.withColumn("id1", df2["id"].cast(LongType()))
res1=df3.select("id1", "code")
res1.show(1, False)
res1.groupBy("code").agg(countDistinct("id1")).show(1, False)
This should work. In spark Dataframes are immutable so you should not assign the value of transformation operation to a same df variable, you should use a different variable name. In scala it would give you compile time error but in python its allowed so you don't notice it.
if you want you could also chain all of your transformation and get a single df variable and perform groupby operation on it as below :
df = spark.read.parquet('hdfs path').select("id", "code").withColumn("id1", col("id").cast(LongType())).select("id1", "code")
df.groupBy("code").agg(countDistinct("id1")).show(1, False)

PySpark: Search For substrings in text and subset dataframe

I am brand new to pyspark and want to translate my existing pandas / python code to PySpark.
I want to subset my dataframe so that only rows that contain specific key words I'm looking for in 'original_problem' field is returned.
Below is the Python code I tried in PySpark:
def pilot_discrep(input_file):
df = input_file
searchfor = ['cat', 'dog', 'frog', 'fleece']
df = df[df['original_problem'].str.contains('|'.join(searchfor))]
return df
When I try to run the above, I get the following error:
AnalysisException: u"Can't extract value from original_problem#207:
need struct type but got string;"
In pyspark, try this:
df = df[df['original_problem'].rlike('|'.join(searchfor))]
Or equivalently:
import pyspark.sql.functions as F
df.where(F.col('original_problem').rlike('|'.join(searchfor)))
Alternatively, you could go for udf:
import pyspark.sql.functions as F
searchfor = ['cat', 'dog', 'frog', 'fleece']
check_udf = F.udf(lambda x: x if x in searchfor else 'Not_present')
df = df.withColumn('check_presence', check_udf(F.col('original_problem')))
df = df.filter(df.check_presence != 'Not_present').drop('check_presence')
But the DataFrame methods are preferred because they will be faster.

convert rdd to dataframe without schema in pyspark

I'm trying to convert an rdd to dataframe with out any schema.
I tried below code. It's working fine, but the dataframe columns are getting shuffled.
def f(x):
d = {}
for i in range(len(x)):
d[str(i)] = x[i]
return d
rdd = sc.textFile("test")
df = rdd.map(lambda x:x.split(",")).map(lambda x :Row(**f(x))).toDF()
df.show()
If you don't want to specify a schema, do not convert use Row in the RDD. If you simply have a normal RDD (not an RDD[Row]) you can use toDF() directly.
df = rdd.map(lambda x: x.split(",")).toDF()
You can give names to the columns using toDF() as well,
df = rdd.map(lambda x: x.split(",")).toDF("col1_name", ..., "colN_name")
If what you have is an RDD[Row] you need to actually know the type of each column. This can be done by specifying a schema or as follows
val df = rdd.map({
case Row(val1: String, ..., valN: Long) => (val1, ..., valN)
}).toDF("col1_name", ..., "colN_name")

Filtering rows in Spark Dataframe based on multiple values in a list [duplicate]

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where a is the tuple (1, 2, 3). I am getting this error:
java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found
which is basically saying it was expecting something like '(1, 2, 3)' instead of a.
The problem is I can't manually write the values in a as it's extracted from another job.
How would I filter in this case?
String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
In practice DataFrame DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.
reiterating what #zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below
from pyspark.sql.functions import col
df.where(col("v").isin(["foo", "bar"])).count()
Just a little addition/update:
choice_list = ["foo", "bar", "jack", "joan"]
If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then
from pyspark.sql.functions import col
df_filtered = df.where( ( col("v").isin (choice_list) ) )
You can also do this for integer columns:
df_filtered = df.filter("field1 in (1,2,3)")
or this for string columns:
df_filtered = df.filter("field1 in ('a','b','c')")
A slightly different approach that worked for me is to filter with a custom filter function.
def filter_func(a):
"""wrapper function to pass a in udf"""
def filter_func_(col):
"""filtering function"""
if col in a.value:
return True
return False
return udf(filter_func_, BooleanType())
# Broadcasting allows to pass large variables efficiently
a = sc.broadcast((1, 2, 3))
df = my_df.filter(filter_func(a)(col('field1'))) \
from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('Practise').getOrCreate()
df_pyspark=spark.read.csv('datasets/myData.csv',header=True,inferSchema=True)
df_spark.createOrReplaceTempView("df") # we need to create a Temp table first
spark.sql("SELECT * FROM df where Departments in ('IOT','Big Data') order by Departments").show()

Resources