How do I replace a string value with a NULL in PySpark? - apache-spark

I want to do something like this:
df.replace('empty-value', None, 'NAME')
Basically, I want to replace some value with NULL. but it does not accept None in this function. How can I do this?

You can combine when clause with NULL literal and types casting as follows:
from pyspark.sql.functions import when, lit, col
df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])
def replace(column, value):
return when(column != value, column).otherwise(lit(None))
df.withColumn("y", replace(col("y"), "bar")).show()
## +---+----+
## | x| y|
## +---+----+
## | 1| foo|
## | 2|null|
## +---+----+
It doesn't introduce BatchPythonEvaluation and because of that should be significantly more efficient than using an UDF.

This will replace empty-value with None in your name column:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"])
new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType())
new_df = df.withColumn("name", new_column_udf(df.name))
new_df.collect()
Output:
[Row(key=1, name=None), Row(key=2, name=u'something else')]
By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.

You could also simply use a dict for the first argument of replace. I tried it and this seems to accept None as an argument.
df = df.replace({'empty-value':None}, subset=['NAME'])
Note that your 'empty-value' needs to be hashable.

The best alternative is the use of a when combined with a NULL. Example:
from pyspark.sql.functions import when, lit, col
df= df.withColumn('foo', when(col('foo') != 'empty-value',col('foo)))
If you want to replace several values to null you can either use | inside the when condition or the powerfull create_map function.
Important to note is that the worst way to solve it with the use of a UDF. This is so because udfs provide great versatility to your code but come with a huge penalty on performance.

Related

pyspark getting the field names of a a struct datatype inside a udf

I am trying to pass multiple columns to a udf as a StructType (using pyspark.sql.functions.struct()).
Inside this udf I want to get the fields of the struct column that I passed as a list, so that I can iterate over the passed columns for every row.
Basically I am looking for a pyspark version of the scala code provided in this answer - Spark - pass full row to a udf and then get column name inside udf
You can use the same method as on the post you linked, i.e. by using a pyspark.sql.Row. But instead of .schema.fieldNames, you can use .asDict() to convert the Row into a dictionary.
For example, here is a way to iterate over the column names and values simultaneously:
from pyspark.sql.functions import col, struct, udf
df = spark.createDataFrame([(1, 2, 3)], ["a", "b", "c"])
f = udf(lambda row: "; ".join(["=".join(map(str, [k,v])) for k, v in row.asDict().items()]))
df.select(f(struct(*df.columns)).alias("myUdfOutput")).show()
#+-------------+
#| myUdfOutput|
#+-------------+
#|a=1; c=3; b=2|
#+-------------+
An alternative would be to build a MapType() of column name to value, and pass this to your udf.
from itertools import chain
from pyspark.sql.functions import create_map, lit
f2 = udf(lambda row: "; ".join(["=".join(map(str, [k,v])) for k, v in row.items()]))
df.select(
f2(
create_map(
*chain.from_iterable([(lit(c), col(c)) for c in df.columns])
)
).alias("myNewUdfOutput")
).show()
#+--------------+
#|myNewUdfOutput|
#+--------------+
#| a=1; c=3; b=2|
#+--------------+
This second method is arguably unnecessarily complicated, so the first option is the recommended approach.

How to reverse and combine string columns in a spark dataframe?

I am using pyspark version 2.4 and I am trying to write a udf which should take the values of column id1 and column id2 together, and returns the reverse string of it.
For example, my data looks like:
+---+---+
|id1|id2|
+---+---+
| a|one|
| b|two|
+---+---+
the corresponding code is:
df = spark.createDataFrame([['a', 'one'], ['b', 'two']], ['id1', 'id2'])
The returned value should be
+---+---+----+
|id1|id2| val|
+---+---+----+
| a|one|enoa|
| b|two|owtb|
+---+---+----+
My code is:
#udf(string)
def reverse_value(value):
return value[::-1]
df.withColumn('val', reverse_value(lit('id1' + 'id2')))
My errors are:
TypeError: Invalid argument, not a string or column: <function
reverse_value at 0x0000010E6D860B70> of type <class 'function'>. For
column literals, use 'lit', 'array', 'struct' or 'create_map'
function.
Should be:
from pyspark.sql.functions import col, concat
df.withColumn('val', reverse_value(concat(col('id1'), col('id2'))))
Explanation:
lit is a literal while you want to refer to individual columns (col).
Columns have to be concatenated using concat function (Concatenate columns in Apache Spark DataFrame)
Additionally it is not clear if argument of udf is correct. It should be either:
from pyspark.sql.functions import udf
#udf
def reverse_value(value):
...
or
#udf("string")
def reverse_value(value):
...
or
from pyspark.sql.types import StringType
#udf(StringType())
def reverse_value(value):
...
Additionally the stacktrace suggests that you have some other problems in your code, not reproducible with the snippet you've shared - the reverse_value seems to return function.
The answer by #user11669673 explains what's wrong with your code and how to fix the udf. However, you don't need a udf for this.
You will achieve much better performance by using pyspark.sql.functions.reverse:
from pyspark.sql.functions import col, concat, reverse
df.withColumn("val", concat(reverse(col("id2")), col("id1"))).show()
#+---+---+----+
#|id1|id2| val|
#+---+---+----+
#| a|one|enoa|
#| b|two|owtb|
#+---+---+----+

Assigning columns to another columns in a Spark Dataframe using Scala

I was looking at this excellent question so as to improve my Scala skills and the answer: Extract a column value and assign it to another column as an array in spark dataframe
I created my modified code as follows which works, but am left with a few questions:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
val uniqueVal = df.select("b").distinct().map(x => x.getAs[Int](0)).collect.toList
def myfun: Int => List[Int] = _ => uniqueVal
def myfun_udf = udf(myfun)
df.withColumn("X", myfun_udf( col("b") )).show
+---+---+---+---------+
| ID| a| b| X|
+---+---+---+---------+
| r1| 1| 1|[1, 4, 2]|
| r2| 6| 4|[1, 4, 2]|
| r3| 4| 1|[1, 4, 2]|
| r4| 1| 2|[1, 4, 2]|
+---+---+---+---------+
It works, but:
I note b column is put in twice.
I can also put in column a on the second statement and I get the same result. E.g. and what point is that then?
df.withColumn("X", myfun_udf( col("a") )).show
If I put in col ID then it gets null.
So, I am wondering why the second col is input?
And how this could be made to work generically for all columns?
So, this was code that I looked at elsewhere, but I am missing something.
The code you've shown doesn't make much sense:
It is not scalable - in the worst case scenario size of each row is proportional to the size
As you've already figure out it doesn't need argument at all.
It doesn't need (and what's important it didn't need) udf at the time it was written (on 2016-12-23 Spark 1.6 and 2.0 where already released)
If you still wanted to use udf nullary variant would suffice
Overall it is just another convoluted and misleading answer that served OP at the point. I'd ignore (or vote accordingly) and move on.
So how could this be done:
If you have a local list and you really want to use udf. For single sequence use udf with nullary function:
val uniqueBVal: Seq[Int] = ???
val addUniqueBValCol = udf(() => uniqueBVal)
df.withColumn("X", addUniqueBValCol())
Generalize to:
import scala.reflect.runtime.universe.TypeTag
def addLiteral[T : TypeTag](xs: Seq[T]) = udf(() => xs)
val x = addLiteral[Int](uniqueBVal)
df.withColumn("X", x())
Better don't use udf:
import org.apache.spark.sql.functions._
df.withColumn("x", array(uniquBVal map lit: _*))
As of
And how this could be made to work generically for all columns?
as mentioned at the beginning the whole concept is hard to defend. Either window functions (completely not scalable)
import org.apache.spark.sql.expressions.Window
val w = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.select($"*" +: df.columns.map(c => collect_set(c).over(w).alias(s"${c}_unique")): _*)
or cross join with aggregate (most of the time not scalable)
val uniqueValues = df.select(
df.columns map (c => collect_set(col(c)).alias(s"${c}_unique")):_*
)
df.crossJoin(uniqueValues)
In general though - you'll have to rethink your approach, if this comes anywhere actual applications, unless you know for sure, that cardinalities of columns are small and have strict upper bounds.
Take away message is - don't trust random code that random people post in Internet. This one included.

pyspark dataframe filtering doesn't really remove rows?

My dataframe undergoes two consecutive filtering passes each using a boolean-valued UDF. The first filtering removes all rows whose columns are not present as keys in some broadcast dictionary. The second filtering imposes thresholds on values that this dictionary associates with the present keys.
If I display the result after just the first filtering, the row with 'c' is not in it, as expected. However, attempts to display the result of the second filtering lead to a KeyError exception for u'c'
sc = SparkContext()
ss = SparkSession(sc)
mydict={ "a" : 4, "b" : 6 }
mydict_bc = sc.broadcast(mydict)
udf_indict=func.udf( lambda x: x in mydict_bc.value, BooleanType() )
udf_bigenough=func.udf( lambda x: mydict_bc.value[x] > 5, BooleanType() )
df=ss.createDataFrame([ "a", "b", "c" ], StringType() ).toDF("name")
df1 = df.where( udf_indict('name') )
df1.show()
+----+
|name|
+----+
| a|
| b|
+----+
df1.where( udf_bigenough('name') ).show()
KeyError: u'c'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
...
I guess this has something to do with delayed execution and internal optimization, but is this really an expected behavior?
Thanks
This
My dataframe undergoes two consecutive filtering passes
is incorrect assumption. Unlike RDD, where all transformations are WYSIWYG, SQL API is purely declarative. It explains what has to be done, but not how. Optimizer can rearrange all elements as it see fit.
Using nondeterministic variant will disable optimizations:
df1 = df.where( udf_indict.asNondeterministic()('name'))
df1.where( udf_bigenough.asNondeterministic()('name') ).show()
but you should really handle exceptions
#udf(BooleanType())
def udf_bigenough(x):
try:
return mydict_bc.get(x) > 5
except TypeError:
pass
or better, not use udf.

Change the timestamp to UTC format in Pyspark

I have an input dataframe(ip_df), data in this dataframe looks like as below:
id timestamp_value
1 2017-08-01T14:30:00+05:30
2 2017-08-01T14:30:00+06:30
3 2017-08-01T14:30:00+07:30
I need to create a new dataframe(op_df), wherein i need to convert timestamp value to UTC format. So final output dataframe will look like as below:
id timestamp_value
1 2017-08-01T09:00:00+00:00
2 2017-08-01T08:00:00+00:00
3 2017-08-01T07:00:00+00:00
I want to achieve it using PySpark. Can someone please help me with it? Any help will be appericiated.
If you absolutely need the timestamp to be formatted exactly as indicated, namely, with the timezone represented as "+00:00", I think using a UDF as already suggested is your best option.
However, if you can tolerate a slightly different representation of the timezone, e.g. either "+0000" (no colon separator) or "Z", it's possible to do this without a UDF, which may perform significantly better for you depending on the size of your dataset.
Given the following representation of data
+---+-------------------------+
|id |timestamp_value |
+---+-------------------------+
|1 |2017-08-01T14:30:00+05:30|
|2 |2017-08-01T14:30:00+06:30|
|3 |2017-08-01T14:30:00+07:30|
+---+-------------------------+
as given by:
l = [(1, '2017-08-01T14:30:00+05:30'), (2, '2017-08-01T14:30:00+06:30'), (3, '2017-08-01T14:30:00+07:30')]
ip_df = spark.createDataFrame(l, ['id', 'timestamp_value'])
where timestamp_value is a String, you could do the following (this uses to_timestamp and session local timezone support which were introduced in Spark 2.2):
from pyspark.sql.functions import to_timestamp, date_format
spark.conf.set('spark.sql.session.timeZone', 'UTC')
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssZ"
).alias('timestamp_value'))
which yields:
+------------------------+
|timestamp_value |
+------------------------+
|2017-08-01T09:00:00+0000|
|2017-08-01T08:00:00+0000|
|2017-08-01T07:00:00+0000|
+------------------------+
or, slightly differently:
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssXXX"
).alias('timestamp_value'))
which yields:
+--------------------+
|timestamp_value |
+--------------------+
|2017-08-01T09:00:00Z|
|2017-08-01T08:00:00Z|
|2017-08-01T07:00:00Z|
+--------------------+
You can use parser and tz in dateutil library.
I assume you have Strings and you want a String Column :
from dateutil import parser, tz
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, udf
# Create UTC timezone
utc_zone = tz.gettz('UTC')
# Create UDF function that apply on the column
# It takes the String, parse it to a timestamp, convert to UTC, then convert to String again
func = udf(lambda x: parser.parse(x).astimezone(utc_zone).isoformat(), StringType())
# Create new column in your dataset
df = df.withColumn("new_timestamp",func(col("timestamp_value")))
It gives this result :
<pre>
+---+-------------------------+-------------------------+
|id |timestamp_value |new_timestamp |
+---+-------------------------+-------------------------+
|1 |2017-08-01T14:30:00+05:30|2017-08-01T09:00:00+00:00|
|2 |2017-08-01T14:30:00+06:30|2017-08-01T08:00:00+00:00|
|3 |2017-08-01T14:30:00+07:30|2017-08-01T07:00:00+00:00|
+---+-------------------------+-------------------------+
</pre>
Finally you can drop and rename :
df = df.drop("timestamp_value").withColumnRenamed("new_timestamp","timestamp_value")

Resources