I have a table in which there are 4 columns: "ID", "FLAG_A", "FLAG_B", "FLAG_C".
This is the SQL query I want to transform into PySpark, there are two conditions which I need to satisfy for both columns "FLAG_A" and "FLAG_B". How to do it in PySpark?
UPDATE STATUS_TABLE SET STATUS_TABLE.[FLAG_A] = "JAVA",
STATUS_TABLE.FLAG_B = "PYTHON"
WHERE (((STATUS_TABLE.[FLAG_A])="PROFESSIONAL_CODERS") AND
((STATUS_TABLE.FLAG_C) Is Null));
Is it possible to code this in a single statement by giving two conditions and satisfying the "FLAG_A" and "FLAG_B" columns in PySpark?
I can't think of any way to rewrite this into a single statement which you thought of. I tried writing the UPDATE query inside Spark, but it seems UPDATE is not working:
: java.lang.UnsupportedOperationException: UPDATE TABLE is not supported temporarily.
The following does exactly the same as your UPDATE query:
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'PROFESSIONAL_CODERS', 'X', None),
(2, 'KEEP', 'KEEP', 'KEEP')],
['ID', 'FLAG_A', 'FLAG_B', 'FLAG_C'])
Script:
cond = (F.col('FLAG_A') == 'PROFESSIONAL_CODERS') & F.isnull('FLAG_C')
df = df.withColumn('FLAG_B', F.when(cond, 'PYTHON').otherwise(F.col('FLAG_B')))
df = df.withColumn('FLAG_A', F.when(cond, 'JAVA').otherwise(F.col('FLAG_A')))
df.show()
# +---+------+------+------+
# | ID|FLAG_A|FLAG_B|FLAG_C|
# +---+------+------+------+
# | 1| JAVA|PYTHON| null|
# | 2| KEEP| KEEP| KEEP|
# +---+------+------+------+
Related
I have a spark dataframe that has a list of timestamps (partitioned by uid, ordered by timestamp). Now, I'd like to query the dataframe to get either previous or next record.
df = myrdd.toDF().repartition("uid").sort(desc("timestamp"))
df.show()
+------------------------+-------+
|uid |timestamp |
+------------------------+-------+
|Peter_Parker|2020-09-19 02:14:40|
|Peter_Parker|2020-09-19 01:07:38|
|Peter_Parker|2020-09-19 00:04:39|
|Peter_Parker|2020-09-18 23:02:36|
|Peter_Parker|2020-09-18 21:58:40|
So for example if I were to query:
ts=datetime.datetime(2020, 9, 19, 0, 4, 39)
I want to get the previous record on (2020-09-18 23:02:36), and only that one.
How can I get the previous one?
It's possible to do it using withColumn() and diff, but is there a smarter more efficient way of doing that? I really really don't need to calculate diff for ALL events, since it is already ordered. I just want prev/next record.
You can use a filter and order by, and then limit the results to 1 row:
df2 = (df.filter("uid = 'Peter_Parker' and timestamp < timestamp('2020-09-19 00:04:39')")
.orderBy('timestamp', ascending=False)
.limit(1)
)
df2.show()
+------------+-------------------+
| uid| timestamp|
+------------+-------------------+
|Peter_Parker|2020-09-18 23:02:36|
+------------+-------------------+
Or by using row_number after filtering :
from pyspark.sql import Window
from pyspark.sql import functions as F
df1 = df.filter("timestamp < '2020-09-19 00:04:39'") \
.withColumn("rn", F.row_number().over(Window.orderBy(F.desc("timestamp")))) \
.filter("rn = 1").drop("rn")
df1.show()
#+------------+-------------------+
#| uid| timestamp|
#+------------+-------------------+
#|Peter_Parker|2020-09-18 23:02:36|
#+------------+-------------------+
I'm trying to incorporate a Try().getOrElse() statement in my select statement for a Spark DataFrame. The project I'm working on is going to be applied to multiple environments. However, each environment is a little different in terms of the naming of the raw data for ONLY one field. I do not want to write several different functions to handle each different field. Is there a elegant way to handle exceptions, like this below, in a DataFrame select statement?
val dfFilter = dfRaw
.select(
Try($"some.field.nameOption1).getOrElse($"some.field.nameOption2"),
$"some.field.abc",
$"some.field.def"
)
dfFilter.show(33, false)
However, I keep getting the following error, which makes sense because it does not exist in this environments raw data, but I'd expect the getOrElse statement to catch that exception.
org.apache.spark.sql.AnalysisException: No such struct field nameOption1 in...
Is there a good way to handle exceptions in Scala Spark for select statements? Or will I need to code up different functions for each case?
val selectedColumns = if (dfRaw.columns.contains("some.field.nameOption1")) $"some.field.nameOption2" else $"some.field.nameOption2"
val dfFilter = dfRaw
.select(selectedColumns, ...)
So I'm revisiting this question after a year. I believe this solution to be much more elegant to implement. Please let me know anyone else's thoughts:
// Generate a fake DataFrame
val df = Seq(
("1234", "A", "AAA"),
("1134", "B", "BBB"),
("2353", "C", "CCC")
).toDF("id", "name", "nameAlt")
// Extract the column names
val columns = df.columns
// Add a "new" column name that is NOT present in the above DataFrame
val columnsAdd = columns ++ Array("someNewColumn")
// Let's then "try" to select all of the columns
df.select(columnsAdd.flatMap(c => Try(df(c)).toOption): _*).show(false)
// Let's reduce the DF again...should yield the same results
val dfNew = df.select("id", "name")
dfNew.select(columnsAdd.flatMap(c => Try(dfNew(c)).toOption): _*).show(false)
// Results
columns: Array[String] = Array(id, name, nameAlt)
columnsAdd: Array[String] = Array(id, name, nameAlt, someNewColumn)
+----+----+-------+
|id |name|nameAlt|
+----+----+-------+
|1234|A |AAA |
|1134|B |BBB |
|2353|C |CCC |
+----+----+-------+
dfNew: org.apache.spark.sql.DataFrame = [id: string, name: string]
+----+----+
|id |name|
+----+----+
|1234|A |
|1134|B |
|2353|C |
+----+----+
Suppose I have a spark dataframe df with some columns (id,...) and a string sqlFilter with a SQL filter, e.g. "id is not null".
I want to filter the dataframe df based on sqlFilter, i.e.
val filtered = df.filter(sqlFilter)
Now, I want to have a list of 10 ids from df that were removed by the filter.
Currently, I'm using a "leftanti" join to achieve this, i.e.
val examples = df.select("id").join(filtered.select("id"), Seq("id"), "leftanti")
.take(10)
.map(row => Option(row.get(0)) match { case None => "null" case Some(x) => x.toString})
However, this is really slow.
My guess is that this can be implemented faster, because spark only has to have a list for every partitition
and add an id to the list when filter removes a row and the list contains less than 10 elements. Once the action after
filter finishes, spark has to collect all the lists from the partitions until it has 10 ids.
I wanted to use accumulators as described here,
but I failed because I could not find out how to parse and use sqlFilter.
Has anybody an idea how I can improve the performance?
Update
Ramesh Maharjan suggested in the comments to inverse the SQL query, i.e.
df.filter(s"NOT ($filterString)")
.select(key)
.take(10)
.map(row => Option(row.get(0)) match { case None => "null" case Some(x) => x.toString})
This indeed improves the performance but it is not 100% equivalent.
If there are multiple rows with the same id, the id will end up in the examples if one row is removed due to the filter. With the leftantit join it does not end up in the examples because the id is still in filtered.
However, that is fine with me.
I'm still interested if it is possible to create the list "on the fly" with accumulators or something similar.
Update 2
Another issue with inverting the filter is the logical value UNKNOWN in SQL, because NOT UNKNWON = UNKNOWN, i.e. NOT(null <> 1) <=> UNKNOWN and hence this row shows up neither in the filtered dataframe nor in the inverted dataframe.
You can use a custom accumulator (because longAccumulator won't help you as all ids will be null); and you must formulate your filter statement as function :
Suppose you have a dataframe :
+----+--------+
| id| name|
+----+--------+
| 1|record 1|
|null|record 2|
| 3|record 3|
+----+--------+
Then you could do :
import org.apache.spark.util.AccumulatorV2
class RowAccumulator(var value: Seq[Row]) extends AccumulatorV2[Row, Seq[Row]] {
def this() = this(Seq.empty[Row])
override def isZero: Boolean = value.isEmpty
override def copy(): AccumulatorV2[Row, Seq[Row]] = new RowAccumulator(value)
override def reset(): Unit = value = Seq.empty[Row]
override def add(v: Row): Unit = value = value :+ v
override def merge(other: AccumulatorV2[Row, Seq[Row]]): Unit = value = value ++ other.value
}
val filteredAccum = new RowAccumulator()
ss.sparkContext.register(filteredAccum, "Filter Accum")
val filterIdIsNotNull = (r:Row) => {
if(r.isNullAt(r.fieldIndex("id"))) {
filteredAccum.add(r)
false
} else {
true
}}
df
.filter(filterIdIsNotNull)
.show()
println(filteredAccum.value)
gives
+---+--------+
| id| name|
+---+--------+
| 1|record 1|
| 3|record 3|
+---+--------+
List([null,record 2])
But personally I would not do this, I would rather do something like you've already suggested :
val dfWithFilter = df
.withColumn("keep",expr("id is not null"))
.cache() // check whether caching is feasibly
// show 10 records which we do not keep
dfWithFilter.filter(!$"keep").drop($"keep").show(10) // or use take(10)
+----+--------+
| id| name|
+----+--------+
|null|record 2|
+----+--------+
// rows that we keep
val filteredDf = dfWithFilter.filter($"keep").drop($"keep")
My dataframe undergoes two consecutive filtering passes each using a boolean-valued UDF. The first filtering removes all rows whose columns are not present as keys in some broadcast dictionary. The second filtering imposes thresholds on values that this dictionary associates with the present keys.
If I display the result after just the first filtering, the row with 'c' is not in it, as expected. However, attempts to display the result of the second filtering lead to a KeyError exception for u'c'
sc = SparkContext()
ss = SparkSession(sc)
mydict={ "a" : 4, "b" : 6 }
mydict_bc = sc.broadcast(mydict)
udf_indict=func.udf( lambda x: x in mydict_bc.value, BooleanType() )
udf_bigenough=func.udf( lambda x: mydict_bc.value[x] > 5, BooleanType() )
df=ss.createDataFrame([ "a", "b", "c" ], StringType() ).toDF("name")
df1 = df.where( udf_indict('name') )
df1.show()
+----+
|name|
+----+
| a|
| b|
+----+
df1.where( udf_bigenough('name') ).show()
KeyError: u'c'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
...
I guess this has something to do with delayed execution and internal optimization, but is this really an expected behavior?
Thanks
This
My dataframe undergoes two consecutive filtering passes
is incorrect assumption. Unlike RDD, where all transformations are WYSIWYG, SQL API is purely declarative. It explains what has to be done, but not how. Optimizer can rearrange all elements as it see fit.
Using nondeterministic variant will disable optimizations:
df1 = df.where( udf_indict.asNondeterministic()('name'))
df1.where( udf_bigenough.asNondeterministic()('name') ).show()
but you should really handle exceptions
#udf(BooleanType())
def udf_bigenough(x):
try:
return mydict_bc.get(x) > 5
except TypeError:
pass
or better, not use udf.
I want to do something like this:
df.replace('empty-value', None, 'NAME')
Basically, I want to replace some value with NULL. but it does not accept None in this function. How can I do this?
You can combine when clause with NULL literal and types casting as follows:
from pyspark.sql.functions import when, lit, col
df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])
def replace(column, value):
return when(column != value, column).otherwise(lit(None))
df.withColumn("y", replace(col("y"), "bar")).show()
## +---+----+
## | x| y|
## +---+----+
## | 1| foo|
## | 2|null|
## +---+----+
It doesn't introduce BatchPythonEvaluation and because of that should be significantly more efficient than using an UDF.
This will replace empty-value with None in your name column:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"])
new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType())
new_df = df.withColumn("name", new_column_udf(df.name))
new_df.collect()
Output:
[Row(key=1, name=None), Row(key=2, name=u'something else')]
By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.
You could also simply use a dict for the first argument of replace. I tried it and this seems to accept None as an argument.
df = df.replace({'empty-value':None}, subset=['NAME'])
Note that your 'empty-value' needs to be hashable.
The best alternative is the use of a when combined with a NULL. Example:
from pyspark.sql.functions import when, lit, col
df= df.withColumn('foo', when(col('foo') != 'empty-value',col('foo)))
If you want to replace several values to null you can either use | inside the when condition or the powerfull create_map function.
Important to note is that the worst way to solve it with the use of a UDF. This is so because udfs provide great versatility to your code but come with a huge penalty on performance.