Is that possible to perform set of operations on dataframe (adding new columns, replace some existing values, etc) and do not fast fail on first failed rows, but instead perform full transformation and separately return rows that has been processed with errors?
Example:
it's more like pseudocode, but the idea must be clear:
df.withColumn('PRICE_AS_NUM', to_num(df["PRICE_AS_STR"]))
to_num - is my custom function of transformation string to number.
assuming I have some records where price can't be cast to number - I want to get those records in separate dataframe.
I see an approach, but it will make code a little ugly (and not quite productive):
do a filter with try catch - if exception happen - filter those records into separate df..
What if I have many of such transformations... Is any better way?
I think one approach would be to wrap your transformation with a try/except function that returns a boolean. Then use when() and otherwise() to filter on the boolean. For example:
def to_num_wrapper(inputs):
try:
to_num(inputs)
return True
except:
return False
from pyspark.sql.functions import when
df.withColumn('PRICE_AS_NUM',
when(
to_num_wrapper(df["PRICE_AS_STR"]),
to_num(df["PRICE_AS_STR"])
).otherwise('FAILED')
)
Then you can filter on the columns where the value is 'FAILED'.
Preferred option
Always prefer built-in SQL functions over UDF. There safe to execute and much faster than a Python UDF. As a bonus they follow SQL semantics - if there is a problem on the line, the output is NULL - undefined.
If you go with UDF
Follow the same approach as built-in functions.
def safe_udf(f, dtype):
def _(*args):
try:
return f(*args)
except:
pass
return udf(_, dtype)
to_num_wrapper = safe_udf(lambda x: float(x), "float")
df = spark.createDataFrame([("1.123", ), ("foo", )], ["str"])
df.withColumn("num", to_num_wrapper("str")).show()
# +-----+-----+
# | str| num|
# +-----+-----+
# |1.123|1.123|
# | foo| null|
# +-----+-----+
While swallowing exception might be counter-intuitive it just a matter of following SQL conventions.
No matter which one you choose:
Once you adjust you with one of the above, handling malformed data is just a matter of applying DataFrameNaFunctions (.na.drop, .na.replace).
Related
I am trying to understand the order of predicate evaluation in Spark SQL in order to increase performance of a query.
Let's say I have the following query
"select * from tbl where pred1 and pred2"
and lets say that none of the predicates qualify as pushdown filters (for simplification).
Also lets assume that pred1 is computationally much more complex than pred2 (assume regex pattern matching vs negation).
Is there any way to verify that spark will evaluate pred2 before
pred1?
Is this deterministic?
Is this controllable?
Is there any way to see the final execution plan?
General
Good question.
Inferred answer via testing a scenario and making deductions as could not find the suitable docs. 2nd attempt due to all sorts of statements on the web not able to be backed up.
This question I think is not about AQE Spark 3.x aspects, but it is
about say, a dataframe as part of Stage N of a Spark App that has
passed the stage of acquiring data from sources at rest, which is
subject to filtering with multiple predicates being applied.
Then the central point is does it matter how the predicates are
ordered or does Spark (Catalyst) re-order the predicates to minimize
the work to be done?
The premise here is that filtering the maximum amount of data out first makes more sense than evaluating a predicate that filters very
little out.
This is a well-known RDBMS point referring to sargable predicates (subject to evolution of definition over time).
A lot of the discussion focused on indexes, Spark, Hive do not have this, but DF's are columnar.
Point 1
You can try for %sql
EXPLAIN EXTENDED select k, sum(v) from values (1, 2), (1, 3) t(k, v) group by k;
From this you can see what's going on if there is re-arranging of
predicates, but I saw no such aspects in the Physical Plan in non-AQE
mode on Databricks. Refer to
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-qry-explain.html.
Catalyst can re-arrange filtering I read here and there. To what
extent, is a lot of research; I was not able to confirm this.
Also an interesting read:
https://www.waitingforcode.com/apache-spark-sql/catalyst-optimizer-in-spark-sql/read
Point 2
I ran the following pathetic contrived examples with the same
functional query but with predicates reversed, using a column that has
high cardinality and tested for a value that does not in fact exist
and then compared the count of the accumulator used in an UDF when called.
Scenario 1
import org.apache.spark.sql.functions._
def randomInt1to1000000000 = scala.util.Random.nextInt(1000000000)+1
def randomInt1to10 = scala.util.Random.nextInt(10)+1
def randomInt1to1000000 = scala.util.Random.nextInt(1000000)+1
val df = sc.parallelize(Seq.fill(1000000){(randomInt1to1000000,randomInt1to1000000000,randomInt1to10)}).toDF("nuid","hc", "lc").withColumn("text", lpad($"nuid", 3, "0")).withColumn("literal",lit(1))
val accumulator = sc.longAccumulator("udf_call_count")
spark.udf.register("myUdf", (x: String) => {accumulator.add(1)
x.length}
)
accumulator.reset()
df.where("myUdf(text) = 3 and hc = -4").select(max($"text")).show(false)
println(s"Number of UDF calls ${accumulator.value}")
returns:
+---------+
|max(text)|
+---------+
|null |
+---------+
Number of UDF calls 1000000
Scenario 2
import org.apache.spark.sql.functions._
def randomInt1to1000000000 = scala.util.Random.nextInt(1000000000)+1
def randomInt1to10 = scala.util.Random.nextInt(10)+1
def randomInt1to1000000 = scala.util.Random.nextInt(1000000)+1
val dfA = sc.parallelize(Seq.fill(1000000){(randomInt1to1000000,randomInt1to1000000000,randomInt1to10)}).toDF("nuid","hc", "lc").withColumn("text", lpad($"nuid", 3, "0")).withColumn("literal",lit(1))
val accumulator = sc.longAccumulator("udf_call_count")
spark.udf.register("myUdf", (x: String) => {accumulator.add(1)
x.length}
)
accumulator.reset()
dfA.where("hc = -4 and myUdf(text) = 3").select(max($"text")).show(false)
println(s"Number of UDF calls ${accumulator.value}")
returns:
+---------+
|max(text)|
+---------+
|null |
+---------+
Number of UDF calls 0
My conclusion here is that:
There is left to right evaluation - in this case - as there are 0 calls for the udf as the accumulator value is 0 for scenario 2, as opposed to scenario 1 with 1M calls registered.
So, the order of predicate processing as say ORACLE and DB2 may do for Stage 1 predicates does not apply.
Point 3
I note from the manual however
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html the
following:
Evaluation order and null checking
Spark SQL (including SQL and the DataFrame and Dataset APIs) does not
guarantee the order of evaluation of subexpressions. In particular,
the inputs of an operator or function are not necessarily evaluated
left-to-right or in any other fixed order. For example, logical AND
and OR expressions do not have left-to-right “short-circuiting”
semantics.
Therefore, it is dangerous to rely on the side effects or order of
evaluation of Boolean expressions, and the order of WHERE and HAVING
clauses, since such expressions and clauses can be reordered during
query optimization and planning. Specifically, if a UDF relies on
short-circuiting semantics in SQL for null checking, there’s no
guarantee that the null check will happen before invoking the UDF. For
example,
spark.udf.register("strlen", (s: String) => s.length)
spark.sql("select s from test1 where s is not null and strlen(s) > 1") // no guarantee
This WHERE clause does not guarantee the strlen UDF to be invoked
after filtering out nulls.
To perform proper null checking, we recommend that you do either of
the following:
Make the UDF itself null-aware and do null checking inside the UDF
itself Use IF or CASE WHEN expressions to do the null check and invoke
the UDF in a conditional branch.
spark.udf.register("strlen_nullsafe", (s: String) => if (s != null) s.length else -1)
spark.sql("select s from test1 where s is not null and strlen_nullsafe(s) > 1") // ok
spark.sql("select s from test1 where if(s is not null, strlen(s), null) > 1") // ok
Slight contradiction.
I want to perform some conditional branching to avoid calculating unnecessary nodes but I am noticing that if the source column in the condition statement is a UDF then the otherwise is resolved regardless:
#pandas_udf("double", PandasUDFType.SCALAR)
def udf_that_throws_exception(*cols):
raise Exception('Error')
#pandas_udf("int", PandasUDFType.SCALAR)
def simple_mul_udf(*cols):
result = cols[0]
for c in cols[1:]:
result *= c
return result
df = spark.range(0,5)
df = df.withColumn('A', lit(1))
df = df.withColumn('B', lit(2))
df = df.withColumn('udf', simple_mul('A','B'))
df = df.withColumn('sql', expr('A*B'))
df = df.withColumn('res', when(df.sql < 100, lit(1)).otherwise(udf_that_throws(lit(0))))
The above code works as expected, the statement in this case is always true so my UDF that throws an exception is never called.
However, if i change the condition to use df.udf instead then all of a sudden the otherwise UDF is called and i get the exception even though the condition result has not changed.
I thought i might be able to obfuscate it by removing the UDF from the condition however the same result occurs regardless:
df = df.withColumn('cond', when(df.udf < 100, lit(1)).otherwise(lit(0)))
df = df.withColumn('res', when(df.cond == lit(1), lit(1)).otherwise(udf_that_throws_exception(lit(0))))
I imagine this has something to do with the way Spark optimizes, which is fine, but am looking for any way to do this without incurring the cost. Any ideas?
Edit
Per request for more information. We are writing a processing engine that can accept an arbitrary model and the code generates the graph. Along the way there are junctures where we make decisions based on the state of values at runtime. We make heavy use of pandas UDF. So imagine a situation where we have multiple paths in the graph and, depending on some condition at runtime, we wish to follow one of those paths, leaving all others untouched.
I would like to encode this logic into the graph so there is no point where I have to collect and branch in the code.
The sample code I have provided is for demonstration purposes only. The issue I am facing is that if the column used in the IF statement is a UDF or, it seems, if it is derived from a UDF, then the OTHERWISE condition is always executed even if its never actually used. If the IF/ELSE are cheap operations such as literals I wouldnt mind, but what if the column UDF (perhaps on both sides) results in a large aggregation or some other length process which is actually just thrown away?
In PySpark the UDFs are computed beforehand and therefore you are getting this sub-optimal bahaviour. You can see it also from the query plan:
== Physical Plan ==
*(2) Project [id#753L, 1 AS A#755, 2 AS B#758, pythonUDF1#776 AS udf#763, CASE WHEN (pythonUDF1#776 < 100) THEN 1.0 ELSE pythonUDF2#777 END AS res#769]
+- ArrowEvalPython [simple_mul_udf(1, 2), simple_mul_udf(1, 2), udf_that_throws_exception(0)], [id#753L, pythonUDF0#775, pythonUDF1#776, pythonUDF2#777]
+- *(1) Range (0, 5, step=1, splits=8)
The ArrowEvalPython operator is responsible for computing the UDFs and after that the condition will be evaluated in the Project operator.
The reason why you get different behaviour when you call df.sql in your condition (the optimal behaviour) is that this is a special case in which the value in the this column is constant (both columns A and B are constant) and the Spark optimizer can evaluate it beforehand (in the driver during query plan processing, before the execution of the actual job on the cluster) and thus it knows that the otherwise branch of the condition will never have to be evaluated. If the value in this sql column is dynamic (for example like in the id column) the behaviour will be suboptimal again, because Spark does not know in advance that otherwise part should never take place.
If you want to avoid this suboptimal behaviour (calling udf in otherwise even though it is not needed), one possible solution is that you evaluate this condition inside your udf, for example as follows:
#pandas_udf("int", PandasUDFType.SCALAR)
def udf_with_cond(*cols):
result = cols[0]
for c in cols[1:]:
result *= c
if((result < 100).any()):
return result
else:
raise Exception('Error')
df = df.withColumn('res', udf_with_cond('A', 'B'))
The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()
I have to write a complex UDF, in which I have to do a join with a different table, and return the number of matches. The actual use case is much more complex, but I've simplified the case here to minimum reproducible code. Here is the UDF code.
def predict_id(date,zip):
filtered_ids = contest_savm.where((F.col('postal_code')==zip) & (F.col('start_date')>=date))
return filtered_ids.count()
When I define the UDF using the below code, I get a long list of console errors:
predict_id_udf = F.udf(predict_id,types.IntegerType())
The final line of the error is:
py4j.Py4JException: Method __getnewargs__([]) does not exist
I want to know what is the best way to go about it. I also tried map like this:
result_rdd = df.select("party_id").rdd\
.map(lambda x: predict_id(x[0],x[1]))\
.distinct()
It also resulted in a similar final error. I want to know, if there is anyway, I can do a join within UDF or map function, for each row of the original dataframe.
I have to write a complex UDF, in which I have to do a join with a different table, and return the number of matches.
It is not possible by design. I you want to achieve effect like this you have to use high level DF / RDD operators:
df.join(ontest_savm,
(F.col('postal_code')==df["zip"]) & (F.col('start_date') >= df["date"])
).groupBy(*df.columns).count()
E.g
sqlContext = SQLContext(sc)
sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()
The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.
You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.
You can of course collect
for row in df.rdd.collect():
do_something(row)
or convert toLocalIterator
for row in df.rdd.toLocalIterator():
do_something(row)
and iterate locally as shown above, but it beats all purpose of using Spark.
To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.
def customFunction(row):
return (row.name, row.age, row.city)
sample2 = sample.rdd.map(customFunction)
or
sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.
Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.
sample3 = sample.withColumn('age2', sample.age + 2)
Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:
df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]
In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().
Or more abbreviated:
tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]
And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.
sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
print("{} is a {} year old from {}".format(
row["name"],
row["age"],
row["city"]))
It might not be the best practice, but you can simply target a specific column using collect(), export it as a list of Rows, and loop through the list.
Assume this is your df:
+----------+----------+-------------------+-----------+-----------+------------------+
| Date| New_Date| New_Timestamp|date_sub_10|date_add_10|time_diff_from_now|
+----------+----------+-------------------+-----------+-----------+------------------+
|2020-09-23|2020-09-23|2020-09-23 00:00:00| 2020-09-13| 2020-10-03| 51148 |
|2020-09-24|2020-09-24|2020-09-24 00:00:00| 2020-09-14| 2020-10-04| -35252 |
|2020-01-25|2020-01-25|2020-01-25 00:00:00| 2020-01-15| 2020-02-04| 20963548 |
|2020-01-11|2020-01-11|2020-01-11 00:00:00| 2020-01-01| 2020-01-21| 22173148 |
+----------+----------+-------------------+-----------+-----------+------------------+
to loop through rows in Date column:
rows = df3.select('Date').collect()
final_list = []
for i in rows:
final_list.append(i[0])
print(final_list)
Give A Try Like this
result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]);
for f in result.collect():
print (f.col_name)
If you want to do something to each row in a DataFrame object, use map. This will allow you to perform further calculations on each row. It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1.
Note that this will return a PipelinedRDD, not a DataFrame.
above
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
should be
tupleList = [{'name':x["name"], 'age':x["age"], 'city':x["city"]}
for name, age, and city are not variables but simply keys of the dictionary.