Java Spark Dataset retrieve cell value - apache-spark

My Dataset looks like below, i want to fetch the 1st row,1st column value (A1 in this case)
+-------+---+--------------+----------+
|account|ccy|count(account)|sum_amount|
+-------+---+--------------+----------+
| A1|USD| 2| 500.24|
| A2|SGD| 1| 200.24|
| A2|USD| 1| 300.36|
+-------+---+--------------+----------+
I can do this as below :
Dataset finalDS = dataset.groupBy("account", "ccy").
agg(count("account"), sum("amount").alias("sum_amount"))
.orderBy("account", "ccy");
Object[] items = (Object[])(finalDS.filter(functions.col("sum_amount")
.equalTo(300.36))
.collect());
String accountNo = (String)((GenericRowWithSchema)items[0]).get(0);
2 questions :
Any other/more efficient way to do this ? I am aware of Dataframe/JavaRDD queries
Without the explicit cast Object[], there is a compile time failure, however I would have thought that this is an implicit cast. Why ? I suspect something to do with scala compilation.

Any other/more efficient way to do this ? I am aware of Dataframe/JavaRDD queries
You'd better use Dataset.head (javadocs) function in order to eliminate passing all the data to driver process. This will limit you to loading only 1st row to driver RAM instead of the entire dataset. You also can consider using take function to obtain first N rows.
Without the explicit cast Object[], there is a compile time failure, however I would have thought that this is an implicit cast. Why ? I suspect something to do with scala compilation.
It depends on how your dataset is typed. In case of Datarame (which is Dataset[Row], proof), you'll get an Array[Row] on call to collect. It's worth to mention the signature of collect functions:
def collect(): Array[T] = withAction("collect", queryExecution)(collectFromPlan)

Related

Order of evaluation of predicates in Spark SQL where clause

I am trying to understand the order of predicate evaluation in Spark SQL in order to increase performance of a query.
Let's say I have the following query
"select * from tbl where pred1 and pred2"
and lets say that none of the predicates qualify as pushdown filters (for simplification).
Also lets assume that pred1 is computationally much more complex than pred2 (assume regex pattern matching vs negation).
Is there any way to verify that spark will evaluate pred2 before
pred1?
Is this deterministic?
Is this controllable?
Is there any way to see the final execution plan?
General
Good question.
Inferred answer via testing a scenario and making deductions as could not find the suitable docs. 2nd attempt due to all sorts of statements on the web not able to be backed up.
This question I think is not about AQE Spark 3.x aspects, but it is
about say, a dataframe as part of Stage N of a Spark App that has
passed the stage of acquiring data from sources at rest, which is
subject to filtering with multiple predicates being applied.
Then the central point is does it matter how the predicates are
ordered or does Spark (Catalyst) re-order the predicates to minimize
the work to be done?
The premise here is that filtering the maximum amount of data out first makes more sense than evaluating a predicate that filters very
little out.
This is a well-known RDBMS point referring to sargable predicates (subject to evolution of definition over time).
A lot of the discussion focused on indexes, Spark, Hive do not have this, but DF's are columnar.
Point 1
You can try for %sql
EXPLAIN EXTENDED select k, sum(v) from values (1, 2), (1, 3) t(k, v) group by k;
From this you can see what's going on if there is re-arranging of
predicates, but I saw no such aspects in the Physical Plan in non-AQE
mode on Databricks. Refer to
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-qry-explain.html.
Catalyst can re-arrange filtering I read here and there. To what
extent, is a lot of research; I was not able to confirm this.
Also an interesting read:
https://www.waitingforcode.com/apache-spark-sql/catalyst-optimizer-in-spark-sql/read
Point 2
I ran the following pathetic contrived examples with the same
functional query but with predicates reversed, using a column that has
high cardinality and tested for a value that does not in fact exist
and then compared the count of the accumulator used in an UDF when called.
Scenario 1
import org.apache.spark.sql.functions._
def randomInt1to1000000000 = scala.util.Random.nextInt(1000000000)+1
def randomInt1to10 = scala.util.Random.nextInt(10)+1
def randomInt1to1000000 = scala.util.Random.nextInt(1000000)+1
val df = sc.parallelize(Seq.fill(1000000){(randomInt1to1000000,randomInt1to1000000000,randomInt1to10)}).toDF("nuid","hc", "lc").withColumn("text", lpad($"nuid", 3, "0")).withColumn("literal",lit(1))
val accumulator = sc.longAccumulator("udf_call_count")
spark.udf.register("myUdf", (x: String) => {accumulator.add(1)
x.length}
)
accumulator.reset()
df.where("myUdf(text) = 3 and hc = -4").select(max($"text")).show(false)
println(s"Number of UDF calls ${accumulator.value}")
returns:
+---------+
|max(text)|
+---------+
|null |
+---------+
Number of UDF calls 1000000
Scenario 2
import org.apache.spark.sql.functions._
def randomInt1to1000000000 = scala.util.Random.nextInt(1000000000)+1
def randomInt1to10 = scala.util.Random.nextInt(10)+1
def randomInt1to1000000 = scala.util.Random.nextInt(1000000)+1
val dfA = sc.parallelize(Seq.fill(1000000){(randomInt1to1000000,randomInt1to1000000000,randomInt1to10)}).toDF("nuid","hc", "lc").withColumn("text", lpad($"nuid", 3, "0")).withColumn("literal",lit(1))
val accumulator = sc.longAccumulator("udf_call_count")
spark.udf.register("myUdf", (x: String) => {accumulator.add(1)
x.length}
)
accumulator.reset()
dfA.where("hc = -4 and myUdf(text) = 3").select(max($"text")).show(false)
println(s"Number of UDF calls ${accumulator.value}")
returns:
+---------+
|max(text)|
+---------+
|null |
+---------+
Number of UDF calls 0
My conclusion here is that:
There is left to right evaluation - in this case - as there are 0 calls for the udf as the accumulator value is 0 for scenario 2, as opposed to scenario 1 with 1M calls registered.
So, the order of predicate processing as say ORACLE and DB2 may do for Stage 1 predicates does not apply.
Point 3
I note from the manual however
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html the
following:
Evaluation order and null checking
Spark SQL (including SQL and the DataFrame and Dataset APIs) does not
guarantee the order of evaluation of subexpressions. In particular,
the inputs of an operator or function are not necessarily evaluated
left-to-right or in any other fixed order. For example, logical AND
and OR expressions do not have left-to-right “short-circuiting”
semantics.
Therefore, it is dangerous to rely on the side effects or order of
evaluation of Boolean expressions, and the order of WHERE and HAVING
clauses, since such expressions and clauses can be reordered during
query optimization and planning. Specifically, if a UDF relies on
short-circuiting semantics in SQL for null checking, there’s no
guarantee that the null check will happen before invoking the UDF. For
example,
spark.udf.register("strlen", (s: String) => s.length)
spark.sql("select s from test1 where s is not null and strlen(s) > 1") // no guarantee
This WHERE clause does not guarantee the strlen UDF to be invoked
after filtering out nulls.
To perform proper null checking, we recommend that you do either of
the following:
Make the UDF itself null-aware and do null checking inside the UDF
itself Use IF or CASE WHEN expressions to do the null check and invoke
the UDF in a conditional branch.
spark.udf.register("strlen_nullsafe", (s: String) => if (s != null) s.length else -1)
spark.sql("select s from test1 where s is not null and strlen_nullsafe(s) > 1") // ok
spark.sql("select s from test1 where if(s is not null, strlen(s), null) > 1") // ok
Slight contradiction.

Spark function aliases - performant udfs

Context
In many of the sql queries I write, I find myself combining spark predefined functions in the exact same way, which often results in verbose and duplicated code, and my developer instinct is to want to refactor it.
So, my question is this : is there some way to define some kind of alias for function combinations without resorting to udfs (which are to avoid for perofmance reasons) - the goal being to make the code clearer and cleaner. Essentially, what I want is something like udfs but without the performance penalty. Also, these function MUST be callable from within a spark-sql query usable in spark.sql calls.
Example
For example, let's say my business logic is to reverse some string and hash it like this : (please note that the function combination here is irrelevant, what is important is that it is some combination of existing pre-defined spark functions - possibly many of them)
SELECT
sha1(reverse(person.name)),
sha1(reverse(person.some_information)),
sha1(reverse(person.some_other_information))
...
FROM person
Is there a way of declaring a business function without paying the performance price of using a udf, allowing the code just above to be rewritten as :
SELECT
business(person.name),
business(person.some_information),
business(person.some_other_information)
...
FROM person
I have searched around quite a bit on the spark documentation and on this website and have not found a way of achieving this, which is pretty weird to me because it looks like a pretty natural need, and I don't understand why you should necessarly pay the black-box price of defining and calling a udf.
Is there a way of declaring a business function without paying the performance price of using a udf
You don't have to use udf, you might extend Expression class, or for the simplest operations - UnaryExpression. Then you will have to implement just several methods and here we go. It is natively integrated into Spark, besides that letting use some advantage features such as code generation.
In your case adding business function is pretty straightforward:
def business(column: Column): Column = {
sha1(reverse(column))
}
MUST be callable from within a spark-sql query usable in spark.sql calls
This is more tricky but achievable.
You need to create custom functions registrar:
import org.apache.spark.sql.catalyst.FunctionIdentifier
import org.apache.spark.sql.catalyst.expressions.Expression
object FunctionAliasRegistrar {
val funcs: mutable.Map[String, Seq[Column] => Column] = mutable.Map.empty
def add(name: String, builder: Seq[Column] => Column): this.type = {
funcs += name -> builder
this
}
def registerAll(spark: SparkSession) = {
funcs.foreach { case (alias, builder) => {
def b(children: Seq[Expression]) = builder.apply(children.map(expr => new Column(expr))).expr
spark.sessionState.functionRegistry.registerFunction(FunctionIdentifier(alias), b)
}}
}
}
Then you can use it as follows:
FunctionAliasRegistrar
.add("business1", child => lower(reverse(child.head)))
.add("business2", child => upper(reverse(child.head)))
.registerAll(spark)
dataset.createTempView("data")
spark.sql(
"""
| SELECT business1(name), business2(name) FROM data
|""".stripMargin)
.show(false)
Output:
+--------------------+--------------------+
|lower(reverse(name))|upper(reverse(name))|
+--------------------+--------------------+
|sined |SINED |
|taram |TARAM |
|1taram |1TARAM |
|2taram |2TARAM |
+--------------------+--------------------+
Hope this helps.

Spark: Use aggregation function on all columns [duplicate]

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()

Pyspark dataframe - select rows that fails

Is that possible to perform set of operations on dataframe (adding new columns, replace some existing values, etc) and do not fast fail on first failed rows, but instead perform full transformation and separately return rows that has been processed with errors?
Example:
it's more like pseudocode, but the idea must be clear:
df.withColumn('PRICE_AS_NUM', to_num(df["PRICE_AS_STR"]))
to_num - is my custom function of transformation string to number.
assuming I have some records where price can't be cast to number - I want to get those records in separate dataframe.
I see an approach, but it will make code a little ugly (and not quite productive):
do a filter with try catch - if exception happen - filter those records into separate df..
What if I have many of such transformations... Is any better way?
I think one approach would be to wrap your transformation with a try/except function that returns a boolean. Then use when() and otherwise() to filter on the boolean. For example:
def to_num_wrapper(inputs):
try:
to_num(inputs)
return True
except:
return False
from pyspark.sql.functions import when
df.withColumn('PRICE_AS_NUM',
when(
to_num_wrapper(df["PRICE_AS_STR"]),
to_num(df["PRICE_AS_STR"])
).otherwise('FAILED')
)
Then you can filter on the columns where the value is 'FAILED'.
Preferred option
Always prefer built-in SQL functions over UDF. There safe to execute and much faster than a Python UDF. As a bonus they follow SQL semantics - if there is a problem on the line, the output is NULL - undefined.
If you go with UDF
Follow the same approach as built-in functions.
def safe_udf(f, dtype):
def _(*args):
try:
return f(*args)
except:
pass
return udf(_, dtype)
to_num_wrapper = safe_udf(lambda x: float(x), "float")
df = spark.createDataFrame([("1.123", ), ("foo", )], ["str"])
df.withColumn("num", to_num_wrapper("str")).show()
# +-----+-----+
# | str| num|
# +-----+-----+
# |1.123|1.123|
# | foo| null|
# +-----+-----+
While swallowing exception might be counter-intuitive it just a matter of following SQL conventions.
No matter which one you choose:
Once you adjust you with one of the above, handling malformed data is just a matter of applying DataFrameNaFunctions (.na.drop, .na.replace).

Spark DataFrame: count distinct values of every column

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()

Resources