Apache Spark Dataset API has two methods i.e, head(n:Int) and take(n:Int).
Dataset.Scala source contains
def take(n: Int): Array[T] = head(n)
Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result?
The reason is because, in my view, Apache Spark Dataset API is trying to mimic Pandas DataFrame API which contains head https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html.
I have experimented & found that head(n) and take(n) gives exactly same replica output. Both produces output in the form of ROW object only.
DF.head(2)
[Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1', Price=u'1200', Payment_Type=u'Mastercard', Name=u'carolina', City=u'Basildon', State=u'England', Country=u'United Kingdom'), Row(Transaction_date=u'1/2/2009 4:53', Product=u'Product2', Price=u'1200', Payment_Type=u'Visa', Name=u'Betina', City=u'Parkville', State=u'MO', Country=u'United States')]
DF.take(2)
[Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1', Price=u'1200', Payment_Type=u'Mastercard', Name=u'carolina', City=u'Basildon', State=u'England', Country=u'United Kingdom'), Row(Transaction_date=u'1/2/2009 4:53', Product=u'Product2', Price=u'1200', Payment_Type=u'Visa', Name=u'Betina', City=u'Parkville', State=u'MO', Country=u'United States')]
package org.apache.spark.sql
/* ... */
def take(n: Int): Array[T] = head(n)
I think this is because spark developers tends to give it a rich API, there also the two methods where and filter which does exactly the same thing.
Related
Context
In many of the sql queries I write, I find myself combining spark predefined functions in the exact same way, which often results in verbose and duplicated code, and my developer instinct is to want to refactor it.
So, my question is this : is there some way to define some kind of alias for function combinations without resorting to udfs (which are to avoid for perofmance reasons) - the goal being to make the code clearer and cleaner. Essentially, what I want is something like udfs but without the performance penalty. Also, these function MUST be callable from within a spark-sql query usable in spark.sql calls.
Example
For example, let's say my business logic is to reverse some string and hash it like this : (please note that the function combination here is irrelevant, what is important is that it is some combination of existing pre-defined spark functions - possibly many of them)
SELECT
sha1(reverse(person.name)),
sha1(reverse(person.some_information)),
sha1(reverse(person.some_other_information))
...
FROM person
Is there a way of declaring a business function without paying the performance price of using a udf, allowing the code just above to be rewritten as :
SELECT
business(person.name),
business(person.some_information),
business(person.some_other_information)
...
FROM person
I have searched around quite a bit on the spark documentation and on this website and have not found a way of achieving this, which is pretty weird to me because it looks like a pretty natural need, and I don't understand why you should necessarly pay the black-box price of defining and calling a udf.
Is there a way of declaring a business function without paying the performance price of using a udf
You don't have to use udf, you might extend Expression class, or for the simplest operations - UnaryExpression. Then you will have to implement just several methods and here we go. It is natively integrated into Spark, besides that letting use some advantage features such as code generation.
In your case adding business function is pretty straightforward:
def business(column: Column): Column = {
sha1(reverse(column))
}
MUST be callable from within a spark-sql query usable in spark.sql calls
This is more tricky but achievable.
You need to create custom functions registrar:
import org.apache.spark.sql.catalyst.FunctionIdentifier
import org.apache.spark.sql.catalyst.expressions.Expression
object FunctionAliasRegistrar {
val funcs: mutable.Map[String, Seq[Column] => Column] = mutable.Map.empty
def add(name: String, builder: Seq[Column] => Column): this.type = {
funcs += name -> builder
this
}
def registerAll(spark: SparkSession) = {
funcs.foreach { case (alias, builder) => {
def b(children: Seq[Expression]) = builder.apply(children.map(expr => new Column(expr))).expr
spark.sessionState.functionRegistry.registerFunction(FunctionIdentifier(alias), b)
}}
}
}
Then you can use it as follows:
FunctionAliasRegistrar
.add("business1", child => lower(reverse(child.head)))
.add("business2", child => upper(reverse(child.head)))
.registerAll(spark)
dataset.createTempView("data")
spark.sql(
"""
| SELECT business1(name), business2(name) FROM data
|""".stripMargin)
.show(false)
Output:
+--------------------+--------------------+
|lower(reverse(name))|upper(reverse(name))|
+--------------------+--------------------+
|sined |SINED |
|taram |TARAM |
|1taram |1TARAM |
|2taram |2TARAM |
+--------------------+--------------------+
Hope this helps.
The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()
I have to write a complex UDF, in which I have to do a join with a different table, and return the number of matches. The actual use case is much more complex, but I've simplified the case here to minimum reproducible code. Here is the UDF code.
def predict_id(date,zip):
filtered_ids = contest_savm.where((F.col('postal_code')==zip) & (F.col('start_date')>=date))
return filtered_ids.count()
When I define the UDF using the below code, I get a long list of console errors:
predict_id_udf = F.udf(predict_id,types.IntegerType())
The final line of the error is:
py4j.Py4JException: Method __getnewargs__([]) does not exist
I want to know what is the best way to go about it. I also tried map like this:
result_rdd = df.select("party_id").rdd\
.map(lambda x: predict_id(x[0],x[1]))\
.distinct()
It also resulted in a similar final error. I want to know, if there is anyway, I can do a join within UDF or map function, for each row of the original dataframe.
I have to write a complex UDF, in which I have to do a join with a different table, and return the number of matches.
It is not possible by design. I you want to achieve effect like this you have to use high level DF / RDD operators:
df.join(ontest_savm,
(F.col('postal_code')==df["zip"]) & (F.col('start_date') >= df["date"])
).groupBy(*df.columns).count()
The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
In pySpark you could do something like this, using countDistinct():
from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))
Similarly in Scala :
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)
If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
The approx_count_distinct method relies on HyperLogLog under the hood.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).
It's one of the changes of behavior since Spark 1.6 :
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)
Reference : Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles.
if you just want to count for particular column then following could help. Although its late answer. it might help someone. (pyspark 2.2.0 tested)
from pyspark.sql.functions import col, countDistinct
df.agg(countDistinct(col("colName")).alias("count")).show()
Adding to desaiankitb's answer, this would provide you a more intuitive answer :
from pyspark.sql.functions import count
df.groupBy(colname).count().show()
You can use the count(column name) function of SQL
Alternatively if you are using data analysis and want a rough estimation and not exact count of each and every column you can use approx_count_distinct function
approx_count_distinct(expr[, relativeSD])
This is one way to create dataframe with every column counts :
> df = df.to_pandas_on_spark()
> collect_df = []
> for i in df.columns:
> collect_df.append({"field_name": i , "unique_count": df[i].nunique()})
> uniquedf = spark.createDataFrame(collect_df)
Output would like below. I used this with another dataframe to compare values if columns names are same.Other dataframe was also created way then joined.
df_prod_merged = uniquedf1.join(uniquedf2, on='field_name', how="left")
This is easy way to do it might be expensive on very huge data like 1 tb to process but still very efficient when used to_pandas_on_spark()
I want to use Catalyst rules to transform star-schema (https://en.wikipedia.org/wiki/Star_schema) SQL query to SQL query to denormalized star-schema where some fields from dimensions tables are represented in facts table.
I tried to find some extension points to add own rules to make a transformation described above. But I didn't find any extension points. So there are the following questions:
How can I add own rules to catalyst optimizer?
Is there another solution to implement a functionality described above?
Following #Ambling advice you can use the sparkSession.experimental.extraStrategies to add your functionality to the SparkPlanner.
An example strategy that simply prints "Hello world" on the console
object MyStrategy extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = {
println("Hello world!")
Nil
}
}
with example run:
val spark = SparkSession.builder().master("local").getOrCreate()
spark.experimental.extraStrategies = Seq(MyStrategy)
val q = spark.catalog.listTables.filter(t => t.name == "five")
q.explain(true)
spark.stop()
You can find a working example project on friend's GitHub: https://github.com/bartekkalinka/spark-custom-rule-executor
As a clue, now in Spark 2.0, you can import extraStrategies and extraOptimizations through SparkSession.experimental.