I am currently working on Spark 1.6 using Scala. I want to get the quantiles of an integer column. Unfortunately, Spark doesn't have any quantile fun in 1.6. However, I found that we have percentile_approx() in hive. Is there any significant difference between those two? or Can I just use Percentile_approx instead of quantiles?
Related
Here, it is stated:
..you can create Datasets within a Scala or Python..
while here, the following is stated:
Python does not have the support for the Dataset API
Are datasets available in python?
Perhaps the question is about Typed Spark Datasets.
If so, then the answer is no.
Mentioned spark datasets are only available in Scala and Java.
In Python implementation of Spark (or PySpark) you have to choose between DataFrames as the preferred choice and RDD.
Reference:
RDD vs. DataFrame vs. Dataset
Update 2022-09-26: Clarification regarding typed spark datasets
This question already has answers here:
How to find median and quantiles using Spark
(8 answers)
Closed 4 years ago.
I have a large grouped dataset in spark that I need to return the percentiles from 0.01 to 0.99 for.
I have been using online resources to determine different methods of doing this, from operations on RDD:
How to compute percentiles in Apache Spark
To SQLContext functionality:
Calculate quantile on grouped data in spark Dataframe
My question is does anyone have any opinion on what the most efficient approach is?
Also as a bonus, in SQLContext there is functions for both percentile_approx and percentile. There isn't much documentation available online for 'percentile' is this just a non-approximated 'percentile_approx' function?
Dataframes will be more efficient in general. Read this for details on the reasons - https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html.
There are a few benchmarks out there as well. For example this one claims that "the new DataFrame API is faster than the RDD API for simple grouping and aggregations".
You can look up Hive's documentation to figure out difference between percentile and percentile_approx.
We currently use typed Dataset in our work. And we are currently exploring using Graphframes.
However, Graphframes seem to be based on Dataframe which is Dataset[Row]. Would Graphframes be compatible with typed Dataset. e.g. Dataset[Person]
GrahpFrames support only DataFrames. To use statically Dataset you have convert it to DataFrame, apply graph operations, and convert back to statically structure.
You can follow this issue: https://github.com/graphframes/graphframes/issues/133
Lately, I've been learning about spark sql, and I wanna know, is there any possible way to use mllib in spark sql, like :
select mllib_methodname(some column) from tablename;
here, the "mllib_methodname" method is a mllib method.
Is there some example shows how to use mllib methods in spark sql?
Thanks in advance.
The new pipeline API is based on DataFrames, which is backed by SQL. See
http://spark.apache.org/docs/latest/ml-guide.html
Or you can simply register the predict method from MLlib models as UDFs and use them in your SQL statement. See
http://spark.apache.org/docs/latest/sql-programming-guide.html#udf-registration-moved-to-sqlcontextudf-java--scala
This question already has answers here:
Stratified sampling in Spark
(2 answers)
Closed 4 years ago.
I'm in Spark 1.3.0 and my data is in DataFrames.
I need operations like sampleByKey(), sampleByKeyExact().
I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157).
That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames.
Thanks & Regards
MK
Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then they are available without MLLib dependencies.
These two functions are PairRDDFunctions and belong to key-value RDD[(K,T)]. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:
val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)
Note that sample is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df.