This question already has answers here:
Stratified sampling in Spark
(2 answers)
Closed 4 years ago.
I'm in Spark 1.3.0 and my data is in DataFrames.
I need operations like sampleByKey(), sampleByKeyExact().
I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157).
That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames.
Thanks & Regards
MK
Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then they are available without MLLib dependencies.
These two functions are PairRDDFunctions and belong to key-value RDD[(K,T)]. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:
val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)
Note that sample is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df.
Related
Here, it is stated:
..you can create Datasets within a Scala or Python..
while here, the following is stated:
Python does not have the support for the Dataset API
Are datasets available in python?
Perhaps the question is about Typed Spark Datasets.
If so, then the answer is no.
Mentioned spark datasets are only available in Scala and Java.
In Python implementation of Spark (or PySpark) you have to choose between DataFrames as the preferred choice and RDD.
Reference:
RDD vs. DataFrame vs. Dataset
Update 2022-09-26: Clarification regarding typed spark datasets
I am trying to understand if there is a relationship between RDDs and Dataframes/Datesets from a technical point of view. RDDs are often described as the fundamental data abstraction in Spark. In my understanding this would mean that Dataframes/Datasets should also be based on it. In the original Spark SQL Paper the figures 1 & 3 point to this connection. However, I haven't found any documentation on how this connection looks like (if it exists at all).
So my question: Are Dataframes/Datasets based on RDDs or are these two concepts independent?
Dataframe and Datasets are based on the Rdd, however this is a little bit hidden. The fact is that Dataframe and Datasets are more used on the spark-sql project, where as Rdd are on the spark-core.
Here is the technical point of view on how Dataframe, which is Dataset[Row], and Rdd are linked: Dataframe has a QueryExecution which controls how all the sql execution acts. Now when this get executed by the engine it will be output in an internal rdd of type Row, lazy val toRdd: RDD[InternalRow] = executedPlan.execute(). Having that rdd and a schema it will form a Dataframe.
This question already has answers here:
Difference between DataFrame, Dataset, and RDD in Spark
(14 answers)
Closed 4 years ago.
In spark,there always be operation like this:
hiveContext.sql("select * from demoTable").show()
When I look up the show() method in Spark Official API,the result is like this:
enter image description here
And when I change the key word to 'Dataset',I Find that the method used on DataFrame belongs to Dataset. How does it happen? Is there any implication?
According to the documentation:
A Dataset is a distributed collection of data.
And:
A DataFrame is a Dataset organized into named columns.
So, technically:
DataFrame is equivalent to Dataset<Row>
And one last quote:
In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset to represent a DataFrame.
In short, a the concrete type is Dataset.
This question already has answers here:
How to find median and quantiles using Spark
(8 answers)
Closed 4 years ago.
I have a large grouped dataset in spark that I need to return the percentiles from 0.01 to 0.99 for.
I have been using online resources to determine different methods of doing this, from operations on RDD:
How to compute percentiles in Apache Spark
To SQLContext functionality:
Calculate quantile on grouped data in spark Dataframe
My question is does anyone have any opinion on what the most efficient approach is?
Also as a bonus, in SQLContext there is functions for both percentile_approx and percentile. There isn't much documentation available online for 'percentile' is this just a non-approximated 'percentile_approx' function?
Dataframes will be more efficient in general. Read this for details on the reasons - https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html.
There are a few benchmarks out there as well. For example this one claims that "the new DataFrame API is faster than the RDD API for simple grouping and aggregations".
You can look up Hive's documentation to figure out difference between percentile and percentile_approx.
This question already has an answer here:
DataFrame / Dataset groupBy behaviour/optimization
(1 answer)
Closed 5 years ago.
I know that in RDD's we were discouraged from using groupByKey, and encouraged to use alternatives such as reduceByKey(), and aggregateByKey() since these other methods would reduce first on each partition, and then perform groupByKey() and thus reduces the amount of data being shuffled.
Now, my question is if this still applies to Dataset/Dataframe? I was thinking that since catalyst engine does a lot of optimization, that the catalyst will automatically know that it should reduce on each partition, and then perform the groupBy. Am I correct? Or we still need to take steps to ensure reduction on each partition is performed before groupBy.
The groupBy should be used at Dataframes and Datasets. You thinking is complete right, the Catalyst Optimizer will build the plan and optimize all the entrances in GroupBy and other aggregations that you want to do.
There is a good example, that is in spark 1.4 on this link that show the comparison of reduceByKey with RDD and GroupBy with DataFrame.
And you can see that is really much more faster than RDD, so groupBy optimize all the execution for more details you can see the oficial post of DataBricks with the introduction of DataFrames