This question already has answers here:
Difference between DataFrame, Dataset, and RDD in Spark
(14 answers)
Closed 4 years ago.
In spark,there always be operation like this:
hiveContext.sql("select * from demoTable").show()
When I look up the show() method in Spark Official API,the result is like this:
enter image description here
And when I change the key word to 'Dataset',I Find that the method used on DataFrame belongs to Dataset. How does it happen? Is there any implication?
According to the documentation:
A Dataset is a distributed collection of data.
And:
A DataFrame is a Dataset organized into named columns.
So, technically:
DataFrame is equivalent to Dataset<Row>
And one last quote:
In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset to represent a DataFrame.
In short, a the concrete type is Dataset.
Related
This question already has answers here:
How to understand the format type of libsvm of Spark MLlib?
(1 answer)
How can I read LIBSVM models (saved using LIBSVM) into PySpark?
(1 answer)
Closed 4 years ago.
I am reading Binary classification used in SparkML data. I read the JavaCode of Spark, I am also aware of Binary classification but I am not able to understand, how these data are generated. for example https://github.com/apache/spark/blob/master/data/mllib/sample_binary_classification_data.txt
this link is sample for binary_classifcation if I want to generate these type of data, how to do that?
Usually, the first column is the class label (in this case 0 / 1), the others columns are the values of the features.
To generate the data yourself, you can use a random generator, for instance.
But it is depend on the problem you are working on.
If you need to download datasets to apply classification algorithms you can use repositories, such as: UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php
This question already has answers here:
Spark RDD to DataFrame python
(3 answers)
Closed 4 years ago.
This is how my pipelined RDD looks:
[([3.0, 12.0, 8.0, 49.0, 27.0], 7968.0),
([165.0, 140.0, 348.0, 615.0, 311.0], 165.0)]
I want to convert this to a dataframe. I have tried converting the first element (in square brackets) to an RDD and the second one to an RDD and then convert them individually to dataframes. I have also tried setting a schema and converting it but it has not worked. Can anybody help?
Thanks!
You need to flatten your RDD before converting to a DataFrame:
df=rdd.map(lambda (x,y): x+[y]).toDF()
You can specify the schema argument of toDF() to get meaningful column names and/or types.
This question already has answers here:
How to find median and quantiles using Spark
(8 answers)
Closed 4 years ago.
I have a large grouped dataset in spark that I need to return the percentiles from 0.01 to 0.99 for.
I have been using online resources to determine different methods of doing this, from operations on RDD:
How to compute percentiles in Apache Spark
To SQLContext functionality:
Calculate quantile on grouped data in spark Dataframe
My question is does anyone have any opinion on what the most efficient approach is?
Also as a bonus, in SQLContext there is functions for both percentile_approx and percentile. There isn't much documentation available online for 'percentile' is this just a non-approximated 'percentile_approx' function?
Dataframes will be more efficient in general. Read this for details on the reasons - https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html.
There are a few benchmarks out there as well. For example this one claims that "the new DataFrame API is faster than the RDD API for simple grouping and aggregations".
You can look up Hive's documentation to figure out difference between percentile and percentile_approx.
This question already has an answer here:
DataFrame / Dataset groupBy behaviour/optimization
(1 answer)
Closed 5 years ago.
I know that in RDD's we were discouraged from using groupByKey, and encouraged to use alternatives such as reduceByKey(), and aggregateByKey() since these other methods would reduce first on each partition, and then perform groupByKey() and thus reduces the amount of data being shuffled.
Now, my question is if this still applies to Dataset/Dataframe? I was thinking that since catalyst engine does a lot of optimization, that the catalyst will automatically know that it should reduce on each partition, and then perform the groupBy. Am I correct? Or we still need to take steps to ensure reduction on each partition is performed before groupBy.
The groupBy should be used at Dataframes and Datasets. You thinking is complete right, the Catalyst Optimizer will build the plan and optimize all the entrances in GroupBy and other aggregations that you want to do.
There is a good example, that is in spark 1.4 on this link that show the comparison of reduceByKey with RDD and GroupBy with DataFrame.
And you can see that is really much more faster than RDD, so groupBy optimize all the execution for more details you can see the oficial post of DataBricks with the introduction of DataFrames
This question already has answers here:
Stratified sampling in Spark
(2 answers)
Closed 4 years ago.
I'm in Spark 1.3.0 and my data is in DataFrames.
I need operations like sampleByKey(), sampleByKeyExact().
I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157).
That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames.
Thanks & Regards
MK
Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then they are available without MLLib dependencies.
These two functions are PairRDDFunctions and belong to key-value RDD[(K,T)]. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:
val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)
Note that sample is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df.