I'm writing some custom additions to SparkSQL and in some places, I require a reference to Dataframe.
I've already checked questions like this, and I know that Logical Plans can be converted to data frame by Dataset.ofRows(Sqlctx, Plan).
My question is more about the implications of this like will this conversion trigger a job/ data scan? I'm not performing any explicit action on this data frame, just chaining it through some more transformations.
Please help me here, and let me know if any more details are required.
Related
What is the purpose of Apache Arrow? It converts from one binary format to another, but why do i need that? If I have a spark program,then spark can read parquet,so why do i need to convert it into another format,midway through my processing?
Is it to pass that data in memory to another language like python or java without having to write it to a text/json format?
Disclaimer: This question is broad and I am somewhat involved with the Apache Arrow project so my answer may/or may not be biased.
This question is broad in the sense that a question like a "When should I use NoSQL?" type of question is broad. It depends. This answer is based on the assumption that you already have a Spark pipeline. This answer is not an attempt at Spark Vs. Arrow (which is even broader to the point I wouldn't touch it).
Many Apache Spark pipelines would never need to use Arrow. Spark, unlike Arrow-based pipelines, has its own in-memory dataframe format (https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html) which, to my knowledge, cannot be zero-copied to Arrow. So converting from one format to the other is likely to introduce a performance hit of some kind and any benefit you achieve is going to have to be weighed against that.
You brought up one great example, which is switching to other languages / libraries. For example, Spark currently uses Arrow to apply a Pandas UDF (https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html). In this case, whenever you are going to a library that doesn't use Spark's in-memory format (which means any non-Java library and some Java libraries) you are going to have to do a translation between in-memory formats and so you are going to pay the performance hit anyways and you might as well switch to Arrow.
There are some things that are faster with Arrow's format than Spark's format. I'm not going to try and list those here because, for the most part, the benefit isn't going to outweigh the cost of going Spark -> Arrow in the first place and I don't know that I have enough information to do so in any sort of comprehensive way. Instead, I'll provide one concrete example:
A common case for Arrow is when you need to transfer a table between processes that are on the same machine (or have a very fast I/O channel in between). In that case the cost of serializing to parquet and then deserializing back (Spark must do this to go Spark Dataframe -> Parquet -> Wire -> Parquet -> Spark Dataframe) is more expensive than the I/O saved (Parquet is more compact than Spark Dataframe so you will save some in transmission). If you have a lot of this type of communication it may be beneficial to leave Spark, do these transmissions in Arrow, and then return to Spark.
I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.
Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.
Currently all our data engineering flows are using Spark (Scala)DataFrame.
We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.
EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets
There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.
For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.
Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.
There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.
In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.
Limitations of Spark Datasets:
Datasets used to be less performant (not sure if that's been fixed yet)
You need to define a new case class whenever you change the Dataset schema, which is cumbersome
Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date
case class Birth(hospitalName: String, birthDate: Date)
val birthsDS = Seq(
Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+
I am trying to process data from Kafka using Spark Structured Streaming. The code for ingesting the data is as follows:
val enriched = df.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")
ds is a DataFrame with the data consumed from Kafka.
The problem comes when I try to read is as JSON in order to do faster queries. the function that comes from org.apache.spark.sql.functions from_json() is asking obligatory for a schema. What if the messages have some different fields?
As #zero323 and the answer he or she referenced suggest, you are asking a contradictory question: essentially how does one impose a schema when one doesn't know the schema? One can't of course. I think the idea to use open-ended collection types is your best option.
Ultimately though, it is almost certainly true that you can represent your data with a case class even if it means using a lot of Options, strings you need to parse, and maps you need to interrogate. Invest in the effort to define that case class. Otherwise, your Spark jobs will essentially a lot of ad hoc, time-consuming busywork.
I have a MySQL database with a single table containing about 100 million records (~25GB, ~5 columns). Using Apache Spark, I extract this data via a JDBC connector and store it in a DataFrame.
From here, I do some pre-processing of the data (e.g. replacing the NULL values), so I absolutely need to go through each record.
Then I would like to perform dimensionality reduction and feature selection (e.g. using PCA), perform clustering (e.g. K-Means) and later on do the testing of the model on new data.
I have implemented this in Spark's Java API, but it is too slow (for my purposes) since I do a lot of copying of the data from a DataFrame to a java.util.Vector and java.util.List (to be able to iterate over all records and do the pre-processing), and later back to a DataFrame (since PCA in Spark expects a DataFrame as input).
I have tried extracting information from the database into a org.apache.spark.sql.Column but cannot find a way to iterate over it.
I also tried avoiding the use of Java data structures (such as List and Vector) by using the org.apache.spark.mllib.linalg.{DenseVector, SparseVector}, but cannot get that to work either.
Finally, I also considered using JavaRDD (by creating it from a DataFrame and a custom schema), but couldn't work it out entirely.
After a lengthy description, my question is: is there a way to do all steps mentioned in the first paragraph, without copying all the data into a Java data structure?
Maybe one of the options I tried could actually work, but I just can't seem to find out how, as the docs and literature on Spark are a bit scarce.
From the wording of your question, it seems there is some confusion about the stages of Spark processing.
First, we tell Spark what to do by specifying inputs and transformations. At this point, the only things that are known are (a) the number of partitions at various stages of processing and (b) the schema of the data. org.apache.spark.sql.Column is used at this stage to identify the metadata associated with a column. However, it doesn't contain any of the data. In fact, there is no data at all at this stage.
Second, we tell Spark to execute an action on a dataframe/dataset. This is what kicks off processing. The input is read and flows through the various transformations and into the final action operation, be it collect or save or something else.
So, that explains why you cannot "extract information from the database into" a Column.
As for the core of your question, it's hard to comment without seeing your code and knowing exactly what it is you are trying to accomplish but it is safe to say that much migrating between types is a bad idea.
Here are a couple of questions that might help guide you to a better outcome:
Why can't you perform the data transformations you need by operating directly on the Row instances?
Would it be convenient to wrap some of your transformation code into a UDF or UDAF?
Hope this helps.
Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms.
However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions. Namely, it is indicated in sourcecode of org.apache.spark.sql.catalyst.expressions.ScalaUDF, that every user defined function does 3 things:
convert catalyst type (used in InternalRow) to scala type (used in GenericRow).
apply the function
convert the result back from scala type to catalyst type
Apparently this is even slower than just applying the function directly on RDD without any conversion. Can anyone confirm or deny my speculation by some real-case profiling and code analysis?
Thank you so much for any suggestion or insight.
From this Databricks' blog article A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets
When to use RDDs?
Consider these scenarios or common use cases for
using RDDs when:
you want low-level transformation and actions and control on your
dataset;
your data is unstructured, such as media streams or streams
of text;
you want to manipulate your data with functional programming
constructs than domain specific expressions;
you don’t care about
imposing a schema, such as columnar format, while processing or
accessing data attributes by name or column;
and you can forgo some
optimization and performance benefits available with DataFrames and
Datasets for structured and semi-structured data.
In High Performance Spark's Chapter 3. DataFrames, Datasets, and Spark SQL, you can see some performance you can get with the Dataframe/Dataset API compared to RDD
And in the Databricks' article mentioned you can also find that Dataframe optimizes space usage compared to RDD
I think Dataset is schema RDD.
when you create Dataset,you should give StructType to it.
In fact, Dataset after logic plan and physical plan ,will generate RDD operator.Maybe this is RDD performance more than Dataset.