Is there an easy way to convert an RDD to a Dataset (or DataFrame) in Mobius. Basically something similar to the functionality provided by scala's
import sqlContext.implicits._
I know there's sqlContext.CreateDataFrame() but as far as I can tell that requires me to define my own StructType in order to do the conversion.
No. For now, sqlContext.CreateDataFrame is the only option. Feel free to create an issue in Mobius repo to get the discussion started if you think ToDF() is required on RDDs.
Related
there.
I am very new to Pyspark and I am learning the UDF myself. I realize UDF sometimes will slow down your code. I want to know about your experience. What UDF function did you apply(cannot be achieved with Pyspark code only). Is there any useful UDF function that helps you clean the data? Except for the Pyspark document, is there any source that can help me learn the UDF function?
You can find most of your needed functionality within the standard library functions spark has.
import pyspark.sql.functions - Check the docs here https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions
Now, sometimes you do have to create custom UDF's but be aware that it does slow down since spark has to evaluate it for every dataframe row.
try to avoid this as much as you can.
When you don't have any other option, use it, but try to minimize the complexity and the external libraries you use.
Another approach is to use an RDD, which means you convert your dataframe to an rdd (MYDF.rdd)
And right after you call mapPartitions or map which accept a function that manipulate your data.
It basically sends chunks each time as a list of spark Row entity.
Read more about mapPartitions or map here: https://sparkbyexamples.com/spark/spark-map-vs-mappartitions-transformation/
Is it possible to use a dask array as input for pyspark?
I have a dask array that I like to feed to pyspark.mllib.clustering.StreamingKMeans.
There was once a proof-of-concept for using Dask as a preprocessing layer for handing off work to Spark, where the dask and spark workers were co-located. I don't believe the effort was ever pushed far or used in any kind of production, so the short answer is "no", there's no way to directly pass a dask array to spark. As things stand, you would need to compute the whole thing the client, or write to a storage system that both frameworks can see
I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.
Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.
Currently all our data engineering flows are using Spark (Scala)DataFrame.
We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.
EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets
There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.
For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.
Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.
There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.
In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.
Limitations of Spark Datasets:
Datasets used to be less performant (not sure if that's been fixed yet)
You need to define a new case class whenever you change the Dataset schema, which is cumbersome
Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date
case class Birth(hospitalName: String, birthDate: Date)
val birthsDS = Seq(
Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+
Want to compare the data types of two dataset's in spark using java.
There is a method called dtypes(). See the docs here: Dataset JavaDoc
I think this method can do your job.
I am trying to process data from Kafka using Spark Structured Streaming. The code for ingesting the data is as follows:
val enriched = df.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")
ds is a DataFrame with the data consumed from Kafka.
The problem comes when I try to read is as JSON in order to do faster queries. the function that comes from org.apache.spark.sql.functions from_json() is asking obligatory for a schema. What if the messages have some different fields?
As #zero323 and the answer he or she referenced suggest, you are asking a contradictory question: essentially how does one impose a schema when one doesn't know the schema? One can't of course. I think the idea to use open-ended collection types is your best option.
Ultimately though, it is almost certainly true that you can represent your data with a case class even if it means using a lot of Options, strings you need to parse, and maps you need to interrogate. Invest in the effort to define that case class. Otherwise, your Spark jobs will essentially a lot of ad hoc, time-consuming busywork.