I want to test if changing datatype for a schema is compatible or not.
Eg:
df1=spark.createDataframe([],StringType())
df2=spark.createDataframe([],IntegerType())
I know if the two dfs are written as parquet, I can use mergeschema functionality to test if the two dfs read together are compatible or not
How can I find at dataframe/schema level if the schemas allow implicit conversion? Is it even possible?
Related
Is it possible to use Hive/Beeline/Spark's DDL parsing capabilities within our custom programs preferably in Java or Scala?. I have already looked at the project https://github.com/xnuinside/simple-ddl-parser and it does exactly what I want. The concern I have with this project is, it is not using Hive or Spark's own internal classes for the parsing. They have come up with their own regex pattern to parse the given DDL statements.
I know beeline or spark-shell accepts the create table statements and it creates the table. I am thinking it must have internal classes which does the parsing and then it creates the table. If they are public classes or methods can we not use these instead of reinventing the wheel?. I do not know what are those internal classes or methods that parses the DDL statements. Please let me know if you know more about it. For my use case, I need to extract TableName, ColumnNames, DataTypes, PartitionKeys, SerDe, InputFormat, OutputFormat from the given Create Table Statement.
One of my friends suggested me to use the Apache-Hive library itself, specifically the class org.apache.hadoop.hive.ql.parse.HiveParser. Example programs can be found in the link1 or link2.
I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.
Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.
Currently all our data engineering flows are using Spark (Scala)DataFrame.
We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.
EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets
There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.
For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.
Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.
There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.
In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.
Limitations of Spark Datasets:
Datasets used to be less performant (not sure if that's been fixed yet)
You need to define a new case class whenever you change the Dataset schema, which is cumbersome
Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date
case class Birth(hospitalName: String, birthDate: Date)
val birthsDS = Seq(
Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+
I work in a place where we use JOOQ for sql query generation in some part of the backend code. Lots of code has been written to work with it. On my side of things, I would like to map theses features into spark and especially generate queries in Spark SQL over dataframes loaded from a bunch of parquet files.
Is there any tooling to generate DSL classes from parquet (or spark) schema? I could not find any. Other approaches has been successful on this matter?
Ideally, I would like to generate tables and fields dynamically from possibly evolving schema.
I know this is a broad question and I will close it if it is deemed out of scope for SO.
jOOQ doesn't officially support Spark, but you have a variety of options to reverse engineer any schema metadata that you have in your Spark database:
Using the JDBCDatabase
Like any other jooq-meta Database implementation, you can use the JDBCDatabase that reverse engineers anything it can find through the JDBC DatabaseMetaData API, if your JDBC driver supports that.
Using files as a meta data source
As of jOOQ version 3.10, there are three different types of "offline" meta data sources that you can use to generate data:
The XMLDatabase will generate code from an XML file.
The JPADatabase will generate code from JPA-annotated entities.
The DDLDatabase will parse DDL file(s) and reverse engineer its output (this probably won't work well for Spark, as its syntax is not officially supported)
Not using the code generator
Of course, you don't have to generate any code. You can get meta data information directly from your JDBC driver (again through the DatabaseMetaData API), which is abstracted through DSLContext.meta(), or you supply the schema again dynamically to jOOQ using XML content through DSLContext.meta(InformationSchema)
I need to process quite a big json file using spark. I don't need all the fields in the json and actually would like to read only part of them (not read all fields and project).
I was wondering If I could use the json connector and give it a partial read schema with only the fields I'm interested loading.
It depends on whether your json is multi line. Currently spark only support json on single line as data frame. The next release of spark 2.3 will support multiline json.
But for your question. I don't think you can use a partial schema to read in json. You can first provide the full schema to read in as a dataframe, then select the specific column you need to construct your partial schema as a seperate dataframe. Since spark's use lazy evaluation and the sql engine is able to push down the filter, the performance won't be bad.
I am new to data science tools and have a use case to transform json logs into a flattened columnar data maybe considered as normal csv, I was looking into a lot of alternatives (tools) to approach this problem and found that I can easily solve this using Apache Spark Sql but the problem is my json log can be a complex data structure with hierarchical arrays i.e. I would have to explode the dataset multiple times to transform it.
The problem is I don't want to hard code the logic for data transformation as I wish to reuse the same chunk of code with different transformation logic, or to put it in a better way I want my transformation to be driven by configurations rather than code.
For the same reason I was looking into Apache Avro which provides me with liberty to define my own schema for the input, but here the problem is I am unaware if I can also define the output schema as well ? If not then it will be same as reading and filtering the avro data structure (generated) into my code logic.
One probable solution which I can think of is to define my schema along with the array fields and some flags to notify my parser to explode on them, which might be recursive as well till I transform the input schema into output i.e. generating the transformation logic based on my input and output schemas.
Is there any better approach which I am unaware of or not being able to think about ?