I'm aware that Spark supports a concept of Schema Converters that allow translation of AVRO schema to Spark Schema. However is there a reliable/supported converter for JSONSchema?
How are users using JSONSchema with Spark without duplicating the same schema in Spark Schema format?
Regards
p.s. I've seen Zalando's converter but it doesn't seem to be supported or updated for 6 years. JSONSchema drafts have evolved since then.
Related
I want to test if changing datatype for a schema is compatible or not.
Eg:
df1=spark.createDataframe([],StringType())
df2=spark.createDataframe([],IntegerType())
I know if the two dfs are written as parquet, I can use mergeschema functionality to test if the two dfs read together are compatible or not
How can I find at dataframe/schema level if the schemas allow implicit conversion? Is it even possible?
This question already exists:
Difference between DataSet API and DataFrame API [duplicate]
Closed 4 years ago.
My understanding that one of the big changes between Spark 1.x and 2.x was the migration away from DataFrames to the adoption of newer/improved Dataset objects.
However in all the Spark 2.x docs I see DataFrames being used, not Datasets.
So I ask: In Spark 2.x are we still using DataFrames, or have the Spark folks just not updated there 2.x docs to use the newer + recommended Datasets?
DataFrames ARE Datasets, just a special type of Datasets, namely Dataset[Row], meaning untyped Datasets.
But it's true that even with Spark 2.x, many Spark users still use DataFrames, especially for fast prototyping (I'm one of them), because it's a very convenient API and many operations are (in my view) easier to do with DataFrames than with Datasets
Apparently you can use both but no one over at Spark has bothered updating the docs to show how to use Datasets so I'm guessing they really want us to just use DataFrames like we did in 1.x.
I work in a place where we use JOOQ for sql query generation in some part of the backend code. Lots of code has been written to work with it. On my side of things, I would like to map theses features into spark and especially generate queries in Spark SQL over dataframes loaded from a bunch of parquet files.
Is there any tooling to generate DSL classes from parquet (or spark) schema? I could not find any. Other approaches has been successful on this matter?
Ideally, I would like to generate tables and fields dynamically from possibly evolving schema.
I know this is a broad question and I will close it if it is deemed out of scope for SO.
jOOQ doesn't officially support Spark, but you have a variety of options to reverse engineer any schema metadata that you have in your Spark database:
Using the JDBCDatabase
Like any other jooq-meta Database implementation, you can use the JDBCDatabase that reverse engineers anything it can find through the JDBC DatabaseMetaData API, if your JDBC driver supports that.
Using files as a meta data source
As of jOOQ version 3.10, there are three different types of "offline" meta data sources that you can use to generate data:
The XMLDatabase will generate code from an XML file.
The JPADatabase will generate code from JPA-annotated entities.
The DDLDatabase will parse DDL file(s) and reverse engineer its output (this probably won't work well for Spark, as its syntax is not officially supported)
Not using the code generator
Of course, you don't have to generate any code. You can get meta data information directly from your JDBC driver (again through the DatabaseMetaData API), which is abstracted through DSLContext.meta(), or you supply the schema again dynamically to jOOQ using XML content through DSLContext.meta(InformationSchema)
I need to process quite a big json file using spark. I don't need all the fields in the json and actually would like to read only part of them (not read all fields and project).
I was wondering If I could use the json connector and give it a partial read schema with only the fields I'm interested loading.
It depends on whether your json is multi line. Currently spark only support json on single line as data frame. The next release of spark 2.3 will support multiline json.
But for your question. I don't think you can use a partial schema to read in json. You can first provide the full schema to read in as a dataframe, then select the specific column you need to construct your partial schema as a seperate dataframe. Since spark's use lazy evaluation and the sql engine is able to push down the filter, the performance won't be bad.
Is it possible to export models as PMMLs using PySpark? I know this is possible using Spark. But I did not find any reference in PySpark docs. So does this mean that if I want to do this, I need to write custom code using some third party python PMML library?
It is possible to export Apache Spark pipelines to PMML using the JPMML-SparkML library. Furthermore, this library is made available for end users in the form of a "Spark Package" by the JPMML-SparkML-Package project.
Example PySpark code:
from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, df, pipelineModel)
print(pmmlBytes)