I work in a place where we use JOOQ for sql query generation in some part of the backend code. Lots of code has been written to work with it. On my side of things, I would like to map theses features into spark and especially generate queries in Spark SQL over dataframes loaded from a bunch of parquet files.
Is there any tooling to generate DSL classes from parquet (or spark) schema? I could not find any. Other approaches has been successful on this matter?
Ideally, I would like to generate tables and fields dynamically from possibly evolving schema.
I know this is a broad question and I will close it if it is deemed out of scope for SO.
jOOQ doesn't officially support Spark, but you have a variety of options to reverse engineer any schema metadata that you have in your Spark database:
Using the JDBCDatabase
Like any other jooq-meta Database implementation, you can use the JDBCDatabase that reverse engineers anything it can find through the JDBC DatabaseMetaData API, if your JDBC driver supports that.
Using files as a meta data source
As of jOOQ version 3.10, there are three different types of "offline" meta data sources that you can use to generate data:
The XMLDatabase will generate code from an XML file.
The JPADatabase will generate code from JPA-annotated entities.
The DDLDatabase will parse DDL file(s) and reverse engineer its output (this probably won't work well for Spark, as its syntax is not officially supported)
Not using the code generator
Of course, you don't have to generate any code. You can get meta data information directly from your JDBC driver (again through the DatabaseMetaData API), which is abstracted through DSLContext.meta(), or you supply the schema again dynamically to jOOQ using XML content through DSLContext.meta(InformationSchema)
Related
Is it possible to use Hive/Beeline/Spark's DDL parsing capabilities within our custom programs preferably in Java or Scala?. I have already looked at the project https://github.com/xnuinside/simple-ddl-parser and it does exactly what I want. The concern I have with this project is, it is not using Hive or Spark's own internal classes for the parsing. They have come up with their own regex pattern to parse the given DDL statements.
I know beeline or spark-shell accepts the create table statements and it creates the table. I am thinking it must have internal classes which does the parsing and then it creates the table. If they are public classes or methods can we not use these instead of reinventing the wheel?. I do not know what are those internal classes or methods that parses the DDL statements. Please let me know if you know more about it. For my use case, I need to extract TableName, ColumnNames, DataTypes, PartitionKeys, SerDe, InputFormat, OutputFormat from the given Create Table Statement.
One of my friends suggested me to use the Apache-Hive library itself, specifically the class org.apache.hadoop.hive.ql.parse.HiveParser. Example programs can be found in the link1 or link2.
I am on jooq queries now...I feel the SQL queries looks more readable and maintainable and why we need to use JOOQ instead of using native SQL queries.
Can someone explains few reason for using the same?
Thanks.
Here are the top value propositions that you will never get with native (string based) SQL:
Dynamic SQL is what jOOQ is really really good at. You can compose the most complex queries dynamically based on user input, configuration, etc. and still be sure that the query will run correctly.
An often underestimated effect of dynamic SQL is the fact that you will be able to think of SQL as an algebra, because instead of writing difficult to compose native SQL syntax (with all the keywords, and weird parenthesis rules, etc.), you can think in terms of expression trees, because you're effectively building an expression tree for your queries. Not only will this allow you to implement more sophisticated features, such as SQL transformation for multi tenancy or row level security, but every day things like transforming a set of values into a SQL set operation
Vendor agnosticity. As soon as you have to support more than one SQL dialect, writing SQL manually is close to impossible because of the many subtle differences in dialects. The jOOQ documentation illustrates this e.g. with the LIMIT clause. Once this is a problem you have, you have to use either JPA (much restricted query language: JPQL) or jOOQ (almost no limitations with respect to SQL usage).
Type safety. Now, you will get type safety when you write views and stored procedures as well, but very often, you want to run ad-hoc queries from Java, and there is no guarantee about table names, column names, column data types, or syntax correctness when you do SQL in a string based fashion, e.g. using JDBC or JdbcTemplate, etc. By the way: jOOQ encourages you to use as many views and stored procedures as you want. They fit perfectly in the jOOQ paradigm.
Code generation. Which leads to more type safety. Your database schema becomes part of your client code. Your client code no longer compiles when your queries are incorrect. Imagine someone renaming a column and forgetting to refactor the 20 queries that use it. IDEs only provide some degree of safety when writing the query for the first time, they don't help you when you refactor your schema. With jOOQ, your build fails and you can fix the problem long before you go into production.
Documentation. The generated code also acts as documentation for your schema. Comments on your tables, columns turn into Javadoc, which you can introspect in your client language, without the need for looking them up in the server.
Data type bindings are very easy with jOOQ. Imagine using a library of 100s of stored procedures. Not only will you be able to access them type safely (through code generation), as if they were actual Java code, but you don't have to worry about the tedious and useless activity of binding each single in and out parameter to a type and value.
There are a ton of more advanced features derived from the above, such as:
The availability of a parser and by consequence the possibility of translating SQL.
Schema management tools, such as diffing two schema versions
Basic ActiveRecord support, including some nice things like optimistic locking.
Synthetic SQL features like type safe implicit JOIN
Query By Example.
A nice integration in Java streams or reactive streams.
Some more advanced SQL transformations (this is work in progress).
Export and import functionality
Simple JDBC mocking functionality, including a file based database mock.
Diagnostics
And, if you occasionally think something is much simpler to do with plain native SQL, then just:
Use plain native SQL, also in jOOQ
Disclaimer: As I work for the vendor, I'm obviously biased.
Is there any way i can maintain table relationship in Hive, Spark SQL and Azure U SQL. Does any of these support creating relationships like in Oracle or SQL Server. In oracle i can get the relationships using user_constraints table. Looking for something similar to that for processing in ML.
I don't believe any of the big data platforms support integrity constraints. They are designed to load data as quickly as possible. Having constraints will only slow down the import. Hope this makes sense
None of the big data tools maintain constraints. If we just consider hive it doesn't even bother while you are writing data to the table whether the table schema is maintained or not as Hive follows schema on read approach.
While reading the data if you want to establish any relation, one has to work with joins.
Apache Sparkā¢ provides a pluggable mechanism to integrate with external data sources using the DataSource APIs. These APIs allow Spark to read data from external data sources and also for data that is analyzed in Spark to be written back out to the external data sources. The DataSource APIs also support filter pushdowns and column pruning that can significantly improve the performance of queries.
In addition to this I want to know if Apache spark also provide ability (or interface)
for data sources which are able to execute functions (native or user defined) natively ?
We have a proprietary data source, and it can give results to functions like max(), min(), size() etc.
tl;dr No, that's not possible.
Spark SQL uses functions as a more developer-friendly interface to create Catalyst expressions that know what to generate when given an InternalRow (zero, one or more rows per what's available and whether the expression is a user-defined function or user-defined aggregate function, respectively).
DataSource does not interact with Column (or Catalyst expression in particular) or vice versa in any way. They are separate.
To get very low-level, you could review Max Catalyst expression yourself and learn what and when is generated at execution time.
I need to process quite a big json file using spark. I don't need all the fields in the json and actually would like to read only part of them (not read all fields and project).
I was wondering If I could use the json connector and give it a partial read schema with only the fields I'm interested loading.
It depends on whether your json is multi line. Currently spark only support json on single line as data frame. The next release of spark 2.3 will support multiline json.
But for your question. I don't think you can use a partial schema to read in json. You can first provide the full schema to read in as a dataframe, then select the specific column you need to construct your partial schema as a seperate dataframe. Since spark's use lazy evaluation and the sql engine is able to push down the filter, the performance won't be bad.