How to figure out if a Metastore object is a view or a table? - databricks

I am trying to figure out if there is a way to find if a metastore object is a Table or a View.
For example, we can use this SQL query:
Describe detail mydb.mytable
that returns metadata about the table:
The format column here tells us that this object is a table - "delta" (table).
Is there an equivalent query that I can use to check if an object is a View?
Note: there seems to be no equivalent for the above Describe detail query for a view. I tried bellow:
Describe detail mydb.myview
But I am getting this error:
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: Table or view not found: information_schema.tables; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [information_schema, tables], [], false
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$2(CheckAnalysis.scala:138)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$2$adapted(CheckAnalysis.scala:105)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:358)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:357)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:357)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
...

Related

Spark partition filter is skipped when table is used in where condition, why?

Maybe someone observed this behavior and knows why Spark takes this route.
I wanted to read only few partitions from partitioned table.
SELECT *
FROM my_table
WHERE snapshot_date IN('2023-01-06', '2023-01-07')
results in (part of) the physical plan:
-- Location: PreparedDeltaFileIndex [dbfs:/...]
-- PartitionFilters: [cast(snapshot_date#282634 as string) IN (2023-01-06,2033-01-07)]
It is very fast, ~1s, in the execution plan I see it is using those provided datasets as arguments for partition filters.
If I try to provide filter predicate in form of the one column table it does full table scan and it takes 100x longer.
SELECT *
FROM
my_table
WHERE snapshot_date IN (
SELECT snapshot_date
FROM (VALUES('2023-01-06'), ('2023-01-07')) T(snapshot_date)
)
-- plan
Location: PreparedDeltaFileIndex [dbfs:/...]
PartitionFilters: []
ReadSchema: ...
I was unable to find any query hints that would force Spark to push down this predicate.
One can easily do for loop in python and wrap logic of reading a table with desired dates and read them one by one. But I'm not sure it is possible in SQL.
Is there any option/switch I have missed?
I don't think pushing down this kind of predicate is something supported by Spark's HiveMetaStore client, today.
So in first case, HiveShim.convertFilters(...) method will transform
:
WHERE snapshot_date IN ('2023-01-06', '2023-01-07')
into a filtering predicate understood by HMS as
snapshot_date="2023-01-06" or snapshot_date="2023-01-07"
but in the second, sub-select, case the condition will be skipped altogether.
/**
* Converts catalyst expression to the format that Hive's getPartitionsByFilter() expects, i.e.
* a string that represents partition predicates like "str_key=\"value\" and int_key=1 ...".
*
* Unsupported predicates are skipped.
*/
def convertFilters(table: Table, filters: Seq[Expression]): String = {
lazy val dateFormatter = DateFormatter()
:
:

NullPointerException while using JSON datatype with Jooq

I am using jooq 3.15.12 and I am getting the following error while generating code:
Error while generating table ArmourDb.REGEX
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.NullPointerException
at org.jooq.meta.AbstractDatabase.onError(AbstractDatabase.java:3104)
at org.jooq.meta.AbstractDatabase.getRelations(AbstractDatabase.java:2317)
at org.jooq.meta.DefaultColumnDefinition.getPrimaryKey(DefaultColumnDefinition.java:100)
at org.jooq.meta.AbstractTableDefinition.getPrimaryKey(AbstractTableDefinition.java:91)
at org.jooq.codegen.JavaGenerator.generateTable(JavaGenerator.java:5019)
at org.jooq.codegen.JavaGenerator.generateTables(JavaGenerator.java:5000)
at org.jooq.codegen.JavaGenerator.generate(JavaGenerator.java:582)
at org.jooq.codegen.JavaGenerator.generate(JavaGenerator.java:537)
at org.jooq.codegen.JavaGenerator.generate(JavaGenerator.java:436)
at org.jooq.codegen.GenerationTool.run0(GenerationTool.java:879)
at org.jooq.codegen.GenerationTool.run(GenerationTool.java:233)
at org.jooq.codegen.GenerationTool.generate(GenerationTool.java:228)
at com.linkedin.gradle.mysql.tasks.MySQLJooqCodegenTask.taskAction(MySQLJooqCodegenTask.java:32)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
The table looks like this:
CREATE TABLE REGEX (
ID BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
WORD VARCHAR(4000) NOT NULL,
DESCRIPTION VARCHAR(1000),
TRAITS JSON,
MODIFIED_BY VARCHAR(255) NOT NULL,
MODIFIED_AT TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
CREATED_AT TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE = InnoDB;
I have observed that JSON field is causing this issue, if I change the datatype from json to some other type then the java files get generated without any issue.
As per the stack trace, NullPointerException is coming while getting relations of the table.

Could not create BLOOMFILTER Index in databricks

I am trying to create BLOOMFILTER index by referring to the document -
https://docs.databricks.com/spark/2.x/spark-sql/language-manual/create-bloomfilter-index.html
I created the DELTA table by,
spark.sql("DROP TABLE IF EXISTS testdb.fact_lists")
spark.sql("CREATE TABLE testdb.fact_lists USING DELTA LOCATION '/delta/fact-lists'")
I enabled bloom filter by,
%sql
SET spark.databricks.io.skipping.bloomFilter.enabled = true;
SET delta.bloomFilter.enabled = true;
When I try to run the below CREATE statement for BLOOMFILTER I get the "no viable input" error
%sql
CREATE BLOOMFILTER INDEX
ON TABLE testdb.fact_lists
FOR COLUMNS(event_id OPTION(fpp=0.1, numItems=100))
Error:
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'CREATE BLOOMFILTER'(line 1, pos 7)
== SQL ==
CREATE BLOOMFILTER INDEX
-------^^^
ON TABLE testdb.fact_lists
FOR COLUMNS(event_id OPTION(fpp=0.1, numItems=100))
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:298)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:159)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:88)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:106)
at com.databricks.sql.parser.DatabricksSqlParser.$anonfun$parsePlan$1(DatabricksSqlParser.scala:77)
at com.databricks.sql.parser.DatabricksSqlParser.parse(DatabricksSqlParser.scala:97)
at com.databricks.sql.parser.DatabricksSqlParser.parsePlan(DatabricksSqlParser.scala:74)
at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:801)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:151)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:801)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:798)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:695)
at com.databricks.backend.daemon.driver.SQLDriverLocal.$anonfun$executeSql$1(SQLDriverLocal.scala:91)
at scala.collection.immutable.List.map(List.scala:293)
at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:37)
at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:145)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:605)
at com.databricks.logging.Log4jUsageLoggingShim$.$anonfun$withAttributionContext$1(Log4jUsageLoggingShim.scala:33)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:94)
at com.databricks.logging.Log4jUsageLoggingShim$.withAttributionContext(Log4jUsageLoggingShim.scala:31)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:205)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:204)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:60)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:240)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:225)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:60)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:582)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:615)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:607)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:526)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:561)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:431)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:374)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:225)
at java.lang.Thread.run(Thread.java:748)
at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:130)
at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:145)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:605)
at com.databricks.logging.Log4jUsageLoggingShim$.$anonfun$withAttributionContext$1(Log4jUsageLoggingShim.scala:33)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:94)
at com.databricks.logging.Log4jUsageLoggingShim$.withAttributionContext(Log4jUsageLoggingShim.scala:31)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:205)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:204)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:60)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:240)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:225)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:60)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:582)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:615)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:607)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:526)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:561)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:431)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:374)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:225)
at java.lang.Thread.run(Thread.java:748)
Kindly assist. Thanks in advance!
I got the same error when I used the same query to create Bloomfilter index in Databricks 10.4 LTS on my sample table.
CREATE BLOOMFILTER INDEX
ON TABLE factlists
FOR COLUMNS(id OPTION(fpp=0.1, numItems=100))
#error message
ParseException:
no viable alternative at input 'CREATE bloomfilter'(line 1, pos 7)
== SQL ==
CREATE bloomfilter INDEX
-------^^^
ON TABLE factlists
FOR COLUMNS(id OPTION(fpp=0.1, numItems=100))
The error was because of the incorrect syntax. When I used the following modified query, successful creation of Bloomfilter index was possible (OPTIONS instead of OPTION).
CREATE bloomfilter INDEX
ON TABLE factlists
FOR COLUMNS(id OPTIONS(fpp=0.1, numItems=100))
In your query, try changing the syntax i.e., OPTION to OPTIONS (the cause for the error) to overcome the error.

Spark jdbc on SQL server failed with filter and disable pushDownPredicate not work

val url = "jdbc:sqlserver://XXXXXX"
val properties = new Properties
// properties.setProperty(JDBCOptions.JDBC_PUSHDOWN_PREDICATE, "false") add but still not work
val df = spark.read.jdbc(url, "movies", properties)
df.filter("rated == true").show()
The code is quite simple. And it will fail with error
Job aborted due to stage failure.
Caused by: SQLServerException: Invalid column name 'true'.
The table is in SQL Server and the table schema is
CREATE TABLE [dbo].[movies](
[movieId] [int] NULL,
[title] [nvarchar](max) NULL,
[releaseDate] [date] NULL,
[rated] [bit] NULL,
[screenedOn] [datetime2](7) NULL,
[ticketPrice] [decimal](10, 2) NULL
)
Using code in JDBCUtils and JDBCDialect, the 'bit' type will be translated to BooleanType, which is good. However the filter logic is pushdown to JDBCRDD, and due to code defect in MsSqlServerDialect compileValue() method, the where clause doesn't change boolean value to '1' and '0' to match the TSQL in SQL server which caused such error. And even if I write a new dialect, due to AggregatedDialect didn't overwrite compileValue() to loop all contained dialect, it still fails. That's is something I think need to be fixed from current Spark code.
And my question is, from doc there is one 'pushDownPredicate' option to control whether to let filter logic push down to where clause.
pushDownPredicate The option to enable or disable predicate push-down
into the JDBC data source. The default value is true, in which case
Spark will push down filters to the JDBC data source as much as
possible. Otherwise, if set to false, no filter will be pushed down to
the JDBC data source and thus all filters will be handled by Spark.
Predicate push-down is usually turned off when the predicate filtering
is performed faster by Spark than by the JDBC data source.
But even if I set properties.setProperty(JDBCOptions.JDBC_PUSHDOWN_PREDICATE, "false") it still fails with the same error. I wonder how pushDownPredicate work in spark.jdbc and do I understand correctly 'pushDownPredicate' is able to prevent filter push down to jdbcrdd.
Here is related logical plan:
Project
+- Filter (isnotnull(rated#91) && (cast(rated#91 as string) = true)) +- Relation[movieId#88,title#89,releaseDate#90,rated#91,screenedOn#92,ticketPrice#93]
JDBCRelation(dbo.[movies]) [numPartitions=1]
We had this issue moving from databricks V9 to 10.
We were using pyspark but including {"pushDownPredicate":"false"} to the config file solved it for us.
properties =
{
"user": jdbcUsername,
"password":jdbcPassword,
"driver":driver,
"pushDownPredicate":"false"
}
spark.read.jdbc(url=url, table=controltable, properties=properties) \

Suqueries in select clause with JPA

I need to execute a subquery in a select clause with Apache Openjpa 2.
Does JPA support subqueries in SELECT clause?
My Query is something like this:
SELECT t.date, t.value,
(SELECT COUNT(DISTINCT t2.value2) FROM table t2 WHERE t2.date = t.date)
FROM table t
WHERE ...
When I execute my query, I get a class cast exception:
Exception in thread "main" <openjpa-2.1.1-SNAPSHOT-r422266:1141200 nonfatal user error> org.apache.openjpa.persistence.ArgumentException:
at org.apache.openjpa.kernel.QueryImpl.execute(QueryImpl.java:872)
at org.apache.openjpa.kernel.QueryImpl.execute(QueryImpl.java:794)
at org.apache.openjpa.kernel.DelegatingQuery.execute(DelegatingQuery.java:542)
at org.apache.openjpa.persistence.QueryImpl.execute(QueryImpl.java:315)
at org.apache.openjpa.persistence.QueryImpl.getResultList(QueryImpl.java:331)
Caused by: java.lang.ClassCastException: org.apache.openjpa.jdbc.sql.LogicalUnion$UnionSelect incompatible with org.apache.openjpa.jdbc.sql.SelectImpl
at org.apache.openjpa.jdbc.sql.SelectImpl.setParent(SelectImpl.java:579)
at org.apache.openjpa.jdbc.kernel.exps.SelectConstructor.newSelect(SelectConstructor.java:147)
at org.apache.openjpa.jdbc.kernel.exps.SelectConstructor.evaluate(SelectConstructor.java:87)
at org.apache.openjpa.jdbc.kernel.exps.SubQ.appendTo(SubQ.java:209)
at org.apache.openjpa.jdbc.kernel.exps.SubQ.appendTo(SubQ.java:203)
at org.apache.openjpa.jdbc.kernel.exps.SubQ.newSQLBuffer(SubQ.java:167)
at org.apache.openjpa.jdbc.kernel.exps.SubQ.selectColumns(SubQ.java:153)
at org.apache.openjpa.jdbc.kernel.exps.SubQ.select(SubQ.java:148)
at org.apache.openjpa.jdbc.kernel.exps.SelectConstructor.select(SelectConstructor.java:372)
at org.apache.openjpa.jdbc.kernel.JDBCStoreQuery.populateSelect(JDBCStoreQuery.java:295)
at org.apache.openjpa.jdbc.kernel.JDBCStoreQuery.access$100(JDBCStoreQuery.java:86)
at org.apache.openjpa.jdbc.kernel.JDBCStoreQuery$1.select(JDBCStoreQuery.java:267)
at org.apache.openjpa.jdbc.sql.LogicalUnion.select(LogicalUnion.java:297)
at org.apache.openjpa.jdbc.kernel.JDBCStoreQuery.populateUnion(JDBCStoreQuery.java:265)
at org.apache.openjpa.jdbc.kernel.JDBCStoreQuery.executeQuery(JDBCStoreQuery.java:211)
at org.apache.openjpa.kernel.ExpressionStoreQuery$DataStoreExecutor.executeQuery(ExpressionStoreQuery.java:782)
at org.apache.openjpa.datacache.QueryCacheStoreQuery$QueryCacheExecutor.executeQuery(QueryCacheStoreQuery.java:346)
at org.apache.openjpa.kernel.QueryImpl.execute(QueryImpl.java:1005)
at org.apache.openjpa.kernel.QueryImpl.execute(QueryImpl.java:863)
... 6 more
Is this possible or do I have to use NativeQuery / Single Queries?
No, it is not supported to use subqueries in SELECT clause. In JPA 2.0 specification this is told with following words:
Subqueries may be used in the WHERE or HAVING clauses.

Resources