Environment
presto 0.215
presto-cli 0.215
presto-jdbc 0.215
Hive Table created by Presto
CREATE TABLE hive.origin.test_part (
id int,
date_key int
)
WITH (
format = 'ORC',
partitioned_by = ARRAY['date_key'],
external_location = '/user/hive/warehouse/origin.db/test_part/'
)
Presto JDBC and CLI both insert into success
partiton '20190122' doesn't exist before and insert succeeded which means rename tmp directory to /user/hive/warehouse/origin.db/test_part/date_key=20190122 succeeded.
/user/hive/warehouse/origin.db/test_part/date_key=20190122/ in hdfs
But Presto CLI CALL system.create_empty_partition() failed
CALL system.create_empty_partition( schema_name => 'origin', table_name => 'test_part', partition_columns => ARRAY['date_key'], partition_values => ARRAY['20190121'])
Full error message
com.facebook.presto.spi.PrestoException: Failed to rename hdfs://datacenter1:8020/tmp/presto-hive/b87162e5-9e48-4d43-a0e7-ecf0994fe625/date_key=20190121 to hdfs://datacenter1:8020/user/hive/warehouse/origin.db/test_part/date_key=20190121: rename returned false
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore.renameDirectory(SemiTransactionalHiveMetastore.java:1787)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore.access$2700(SemiTransactionalHiveMetastore.java:87)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore$Committer.prepareAddPartition(SemiTransactionalHiveMetastore.java:1177)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore$Committer.access$700(SemiTransactionalHiveMetastore.java:957)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore.commitShared(SemiTransactionalHiveMetastore.java:885)
at com.facebook.presto.hive.metastore.SemiTransactionalHiveMetastore.commit(SemiTransactionalHiveMetastore.java:807)
at com.facebook.presto.hive.HiveMetadata.commit(HiveMetadata.java:1949)
at com.facebook.presto.hive.CreateEmptyPartitionProcedure.createEmptyPartition(CreateEmptyPartitionProcedure.java:126)
at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:649)
at com.facebook.presto.execution.CallTask.execute(CallTask.java:160)
at com.facebook.presto.execution.CallTask.execute(CallTask.java:60)
at com.facebook.presto.execution.DataDefinitionExecution.start(DataDefinitionExecution.java:168)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
/tmp/presto-hive/ in hdfs
So
CALL system.create_empty_partition() use different 'user' to manipulate hdfs?
This is failing due to a bug that prevents it from working with non-bucketed tables. It is fixed in the 301 release.
Related
I am trying to create BLOOMFILTER index by referring to the document -
https://docs.databricks.com/spark/2.x/spark-sql/language-manual/create-bloomfilter-index.html
I created the DELTA table by,
spark.sql("DROP TABLE IF EXISTS testdb.fact_lists")
spark.sql("CREATE TABLE testdb.fact_lists USING DELTA LOCATION '/delta/fact-lists'")
I enabled bloom filter by,
%sql
SET spark.databricks.io.skipping.bloomFilter.enabled = true;
SET delta.bloomFilter.enabled = true;
When I try to run the below CREATE statement for BLOOMFILTER I get the "no viable input" error
%sql
CREATE BLOOMFILTER INDEX
ON TABLE testdb.fact_lists
FOR COLUMNS(event_id OPTION(fpp=0.1, numItems=100))
Error:
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'CREATE BLOOMFILTER'(line 1, pos 7)
== SQL ==
CREATE BLOOMFILTER INDEX
-------^^^
ON TABLE testdb.fact_lists
FOR COLUMNS(event_id OPTION(fpp=0.1, numItems=100))
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:298)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:159)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:88)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:106)
at com.databricks.sql.parser.DatabricksSqlParser.$anonfun$parsePlan$1(DatabricksSqlParser.scala:77)
at com.databricks.sql.parser.DatabricksSqlParser.parse(DatabricksSqlParser.scala:97)
at com.databricks.sql.parser.DatabricksSqlParser.parsePlan(DatabricksSqlParser.scala:74)
at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:801)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:151)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:801)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:798)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:695)
at com.databricks.backend.daemon.driver.SQLDriverLocal.$anonfun$executeSql$1(SQLDriverLocal.scala:91)
at scala.collection.immutable.List.map(List.scala:293)
at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:37)
at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:145)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:605)
at com.databricks.logging.Log4jUsageLoggingShim$.$anonfun$withAttributionContext$1(Log4jUsageLoggingShim.scala:33)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:94)
at com.databricks.logging.Log4jUsageLoggingShim$.withAttributionContext(Log4jUsageLoggingShim.scala:31)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:205)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:204)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:60)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:240)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:225)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:60)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:582)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:615)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:607)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:526)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:561)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:431)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:374)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:225)
at java.lang.Thread.run(Thread.java:748)
at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:130)
at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:145)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:605)
at com.databricks.logging.Log4jUsageLoggingShim$.$anonfun$withAttributionContext$1(Log4jUsageLoggingShim.scala:33)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:94)
at com.databricks.logging.Log4jUsageLoggingShim$.withAttributionContext(Log4jUsageLoggingShim.scala:31)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:205)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:204)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:60)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:240)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:225)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:60)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:582)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:615)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:607)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:526)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:561)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:431)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:374)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:225)
at java.lang.Thread.run(Thread.java:748)
Kindly assist. Thanks in advance!
I got the same error when I used the same query to create Bloomfilter index in Databricks 10.4 LTS on my sample table.
CREATE BLOOMFILTER INDEX
ON TABLE factlists
FOR COLUMNS(id OPTION(fpp=0.1, numItems=100))
#error message
ParseException:
no viable alternative at input 'CREATE bloomfilter'(line 1, pos 7)
== SQL ==
CREATE bloomfilter INDEX
-------^^^
ON TABLE factlists
FOR COLUMNS(id OPTION(fpp=0.1, numItems=100))
The error was because of the incorrect syntax. When I used the following modified query, successful creation of Bloomfilter index was possible (OPTIONS instead of OPTION).
CREATE bloomfilter INDEX
ON TABLE factlists
FOR COLUMNS(id OPTIONS(fpp=0.1, numItems=100))
In your query, try changing the syntax i.e., OPTION to OPTIONS (the cause for the error) to overcome the error.
I'm trying to run such a PySpark application:
with SparkSession.builder.appName(f"Spark App").getOrCreate() as spark:
dataframe_mysql = spark.read.format('jdbc').options(
url="jdbc:mysql://.../...",
driver='com.mysql.cj.jdbc.Driver',
dbtable='my_table',
user=...,
password=...,
partitionColumn='id',
lowerBound=0,
upperBound=10000000,
numPartitions=11,
fetchsize=1000000,
isolationLevel='NONE'
).load()
dataframe_mysql = dataframe_mysql.filter("date > '2022-01-01'")
dataframe_mysql.write.parquet('...')
And I found that Spark didn't load data from Mysql until executing the write, this means that Spark let the database take care of filtering the data, and the SQL that database received may like:
select * from my_table where id > ... and id< ... and date > '2022-01-01'
My table was too big and there's no index on date column, the database couldn't handle the filtering. How can I load data into Spark's memory before filtering, I hope the query sent to the databse could be:
select * from my_table where id > ... and id< ...
According to #samkart's comment, set pushDownPredicate to False could solve this problem
I use Spark SQL v2.4. with the SQL API. I have a sql query, which fails when I run the job in Spark, it fails with the error :-
WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes).
This may impact query planning performance.
ERROR TransportClient: Failed to send RPC RPC 8371705265602543276 to xx.xxx.xxx.xx:52790:java.nio.channels.ClosedChannelException
The issue occurs when I am triggering write command to save the output of the query to parquet file on S3:
The query is:-
create temp view last_run_dt
as
select dt,
to_date(last_day(add_months(to_date('[${last_run_date}]','yyyy-MM-dd'), -1)), 'yyyy-MM-dd') as dt_lst_day_prv_mth
from date_dim
where dt = add_months(to_date('[${last_run_date}]','yyyy-MM-dd'), -1);
create temp view get_plcy
as
select plcy_no, cust_id
from (select
plcy_no,
cust_id,
eff_date,
row_number() over (partition by plcy_no order by eff_date desc) AS row_num
from plcy_mstr pm
cross join last_run_dt lrd
on pm.curr_pur_dt <= lrd.dt_lst_day_prv_mth
and pm.fund_type NOT IN (27, 36, 52)
and pm.fifo_time <= '2022-02-12 01:25:00'
and pm.plcy_no is not null
)
where row_num = 1;
I am writing the output as :
df.coalesce(10).write.parquet('s3:/some/dir/data', mode="overwrite", compression="snappy")
The "plcy_mstr" table in the above query is a big table of 500 GB size and is partitioned on eff_dt column. Partitioned by every date.
I have tried to increase the executor memory by applying the following configurations, but the job still fails.
set spark.driver.memory=20g;
set spark.executor.memory=20g;
set spark.executor.cores=3;
set spark.executor.instances=30;
set spark.memory.fraction=0.75;
set spark.driver.maxResultSize=0;
The cluster contains 20 nodes with 8 cores each and 64GB of memory.
Can anyone please help me identify the issue and fix the job ? Any help is appreciated.
Happy to provide more information if required.
Thanks
I have a 3 node cluster in cassandra. And created system key using dsetool createsystemkey 'AES/ECB/PKCS5Padding' 128 system_key with proper permission and ownership to /etc/ /etc/dse/conf/.
But when I am creating a table with encryption I am getting the following error.
ConfigurationException: ErrorMessage code=2300 [Query invalid because
of configuration issue] message="Encryptor.create() threw an error:
java.lang.RuntimeException Failed to initialize Encryptor:
com.datastax.bdp.cassandra.crypto.KeyGenerationException:
java.io.IOException: Couldn't encrypt input"
Table schema
CREATE TABLE test ( id text PRIMARY KEY , data text ) WITH compression = {'sstable_compression': 'Encryptor','cipher_algorithm' : 'AES/ECB/PKCS5Padding', 'secret_key_strength' : 128, 'chunk_length_kb' : 1 };
My DSE version : 4.8.4
I am trying out Cassandra for the first time and running it locally for simple session management db. [Cassandra-2.0.4, CQL3, datastax driver 2.0.0-rc2]
The following count query works fine when there is no data in the table:
select count(*) from session_data where app_name=? and account=? and last_access > ?
But after even a single row is inserted into the table, the query fails with the following error:
java.lang.AssertionError
at org.apache.cassandra.db.filter.ExtendedFilter$WithClauses.getExtraFilter(ExtendedFilter.java:258)
at org.apache.cassandra.db.ColumnFamilyStore.filter(ColumnFamilyStore.java:1719)
at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1674)
at org.apache.cassandra.db.PagedRangeCommand.executeLocally(PagedRangeCommand.java:111)
at org.apache.cassandra.service.StorageProxy$LocalRangeSliceRunnable.runMayThrow(StorageProxy.java:1418)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1931)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Here is the schema I am using:
CREATE KEYSPACE session WITH replication= {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE session_data (
username text,
session_id text,
app_name text,
account text,
last_access timestamp,
created_on timestamp,
PRIMARY KEY (username, session_id, app_name, account)
);
create index sessionIndex ON session_data (session_id);
create index sessionAppName ON session_data (app_name);
create index lastAccessIndex ON session_data (last_access);
I am wondering if there is something wrong in the table definition/indexes or the query itself. Any help/insight would be greatly appreciated.
It looks like you're tripping over a bug in Cassandra. Here is the assertion and related comments in the Cassandra sources:
/*
* This method assumes the IndexExpression names are valid column names, which is not the
* case with composites. This is ok for now however since:
* 1) CompositeSearcher doesn't use it.
* 2) We don't yet allow non-indexed range slice with filters in CQL3 (i.e. this will never be
* called by CFS.filter() for composites).
*/
assert !(cfs.getComparator() instanceof CompositeType);
This code was modified between cassandra-2.0.4 and trunk as part of ticket CASSANDRA-5417, but it's not clear to me that the author was aware of this issue. The assertion was removed, but the comment was not. I would recommend submitting a bug report to the Cassandra project.