How to create external Hive table without location? - apache-spark

I have a spark sql 2.1.1 job on a yarn cluster in cluster mode where I want to create an empty external hive table (partitions with location will be added in a later step).
CREATE EXTERNAL TABLE IF NOT EXISTS new_table (id BIGINT, StartTime TIMESTAMP, EndTime TIMESTAMP) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
When I run the job I get the error:
CREATE EXTERNAL TABLE must be accompanied by LOCATION
But when I run the same query on Hive Editor on Hue it runs just fine. I was trying to find an answer in the SparkSQL 2.1.1 documentation but came up empty.
Does anyone know why Spark SQL is more strict on queries?

TL;DR EXTERNAL with no LOCATION is not allowed.
The definitive answer is in Spark SQL's grammar definition file SqlBase.g4.
You can find the definition of CREATE EXTERNAL TABLE as createTableHeader:
CREATE TEMPORARY? EXTERNAL? TABLE (IF NOT EXISTS)? tableIdentifier
This definition is used in the supported SQL statements.
Unless I'm mistaken locationSpec is optional. That's according to the ANTLR grammar. The code may decide otherwise and it seems it does.
scala> spark.version
res4: String = 2.3.0-SNAPSHOT
val q = "CREATE EXTERNAL TABLE IF NOT EXISTS new_table (id BIGINT, StartTime TIMESTAMP, EndTime TIMESTAMP) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'"
scala> sql(q)
org.apache.spark.sql.catalyst.parser.ParseException:
Operation not allowed: CREATE EXTERNAL TABLE must be accompanied by LOCATION(line 1, pos 0)
== SQL ==
CREATE EXTERNAL TABLE IF NOT EXISTS new_table (id BIGINT, StartTime TIMESTAMP, EndTime TIMESTAMP) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
^^^
at org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:39)
at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitCreateHiveTable$1.apply(SparkSqlParser.scala:1096)
at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitCreateHiveTable$1.apply(SparkSqlParser.scala:1064)
at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitCreateHiveTable(SparkSqlParser.scala:1064)
at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitCreateHiveTable(SparkSqlParser.scala:55)
at org.apache.spark.sql.catalyst.parser.SqlBaseParser$CreateHiveTableContext.accept(SqlBaseParser.java:1124)
at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)
at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:71)
at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:99)
at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:70)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:68)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:97)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
... 48 elided
The default SparkSqlParser (with astBuilder as SparkSqlAstBuilder) has the following assertion that leads to the exception:
if (external && location.isEmpty) {
operationNotAllowed("CREATE EXTERNAL TABLE must be accompanied by LOCATION", ctx)
I'd consider reporting an issue in Spark's JIRA if you think that the case should be allowed. See SPARK-2825 to have a strong argument to have the support:
CREATE EXTERNAL TABLE already works as far as I know and should have the same semantics as Hive.

Related

Hive, how to partition by a colum with null values, putting all nulls in one partition

I am using Hive, and the IDE is Hue. I am trying different key combinations to choose for my partition key(s).
The definition of my original table is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_day` string
)
PARTITIONED BY (
`date_year` string,
`date_month` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
The above code is working properly. But when I create a new table with date_day as partition key, the table is empty and I need to run MSCK Repair Table. However, I am getting the following error:
Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
When the partion keys were date_year, date_month, MSCK worked properly.
Table definition of the table I am getting the error for is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (
`date_day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
After this the following query is empty:
Select * From `my_hive_db`.`my_table` Limit 10;
I therefore ran the following command:
MSCK REPAIR TABLE `my_hive_db`.`my_table`;
And I get the error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
I checked this link as it is exactly the error I am getting, but by using the solution provided:
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
I get a different error:
Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime.
I think the reason I am getting these errors is that there are more than 200 million records with date_day having null value.
There are 31 distinct date-day not null values. I would like to partition my table with 32 partitions, each for a distinct value of date_day field, and all the null values get into a different partition. Is there a way to do so (partitioning by a column with null values)?
If this can be achieved by spark, I am also open to use it.
This is part of a bigger problem of changing partition keys by recreating a table as mentioned in this link in answer to my other question.
Thank you for your help.
You seem to not understand how Hive's partitioning work.
Hive stores data into files on HDFS (or S3, or some other distributed folders).
If you create a non-partitioned parquet table called my_schema.my_table, you will see in your distributed storage files stored in a folder
hive/warehouse/my_schema.db/my_table/part_00001.parquet
hive/warehouse/my_schema.db/my_table/part_00002.parquet
...
If you create a table partitioned by a column p_col, the files will look like
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00002.parquet
...
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00002.parquet
...
The command MSCK repair table allows you to automatically reload the partitions, when you create an external table.
Let's say you have folders on s3 that look like this:
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value3/part_00001.parquet
You create an external table with
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_col STRING)
the table will be created but empty, because Hive hasn't detected the partitions yet. You run MSCK REPAIR TABLE my_schema.my_table, and Hive will recognize that your partition p_col matches the partitioning scheme on s3 (/p_col=value1/).
From what I could understand from your other question, you are trying to change the partitioning scheme of the table by doing
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_another_col STRING)
and you are getting an error message because p_another_col doesn't match with the column used in s3, which was p_col.
And this error is perfectly normal, since what you are doing doesn't make sense.
As stated in the other question's answer, you need to create a copy of the first table, with a different partitioning scheme.
You should instead try something like this:
CREATE External Table my_hive_db.my_table_2(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (`date_day` string)
and then populate your new table with dynamic partitioning
INSERT OVERWRITE TABLE my_hive_db.my_table_2 PARTITION(date_day)
SELECT
col_id,
result_section__col2,
result_section_col3,
result_section_col4,
result_section_col5,
result_section_col6__label,
result_section_col7__label_id,
result_section_text,
result_section_unit,
result_section_col,
result_section_title,
result_section_title_id,
col13,
timestamp,
date_year,
date_month,
date_day
FROM my_hive_db.my_table_1

Can I use regular expression in PARTITION BY?

(
ResponseRgBasketId STRING,
RawStandardisedLoadDateTime TIMESTAMP,
InfoMartLoadDateTime TIMESTAMP,
Operaame STRING,
RequestTimestamp TIMESTAMP,
RequestSiteId STRING,
RequestSalePointId STRING,
RequestdTypeId STRING,
RequeetValue DECIMAL(10,2),
ResponsegTimestamp TIMESTAMP,
RequessageId STRING,
RequestBasketId STRING,
ResponsesageId STRING,
RequestTransmitAttempt INT,
ResponseCode STRING,
RequestasketItems INT,
ResponseFinancialTimestamp TIMESTAMP,
RequeketJsonString STRING,
LoyaltyId STRING
)
USING DELTA
PARTITIONED BY (RequestTimestamp)
TBLPROPERTIES
(
delta.deletedFileRetentionDuration = "interval 1 seconds",
delta.autoOptimize.optimizeWrite = true
)
It has been partitioned by RequestTimestamp(2020-12-12T07:39:35.000+0000
), but it has the format as below. Could I change the format to different format to something like 2020-12-34 in partition by?
Short answer: No regexp or other transformation is possible in PARTITIONED BY.
The only solution is to apply substr(timestamp, 1, 10) during/before load.
See also this answer: https://stackoverflow.com/a/64171676/2700344
Long answer:
No regexp is possible in PARTITIONED BY. No functions are allowed in table DDL, only type can be specified. Type in column specification works as constraint and at the same time can cause implicit type conversion. For example if you are loading strings into dates, it will be casted implicitly if possible and loaded into null default partition if not possible to cast. Also if you are loading BIGINT, it will be silently truncated to INT, as a result you will see corrupted data and duplicates.
Does the same implicit cast work with partitioned by? Let,s see:
DROP TABLE IF EXISTS test_partition;
CREATE TABLE IF NOT EXISTS test_partition (Id int)
partitioned by (dt date) --Hope timestamp will be truncated to DATE
;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table test_partition partition(dt)
select 1 as id, current_timestamp as dt;
show partitions test_partition;
Result (We expect timestamp truncated to DATE...):
dt=2021-03-24 10%3A19%3A19.985
No, it does not work. Tested the same with varchar(10) column with strings like yours.
See short answer.

Hive - insert into table partition throwing error

I am trying to create a partitioned table in Hive on spark and load it with data available in other table in Hive.
I am getting following error while loading the data:
Error: org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException:
Partition spec {cardsuit=, cardcolor=, cardSuit=SPA, cardColor=BLA}
contains non-partition columns;
following are the commands used to execute the task:-
create table if not exists hive_tutorial.hive_table(color string, suit string,value string) comment 'first hive table' row format delimited fields terminated by '|' stored as TEXTFILE;
LOAD DATA LOCAL INPATH 'file:///E:/Kapil/software-study/Big_Data_NoSql/hive/deckofcards.txt' OVERWRITE INTO TABLE hive_table; --data is correctly populated(52 rows)
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
create table if not exists hive_tutorial.hive_table_partitioned(color string, suit string,value int) comment 'first hive table' partitioned by (cardSuit string,cardColor string) row format delimited fields terminated by '|' stored as TEXTFILE;
INSERT INTO TABLE hive_table_partitioned PARTITION (cardSuit,cardColor) select color,suit,value,substr(suit, 1, 3) as cardSuit,substr(color, 1, 3) as cardColor from hive_table;
--alternatively i tried
INSERT OVERWRITE TABLE hive_table_partitioned PARTITION (cardSuit,cardColor) select color,suit,value,substr(suit, 1, 3) as cardSuit,substr(color, 1, 3) as cardColor from hive_table;
sample of data:-
BLACK|SPADE|2
BLACK|SPADE|3
BLACK|SPADE|4
BLACK|SPADE|5
BLACK|SPADE|6
BLACK|SPADE|7
BLACK|SPADE|8
BLACK|SPADE|9
I am using spark 2.2.0 and java version 1.8.0_31.
I have checked and tried answers given in similar thread but could not solve my problem:-
SemanticException Partition spec {col=null} contains non-partition columns
Am I missing something here?
After carefully reading the scripts above while creating tables ,value column is type of int in partitioned table where as string in original. Error was misleading !!!

Having trouble querying by dates using the Java Cassandra Spark SQL Connector

I'm trying to use Spark SQL to query a table by a date range. For example, I'm trying to run an SQL statement like: SELECT * FROM trip WHERE utc_startdate >= '2015-01-01' AND utc_startdate <= '2015-12-31' AND deployment_id = 1 AND device_id = 1. When I run the query no error is being thrown but I'm not receiving any results back when I would expect some. When running the query without the date range I am getting results back.
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("SparkTest")
.set("spark.executor.memory", "1g")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.connection.native.port", "9042")
.set("spark.cassandra.connection.rpc.port", "9160");
JavaSparkContext context = new JavaSparkContext(sparkConf);
JavaCassandraSQLContext sqlContext = new JavaCassandraSQLContext(context);
sqlContext.sqlContext().setKeyspace("mykeyspace");
String sql = "SELECT * FROM trip WHERE utc_startdate >= '2015-01-01' AND utc_startdate < '2015-12-31' AND deployment_id = 1 AND device_id = 1";
JavaSchemaRDD rdd = sqlContext.sql(sql);
List<Row> rows = rdd.collect(); // rows.size() is zero when I would expect it to contain numerous rows.
Schema:
CREATE TABLE trip (
device_id bigint,
deployment_id bigint,
utc_startdate timestamp,
other columns....
PRIMARY KEY ((device_id, deployment_id), utc_startdate)
) WITH CLUSTERING ORDER BY (utc_startdate ASC);
Any help would be greatly appreciated.
What does your table schema (in particular, your PRIMARY KEY definition) look like? Even without seeing it, I am fairly certain that you are seeing this behavior because you are not qualifying your query with a partition key. Using the ALLOW FILTERING directive will filter the rows by date (assuming that is your clustering key), but that is not a good solution for a big cluster or large dataset.
Let's say that you are querying users in a certain geographic region. If you used region as a partition key, you could run this query, and it would work:
SELECT * FROM users
WHERE region='California'
AND date >= '2015-01-01' AND date <= '2015-12-31';
Give Patrick McFadin's article on Getting Started with Timeseries Data a read. That has some good examples that should help you.

DSE/Cassandra CQL now() does not work for timestamp type

I am having troubles with using now() function with timestamp type.
Please take a look at the following code:
Table creation:
CREATE TABLE "Test" (
video_id UUID,
upload_timestamp TIMESTAMP,
title VARCHAR,
views INT,
PRIMARY KEY (video_id, upload_timestamp)
) WITH CLUSTERING ORDER BY (upload_timestamp DESC);
The problematic INSERT query:
INSERT INTO "Test" (video_id, upload_timestamp, title, views)
VALUES (uuid(), now(), 'Test', 0);
The INSERT query seems looking fine to me. However, when I execute it, I see the following error:
Unable to execute CQL script on 'XXX': cannot assign result of function now (type timeuuid) to upload_timestamp (type timestamp)
What I am doing wrong here?
I use DataStax Enterprise 4.5.2
now() returns a timeuuid, not a timestamp. You clould try dateOf(now()). Have a read of this from the docs:
dateOf and unixTimestampOf
The dateOf and unixTimestampOf functions take a timeuuid argument and
extract the embedded timestamp. However, while the dateof function
return it with the timestamp type (that most client, including cqlsh,
interpret as a date), the unixTimestampOf function returns it as a
bigint raw value.

Resources