Hive - alter struct column with nested struct field - struct

I have table with following structure:
CREATE TABLE test.news
(
`event` string,
`description` string,
`info` struct<data:struct<url:string,adress:string,errors:string>, ip:string>,
`id` string)
PARTITIONED BY (
`date` string)
And I need to expand my 'info' columns with following elements:
enabled:string,no:string
To get following table structure:
CREATE TABLE test.news
(
`event` string,
`description` string,
`info` struct<enabled:string,no:string,data:struct<url:string,adress:string,errors:string>, ip:string>,
`id` string)
PARTITIONED BY (
`date` string)
But when I alter column, it will result with errors when using SELECT on my table. This is ALTER expression I used:
ALTER TABLE test.news
CHANGE COLUMN info info struct<enabled:string,no:string,data:struct<url:string,adress:string,errors:string>, ip:string>
Is it even possible to do this with nested struct inside another struct?
This worked fine work me when I just need to expand struct with another elements (without another struct inside struct)...

Related

delete Apache hudi duplicate record key

I got some trouble in hudi when i delete rows with the same record key by spark-sql.
e.g
I created a table and set the recordKey=empno
CREATE TABLE emp_duplicate_pk (
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal int,
comm int,
deptno int,
tx_date string
)
using hudi
options(
type='cow'
,primaryKey='empno'
,payloadclass='org.apache.hudi.common.model.OverwriteNonDefaultWithLatestAvroPayLoad'
,preCombineField='tx_date'
,hoodie.cleaner.commits.retained='10'
,hoodie.keep.min.commits='20'
,hoodie.keep.max.commits='30'
,hoodie.index.type='SIMPLE'
,hoodie.sql.insert.mode='non-strict'
,hoodie.combine.before.insert='false'
,hoodie.combine.before.upsert='false'
,hoodie.merge.allow.duplicate.on.inserts='true'
);
then insert some test records
seems hudi allows duplicate primary key with these attributes
hoodie.sql.insert.mode='non-strict'
hoodie.combine.before.insert='false'
hoodie.combine.before.upsert='false'
hoodie.merge.allow.duplicate.on.inserts='true'
insert into emp_duplicate_pk values
(7369,'SMITH','CLERK',7902,'1980-12-17',800,100,20,'2022-11-17'),
(7499,'ALLEN','SALESMAN',7698,'1981-02-20',1600,300,30,'2022-11-17'),
(5233,'PTER','DEVELOPER',9192,'1996-05-30',5000,3000,10,'2022-11-13'),
(5233,'PTER','DEVELOPER',9192,'1996-05-30',5000,3000,10,'2022-11-14');
insert into emp_duplicate_pk values
(5233,'PTER','DEVELOPER',9192,'1996-05-30',5000,3000,10,'2022-11-15'),
(5233,'PTER','DEVELOPER',9192,'1996-05-30',5000,3000,10,'2022-11-16'),
(5233,'PTER','DEVELOPER',9192,'1996-05-30',5000,3000,10,'2022-11-17');
all data could be searched (image)
then i delete a record
delete from emp_duplicate_pk where tx_date='2022-11-16';
seems it delete all empno=5233 rows,only two rows left(7369 and 7499) (image)
How to delete tx_date='2022-11-16' exactly and reserve other rows,anyone can help?

Hive, how to partition by a colum with null values, putting all nulls in one partition

I am using Hive, and the IDE is Hue. I am trying different key combinations to choose for my partition key(s).
The definition of my original table is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_day` string
)
PARTITIONED BY (
`date_year` string,
`date_month` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
The above code is working properly. But when I create a new table with date_day as partition key, the table is empty and I need to run MSCK Repair Table. However, I am getting the following error:
Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
When the partion keys were date_year, date_month, MSCK worked properly.
Table definition of the table I am getting the error for is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (
`date_day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
After this the following query is empty:
Select * From `my_hive_db`.`my_table` Limit 10;
I therefore ran the following command:
MSCK REPAIR TABLE `my_hive_db`.`my_table`;
And I get the error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
I checked this link as it is exactly the error I am getting, but by using the solution provided:
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
I get a different error:
Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime.
I think the reason I am getting these errors is that there are more than 200 million records with date_day having null value.
There are 31 distinct date-day not null values. I would like to partition my table with 32 partitions, each for a distinct value of date_day field, and all the null values get into a different partition. Is there a way to do so (partitioning by a column with null values)?
If this can be achieved by spark, I am also open to use it.
This is part of a bigger problem of changing partition keys by recreating a table as mentioned in this link in answer to my other question.
Thank you for your help.
You seem to not understand how Hive's partitioning work.
Hive stores data into files on HDFS (or S3, or some other distributed folders).
If you create a non-partitioned parquet table called my_schema.my_table, you will see in your distributed storage files stored in a folder
hive/warehouse/my_schema.db/my_table/part_00001.parquet
hive/warehouse/my_schema.db/my_table/part_00002.parquet
...
If you create a table partitioned by a column p_col, the files will look like
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00002.parquet
...
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00002.parquet
...
The command MSCK repair table allows you to automatically reload the partitions, when you create an external table.
Let's say you have folders on s3 that look like this:
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value3/part_00001.parquet
You create an external table with
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_col STRING)
the table will be created but empty, because Hive hasn't detected the partitions yet. You run MSCK REPAIR TABLE my_schema.my_table, and Hive will recognize that your partition p_col matches the partitioning scheme on s3 (/p_col=value1/).
From what I could understand from your other question, you are trying to change the partitioning scheme of the table by doing
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_another_col STRING)
and you are getting an error message because p_another_col doesn't match with the column used in s3, which was p_col.
And this error is perfectly normal, since what you are doing doesn't make sense.
As stated in the other question's answer, you need to create a copy of the first table, with a different partitioning scheme.
You should instead try something like this:
CREATE External Table my_hive_db.my_table_2(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (`date_day` string)
and then populate your new table with dynamic partitioning
INSERT OVERWRITE TABLE my_hive_db.my_table_2 PARTITION(date_day)
SELECT
col_id,
result_section__col2,
result_section_col3,
result_section_col4,
result_section_col5,
result_section_col6__label,
result_section_col7__label_id,
result_section_text,
result_section_unit,
result_section_col,
result_section_title,
result_section_title_id,
col13,
timestamp,
date_year,
date_month,
date_day
FROM my_hive_db.my_table_1

Can I use regular expression in PARTITION BY?

(
ResponseRgBasketId STRING,
RawStandardisedLoadDateTime TIMESTAMP,
InfoMartLoadDateTime TIMESTAMP,
Operaame STRING,
RequestTimestamp TIMESTAMP,
RequestSiteId STRING,
RequestSalePointId STRING,
RequestdTypeId STRING,
RequeetValue DECIMAL(10,2),
ResponsegTimestamp TIMESTAMP,
RequessageId STRING,
RequestBasketId STRING,
ResponsesageId STRING,
RequestTransmitAttempt INT,
ResponseCode STRING,
RequestasketItems INT,
ResponseFinancialTimestamp TIMESTAMP,
RequeketJsonString STRING,
LoyaltyId STRING
)
USING DELTA
PARTITIONED BY (RequestTimestamp)
TBLPROPERTIES
(
delta.deletedFileRetentionDuration = "interval 1 seconds",
delta.autoOptimize.optimizeWrite = true
)
It has been partitioned by RequestTimestamp(2020-12-12T07:39:35.000+0000
), but it has the format as below. Could I change the format to different format to something like 2020-12-34 in partition by?
Short answer: No regexp or other transformation is possible in PARTITIONED BY.
The only solution is to apply substr(timestamp, 1, 10) during/before load.
See also this answer: https://stackoverflow.com/a/64171676/2700344
Long answer:
No regexp is possible in PARTITIONED BY. No functions are allowed in table DDL, only type can be specified. Type in column specification works as constraint and at the same time can cause implicit type conversion. For example if you are loading strings into dates, it will be casted implicitly if possible and loaded into null default partition if not possible to cast. Also if you are loading BIGINT, it will be silently truncated to INT, as a result you will see corrupted data and duplicates.
Does the same implicit cast work with partitioned by? Let,s see:
DROP TABLE IF EXISTS test_partition;
CREATE TABLE IF NOT EXISTS test_partition (Id int)
partitioned by (dt date) --Hope timestamp will be truncated to DATE
;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table test_partition partition(dt)
select 1 as id, current_timestamp as dt;
show partitions test_partition;
Result (We expect timestamp truncated to DATE...):
dt=2021-03-24 10%3A19%3A19.985
No, it does not work. Tested the same with varchar(10) column with strings like yours.
See short answer.

Why AWS Athena returns "string" datatype to all table's fields on "show create table" command or describe tables

Why AWS Athena returns "string" datatype to all table's fields on
"show create table" command or on describe tables:
for example table t_mus_albums:
albumid (bigint)
title (string)
artistid (bigint)
whan running
show create table t_mus_albums;
I get:
CREATE EXTERNAL TABLE `t_mus_albums`(
`albumid` string COMMENT 'from deserializer',
`title` string COMMENT 'from deserializer',
`artistid` string COMMENT 'from deserializer')
I think you might be doing something wrong or while generating the table automatically, you may not have correct formatted data.
Here are the systematic steps to solve your problem.
Assume that your data is in below format.
ID,Code,City,State
41,5,"Youngstown", OH
42,52,"Yankton", SD
46,35,"Yakima", WA
42,16,"Worcester", MA
43,37,"Wisconsin Dells", WI
36,5,"Winston-Salem", NC
Then your create table will go something like below.
CREATE EXTERNAL TABLE IF NOT EXISTS example.tbl_datatype (
`id` int,
`code` int,
`city` string,
`state` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://example-bucket/location/a/'
TBLPROPERTIES ('has_encrypted_data'='false');
Then, run the Query to description the table.
SHOW CREATE TABLE tbl_datatype;
It will give you output something like below.
CREATE EXTERNAL TABLE `tbl_datatype`(
`id` int,
`code` int,
`city` string,
`state` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://example-bucket/location/a/';
Hope it helps!
This is because you use CSV serde and not e.g. TEXT serde.
CSV serde does support only string data type, so all columns are of this type.
From https://docs.aws.amazon.com/athena/latest/ug/csv.html
The OpenCSV SerDe [...] Converts all column type values to STRING.
The documentation outlines some conditions under which the table schema could be different than all strings ("For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them"), but apparently this was not effective in your case.

Spark not able to read hive table because of _1 and _2 sub folders in S3

I have below 3 hive tables with same structure.
drop table default.test1;
CREATE EXTERNAL TABLE `default.test1`(
`c1` string,
`c2` string,
`c3` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://s3_bucket/dev/dev/testspark/test1/';
drop table default.test2;
CREATE EXTERNAL TABLE `default.test2`(
`c1` string,
`c2` string,
`c3` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://s3_bucket/dev/dev/testspark/test2/';
drop table default.test3;
CREATE EXTERNAL TABLE `default.test3`(
`c1` string,
`c2` string,
`c3` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://s3_bucket/dev/dev/testspark/test3/';
hive>insert into default.test1 values("a","b","c");
hive>insert into default.test2 values("d","e","f");
hive>insert overwrite table default.test3 select * from default.test1 UNION ALL select * from default.test2;
Once after I loaded data by using UNION ALL of test1 and test2. test3 table s3 path is having the data in sub folders like below.
PRE 1/
PRE 2/
When I query the test3 table from hive it will give the result of the data which was inserted.
But when I query the same in spark. It is getting zero count.
pyspark shell:
>>>sqlContext.sql("select * from default.test3").count()
>>>0
How to fix this issue ?
There is one more property need to be set along with the ones above to make this work.
spark.conf.set("mapred.input.dir.recursive","true")
spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
spark.conf.set("spark.sql.hive.convertMetastoreParquet", "false")
Try setting up the below properties before running sqlContext.sql
sqlContext.setConf("mapred.input.dir.recursive","true");
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");

Resources