How can I set a value of Location with CustomDataSource? - apache-spark

I want to use spark to customize the data source based on ParquetDataSourceV2 class in spark v3.2.2, but I want to leave out the location field and then modify the table and partition path in the code. How can I do that?
This is my DDL:
CREATE EXTERNAL TABLE `DEFAULT`.`USER_PARQUET_READ` (
`id` BIGINT COMMENT '',
`name` STRING COMMENT '')
USING com.kyligence.spark.datasources.DefaultSource
PARTITIONED BY (`asOfDate` DATE COMMENT 'PARTITIONED KEY')
LOCATION ' ' —> *this value is empty string, then I need to set a value in Scala code , how can I do that ? *
TBLPROPERTIES (
'transient_lastDdlTime' = '1641864235',
'skip.header.line.count' = '1')
I expect you can give some codes in scala code, Thank you so much!

Related

Hive, how to partition by a colum with null values, putting all nulls in one partition

I am using Hive, and the IDE is Hue. I am trying different key combinations to choose for my partition key(s).
The definition of my original table is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_day` string
)
PARTITIONED BY (
`date_year` string,
`date_month` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
The above code is working properly. But when I create a new table with date_day as partition key, the table is empty and I need to run MSCK Repair Table. However, I am getting the following error:
Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
When the partion keys were date_year, date_month, MSCK worked properly.
Table definition of the table I am getting the error for is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (
`date_day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
After this the following query is empty:
Select * From `my_hive_db`.`my_table` Limit 10;
I therefore ran the following command:
MSCK REPAIR TABLE `my_hive_db`.`my_table`;
And I get the error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
I checked this link as it is exactly the error I am getting, but by using the solution provided:
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
I get a different error:
Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime.
I think the reason I am getting these errors is that there are more than 200 million records with date_day having null value.
There are 31 distinct date-day not null values. I would like to partition my table with 32 partitions, each for a distinct value of date_day field, and all the null values get into a different partition. Is there a way to do so (partitioning by a column with null values)?
If this can be achieved by spark, I am also open to use it.
This is part of a bigger problem of changing partition keys by recreating a table as mentioned in this link in answer to my other question.
Thank you for your help.
You seem to not understand how Hive's partitioning work.
Hive stores data into files on HDFS (or S3, or some other distributed folders).
If you create a non-partitioned parquet table called my_schema.my_table, you will see in your distributed storage files stored in a folder
hive/warehouse/my_schema.db/my_table/part_00001.parquet
hive/warehouse/my_schema.db/my_table/part_00002.parquet
...
If you create a table partitioned by a column p_col, the files will look like
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00002.parquet
...
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00002.parquet
...
The command MSCK repair table allows you to automatically reload the partitions, when you create an external table.
Let's say you have folders on s3 that look like this:
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value3/part_00001.parquet
You create an external table with
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_col STRING)
the table will be created but empty, because Hive hasn't detected the partitions yet. You run MSCK REPAIR TABLE my_schema.my_table, and Hive will recognize that your partition p_col matches the partitioning scheme on s3 (/p_col=value1/).
From what I could understand from your other question, you are trying to change the partitioning scheme of the table by doing
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_another_col STRING)
and you are getting an error message because p_another_col doesn't match with the column used in s3, which was p_col.
And this error is perfectly normal, since what you are doing doesn't make sense.
As stated in the other question's answer, you need to create a copy of the first table, with a different partitioning scheme.
You should instead try something like this:
CREATE External Table my_hive_db.my_table_2(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (`date_day` string)
and then populate your new table with dynamic partitioning
INSERT OVERWRITE TABLE my_hive_db.my_table_2 PARTITION(date_day)
SELECT
col_id,
result_section__col2,
result_section_col3,
result_section_col4,
result_section_col5,
result_section_col6__label,
result_section_col7__label_id,
result_section_text,
result_section_unit,
result_section_col,
result_section_title,
result_section_title_id,
col13,
timestamp,
date_year,
date_month,
date_day
FROM my_hive_db.my_table_1

Table in Pyspark shows headers from CSV File

I have a csv file with contents as below which has a header in the 1st line .
id,name
1234,Rodney
8984,catherine
Now I was able create a table in hive to skip header and read the data appropriately.
Table in Hive
CREATE EXTERNAL TABLE table_id(
`tmp_id` string,
`tmp_name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-testing/test/data/'
tblproperties ("skip.header.line.count"="1");
Results in Hive
select * from table_id;
OK
1234 Rodney
8984 catherine
Time taken: 1.219 seconds, Fetched: 2 row(s)
But, when I use the same table in pyspark (Ran the same query) I see even the headers from file in pyspark results as below.
>>> spark.sql("select * from table_id").show(10,False)
+------+---------+
|tmp_id|tmp_name |
+------+---------+
|id |name |
|1234 |Rodney |
|8984 |catherine|
+------+---------+
Now, how can I ignore these showing up in the results in pyspark.
I'm aware that we can read the csv file and add .option("header",True) to achieve this but, I wanna know if there's a way to do something similar in pyspark while querying tables.
Can someone suggest me a way.... Thanks 🙏 in Advance !!
u can use below two properties:
serdies properties and table properties, you will be able to access table from hive and spark by skipping header in both env.
CREATE EXTERNAL TABLE `student_test_score_1`(
student string,
age string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'delimiter'=',',
'field.delim'=',',
'header'='true',
'skip.header.line.count'='1',
'path'='hdfs:<path>')
LOCATION
'hdfs:<path>'
TBLPROPERTIES (
'spark.sql.sources.provider'='CSV')
This is know issue in Spark-11374 and closed as won't fix.
In query you can have where clause to select all records except 'id' and 'name'.
spark.sql("select * from table_id where tmp_id <> 'id' and tmp_name <> 'name'").show(10,False)
#or
spark.sql("select * from table_id where tmp_id != 'id' and tmp_name != 'name'").show(10,False)
Another way would be using reading files from HDFS with .option("header","true").

Create table in Athena using all objects from multiple folders in S3 Bucket via Boto3

My S3 Bucket has multiple sub-directories that store data for multiple websites based on the day.
example:
bucket/2020-01-03/website 1 and within this are where the csv's are stored.
I am able to create tables based on each of the objects but I want to create one consolidated table for all sub-directories/objects/data stored within the prefix bucket/2020-01-03 for all websites as well as all other dates.
I used the code below to create one table for
Athena configuration
athena = boto3.client('athena',aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY,
region_name= 'us-west-2')
s3_input = 's3://bucket/2020-01-03/website1'
database = 'database1'
table = 'consolidated_table'
Athena database and table definition
create_table = \
"""CREATE EXTERNAL TABLE IF NOT EXISTS `%s.%s` (
`website_id` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`date` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\"', 'separatorChar'=','
) LOCATION '%s'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transient_lastDdlTime'='1576774420');""" % ( database, table, s3_input )
athena.start_query_execution(QueryString=create_table,
WorkGroup = 'user_group',
QueryExecutionContext={'Database': 'database1'},
ResultConfiguration={'OutputLocation': 's3://aws-athena-query-results-5000-us-west-2'})
I also want to over-write this table with new data from S3 everytime I run it.
You can have a consolidated table for the files from different "directories" on S3 only if all of them adhere the same data schema. As I can see from your CREATE EXTERNAL TABLE, each file contains 4 columns website_id, user, action and date. So you can simply change LOCATION to point to the root of your S3 "directory structure"
CREATE EXTERNAL TABLE IF NOT EXISTS `database1`.`consolidated_table` (
`website_id` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`date` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\"', 'separatorChar'=','
)
LOCATION 's3://bucket' -- instead of restricting it to s3://bucket/2020-01-03/website1
TBLPROPERTIES (
'skip.header.line.count'='1'
);
In this case, each Athena query would scan all files under s3://bucket location and you can use website_id and date in WHERE clause to filter results. However, if you have a lot of data you should consider partitioning. It will save you not only time to execute query but also money (see this post)
I also want to over-write this table with new data from S3 every time I run it.
I assume you mean, that every time you run Athena query, it should scan files on S3 even if they were added after you executed CREATE EXTERNAL TABLE. Note, that CREATE EXTERNAL TABLE simply defines a meta information about you data, i.e. where it is located on S3, columns etc. Thus, query against table with LOCATION 's3://bucket' (w/o partitioning) will always include all your S3 files

Why AWS Athena returns "string" datatype to all table's fields on "show create table" command or describe tables

Why AWS Athena returns "string" datatype to all table's fields on
"show create table" command or on describe tables:
for example table t_mus_albums:
albumid (bigint)
title (string)
artistid (bigint)
whan running
show create table t_mus_albums;
I get:
CREATE EXTERNAL TABLE `t_mus_albums`(
`albumid` string COMMENT 'from deserializer',
`title` string COMMENT 'from deserializer',
`artistid` string COMMENT 'from deserializer')
I think you might be doing something wrong or while generating the table automatically, you may not have correct formatted data.
Here are the systematic steps to solve your problem.
Assume that your data is in below format.
ID,Code,City,State
41,5,"Youngstown", OH
42,52,"Yankton", SD
46,35,"Yakima", WA
42,16,"Worcester", MA
43,37,"Wisconsin Dells", WI
36,5,"Winston-Salem", NC
Then your create table will go something like below.
CREATE EXTERNAL TABLE IF NOT EXISTS example.tbl_datatype (
`id` int,
`code` int,
`city` string,
`state` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://example-bucket/location/a/'
TBLPROPERTIES ('has_encrypted_data'='false');
Then, run the Query to description the table.
SHOW CREATE TABLE tbl_datatype;
It will give you output something like below.
CREATE EXTERNAL TABLE `tbl_datatype`(
`id` int,
`code` int,
`city` string,
`state` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://example-bucket/location/a/';
Hope it helps!
This is because you use CSV serde and not e.g. TEXT serde.
CSV serde does support only string data type, so all columns are of this type.
From https://docs.aws.amazon.com/athena/latest/ug/csv.html
The OpenCSV SerDe [...] Converts all column type values to STRING.
The documentation outlines some conditions under which the table schema could be different than all strings ("For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them"), but apparently this was not effective in your case.

Turning a Comma Separated string into individual rows in Teradata

I read the post:
Turning a Comma Separated string into individual rows
And really like the solution:
SELECT A.OtherID,
Split.a.value('.', 'VARCHAR(100)') AS Data
FROM
( SELECT OtherID,
CAST ('<M>' + REPLACE(Data, ',', '</M><M>') + '</M>' AS XML) AS Data
FROM Table1
) AS A CROSS APPLY Data.nodes ('/M') AS Split(a);
But it did not work when I tried to apply the method in Teradata for a similar question. Here is the summarized error code:
select failed 3707: expected something between '.' and the 'value' keyword. So is the code only valid in SQL Server? Would anyone help me to make it work in Teradata or SAS SQL? Your help will be really appreciated!
This is SQL Server syntax.
In Teradata there's a table UDF named STRTOK_SPLIT_TO_TABLE,
e.g.
SELECT * FROM dbc.DatabasesV AS db
JOIN
(
SELECT token AS DatabaseName, tokennum
FROM TABLE (STRTOK_SPLIT_TO_TABLE(1, 'dbc,systemfe', ',')
RETURNS (outkey INTEGER,
tokennum INTEGER,
token VARCHAR(128) CHARACTER SET UNICODE)
) AS d
) AS dt
ON db.DatabaseName = dt.DatabaseName
ORDER BY tokennum;
Or see my answer to this similar question

Resources