How to add a timestamp column to an existing table in Athena? - presto

According to Athena docs, I can not add the date column to an existing table, so I am trying to use the workaround they propose with the timestamp datatype.
But when I run the ALTER TABLE my_table ADD COLUMNS (date_column TIMESTAMP) query, I still get the following error :
Parquet does not support date. See HIVE-6384
Is there any option to add date or timestamp columns to an existing table?
Thanks
UPD: I found out that I can still add timestamp columns with glue UI interface/API
UPD 2: The issue occurs only with one specific table, but it works for others.

You can use the following query to add a timestamp column to an existing table:
ALTER TABLE my_table ADD COLUMNS (date_column TIMESTAMP);
This should work for both Parquet and ORC tables.

Related

Approach to handling update date update columns in ADF pipelines

New to Adf and trying to build a solution to handle the below simple scenario:
Source and sink are both Azure sql db
New rows based on unique id are inserted with a null [update date] column
Any changes to existing rows have the [update date] column updated to current date
Would the copy data activity be enough to cater for the above?
Thanks
You can use Upsert in the Copy activity. Upsert means update if exists and insert if not exists based on a column.
Follow below repro for your reference:
My source table in SQL:
Target table:
Select upsert in sink:
select Upsert, then you can give your Unique Id column in the Key columns.
The Upsert action will be done based on this column.
Target table after execution:
You can see that the existing rows are updated, and new rows are inserted.

How to add new column in to partition by clause in Hive External Table

I have external Hive Table which is filled by spark job and partitioned by(event_date date) now I have modified the spark code and added one extra column 'country'.In earlier written data country column will have null values as it is newly added. now I want to Alter 'partitioned by' clause as partition by(event_date date,country string) how can I achieve this.Thank you!!
Please try to alter the partition using below commnad-
ALTER TABLE table_name PARTITION part_spec SET LOCATION path
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Try this databricks spark-sql language manual for alter command

Automatically Updating a Hive View Daily

I have a requirement I want to meet. I need to sqoop over data from a DB to Hive. I am sqooping on a daily basis since this data is updated daily.
This data will be used as lookup data from a spark consumer for enrichment. We want to keep a history of all the data we have received but we don't need all the data for lookup only the latest data (same day). I was thinking of creating a hive view from the historical table and only showing records that were inserted that day. Is there a way to automate the view on a daily basis so that the view query will always have the latest data?
Q: Is there a way to automate the view on a daily basis so that the
view query will always have the latest data?
No need to update/automate the process if you get a partitioned table based on date.
Q: We want to keep a history of all the data we have received but we
don't need all the data for lookup only the latest data (same day).
NOTE : Either hive view or hive table you should always avoid scanning the full table data aka full table scan for getting latest partitioned data.
Option 1: hive approach to query data
If you want to adapt hive approach
you have to go with partition column for example : partition_date and partitioned table in hive
select * from table where partition_column in
(select max(distinct partition_date ) from yourpartitionedTable)
or
select * from (select *,dense_rank() over (order by partition_date desc) dt_rnk from db.yourpartitionedTable ) myview
where myview.dt_rnk=1
will give the latest partition always. (if same day or todays date is there in partition data then it will give the same days partition data otherwise it will give max partition_date) and its data from the partition table.
Option 2: Plain spark approach to query data
with spark show partitions command i.e. spark.sql(s"show Partitions $yourpartitionedtablename") get the result in array and sort that to get latest partition date. using that you can query only latest partitioned date as lookup data using your spark component.
see my answer as an idea for getting latest partition date.
I prefer option2 since no hive query is needed and no full table query since
we are using show partitions command. and no performance bottle necks
and speed will be there.
One more different idea is querying with HiveMetastoreClient or with option2... see this and my answer and the other
I am assuming that you are loading daily transaction records to your history table with some last modified date. Every time you insert or update record to your history table you get your last_modified_date column updated. It could be date or timestamp also.
you can create a view in hive to fetch the latest data using analytical function.
Here's some sample data:
CREATE TABLE IF NOT EXISTS db.test_data
(
user_id int
,country string
,last_modified_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;
I am inserting few sample records. you see same id is having multiple records for different dates.
INSERT INTO TABLE db.test_data VALUES
(1,'India','2019-08-06'),
(2,'Ukraine','2019-08-06'),
(1,'India','2019-08-05'),
(2,'Ukraine','2019-08-05'),
(1,'India','2019-08-04'),
(2,'Ukraine','2019-08-04');
creating a view in Hive:
CREATE VIEW db.test_view AS
select user_id, country, last_modified_date
from ( select user_id, country, last_modified_date,
max(last_modified_date) over (partition by user_id) as max_modified
from db.test_data ) as sub
where last_modified_date = max_modified
;
hive> select * from db.test_view;
1 India 2019-08-06
2 Ukraine 2019-08-06
Time taken: 5.297 seconds, Fetched: 2 row(s)
It's showing us result with max date only.
If you further inserted another set of record with max last modified date as:
hive> INSERT INTO TABLE db.test_data VALUES
> (1,'India','2019-08-07');
hive> select * from db.test_view;
1 India 2019-08-07
2 Ukraine 2019-08-06
for reference:Hive View manuual

How to alter cassandra table columns

I need to add additional columns to a table in cassandra. But the existing table is not empty. Is there any way to update it in a simple way? Otherwise what is the best approach to add additional columns to a non empty table? thx in advance.
There's a good example of adding table columns to an existing table in the CQL documentation on ALTER. The following statement will add the column gravesite (with type varchar) to to the table addamsFamily:
ALTER TABLE addamsFamily ADD gravesite varchar;

subsonic timestamp issue on insert record

My problem is about the timestamp column in one of my database tables. This field is required because sqlserver automatically generates it. Since this column is required subsonic included it in the insert query..
im using subsonic 3.0
info.Save();
error message:
Cannot insert an explicit value into a timestamp column. Use INSERT with a column list to exclude the timestamp column, or insert a DEFAULT into the timestamp column.
Have a look at the following answer for a workaround for this:
Subsonic: How to exclude a table column so that its not included in the generated SQL query

Resources