How to add new column in to partition by clause in Hive External Table - apache-spark

I have external Hive Table which is filled by spark job and partitioned by(event_date date) now I have modified the spark code and added one extra column 'country'.In earlier written data country column will have null values as it is newly added. now I want to Alter 'partitioned by' clause as partition by(event_date date,country string) how can I achieve this.Thank you!!

Please try to alter the partition using below commnad-
ALTER TABLE table_name PARTITION part_spec SET LOCATION path
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Try this databricks spark-sql language manual for alter command

Related

create table in hive

I am trying to create hive table with this syntax :
create table table_name as orc as select * from table1 partitioned by (Acc_date date).
I am getting error. My requirement is to create table using select statement and append the table when the next load happens.
I am trying to replicate this spark command:
df1.distinct().repartition("acc_date").write.mode("append").partitionBy("acc_date").format("parquet").saveAsTable("schema.table_name")
Make it a two step process.
Create the partition table as you want.
Insert data into it.
Details
1.sql may be like this -
create table if not exists table_name
(Col1 int, col2...)
partition (acc_date date)
Stored as orc ;
Insert will be like below. Make sure partition column is the last column in select clause.
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
Insert into table_name partition (Acc_date )
Select col1,col2... acc_date from table1 ;

How to add a timestamp column to an existing table in Athena?

According to Athena docs, I can not add the date column to an existing table, so I am trying to use the workaround they propose with the timestamp datatype.
But when I run the ALTER TABLE my_table ADD COLUMNS (date_column TIMESTAMP) query, I still get the following error :
Parquet does not support date. See HIVE-6384
Is there any option to add date or timestamp columns to an existing table?
Thanks
UPD: I found out that I can still add timestamp columns with glue UI interface/API
UPD 2: The issue occurs only with one specific table, but it works for others.
You can use the following query to add a timestamp column to an existing table:
ALTER TABLE my_table ADD COLUMNS (date_column TIMESTAMP);
This should work for both Parquet and ORC tables.

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

Invalid Column name Error in DSE Analytics Spark

I have one table whose structure roughly is as follows ->
CREATE TABLE keyspace_name.table_name (
id text PRIMARY KEY,
type text,
bool_yn boolean,
created_ts timestamp,
modified_ts timestamp
)
Recently I added new column in the table ->
alter table keyspace_name.table_name first_name text;
And when I query on the given column from table in cqlsh, it gives me the result. For eg.
select first_name from keyspace_name.table_name limit 10;
But if I try to perform the same query in dse spark-sql
It is giving me the following error.
Error in query: cannot resolve 'first_name' given input columns: [id, type, bool_yn, created_ts, modified_ts];
I don't know what's wrong in spark-sql. I've tried nodetool repair but problem still persists
Any help would be appreciated. Thanks
If table schema changes, the Spark metastore doesn't automatically refresh the schema changes, so manually remove the old tables from spark sql with a DROP TABLE command, then run SHOW TABLES. The new table with latest schema will be automatically created. This will not change the data in Cassandra.

Brisk cassandra TimeUUIDType

I used brisk. The cassandra column family automatically maps to Hive tables.
However, if data type is timeuuid in column family, it is unreadable in Hive tables.
For example, I used following command to create an external table in hive to map column family.
Hive > create external table A (rowkey string, column_name string, value string)
> STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
> WITH SERDEPROPERTIES (
> "cassandra.columns.mapping" = ":key,:column,:value");
If column name is TimeUUIDType in cassandra, it becomes unreadable in the Hive table.
For example, a row in cassandra column family looks like:
RowKey: 2d36a254bb04272b120aaf79d70a3578
=> (column=29139210-b6dc-11df-8c64-f315e3a329d6, value={"event_id":101},timestamp=1283464254261)
Where column name is TimeUUIDType.
In hive table, it looks like the following row:
2d36a254bb04272b120aaf79d70a3578 t��ߒ4��!�� {"event_id":101}
So, column name is unreadable in Hive table.
This is a known issue with the automatic table mapping. For best results with a timeUUIDType, turn the auto-mapping feature off in $brisk_home/resources/hive/hive-site.xml:
"cassandra.autoCreateHiveSchema"
and create the table in hive manually.

Resources