I am selecting a entire table and loading into a new table.It is loaded correctly but the value is appending not overwriting.
Spark version 1.6
Below is the code snippet
DataFrame df = hiveContext.createDataFrame(JavaRDD<Row>, StructType);
df.registerTempTable("tempregtable");
String query="insert into employee select * from tempregtable";
hiveContext.sql(query);
I am droping and creating the table (employee) and executing the above code.But still the old row value gets appended with new row.For eg if I am inserted four rows and dropped the table and again inserting four rows totally 8 rows got added.Kindly help me, how to overwrite the data instead of appending.
Regards
Prakash
try
String query="insert overwrite table employee select * from tempregtable";
INSERT OVERWRITE will overwrite any existing data in the table or partition
INSERT INTO will append to the table or partition
Reference: Hive Language Manual
Related
According to Athena docs, I can not add the date column to an existing table, so I am trying to use the workaround they propose with the timestamp datatype.
But when I run the ALTER TABLE my_table ADD COLUMNS (date_column TIMESTAMP) query, I still get the following error :
Parquet does not support date. See HIVE-6384
Is there any option to add date or timestamp columns to an existing table?
Thanks
UPD: I found out that I can still add timestamp columns with glue UI interface/API
UPD 2: The issue occurs only with one specific table, but it works for others.
You can use the following query to add a timestamp column to an existing table:
ALTER TABLE my_table ADD COLUMNS (date_column TIMESTAMP);
This should work for both Parquet and ORC tables.
I have external Hive Table which is filled by spark job and partitioned by(event_date date) now I have modified the spark code and added one extra column 'country'.In earlier written data country column will have null values as it is newly added. now I want to Alter 'partitioned by' clause as partition by(event_date date,country string) how can I achieve this.Thank you!!
Please try to alter the partition using below commnad-
ALTER TABLE table_name PARTITION part_spec SET LOCATION path
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Try this databricks spark-sql language manual for alter command
I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")
i have 30 columns in a table i.e table_old
i want to use 29 columns in that table except one . that column is dynamic.
i am using string interpolation.
the below sparksql query i am using
drop_column=now_current_column
var table_new=spark.sql(s"""alter table table_old drop $drop_column""")
but its throwing error
mismatched input expecting 'partition'
i dont want to drop the column using dataframe. i requirement is to drop the column in a table using sparksql only
As mentioned in previous answer, DROP COLUMN is not supported by spark yet.
But, there is a workaround to achieve the same, without much overhead. This trick works for both EXTERNAL and InMemory tables. The code snippet below works for EXTERNAL table, you can easily modify it and use it for InMemory tables as well.
val dropColumnName = "column_name"
val tableIdentifier = "table_name"
val tablePath = "table_path"
val newSchema=StructType(spark.read.table(tableIdentifier).schema.filter(col => col.name != dropColumnName))
spark.sql(s"drop table ${tableIdentifier}")
spark.catalog.createTable(tableIdentifier, "orc", newSchema, Map("path" -> tablePath))
orc is the file format, it should be replaced with the required format. For InMemory tables, remove the tablePath and you are good to go. Hope this helps.
DROP COLUMN (and in general majority of ALTER TABLE commands) are not supported in Spark SQL.
If you want to drop column you should create a new table:
CREATE tmp_table AS
SELECT ... -- all columns without drop TABLE
FROM table_old
and then drop the old table or view, and reclaim the name.
Now drop columns is supported by Spark if you´re using v2 tables. You can check this link
https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html
We are trying to write into a HIVE table from SPARK and we are using saveAsTable function. I want to know whether saveAsTable every time drop and recreate the hive table or not? If it does so, then is there any other possible spark function which will actually just truncate and load a table, instead drop and recreate.
It depends on which .mode value you are specifying
overwrite --> then spark drops the table first then recreates the table
append --> insert new data to the table
1.Drop if exists/create if not exists default.spark1 table in parquet format
>>> df.write.mode("overwrite").saveAsTable("default.spark1")
2.Drop if exists/create if not exists default.spark1 table in orc format
>>> df.write.format("orc").mode("overwrite").saveAsTable("default.spark1")
3.Append the new data to the existing data in the table(doesn't drop/recreate table)
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
Achieve Truncate and Load using Spark:
Method1:-
You can register your dataframe as temp table then execute insert overwrite statement to overwrite target table
>>> df.registerTempTable("temp") --registering df as temptable
>>> spark.sql("insert overwrite table default.spark1 select * from temp") --overwriting the target table.
This method will work for Internal/External tables also.
Method2:-
In case of internal tables as we can truncate the tables first then append the data to the table, by using this way we are not recreating the table but we are just appending the data to the table.
>>> spark.sql("truncate table default.spark1")
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
This method will only work for Internal tables.
Even in case of external tables we can do some workaround to truncate the table by changing table properties.
Let's assume default.spark1 table is external table and
--change external table to internal table
>>> saprk.sql("alter table default.spark1 set tblproperties('EXTERNAL'='FALSE')")
--once the table is internal then we can run truncate table statement
>>> spark.sql("truncate table default.spark1")
--change back the table as External table again
>>> spark.sql("alter table default.spark1 set tblproperties('EXTERNAL'='TRUE')")
--then append data to the table
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
You can also use insertInto("table") which doesn't recreate the table
The main difference between saveAsTable is that insertInto expects that the table already exists and is based on the order of columns instead of names.