I have a Hive table that I wrote in Hive using Spark (saveAs(tableName)). Now I want to add a column to this table. My first approach was to add this column via Hive but apparently Spark does not pick up the new schema even though the column was added in the table. When I check table details in Hue, spark.sql.sources.schema.part.0 is not updated. So I thought to add a column via Spark job that executes this query on the table. Same result, and the column is not even added in the table. Is there a way around this issue? I thought to change table name, create a new one with the right schema and then insert select * ... into new table but that didn't work as the tables are partitioned.
Related
Need some help as we are baffled. Using Impala SQL, we did ALTER TABLE to add 3 columns to a parquet table. The table is used by both Spark (v2) and Impala jobs.
After the columns were added, Impala correctly reports the new columns using describe, however, Spark does not report the freshly added columns when spark.sql("describe tablename") is executed.
We double checked Hive and it correctly reports the added columns.
We ran refresh table tablename in spark but it still doesn't see the new columns. We believe we must be overlooking something simple. What step did we miss?
Update: Impala sees the table with the columns but Spark does not acknowledge the new columns. Reading more about spark, apparently the spark engine reads the schema from the parquet file rather than from the hive meta store. The suggested work around did not work and the only recourse that could be found was to drop the table and rebuild it.
I need to alter one column name in Hive, so I did that with below query. After altering I could see the result in hive for "select columnnm1 from tablename"
alter table tablename change columnnm1 columnnm2 string;
But when I tried to execute the select columnnm2 from spark.sql I am getting NULL values, whereas I could see valid values in hive.
It's managed table. I tried spark metadata refresh, still no luck. As of now I am dropping old table and creating new hive table with correct column name works. But how to handle this ALTER scenario.
spark.catalog.refreshTable("<tablename>")
Thank you.
I am building an etl using pyspark in databricks.
I have a source SQL table with roughly 10 million rows of data which I want to load into a SQL staging table.
I have two basic requirements:-
When a row is added to the source table, it must to inserted into the staging table.
When a row is updated to the source table, it must to updated into the staging table.
Source data
Thankfully the source table has two timestamp columns for created and updated time. I can query the new and updated data using these two columns and put it into a dataframe called source_df.
Target data
I load all the keys (IDs) from the staging table into a dataframe called target_df.
Working out changes
I join the two dataframe together based on the key to work out which rows already exist (which form updates), and which rows don't exist (which form inserts). This gives me two new dataframes, inserts_df and updates_df.
Inserting new rows
This is easy because I can just use inserts_df.write to directly write into the staging table. Job done!
Updating existing rows
This is what I can't figure out as there is little in the way of existing examples. I am lead to believe that you can't do updates using pyspark. I could use the "overwrite" mode to replace the SQL table, but it doesn't make a lot of sense to replace 10 million rows when I only want to update half a dozen.
How can I effectively get the rows from updates_df into SQL without overwriting the whole table?
I have a pyspark dataframe currently from which I initially created a delta table using below code -
df.write.format("delta").saveAsTable("events")
Now, since the above dataframe populates the data on daily basis in my requirement, hence for appending new records into delta table, I used below syntax -
df.write.format("delta").mode("append").saveAsTable("events")
Now this whole thing I did in databricks and in my cluster. I want to know how can I write generic pyspark code in python that will create delta table if it does not exists and append records if delta table exists.This thing I want to do because if I give my python package to someone, they will not have the same delta table in their environment so it should get created dynamically from code.
If you don't have Delta table yet, then it will be created when you're using the append mode. So you don't need to write any special code to handle the case when table doesn't exist yet, and when it exits.
P.S. You'll need to have such code only in case if you're performing merge into the table, not append. In this case the code will looks like this:
if table_exists:
do_merge
else:
df.write....
P.S. here is a generic implementation of that pattern
There are eventually two operations available with spark
saveAsTable:- create or replace the table if present or not with the current DataFrame
insertInto:- Successful if the table present and perform operation based on the mode('overwrite' or 'append'). it requires the table to be available in the database.
The .saveAsTable("events") Basically rewrites the table every time you call it. which means that, even if you have a table present earlier or not, it will replace the table with the current DataFrame value. Instead, you can perform the below operation to be in the safer side:
Step 1: Create the table even if it is present or not. If present, remove the data from the table and append the new data frame records, else create the table and append the data.
df.createOrReplaceTempView('df_table')
spark.sql("create table IF NOT EXISTS table_name using delta select * from df_table where 1=2")
df.write.format("delta").mode("append").insertInto("events")
So, every time it will check if the table is available or not, else it will create the table and move to next step. Else, if the table is available, then append the data into the table.
I am trying to drop a table(Internal) table that was created Spark-Sql, some how table is getting dropped but location of the table is still exists. Can some one let me know how to do this?
I tried both Beeline and Spark-Sql
create table something(hello string)
PARTITIONED BY(date_d string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "^"
LOCATION "hdfs://path"
)
Drop table something;
No rows affected (0.945 seconds)
Thanks
Spark internally uses Hive metastore to create Table. If the table is created as an external hive table from spark i.e. the data present in HDFS and Hive provides a table view on that, drop table command will only delete the Metastore information and will not delete the data from HDFS.
So there are some alternate strategy which you could take
Manually delete the data from HDFS using hadoop fs -rm -rf command
Do alter table on the table you want to delete, change the external table to internal table then drop the table.
ALTER TABLE <table-name> SET TBLPROPERTIES('external'='false');
drop table <table-name>;
The first statement will convert the external table to internal table and 2nd statement will delete the table with the data.