Create Spark SQL tables from multiple parquet paths - apache-spark

I use databricks. I am trying to create a table as below
target_table_name = 'test_table_1'
spark.sql("""
drop table if exists %s
""" % target_table_name)
spark.sql("""
create table if not exists {0}
USING org.apache.spark.sql.parquet
OPTIONS (
path ("/mnt/sparktables/ds=*/name=xyz/")
)
""".format(target_table_name))
Even though using "*" gives me flexibility on loading different files (pattern matching) and eventually create a table, I wish to create a table based on two completely different paths (no pattern matching).
path1 = /mnt/sparktables/ds=*/name=xyz/
path2 = /mnt/sparktables/new_path/name=123fo/

Spark uses Hive metastore to create these permanent tables. These tables are essentially external tables in Hive.
Generally what you are trying is not possible because Hive external table location needs to be unique at the time of creation.
However, you could still achieve the hive table with different location, if you incorporate partitioning strategy on your hive metastore.
In hive metastore you can have partitions which point to different locations.
However there is no off the shelf way to achieve this. Firstly you would need to specify a partition key for your dataset and create a table from the 1st location where the entire data belongs to one partition. Then alter table to add a new partition.
Sample:
create external table tableName(<schema>) partitioned by ('name') location '/mnt/sparktables/ds=*/name=xyz/'
Then you can add partitions
alter table tableName add partition(name='123fo') location '/mnt/sparktables/new_path/name=123fo/'
The alternate to this process is create 2 dataframe out of the 2 location , combine them then saveAsaTable

I would do something like this:
create or replace view 'mytable' as
select * from parquet.`path1`
union all
select * from parquet.`path2`
The view understands how to query from both locations. I assume you will not append/overwrite the table as it would lead to more ambiguity.

You can create data frames separately for two or more parquet files and then union them (assuming they have identical schemas)
df1.union(df2)

Related

Is there a simple way to update location to all partitions in hive external table?

I create some dataframe with spark daily, and save it to HDFS location.
Before saving I partition data by some fields, so path to data looks like this:
/warehouse/tablespace/external/hive/table_name/...
table_name directory has partitions like:
table_name/field=value1
table_name/field=value2
I create external table to operate the data with Hive and set location to data path.
Each day I want to change location to new data path. But if I use
ALTER TABLE table
SET LOCATION 'new location'
querying still return old data because partition's locations don't change.
Is there any way to tell Hive to search partitions in new location, without changing it one by one?
If you don't want to keep the old partition you can drop using the below command and try to add a clause that matches all partition
ALTER TABLE table_name DROP IF EXISTS PARTITION (field != 'non_exist_value');
Then you can check the remaining partition again using the below command
SHOW PARTITIONS table_name;
After that, you can change it to a new location and repair the hive table to create a new partition under the new location (The partition column name in new location must be the same as old)
ALTER TABLE table_name SET LOCATION '/new_location';
MSCK REPAIR TABLE table_name;

Insert or Update a delta table from a dataframe in Pyspark

I have a pyspark dataframe currently from which I initially created a delta table using below code -
df.write.format("delta").saveAsTable("events")
Now, since the above dataframe populates the data on daily basis in my requirement, hence for appending new records into delta table, I used below syntax -
df.write.format("delta").mode("append").saveAsTable("events")
Now this whole thing I did in databricks and in my cluster. I want to know how can I write generic pyspark code in python that will create delta table if it does not exists and append records if delta table exists.This thing I want to do because if I give my python package to someone, they will not have the same delta table in their environment so it should get created dynamically from code.
If you don't have Delta table yet, then it will be created when you're using the append mode. So you don't need to write any special code to handle the case when table doesn't exist yet, and when it exits.
P.S. You'll need to have such code only in case if you're performing merge into the table, not append. In this case the code will looks like this:
if table_exists:
do_merge
else:
df.write....
P.S. here is a generic implementation of that pattern
There are eventually two operations available with spark
saveAsTable:- create or replace the table if present or not with the current DataFrame
insertInto:- Successful if the table present and perform operation based on the mode('overwrite' or 'append'). it requires the table to be available in the database.
The .saveAsTable("events") Basically rewrites the table every time you call it. which means that, even if you have a table present earlier or not, it will replace the table with the current DataFrame value. Instead, you can perform the below operation to be in the safer side:
Step 1: Create the table even if it is present or not. If present, remove the data from the table and append the new data frame records, else create the table and append the data.
df.createOrReplaceTempView('df_table')
spark.sql("create table IF NOT EXISTS table_name using delta select * from df_table where 1=2")
df.write.format("delta").mode("append").insertInto("events")
So, every time it will check if the table is available or not, else it will create the table and move to next step. Else, if the table is available, then append the data into the table.

Add New Partition to Hive External Table via databricks

I have a Folder which previously had subfolders based on ingestiontime which is also the original PARTITION used in its Hive Table.
So the Folder Looks as -
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
........
Inside each ingestiontime folder, data is present in PARQUET format.
Now in the Same myStreamingData folder, I am adding another folder that holds similar data but in the folder named businessname.
So my Folder structure now looks like -
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
........
So I need to add the data in the businessname partition to my current hive table too.
To achieve this , I was running the ALTER Query - ( on Databricks)
%sql
alter table gp_hive_table add partition (businessname=007,ingestiontime=20200712230000) location "s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000"
But I am getting this error -
Error in SQL statement: AnalysisException: businessname is not a valid partition column in table `default`.`gp_hive_table`.;
What part I am doing incorrectly here ?
Thanks in Advance.
Since you're already using Databricks and this is a streaming use case, you should definitely take a serious look at using Delta Lake tables.
You won't have to mess with explicit ... ADD PARTITION and MSCK statements.
Delta Lake with the ACID properties will ensure your data is committed properly, if your job fails you won't end up with partial results. As soon as the data is committed, it is available to users (again without the MSCK and ADD PARTITION) statements.
Just change 'USING PARQUET' to 'USING DELTA' in your DDL.
You can also (CONVERT) your existing parquet table to a Delta Lake table and then start using INSERT, UPDATE, DELETE, MERGE INTO, COPY INTO, from Spark batch and structured streaming jobs. OPTIMIZE will clean up the small file problem.
alter table gp_hive_table add partition is to add partition(data location, not new column) to the table with already defined partitioning scheme, it does not change current partitioning scheme, it just adds partition metadata, that in some location there is partition corresponding to some partitioning column value.
If you want to change partition columns, you need to recreate the table.:
Drop (check it is EXTERNAL) the table: DROP TABLE gp_hive_table;
Create table with new partitioning column. Partitions WILL NOT be created automatically.
Now you can add partitions using ALTER TABLE ADD PARTITION or use MSCK REPAIR TABLE to create them automatically based on directory structure. Directory structure should already match partitioning scheme before you execute these commands
So,
building upon the suggestion from #leftjoin,
Instead of having a hive table without businessname as one of the partition ,
What I did is -
Step 1 -> Create hive table with - PARTITION BY (businessname long,ingestiontime long)
Step 2 -> Executed the query - MSCK REPAIR <Hive_Table_name> to auto add partitions.
Step 3 ->
Now, there are ingestiontime folders which are not in the folder businessname i.e
folders like -
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
I wrote a small piece of code to fetch all such partitions and then ran the following query for all of them -
ALTER TABLE <hive_table_name> ADD PARTITION (businessname=<some_value>,ingestiontime=<ingestion_time_partition_name>) LOCATION "<s3_location_of_all_partitions_not_belonging_to_a_specific_businesskey>
This solved my issue.

drop table command is not deleting path of hive table which was created by spark-sql

I am trying to drop a table(Internal) table that was created Spark-Sql, some how table is getting dropped but location of the table is still exists. Can some one let me know how to do this?
I tried both Beeline and Spark-Sql
create table something(hello string)
PARTITIONED BY(date_d string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "^"
LOCATION "hdfs://path"
)
Drop table something;
No rows affected (0.945 seconds)
Thanks
Spark internally uses Hive metastore to create Table. If the table is created as an external hive table from spark i.e. the data present in HDFS and Hive provides a table view on that, drop table command will only delete the Metastore information and will not delete the data from HDFS.
So there are some alternate strategy which you could take
Manually delete the data from HDFS using hadoop fs -rm -rf command
Do alter table on the table you want to delete, change the external table to internal table then drop the table.
ALTER TABLE <table-name> SET TBLPROPERTIES('external'='false');
drop table <table-name>;
The first statement will convert the external table to internal table and 2nd statement will delete the table with the data.

save dataframe as external hive table

I have used one way to save dataframe as external table using parquet file format but is there some other way to save dataframes directly as external table in hive like we have saveAsTable for managed table
you can do this in this way
df.write.format("ORC").options(Map("path"-> "yourpath")) saveAsTable "anubhav"
In PySpark, External Table can be created as below:
df.write.option('path','<External Table Path>').saveAsTable('<Table Name>')
For external table, don't use saveAsTable. Instead, save the data at location of the external table specified by path. Then add partition so that it is registered with hive metadata. This will allow you to hive query by partition later.
// hc is HiveContext, df is DataFrame.
df.write.mode(SaveMode.Overwrite).parquet(path)
val sql =
s"""
|alter table $targetTable
|add if not exists partition
|(year=$year,month=$month)
|location "$path"
""".stripMargin
hc.sql(sql)
You can also save dataframe with manual create table
dataframe.registerTempTable("temp_table");
hiveSqlContext.sql("create external table
table_name if not exist as select * from temp_table");
Below mentioned link has a good explanation for create table https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html

Resources