DataBricks - Ingest new data from parquet files into delta / delta lake table

DataBricks - Ingest new data from parquet files into delta / delta lake table - databricks

I posted this question on the databricks forum, I'll copy below but basically I need to ingest new data from parquet files into a delta table. I think I have to figure out how to use a merge statement effectively and / or use an ingestion tool.
I'm mounting some parquet files and then I create a table like this:
sqlContext.sql("CREATE TABLE myTableName USING parquet LOCATION 'myMountPointLocation'");
And then I create a delta table with a subset of columns and also a subset of the records. If I do both these things, my queries are super fast.
sqlContext.sql("CREATE TABLE $myDeltaTableName USING DELTA SELECT A, B, C FROM myTableName WHERE Created > '2021-01-01'");
What happens if I now run:
sqlContext.sql("REFRESH TABLE myTableName");
Does my table now update with any additional data that may be present in my parquet source files? Or do I have to re-mount those parquet files to get additional data?
Does my delta table also update with new records? I doubt it but one can hope...
Is this a case for AutoLoader? Or maybe I do some combination of mounting, re-creating / refreshing my source table, and then maybe MERGE new records / updated records into my delta table?

Related

Spark SQL Merge query

I am trying to MERGE two tables using spark sql and getting error with the statement.
The tables are created as external tables pointing to the Azure ADLS storage. The sql is executing using Databricks.
Table 1:
Name,Age.Sex
abc,24,M
bca,25,F
Table 2:
Name,Age,Sex
abc,25,M
acb,25,F
The Table 1 is the target table and Table 2 is the source table.
In the table 2 I have one Insert and one update record which needs to be merged with source table 1.
Query:
MERGE INTO table1 using table2 ON (table1.name=table2.name)
WHEN MATCHED AND table1.age <> table2.age AND table1.sex<>table2.sex
THEN UPDATE SET table1.age=table2.age AND table1.sex=table2.sex
WHEN NOT MATCHED
THEN INSERT (name,age,sex) VALUES (table2.name,table2.age,table2.sex)
Is the spark SQL support merge or is there another way of achieving it ?
Thanks
Sat

To use MERGE you need the Delta Lake option (and associated jars). Then you can use MERGE.
Otherwise, SQL Merge is not supported by Spark. The Dataframe Writer APIs with own logic are then needed. There are a few different ways to do this. Even with ORC ACID, Spark will not work in this way.

Add New Partition to Hive External Table via databricks

I have a Folder which previously had subfolders based on ingestiontime which is also the original PARTITION used in its Hive Table.
So the Folder Looks as -
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
........
Inside each ingestiontime folder, data is present in PARQUET format.
Now in the Same myStreamingData folder, I am adding another folder that holds similar data but in the folder named businessname.
So my Folder structure now looks like -
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
........
So I need to add the data in the businessname partition to my current hive table too.
To achieve this , I was running the ALTER Query - ( on Databricks)
%sql
alter table gp_hive_table add partition (businessname=007,ingestiontime=20200712230000) location "s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000"
But I am getting this error -
Error in SQL statement: AnalysisException: businessname is not a valid partition column in table `default`.`gp_hive_table`.;
What part I am doing incorrectly here ?
Thanks in Advance.

Since you're already using Databricks and this is a streaming use case, you should definitely take a serious look at using Delta Lake tables.
You won't have to mess with explicit ... ADD PARTITION and MSCK statements.
Delta Lake with the ACID properties will ensure your data is committed properly, if your job fails you won't end up with partial results. As soon as the data is committed, it is available to users (again without the MSCK and ADD PARTITION) statements.
Just change 'USING PARQUET' to 'USING DELTA' in your DDL.
You can also (CONVERT) your existing parquet table to a Delta Lake table and then start using INSERT, UPDATE, DELETE, MERGE INTO, COPY INTO, from Spark batch and structured streaming jobs. OPTIMIZE will clean up the small file problem.

alter table gp_hive_table add partition is to add partition(data location, not new column) to the table with already defined partitioning scheme, it does not change current partitioning scheme, it just adds partition metadata, that in some location there is partition corresponding to some partitioning column value.
If you want to change partition columns, you need to recreate the table.:
Drop (check it is EXTERNAL) the table: DROP TABLE gp_hive_table;
Create table with new partitioning column. Partitions WILL NOT be created automatically.
Now you can add partitions using ALTER TABLE ADD PARTITION or use MSCK REPAIR TABLE to create them automatically based on directory structure. Directory structure should already match partitioning scheme before you execute these commands

So,
building upon the suggestion from #leftjoin,
Instead of having a hive table without businessname as one of the partition ,
What I did is -
Step 1 -> Create hive table with - PARTITION BY (businessname long,ingestiontime long)
Step 2 -> Executed the query - MSCK REPAIR <Hive_Table_name> to auto add partitions.
Step 3 ->
Now, there are ingestiontime folders which are not in the folder businessname i.e
folders like -
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
I wrote a small piece of code to fetch all such partitions and then ran the following query for all of them -
ALTER TABLE <hive_table_name> ADD PARTITION (businessname=<some_value>,ingestiontime=<ingestion_time_partition_name>) LOCATION "<s3_location_of_all_partitions_not_belonging_to_a_specific_businesskey>
This solved my issue.

Create index for tables within Delta Lake

I'm new to the Delta Lake, but I want to create some indexes for fast retrieval for some tables in Delta Lake. Based on the docs, it shows that the closest is by creating the Data Skipping then indexing the skipped portion:
create DATASKIPPING index on [TableName] [DBName.]tableName
Can't seem to find other methods of creating indexes other than Data Skipping
How do I create indexes just like any tables in RDBMS, within Delta Lake?
Thanks!

Indexing happens automatically on Databricks Delta and OSS Delta Lake as of v1.2.0. As you write data, the columns in the files you write are indexed and added to the internal table metadata. As you query the data and filter, data skipping is applied.
In addition you can use z-order on Databricks Delta to optimize the files based on specific columns. Again, indexing will still be used for the other columns as well.

Create Spark SQL tables from multiple parquet paths

I use databricks. I am trying to create a table as below
target_table_name = 'test_table_1'
spark.sql("""
drop table if exists %s
""" % target_table_name)
spark.sql("""
create table if not exists {0}
USING org.apache.spark.sql.parquet
OPTIONS (
path ("/mnt/sparktables/ds=*/name=xyz/")
)
""".format(target_table_name))
Even though using "*" gives me flexibility on loading different files (pattern matching) and eventually create a table, I wish to create a table based on two completely different paths (no pattern matching).
path1 = /mnt/sparktables/ds=*/name=xyz/
path2 = /mnt/sparktables/new_path/name=123fo/

Spark uses Hive metastore to create these permanent tables. These tables are essentially external tables in Hive.
Generally what you are trying is not possible because Hive external table location needs to be unique at the time of creation.
However, you could still achieve the hive table with different location, if you incorporate partitioning strategy on your hive metastore.
In hive metastore you can have partitions which point to different locations.
However there is no off the shelf way to achieve this. Firstly you would need to specify a partition key for your dataset and create a table from the 1st location where the entire data belongs to one partition. Then alter table to add a new partition.
Sample:
create external table tableName(<schema>) partitioned by ('name') location '/mnt/sparktables/ds=*/name=xyz/'
Then you can add partitions
alter table tableName add partition(name='123fo') location '/mnt/sparktables/new_path/name=123fo/'
The alternate to this process is create 2 dataframe out of the 2 location , combine them then saveAsaTable

I would do something like this:
create or replace view 'mytable' as
select * from parquet.`path1`
union all
select * from parquet.`path2`
The view understands how to query from both locations. I assume you will not append/overwrite the table as it would lead to more ambiguity.

You can create data frames separately for two or more parquet files and then union them (assuming they have identical schemas)
df1.union(df2)

Create hive external table from partitioned parquet files in Azure HDInsights

I have data saved as parquet files in Azure blob storage. Data is partitioned by year, month, day and hour like:
cont/data/year=2017/month=02/day=01/
I want to create external table in Hive using following create statement, which I wrote using this reference.
CREATE EXTERNAL TABLE table_name (uid string, title string, value string)
PARTITIONED BY (year int, month int, day int) STORED AS PARQUET
LOCATION 'wasb://cont#storage_name.blob.core.windows.net/data';
This creates table but has no rows when querying. I tried same create statement without PARTITIONED BY clause and that seems to work. So looks like issue is with partitioning.
What am I missing?

After you create the partitioned table, run the following in order to add the directories as partitions
MSCK REPAIR TABLE table_name;
If you have a large number of partitions you might need to set hive.msck.repair.batch.size
When there is a large number of untracked partitions, there is a
provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving
the configured batch size for the property hive.msck.repair.batch.size
it can run in the batches internally. The default value of the
property is zero, it means it will execute all the partitions at once.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Written by the OP:
This will probably fix your issue, however if data is very large, it won't work. See relevant issue here.
As a workaround, there is another way to add partitions to Hive metastore one by one like:
alter table table_name add partition(year=2016, month=10, day=11, hour=11)
We wrote simple script to automate this alter statement and it seems to work for now.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string