How to Data to an existing delta table in databricks? - databricks

I am having data in parquet format in ADLS gen2. I want to implement dalta layers in my project.
So I kept all the data from on-prem in ADLS Gen2 via ADF in a separate container called landing zone.
Now i created a separated container called Bronze where I want to keep delta table.
For this I have did like below.
I have created a database in databricks. And I have created a delta table in data bricks using below SQL code.
create table if not exists externaltables.actv_snap_view(
id String,
mbr_id String,
typ_id String,
strt_dttm String,
otcome_typ_id String,
cdc String
)
using delta
location '/mnt/Storage/Bronze/actv_snap_view'
Now my table is not having any data.
How can I add data which is in data lake landing zone into delta table which I created.
My database is in databricks after data is added to the table where will be the underlined data will be stored.

You can follow the steps below to create table using data from landingzone (source for parquet files), where the table belongs to the database present in bronze container.
Considering your ADLS containers are mounted, you can create a database and specify its location as your bronze container mount point as suggested by #Ganesh Chandrasekaran.
create database demo location "/mnt/bronzeoutput/"
Now use the following SQL syntax to create a table using parquet file present in mount point of the landingzone container.
create table demo.<table_name> (<columns>) using parquet location '/mnt/landingzoneinput/<parquet_file_name>';
Using the above steps, you have created a database in your bronze container where you can store your tables. To populate a table created inside this database of bronze container, you are using the files present in your landingzone container.
Update:
Using the create table statement above is creating a table with data from the parquet file, but this table does not reflect in the data lake.
You can instead use the query given below. It first creates a table in the database (present in bronze container). Now you can insert the values from your parquet file present in landingzone.
create table demo.<table_name> (<columns>);
-- demo database is inside bronze container
insert into demo.<table_name> select * from <data_source>.`/mnt/landingzoneinput/source_file`

Related

How to copy the records which are not in the target datastore using azure data factory

I have a table in sql and it is copied to ADLS. After copying, sql table got inserted with new rows. I wanted to get the new rows.
I tried to use join transformation. But I couldn't get the output. What is the way to achieve this.
Refer this link. Using this you can get newly added rows from sql to data lake storage. Reproduced issue from my side and able to get newly added records from pipeline.
Created two tables in sql storage with names data_source_table and watermarktable.
data_source_table is the one which is having data in table and watermarktable used for tracking new records based date.
Created pipeline as shown below,
In lookup1 selecting the datasource table
In lookup2 select Query as follows
MAX(LastModifytime) as NewWatermarkvalue from data_source_table;
Then in copy activity source and sink taken as shown below images
SOURCE:
Query in Source:
select `* from data_source_table where LastModifytime > '#{activity('Lookup1').output.firstRow.WatermarkValue}' and LastModifytime <= '#{activity('Lookup1').output.firstRow.Watermarkvalue}'
SINK:
Pipeline ran successfully and data in sql table is loaded into data lake storage file.
Inserted new rows inserted in data_source_table and able to get those records from Lookup activity

Column names are incorrectly Mapped

I was trying to pull/load data from on-prem data lake to azure data lake using Azure Data Factory.
I was just giving query to pull all the columns. My Sink is Azure Data Lake Gen2.
But my Column names are coming wrong in source and sink.
My columns name in on-prem data lake are like user_id, lst_nm, etc. But in Azure it is like user_tbl.user_id, user_tbl.lst_nm , etc Here user_tbl is my table name.
I don't want table name getting added to columns.
Azure won't add the table name itself to the columns, can you check the output of select query that you are sending to source using preview data in ADF, that will show you the actual column names ADF is getting from source and if it doesn't have the table name prefixed then please check if your ADLS Gen 2 destination folder already have any file, if yes then remove the file and try running the pipeline again
Instead of using Copy activity, use Data flow transformation which allows you to change the Column name at destination dynamically.
Or you can also use Move and transform activity which also allows you to change column name. Refer official tutorial: Dynamically set column names in data flows
Also check ADF Mapping Data Flows: Create rules to modify column names

Azure Data Lake Store as EXTERNAL TABLE in Databricks with multiple PATH?

I am trying to create external tables shown below
Path for the table is dynamic, can external table accept multiple path?
CREATE TABLE tablename
(BusinessDate string,
StoreNumber string)
USING csv
OPTIONS ('DELIMITER' '~',
PATH "/mnt/raw/2021/08/19/store01.txt,/mnt/raw/2021/08/17/store09.txt")
You may try the below steps to create a table using multiple paths from ADLS gen2 account.
Step1: You the demo purpose I had created two sample csv files which contains data of employees data1.csv files contains three rows and data2.csv contains two rows.
Step2: Upload both the files to a container named data in ADLS Gen2 account.
Step3: Create a table using multiple paths from ADLS gen2 account which is mounted as shown below.
CREATE TABLE default.employee
(id INT, name STRING, age INT)
USING CSV
LOCATION '/mnt/sampledata/data/*.csv'

Copy binary data from Azure SQL to Azure Blobs using Data Factory

I'm trying to use Azure Data Factory to move the contents of an Azure SQL table that holds photo (JPEG) data into JPEG files held in Azure blob storage. There does not seem to be a way to create Binary files in Blob storage using ADF without the binary file being a specific format like AVRO or Parquet. I need to create 'raw' binary blobs.
I've been able to create Parquet files for each row in the SQL table, where the Parquet file contains columns for the Id, ImageType and Data (the varbinary that came from the SQL row). I cannot work out how to get the Data column directly into a binary file called "{id}.jpeg".
So far I have a Lookup activity that queries the SQL Photos table to get the Ids of the rows I want, which feeds a ForEach that executes a pipeline for every id. That pipeline uses the Id to query the Id, ImageType and Data from SQL and writes a Parquet file containing those 3 columns into a blob dataset.

Dropping internal Hive table is not deleting the contents of the underlying Azure Blob Storage

I wish to delete the contents of an Azure Blob container by creating an internal Hive table over the contents of the container and the dropping the table as shown below. The container contains a bunch of text files. However, dropping the Hive table doesn't appear to delete the contents of the container.
Am I correct in assuming that dropping an internal table will not remove the contents of the container because HDInsight uses Azure Blob Storage as its storage and not HDFS? Any insight would be greatly appreciated. Thanks.
Cheers
Ryan
--Create internal table
CREATE TABLE temp_logs(
student_id INT,
subject_id INT,
marks INT,
insert_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'wasb://logs#myaccount.blob.core.windows.net/';
--Drop internal table and its underlying files in Azure Blob
DROP TABLE temp_logs;
Drop internal hive table will remove the data. It is the same behavior as other hadoop systems.
When you drop the table, the container will remain. Even after you delete the cluster, the container will stay.

Resources