ADF: Dataflow sink activity file format for adls - azure

I wanted to copy multiple table data from Azure SQL database to ADLS gen2. I created a pipeline which take table names as dynamic input values. later i used dataflow activity which copies the data to adls. I used sink type as delta. Now few of my table data are getting copied to adls properly with snappy.parquet format but few are giving error as column names are invalid for delta format.
How can we deal with this error and get data copied from all tables?
Also for knowledge wanted to know that does file formats for the files generated at destination folder in adls are by default parquet file? Or is there any option to change that?

Delta format is parquet underneath. You cannot use characters like " ,;{}()\n\t=" and have to replace that with _ or another character.
Dataflow has easy ways to rename column names in derive or select transforms.

Related

How can I transition from Azure Data Lake, with data partitioned by date folders into delta lake

I own an azure data lake gen2 with data partitioned by datetime nested folders.
I want to provide delta lake format to my team but I am not sure if I should create a new storage account an copy the data into delta format or if it would be best practice to transform the current azure data lake into a delta lake format.
Could anyone provide any tips on this matter?
AFAIK, Delta format is supported only as inline dataset and only in Data flows, we can have inline datasets.
So, my suggestion is to use Data flows for this.
As you have the data in date time nested folders, I reproduced with sample dates like below. I have uploaded a sample csv file in each folder 10 and 9.
Create a data flow in ADF and in source select inline dataset to give the wild card path we want. Select your data format, here Delimited text for me. give the linked service as well.
Assuming that your nested folder structure is same for all files, give the wild card path like below as per your path level.
Now, create delta format sink like below.
give the linked service as well.
In the sink settings give the folder for your delta files and Update method.
You can see the delta format files were created in the Folder path after execution.

Column names are incorrectly Mapped

I was trying to pull/load data from on-prem data lake to azure data lake using Azure Data Factory.
I was just giving query to pull all the columns. My Sink is Azure Data Lake Gen2.
But my Column names are coming wrong in source and sink.
My columns name in on-prem data lake are like user_id, lst_nm, etc. But in Azure it is like user_tbl.user_id, user_tbl.lst_nm , etc Here user_tbl is my table name.
I don't want table name getting added to columns.
Azure won't add the table name itself to the columns, can you check the output of select query that you are sending to source using preview data in ADF, that will show you the actual column names ADF is getting from source and if it doesn't have the table name prefixed then please check if your ADLS Gen 2 destination folder already have any file, if yes then remove the file and try running the pipeline again
Instead of using Copy activity, use Data flow transformation which allows you to change the Column name at destination dynamically.
Or you can also use Move and transform activity which also allows you to change column name. Refer official tutorial: Dynamically set column names in data flows
Also check ADF Mapping Data Flows: Create rules to modify column names

Ingesting a CSV file thru Polybase without knowing the sequence of columns

I am trying to ingest a few CSV files from Azure Data Lake into Azure Synapse using Polybase.
There is a fixed set of columns in each CSV file and the column names are given on the first line. However, the columns can come in different ordering sequence.
In Polybase, I need to declare external table which I need to know the exact sequence of columns during design time and hence I cannot create the external table. Are there other ways to ingest the CSV file?
I don't believe you can do this directly with Polybase because as you noted the CREATE EXTERNAL TABLE statement requires the column declarations. At runtime, the CSV data is then mapped to those column names.
You could accomplish this easily with Azure Data Factory and Data Flow (which uses Polybase under the covers to move the data to Synapse) by allowing the Data Flow to generate the table. This works because the table is generated after the data has been read rather than before as with EXTERNAL.
For the sink Data Set, create it with parameterized table name [and optionally schema]:
In the Sink activity, specify "Recreate table":
Pass the desired table name to the sink Data Set from the Pipeline:
Be aware that all string-based columns will be defined as VARCHAR(MAX).

Getting "Error converting data type VARCHAR to DATETIM"E while copying data from Azure blob to Azure DW through Polybase

I am new to the Azure environment and i am using data factory while trying to copy data present in the CSV file on Azure blob storage which has three columns (id,age,birth date) to a table in Azure data warehouse. The birth date is of the format "MM/dd/yyyy" and i am using polybase to copy the data from blob to my table in azure DW. The columns of the table are defined as(int,int,datetime).
I can copy my data if i use "Bulk Insert" option in data factory but it gives me an error when i choose the Polybase copy. Also changing the dateformat in the pipleine does not do any good either.
Polybase copies successfully if i change the date format in my file to "yyyy/MM/dd".
Is there a way i can copy data from my blob to my table without having to change the date format in the source file to "yyyy/MM/dd".
I assume you have created an external file format which you reference in your external table?
The CREATE EXTERNAL FILEFORMAT has an option to define how a date is represented: DATE_FORMAT, and you set that to how your source data represents datetime.
So something like so:
CREATE EXTERNAL FILE FORMAT your-format
WITH
(
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
DATE_FORMAT = 'MM/dd/yyyy' )
);
You can find more about this at: https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql?view=sql-server-ver15
Seems like this error is resolved now. I was giving the date-format as 'MM/dd/yyyy' whereas the data factory expected it to be just MM/dd/yyyy without any quotes.
So as per my understanding i will summarize what i learned while copying data from Azure blob to Azure SQL Data Warehouse with a 'MM/dd/yyy' date format, in a few points here :
1) If you are using azure portal to copy data from blob to azure sql data warehouse using Data Factory copy option.
Create a copy data pipe line using data factory.
Specify your input data source and your destination data store.
Under filed mappings,choose datetime in the column that contains the
date, click on the little icon on its right to bring the custom date
format field and enter your date format without quotes e.g.
MM/dd/yyyy as in my case.
Run your pipleline and it should successfully complete.
2) You can use polybase directly by:
Creating external data source that specifies the location of your
input file e.g. csv file on blob storage in my case.
An external file format that specifies the delimiter and custom date format e.g. MM/dd/yyyy in your
input file.
External table that defines all the columns present in your source
file and uses the external data storage and file format which you
defined above.
You can then create your custom tables as select using the external
table(CTAS).Something which Niels stated in his answer above.I used
Microsoft SQL Server Management Studio for this process.

What is the difference between using a COPY DATA activity to a SQL table vs using CREATE EXTERNAL TABLE?

I have a bunch of U-SQL activities that manipulates & transform data in an Azure Data Lake. Out of this, I get a csv file that contains all my events.
Next I would just use a Copy Data activity to copy the csv file from the Data Lake directly into an Azure SQL Data Warehouse table.
I extract the information from a bunch of JSON files stored in the Data Lake and create a staging .csv file;
I grab the staging .csv file & a production .csv file and inject the latest change (and avoid duplicates) and save the production .csv file;
Copy the .csv production file directly to the Warehouse table.
I realized that my table contains duplicated rows and, after having tested the U-SQL scripts, I assume that the Copy Data activity -somehow- merges the content of the csv file into the table.
Question
I am not convinced I am doing the right thing here. Should I define my warehouse table as an EXTERNAL table that would get its data from the .csv production file? Or should I change my U-SQL to only include the latest changes?
If you want to use external tables depends on your use case. If you want the data to be stored inside SQL DW for better performance, you have to copy it at some point, e.g. via a stored procedure. You could then just call the stored procedure from ADF, for instance.
Or, if you don't want to / cannot filter out data beforehand, you could also implement an "Upsert" stored procedure in your SQL DW and call this to insert your data instead of the copy activity.

Resources