Transforming CSV column names in Azure Synapse during implicit copy - azure

I am making a data pipeline in Azure Synapse. I want to copy a 500 GB CSV file from a Blob container file and convert it into an Azure Data Lake Storage Gen2 table. Before I copy it into the table, I want to make some changes to the data using a Data Flow block, to change some column names and other transformations.
Is it possible to copy the data and make transformations implicitly, without a staging Parquet store ?
If yes, how do I make the transformations implicitly ? Ex: Remove dashes ("-") from all column names ?

You can use rule-based mapping in the select transformation to remove the Hyphen symbol from all the column names.
Matching condition: true()
Output column name expression: replace($$, '-','')
Select output:

Related

I'm unable to map drift columns in Azure Data Factory (ADF)

I can't map drift my columns in ADF data flow. I'm able to manually, but this isn't possible as I have 1020 columns. File is .csv
I see a message: 'This drifted column is not in the source schema and therefore can only be referenced with pattern matching expressions'
I was hoping to have a map drifted data flow from my source data.
With > 1k columns, you should consider NOT mapping those columns. Just use column patterns inside your transformation expressions to access columns. Otherwise, ADF will have to materialize the entire 1k+ columns as a physical projection.

Azure Data Factory Merge to files before inserting in to DB

We have two files that are ^ delimited file and a comma separated txt files which are stored in the Blob Storage like below
File1 fields are like
ItemId^Name^c1^type^count^code^Avail^status^Ready
File2 Fields are like
ItemId,Num,c2
Here the first column in both the files are the key and based on it I need to insert them in to one table on the Azure DB using the Azure Data Factory. Can anyone suggest how can this be done in the ADF. Should we merge the two files into one file before inserting into the Database.
AzureDB columns are
ItemId Name c1 type count code Avail status Ready Num c2
So it should be like
Item1 ABC(S) 1234 Toy 10 N N/A POOL N/A 19 EM
Item2 DEF(S) 5678 toy 7 X N/A POOL N/A 6 MP
I was referring to this Merging two or more files from a storage account based on a column using Azure Data Factory but couldnt understand if we can merge the two files before inserting in to DB
You can use the 2 files to create 2 datasets, use join activity to jointhem together and simply sink to the SQL table in a dataflow.
Here Inner join is used, you can adapt to use the type of join your preferred.
You can see the preview of the join successfully merged the 2 files/data sources.
Adjust the field mapping in Sink if needed.
Here is the arrow-separated.csv I used:
ItemId^Name^c1^type^count^code^Avail^status^Ready
Item1^ABC(S)^1234^Toy^10^N^N/A^POOL^N/A
Item2^DEF(S)^5678^toy^7^X^N/A^POOL^N/A
Here is the comma-separated.csv I used:
ItemId,Num,c2
Item1,19,EM
Item2,6,MP
Result in DB:

Azure Data Factory - Exists transformation in Data Flow with generic dataset

I'm having issues using the Exists Transformation within a Data Flow with a generic dataset.
I have two sources (one from staging table "sourceStg", one from DWH table "sourceDwh") and want to compare if the UniqueIdentifier-Column in the staging table is existing in the UniqueIdentifier-Column in the DWH table. For that I have a generic data set which I query with a SQL statement containing parameters.
When I open the "Exists settings" I cannot choose any Column from the source in the conditions since the source is generic and has no Projection until I run the data flow. However, I have a parameter which I get from the parent pipeline which provides me the name of the Column containing the UniqueIdentifier (both column names in staging / DWH are the same).
I tried to add following statement "byName($UniqueIdentifier)" in the left and right column field but the engine resolves them both as the sourceStg-Column since the prefix of the source-transformations is missing and it defaults to the first one. What I basically now try to achieve is having some statement as followed defining the correct source-transformation and the column containing the unique identifier with a parameter.
exists(sourceStg#$UniqueIdentifier == sourceDwh#$UniqueIdentifier)
But either the expression cannot be parsed or the result does not retrieve the actual UniqueIdentifier value from the column but writes the statement (e.g. sourceStg#$UniqueIdentifier) as column value.
The only workaround I found so far is having two derived columns which adds a suffix to the UniqueIdentifier-Column in one source and a new parameter $UniqueIdentiferDwh which is populate with the parameter $UniqueIdentifier and the same suffix as used in the derived column.
Any Azure Data Factory experts out there to help?
Thanks in advance!

Azure Data Factory Copy Activity is dropping columns on the floor

first time, long time.
I'm running an import of a csv file that has 734 columns in Azure Data Factory Copy Activity. Data factory is not reading the last 9 columns and is populating with NULL. Even in the preview I can see that the columns have no values but the schema for those columns is detected. Is there a limit of columns in Copy to 725?
As Joel said there is no restriction for 725 or so columns . I suggest
Go to the mapping tab and only pick 726th column ( if you have a header it will be easy or ADF will generate header like Prop_726( most probably) , copy the data to blob as sink , If the blob has the field , that means that you have a data type issue on the table .
Let me know how its goes , if you are still facing the issue , please share some dummy data for 726th column .
Here is what happened. I had the file in zip folders, and I thought I had to unzip the files first to process them. It turns out that when unzipping through ADF, it stripped the quotation marks from my columns, and then one of the columns had an escape character in it. That escape character shifted everything over, and resulted in me losing nine columns.
But I did learn a bunch of things NOT to do, so it wasn't a total waste of time. Thanks for the answers!

ADFv2 trouble with column mapping (reposting)

I have a source .csv with 21 columns and a destination table with 25 columns.
Not ALL columns within the source have a home in the destination table and not all columns in the destination table come from the source.
I cannot get my CopyData task to let me pick and choose how I want the mapping to be. The only way I can get it to work so far is to load the source data to a "holding" table that has a 1:1 mapping and then execute a stored procedure to insert data from that table into the final destination.
I've tried altering the schemas on both the source and destination to match but it still errors out because the ACTUAL source has more columns than the destination or vice versa.
This can't possibly be the most efficient way to accomplish this but I'm at a loss as to how to make it work.
Yes I have tried the user interface, yes I have tried the column schemas, no I can't modify the source file and shouldn't need to.
The error code that is returned is some variation on:
"errorCode": "2200",
"message": "ErrorCode=UserErrorInvalidColumnMappingColumnCountMismatch,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Invalid column mapping provided to copy activity: '{LONG LIST OF COLUMN MAPPING HERE}', Detailed message: Different column count between target structure and column mapping. Target column count:25, Column mapping count:16. Check column mapping in table definition.,Source=Microsoft.DataTransfer.Common,'",
"failureType": "UserError",
"target": "LoadPrimaryOwner"
Tim F. Please view the statements in this Schema mapping in copy activity:
Column mapping supports mapping all or subset of columns in the source
dataset "structure" to all columns in the sink dataset "structure".
The following are error conditions that result in an exception:
1.Source data store query result does not have a column name that is specified in the input dataset "structure" section.
2.Sink data store (if with pre-defined schema) does not have a column name that is specified in the output dataset "structure" section.
3.Either fewer columns or more columns in the "structure" of sink dataset than specified in the mapping.
4.Duplicate mapping.
So,you could know that all the columns in the sink dataset need to be mapped. Since you can't change the destination,maybe you don't have to struggle in an unsupported feature.
Of course ,you could use stored procedure mentioned in your description.That's a perfect workaround and not very troublesome. About the using details, you could refer to my previous cases:
1.Azure Data Factory activity copy: Evaluate column in sink table with #pipeline().TriggerTime
2.Azure Data factory copy activity failed mapping strings (from csv) to Azure SQL table sink uniqueidentifier field
In addition, if you really don't want avoid above solution,you could submit feedback to ADF team about your desired feature.

Resources