Azure Data Factory removing spaces from column names of csv file - azure

I'm a bit new to azure data factory so apologies if I'm missing anything obvious. I've done several searches and I can't find anything that quite fits.
So the situation is that we have an existing pipeline that will take the path to a csv file and pass this in as a delimited data set. As a sink it is using a parquet data set. This is a generic process that we can pass any delimited file into and it will output it as parquet.
This has been working well but now we have started receiving files with spaces and special characters in the header which causes the output to parquet to fail. Unfortunately we don't have control over the format of the files we receive so I can't handle this at source.
What I would like to do is on ingestion of the file replace any spaces and other special characters in the header with an underscore. If I were doing this on premise I could quickly create a powershell script to do it. I had thought about creating a custom task in AFD to call a powershell script to do this in the blob storage but that seems more complicated than it should be. Is there something else I can do to get this process working while keeping it generic?

As #Joel Cochran mentioned, you can use the below expression in Select transformation to replace space and special characters in the header.
regexReplace($$,'[^a-zA-Z]','_')
Source:
In Select transformation, remove the auto mappings and add new rule base mapping to use this expression.
preview:

You can change the output filename not directly in the Copy activity, assuming you are using this activity.
The workaround is to use a parameter for the filename output that you can cleanup.
You can use the Get Metadata activity to get all filenames from the source csv files.
Then loop over these files with a foreach activity.
Within the foreach activity you can set the output filename with the new name with the cleaned value.
The function could look like this:
#replace(item().name, ' ', '_')
More information on the replace function

Related

Copy CSV File with Multiline Attribute with Azure Synapse Pipeline

I have a CSV File in the Following format which want to copy from an external share to my datalake:
Test; Text
"1"; "This is a text
which goes on on a second line
and on on a third line"
"2"; "Another Test"
I do now want to load it with a Copy Data Task in an Azure Synapse Pipeline. The result is the following:
Test; Text
"1";" \"This is a text"
"which goes on on a second line";
"and on on a third line\"";
"2";" \"Another Test\""
So, yo see, it is not handling the Multi-Line Text correct. I also do not see an option to handle multiline text within a Copy Data Task. Unfortunately i'm not able to use a DataFlow Task, because it is not allowing to run with an external Azure Runtime, which i'm forced to use, due to security reasons.
In fact, i'm of course not speaking about this single test file, instead i do have x thousands of files.
My settings for the CSV File look like follows:
CSV Connector Settings
Can someone tell me how to handle this kind of multiline data correctly?
Do I have any other options within Synapse (apart from the Dataflows)?
Thanks a lot for your help
Well turns out this is not possible with a CSV File.
The pragmatic solution is to use "binary" files instead, to transfer the CSV Files and only load and transform them later on with a Python Notebook in Synapse.
You can achieve this in azure data factory by iterating through all lines and check for delimiter in each line. And then, use string manipulation functions with set variable activities to convert multi-line data to a single line.
Look at the following example. I have a set variable activity with empty value (taken from parameter) for req variable.
In lookup, create a dataset with following configuration to the multiline csv:
In foreach, where I iterate each row by giving items value as #range(0,sub(activity('Lookup1').output.count,1)). Inside for each, I have an if activity with following condition:
#contains(activity('Lookup1').output.value[item()]['Prop_0'],';')
If this is true, then I concat the current result to req variable using 2 set variable activities.
temp: #if(contains(activity('Lookup1').output.value[add(item(),1)]['Prop_0'],';'),concat(variables('req'),activity('Lookup1').output.value[item()]['Prop_0'],decodeUriComponent('%0D%0A')),concat(variables('req'),activity('Lookup1').output.value[item()]['Prop_0'],' '))
actual (req variable): #variables('val')
For false, I have handled the concatenation in the following way:
temp1: #concat(variables('req'),activity('Lookup1').output.value[item()]['Prop_0'],' ')
actual1 (req variable): #variables('val2')
Now, I have used a final variable to handle last line of the file. I have used the following dynamic content for that:
#if(contains(last(activity('Lookup1').output.value)['Prop_0'],';'),concat(variables('req'),decodeUriComponent('%0D%0A'),last(activity('Lookup1').output.value)['Prop_0']),concat(variables('req'),last(activity('Lookup1').output.value)['Prop_0']))
Finally, I have taken copy data activity with a sample source file with 1 column and 1 row (using this to copy our actual data).
Now, take source file configuration as shown below:
Create an additional column with value as final variable value:
Create a sink with following configuration and select mapping for only above created column:
When I run the pipeline, I get the data as required. The following is an output image for reference.

run ibm data stage job with different file in same job

I created a job to input excel data into database. I need the job to be reusable for different excel version. The columns of the excel will be the same but only the values will change, it's like inserting newest excel values version to the database.
Example, the file of sales_report_january.xlsx , sales_report_february.xlsx both have same columns and only the row values is different. I need the job to be able to process both files without changing anything else except the file path. Because recreating different job with the same everything(except for the filepath) for the same task seems inefficient.
Is it available to do this in ibm data stage or do i need to remap everything despite it doesn't need any change? i already tried it by changing the file path manually but it raised error.
In a word: Parameter
Construct your job using a job parameter for the pathname of the Excel workbook.
Whichever stage you are using to read the worksheet will have the workbook name set up as reference(s) to that parameter.
Tip: Use two parameters; one for the dirname part of the pathname and one for the actual name of the workbook. This is a more flexible design in the long run.
I can think of at least four ways to do this. Usually, if the files are all in the same directory, we use looping in the sequence job to process a list of the file names obtained through an appropriate command (such as ls -m pattern for UNIX/Linux). Capture the output, convert the newlines to a delimiter such as comma if necessary, and use that list in the StartLoop activity.

Unable to copy file from SFTP in Azure Data Factory when using wildcard(*) in the filename

I am unable to copy csv files from an SFTP connection to blob storage when using the wildcard(*) in the filename.
More specifically, I receive csv files in the SFTP on a daily basis, and they are of the format: "ddMMyyyyxxxxxx.csv", where "xxxxxx" is the timestamp. More concretely, my csv file for the 13th of March is: "13032019083647.csv", while for the 14th of March: "14032019083556.csv". Obviously, the timestamp is different for every day, thus I want to copy the file independently of whatever strings exists between the date and the the file extenstion.
In the "File" subfield of the "File path" of the "Connection" tab of my subset, I give as input: "13032019*.csv", as instructed by the help icon next to the field:
When I do so, my Debug run fails with:
{"errorCode": "2200", "message":
"ErrorCode=UserErrorInvalidCopyBehaviorBlobNameNotAllowedWithPreserveOrFlattenHierarchy,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Cannot
adopt copy behavior PreserveHierarchy when copying from folder to a
single file.,Source=Microsoft.DataTransfer.ClientLibrary}
I receive a similar error no matter which type of copy behaviour I choose. I have also tried experimenting with the fileFilter parameter (even though ADF warns that the same behaviour can be achieved with the fileName option), but I still end up getting the same error.
For further clarification, I am attaching the Code segment that ADF produces for this configuration:
I should also mention, that when using the full fileName in the corresponding field, namely the value: "13032019083647.csv", copying works normally.
Any help would be greatly appreciated!
My guess it might get two files with wildcard operation.
In such cases we need to use metadata activity, filter activity and for-each activity to copy these files.
1.Metadata activity : Use data-set in these activity to point the particular location of the files and pass the child Items as the parameter.
2.Filter activity : Use filter to filter the files based on your needs.
3.For-each activity : In the For-each activity get Items from the previous activity and add copy activity inside the for-each.
In copy activity the source data set should be #item().name.
I hope this will solve your issue.
What worked for me was the following: I kept the same regex for the input file, but I defined as "Copy behaviour: Merge Files". Since as mentioned, there is only 1 file that satisfies the regex condition, only 1 file was created as output. I am aware that this is a sort of "dirty" solution, but it did the trick for me.

Skip element in BizTalk flat file assembly?

I've been tasked to map an input xml (actually an SAP idoc xml), and to generate a number of flat files. Each input xml may yield multiple output files (one output file per lot number), so I will be using xsl:key and the key() function in my mapping, based on the lot number
The thing is, the lot number itself will not be in the file itself, but the output file name needs to contain that lot number value.
So the question really is: can I map the lot number to the xml and have the flat file assembler skip it when it produces the file? Or is there another way the lot number can be applied as file name by the assembly without having it inside the file itself?
In your orchestration you can set a context property for each output message:
msgOutput(FILE.ReceivedFileName) = "DynamicStuff";
msgOutput then goes to the send shape.
In your send port you set the output file like this:
FixedStuff_%SourceFileName%.xml
The result:
FixedStuff_DynamicStuff.xml
If the value is not required in the message content, don't map it. That's it.
To insert at value in the file name, lot number in this case, you will need to promote that value to the FILE.ReceivedFileName Context Property. Then, you can use the %SourceFileName% Macro as part of the name setting in the Send Port. You can set FILE.ReceivedFileName by either Property Promotion or xpath() in an Orchestration.
Bonus: Sorting and Grouping in xslt is rather unwieldy, which is why I don't do that anymore. Instead, you can use SQL: BizTalk: Sorting and Grouping Flat File Data In SQL Instead of XSL

Azure Data Factory V2 Copy Activity file filters

I am using Data Factory v2 and I currently have a simple copy activity which copies files from an FTP server to blob storage. The file names on this server are of the following form :
File_{Year}{Month}{Day}.zip
In order to download the most recent file I add this filter to my input dataset json file :
"fileName": {
"value": "#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.zip')",
"type": "Expression"
}
I now want to be able to download yesterday's file which is possible using adddays().
However I would like to be able to do this in the same copy activity and it seems that Data Factory v2 does not allow me to use the following kind of regular expression logic :
#concat('File_',formatDateTime(utcnow(), 'yyyyMMdd'), '.zip') || #concat('File_', formatDateTime(adddays(utcnow(), -1), 'yyyyMMdd'), '.zip')
Is this possible or do I need a separate activity ?
It would seem strange to need a second activity since a Copy Activity can only take a single input but if the regex is simple enough, then multiple files are treated as a single input and if not then multiple files are treated as multiple inputs.
The '||' won't work since it will be evaluate as a single string.
But I can provided two solutions for this.
using a tumbling window daily trigger and set the start time as yesterday. So it will trigger two pipeline run.
using Foreach activity + copy activity. The foreach activity iterate an array to pass the yesterday and today to the copy activity.
Btw, you could just use string interpolation expression instead of concat. They are the same.
File_#{formatDateTime(utcnow(), 'yyyyMMdd')}.zip
I would suggest you to read about get metadata activity. This can be helpful in your scenario I think.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
You have itemName property, lastModified property, check it out.

Resources