renaming files in a nested directory with azure data factory

renaming files in a nested directory with azure data factory - azure

I have a daily export set up for several subscriptions - the files export like so
with 7 different directories within daily -- i'm simply trying to rename the files to get rid of the underscore for data flows
my parent pipeline looks like so
get metadata gets the folder names and for each invokes the child pipeline like so
here are the screen grabs of the child pipeline
copy data within the foreach1 -- the source
and now the sink - this is where i want to rename the file, the first time i debugged it simply copied them to the correct place with a .txt extension, the next time it got the extension right but it is not renaming the file,
i replaced #replace(item().name, '_', '-') with #replace(activity('FileInfo').output.itemName, '_','-') and got the following error
The expression '#replace(activity('FileInfo').output.itemName, '_','-')' cannot be evaluated because property 'itemName' doesn't exist, available properties are 'childItems, effectiveIntegrationRuntime, executionDuration, durationInQueue, billingReference'.
so then I replaced that with
#replace(activity('FileInfo').output.childItems, '_', '-')
but that gives the following error
Cannot fit childItems return type into the function parameter string
I'm not sure where to go from here
edit 7/14
making the change from the answer below
here is my linked service for the sink dataset with the parameter renamedFile
here is the sink on the copy data1 for the child_Rename pipeline, it grayed out the file extension as this was mentioned
now here is the sink container after running the pipeline
this is the directory structure of the source data - it's dynamically created from scheduled daily azure exports
here is the output of get metadata - FileInfo from the child pipeline
{
"childItems": [
{
"name": "daily",
"type": "Folder"
}
],
"effectiveIntegrationRuntime": "integrationRuntime1 (Central US)",
"executionDuration": 0,
"durationInQueue": {
"integrationRuntimeQueue": 0
},
"billingReference": {
"activityType": "PipelineActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "Hours"
}
]
}
}
allsubs - source container
daily - directory created by the scheduled export
sub1 - subN - the different subs with scheduled exports
previous-month -> this-month - monthly folders are created automatically
this_fileXX.csv -- files are automatically generated with the underscore in the name - it is my understanding that data flows cannot handle these characters in the file name
├──allsubs/
└── daily/
├── sub1/
| └── previous-month/
└── this_file.csv
└── this_file1.csv
| └── previous-month/
└── this_file11.csv
└── this_file12.csv
| └── this-month/
├── subN/
| └── previous-month/
| └── previous-month/
| └── this-month/
└── this_fileXX.csv
edit 2 - july 20
I think i'm getting closer but there are still some small errors i do not see
the pipeline now moves all the files from the container allsubs to the container renamed-files but it is not renaming the files - it looks like so
Get Metadata -from the dataset allContainers it retrieves the folders with the Child Items
dataset allContainers shown (preview works, linked service works, no paremeters in this dataset)
next the forEach activity calls the output of get metadata
for the items #activity('Get Metadata1').output.childItems
next shown is the copy data within ForEach
the source is the allContainers dataset with the wildcard file path selected, recursively selected and due to the following error max concurrent connections set at 1 -- but this did not resolve the error
error message:
Failure happened on 'Sink' side.
ErrorCode=AzureStorageOperationFailedConcurrentWrite,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Error occurred when trying to upload a file.
It's possible because you have multiple concurrent copy activities
runs writing to the same file 'renamed-files/rlcosts51122/20220601-20220630/rlcosts51122_082dd29b-95b2-4da5-802a-935d762e89d8.csv'.
Check your ADF configuration.
,Source=Microsoft.DataTransfer.ClientLibrary,
''Type=Microsoft.WindowsAzure.Storage.StorageException,
Message=The remote server returned an error: (400) Bad
Request.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified block list is invalid.
RequestId:b519219f-601e-000d-6c4c-9c9c5e000000
Time:2022-07-
20T15:23:51.4342693Z,,''Type=System.Net.WebException,
Message=The remote server returned an error: (400) Bad
Request.,Source=Microsoft.WindowsAzure.Storage,'
copy data source:
copy data sink - the dataset is dsRenamesink, it's simply another container in a different storage account, linked service is set up correctly, it has the parameter renamedFile but I suspect this is the source of my error. still testing that.
sink dataset dsRenamesink:
parmeter page:
here's the sink in the copy data where the renamed file is passed the iterator from ForEach1 like so:
#replace(item().name,'_','renameworked')
so the underscore would be replaced with 'renameworked' easy enough to test
debugging the pipeline
the errors look to be consistent for the 7 failures which was shown above as the 'failure happened on the sink side'
however - going into the storage account sink i can see that all of the files from the source were copied over to the sink but the files were not renamed like so
pipeline output:
error messages:
{
"dataRead": 28901858,
"dataWritten": 10006989,
"filesRead": 4,
"filesWritten": 0,
"sourcePeakConnections": 1,
"sinkPeakConnections": 1,
"copyDuration": 7,
"throughput": 4032.067,
"errors": [
{
"Code": 24107,
"Message": "Failure happened on 'Sink' side. ErrorCode=AzureStorageOperationFailedConcurrentWrite,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error occurred when trying to upload a file. It's possible because you have multiple concurrent copy activities runs writing to the same file 'renamed-files/rlcosts51122/20220601-20220630/rlcosts51122_082dd29b-95b2-4da5-802a-935d762e89d8.csv'. Check your ADF configuration.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (400) Bad Request.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified block list is invalid.\nRequestId:b519219f-601e-000d-6c4c-9c9c5e000000\nTime:2022-07-20T15:23:51.4342693Z,,''Type=System.Net.WebException,Message=The remote server returned an error: (400) Bad Request.,Source=Microsoft.WindowsAzure.Storage,'",
"EventType": 0,
"Category": 5,
"Data": {
"FailureInitiator": "Sink"
},
"MsgId": null,
"ExceptionType": null,
"Source": null,
"StackTrace": null,
"InnerEventInfos": []
}
],
"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Central US)",
"usedDataIntegrationUnits": 4,
"billingReference": {
"activityType": "DataMovement",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.06666666666666667,
"unit": "DIUHours"
}
]
},
"usedParallelCopies": 1,
"executionDetails": [
{
"source": {
"type": "AzureBlobFS",
"region": "Central US"
},
"sink": {
"type": "AzureBlobStorage"
},
"status": "Failed",
"start": "Jul 20, 2022, 10:23:44 am",
"duration": 7,
"usedDataIntegrationUnits": 4,
"usedParallelCopies": 1,
"profile": {
"queue": {
"status": "Completed",
"duration": 3
},
"transfer": {
"status": "Completed",
"duration": 2,
"details": {
"listingSource": {
"type": "AzureBlobFS",
"workingDuration": 0
},
"readingFromSource": {
"type": "AzureBlobFS",
"workingDuration": 0
},
"writingToSink": {
"type": "AzureBlobStorage",
"workingDuration": 0
}
}
}
},
"detailedDurations": {
"queuingDuration": 3,
"transferDuration": 2
}
}
],
"dataConsistencyVerification": {
"VerificationResult": "NotVerified"
},
"durationInQueue": {
"integrationRuntimeQueue": 0
}
}
all i wanted to do was remove the underscore from the file name to work with data flows....I'm not sure what else to try next
next attempt july 20
it appears that now I have been able to copy and rename some of the files -
changing the sink dataset as follows
#concat(replace(dataset().renamedFile,'_','-'),'',formatDateTime(utcnow(),'yyyyMMddHHmmss'),'.csv')
and removing this parameter from the sink in the copy activity
upon debugging this pipeline I get 1 file in the sink and it is named correctly but there is still something wrong
third attempt 7/20
further updating to be closer to the original answer
sink dataset
copy data activity in the sink - concat works
now after debugging i'm left with 1 file for each of the subs - so there is something still not quite correct

I reproduce the same thing in my environment.
Go to Sink dataset, click and open.First create parameters and add dynamic content, I used this expression #dataset().sinkfilename
In copy activity sink, under dataset properties pass the filename value using the expression #replace(item().name,'_','-') to replace _ with -.
when you create a dataset parameter to pass the filename, the File extension property is automatically disabled.
when the pipeline runs you can see the file name has been renamed accordingly.

Related

Azure Databricks CLI: update workflow/job definition

I have created a pipeline in Azure DevOps to perform the following three steps:
Retrieve the job definition from one Databricks workspace and save it as a json (Databricks CLI config is omitted)
databricks jobs get --job-id $(job_id) > workflow.json
Use this json to update the workflow in a second (separate) Databricks workspace (Databricks CLI is first reconfigured to point to the new workspace)
databricks jobs reset --job-id $(job_id) --json-file workflow.json
Run the updated job in the second Databricks workspace
databricks jobs run-now --job-id $(job_id)
However, my pipeline fails at step 2 with the following error, even though the existing_cluster_id is already defined inside the workflow.json. Any idea?
Error: b'{"error_code":"INVALID_PARAMETER_VALUE","message":"One of job_cluster_key, new_cluster, or existing_cluster_id must be specified."}' 
Here is what my workflow.json looks like (hiding some of the details):
{
"job_id": 123,
"creator_user_name": "user1",
"run_as_user_name": "user1",
"run_as_owner": true,
"settings":
{
"name": "my-workflow",
"existing_cluster_id": "abc-def-123-xyz",
"email_notifications": {
"no_alert_for_skipped_runs": false
},
"webhook_notifications": {},
"timeout_seconds": 0,
"notebook_task": {
"notebook_path": "notebooks/my-notebook",
"base_parameters": {
"environment": "production"
},
"source": "GIT"
},
"max_concurrent_runs": 1,
"git_source": {
"git_url": "https://my-org#dev.azure.com/my-project/_git/my-repo",
"git_provider": "azureDevOpsServices",
"git_branch": "master"
},
"format": "SINGLE_TASK"
},
"created_time": 1676477563075
}

I figured out that you don't need to retrieve the entire workflow definition json file, as shown in step 1, but only the "settings" part, i.e. modifying step 1 to this solved my issue:
databricks jobs get --job-id $(job_id) | jq .settings > workflow.json

Azure Data Factory Get Metadata activity returning "(404) not found" error when getting column count

I am trying to implement a Get Metadata activity to return the column count of files I have in a single blob storage container.
Get Metadata activity is returning this error:
Error
I'm fairly new to Azure Data Factory and cannot solve this. Here's what I have:
Dataset:Source dataset
Name- ten_eighty_split_CSV
Connection- Blob storage
Schema- imported from blob storage file
Parameters- "FileName"; string; "#pipeline().parameters.SourceFile"
Pipeline:
Name: ten eighty split
Parameters: "SourceFile"; string; "#pipeline().parameters.SourceFile"
Settings: Concurrency: 1
Get Metadata activity: Get Metadata
Only argument is "Column count"
Throws the error upon debugging. I am not sure what to do, (404) not found is so broad I could not ascertain a specific solution. Thanks!

The error occurs because you have given incorrect file name (or) name of a file that does not exist.
Since you are trying to use blob created event trigger to find the column count, you can use the procedure below:
After configuring the get metadata activity, create a storage event trigger. Go to Add trigger -> choose trigger -> Create new.
Click on continue. You will get a Trigger Run Parameters tab. In this, give the value as #triggerBody().fileName.
Complete the trigger creation and publish the pipeline. Now whenever the file is uploaded into your container (on top of which you created storage event trigger), it will trigger the pipeline automatically (no need to debug). If the container is empty and you try to debug by giving some value for sourceFile parameter, it would give the same error.
Upload a sample file to your container. It will trigger the pipeline and give the desired result.
The following is the trigger JSON that I created for my container:
{
"name": "trigger1",
"properties": {
"annotations": [],
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "pipeline1",
"type": "PipelineReference"
},
"parameters": {
"sourceFile": "#triggerBody().fileName"
}
}
],
"type": "BlobEventsTrigger",
"typeProperties": {
"blobPathBeginsWith": "/data/blobs/",
"blobPathEndsWith": ".csv",
"ignoreEmptyBlobs": true,
"scope": "/subscriptions/b83c1ed3-c5b6-44fb-b5ba-2b83a074c23f/resourceGroups/<user>/providers/Microsoft.Storage/storageAccounts/blb1402",
"events": [
"Microsoft.Storage.BlobCreated"
]
}
}
}

Azure Data Factory Cannot retrieve output

With ADF I created web activity with Post method and got my output. I need to get the output that's in json to be stored in my blob storage. I tried copy data method but it gives me error of "Failure happened on 'Source' side.
ErrorCode=UserErrorHttpStatusCodeIndicatingFailure,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The HttpStatusCode 404 indicates failure. { "statusCode": 404, "message": "Resource not found" },Source=Microsoft.DataTransfer.ClientLibrary,'"
I have added dynamic value as an input in the copy activity but gives me error in debug. I am able to see the desired output in the input(debug mode) but output in not the same. Here is what it shows as output:
{
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (West US)",
"executionDuration": 0,
"durationInQueue": {
"integrationRuntimeQueue": 1
},
"billingReference": {
"activityType": "ExternalActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "Hours"
}
]
}
}
Goal here is to get the output stored in a blob as csv so I can import that file to sql table. How do I get the output stored in blob storage?

I was able to get this to work by simply using copy activity. The reason why it was failing at first was because of the relative URL. On rest data after adding connection, I separated base and relative URL. I added rest data as a source and for sink I just added SQL table. Was able to store the response directly to table.

Yes, the point is to store only domain name in HTTP LinkedService, and store the whole relative URL (e.g. /api/v1/products) in input dataset.

terraform data source remote state doesn't work

I have the following terraform :
data "terraform_remote_state" "stack" {
backend = "local"
config {
path = "terraform.tfstate"
}
}
output "diditwork" {
value = "${data.terraform_remote_state.stack.aws_autoscaling_group.main.id}"
}
and I have a terraform.tfstate file in the same folder :
{
"version": 3,
"terraform_version": "0.9.3",
"serial": 14,
"lineage": "dc16a61f-72dd-435b-ba3f-5e36e14aace2",
"modules": [
{
"path": [
"root"
],
"outputs": {},
"resources": {
"aws_autoscaling_group.main": {
"type": "aws_autoscaling_group",
"depends_on": [
"aws_launch_configuration.lc"
],
"primary": {
"id": "djin-sample-asg-stag",
"attributes": {
"arn": "arn:aws:autoscaling:us-east-1:174120285419:autoScalingGroup:04c470fa-45f8-4711-aa31-b3ede40d6…
but for some reason when I do a terraform apply my output doesn't print anything for the autoscaling group id. The apply is successful and it doesn't even throw any error. What am i missing ?

This is wrong.
value = "${data.terraform_remote_state.stack.aws_autoscaling_group.main.id}"
You can only get root level outputs in data.
https://www.terraform.io/docs/providers/terraform/d/remote_state.html#root-outputs-only
Only the root level outputs from the remote state are accessible.
Outputs from modules within the state cannot be accessed. If you want
a module output to be accessible via a remote state, you must thread
the output through to a root output.
So, you will first need to output your autoscaling id, something like:
output "asg_id" {
value = "${aws_autoscaling_group.main.id}"
}
And then in the data get, you will do,
output "diditwork" {
value = "${data.terraform_remote_state.stack.asg_id}"
}
Also, doing a remote data source on the same location will be a bad idea in general for backends that support locking (or for scenarios where you really want to use it). I am assuming your are doing the remote data source from the same location only for experimental basis, so no harm done, but in real usage, you should just use the value as used in the output above.

How to use Azure Data Factory for importing data incrementally from MySQL to Azure Data Warehouse?

I'm using Azure Data Factory to periodically import data from MySQL to Azure SQL Data Warehouse.
The data goes through a staging blob storage on an Azure storage account, but when I run the pipeline it fails because it can't separate the blob text back to columns. Each row that the pipeline tries to insert into the destination becomes a long string which contains all the column values delimited by a "⯑" character.
I used Data Factory before, without trying the incremental mechanism, and it worked fine. I don't see a reason it would cause such a behavior, but I'm probably missing something.
I'm attaching the JSON that describes the pipeline with some minor naming changes, please let me know if you see anything that can explain this.
Thanks!
EDIT: Adding exception message:
Failed execution Database operation failed. Error message from
database execution :
ErrorCode=FailedDbOperation,'Type=Microsoft.DataTransfer.Com‌mon.Shared.HybridDel‌iveryException,Messa‌ge=Error
happened when loading data into SQL Data
Warehouse.,Source=Microsoft.DataTransfer.ClientLibrary,''Typ‌e=System.Data.SqlCli‌ent.SqlException,Mes‌sage=Query
aborted-- the maximum reject threshold (0 rows) was reached while
reading from an external source: 1 rows rejected out of total 1 rows
processed.
(/f4ae80d1-4560-4af9-9e74-05de941725ac/Data.8665812f-fba1-40‌7a-9e04-2ee5f3ca5a7e‌.txt)
Column ordinal: 27, Expected data type: VARCHAR(45) collate SQL_Latin1_General_CP1_CI_AS, Offending value:* ROW OF VALUES
* (Tokenization failed), Error: Not enough columns in this
line.,},],'.
{
"name": "CopyPipeline-move_incremental_test",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from [table] where InsertTime >= \\'{0:yyyy-MM-dd HH:mm}\\' AND InsertTime < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "SqlDWSink",
"sqlWriterCleanupScript": "$$Text.Format('delete [schema].[table] where [InsertTime] >= \\'{0:yyyy-MM-dd HH:mm}\\' AND [InsertTime] <\\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)",
"allowPolyBase": true,
"polyBaseSettings": {
"rejectType": "Value",
"rejectValue": 0,
"useTypeDefault": true
},
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "column1:column1,column2:column2,column3:column3"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "StagingStorage-somename",
"path": "somepath"
}
},
"inputs": [
{
"name": "InputDataset-input"
}
],
"outputs": [
{
"name": "OutputDataset-output"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 10,
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Activity-0-_Custom query_->[schema]_[table]"
}
],
"start": "2017-06-01T05:29:12.567Z",
"end": "2099-12-30T22:00:00Z",
"isPaused": false,
"hubName": "datafactory_hub",
"pipelineMode": "Scheduled"
}
}

It sounds like what your doing is right, but the data is poorly formed (common problem, none UTF-8 encoding) so ADF can't parse the structure as you require. When I encounter this I often have to add a custom activity to the pipeline that cleans and prepares the data so it can then be used in a structured way by downstream activities. Unfortunately this is a big be overhead in the development of the solution and will require you to write a C# class to deal with the data transformation.
Also remember ADF has none of its own compute, it only invokes other services, so you'll also need an Azure Batch Service to execute to compiled code.
Sadly there is no magic fix here. Azure is great to Extract and Load your perfectly structured data, but in the real world we need other services to do the Transform or Cleaning meaning we need a pipeline that can ETL or I prefer ECTL.
Here's a link on create ADF custom activities to get you started: https://www.purplefrogsystems.com/paul/2016/11/creating-azure-data-factory-custom-activities/
Hope this helps.

I've been struggeling with the same message, sort of, when importing from Azure sql db to Azure DWH using Data Factory v.2 using staging (which implies Polybase). I've learned that Polybase will fail with error messages related to incorrect data types etc. The message I've received is much similar to the one mentioned here, even though I'm not using Polybase directly from SQL, but via Data Factory.
Anyways, the solution for me was to avoid NULL values for columns of decimal or numeric type, e.g. ISNULL(mynumericCol, 0) as mynumericCol.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

renaming files in a nested directory with azure data factory - azure

Related

Azure Databricks CLI: update workflow/job definition

Azure Data Factory Get Metadata activity returning "(404) not found" error when getting column count

Azure Data Factory Cannot retrieve output

terraform data source remote state doesn't work

How to use Azure Data Factory for importing data incrementally from MySQL to Azure Data Warehouse?

Categories

Resources