Dynamic selection of storage table in azure data factory - azure

I've got an existing set of azure storage tables that are one-per-client to hold events in a multi-tenant cloud system.
Eg, there might be 3 tables to hold sign-in information:
ClientASignins
ClientBSignins
ClientCSignins
Is there a way to dynamically loop through these as part of either a copy operation or in something like a Pig script?
Or is there another way to achieve this result?
Many thanks!

If you keep track of these tables in another location, like Azure Storage, you could use PowerShell to loop through each of them and create a hive table over each. For example:
foreach($t in $tableList) {
$hiveQuery = "CREATE EXTERNAL TABLE $t(IntValue int)
STORED BY 'com.microsoft.hadoop.azure.hive.AzureTableHiveStorageHandler'
TBLPROPERTIES(
""azure.table.name""=""$($t.tableName)"",
""azure.table.account.uri""=""http://$storageAccount.table.core.windows.net"",
""azure.table.storage.key""=""$((Get-AzureStorageKey $storageAccount).Primary)"");"
Out-File -FilePath .\HiveCreateTable.q -InputObject $hiveQuery -Encoding ascii
$hiveQueryBlob = Set-AzureStorageBlobContent -File .\HiveCreateTable.q -Blob "queries/HiveCreateTable.q" `
-Container $clusterContainer.Name -Force
$createTableJobDefinition = New-AzureHDInsightHiveJobDefinition -QueryFile /queries/HiveCreateTable.q
$job = Start-AzureHDInsightJob -JobDefinition $createTableJobDefinition -Cluster $cluster.Name
Wait-AzureHDInsightJob -Job $job
#INSERT YOUR OPERATIONS FOR EACH TABLE HERE
}
Research:
http://blogs.msdn.com/b/mostlytrue/archive/2014/04/04/analyzing-azure-table-storage-data-with-hdinsight.aspx
How can manage Azure Table with Powershell?

In the end I opted for a couple Azure Data Factory Custom Activities written in c# and now my workflow is:
Custom activity: aggregate the data for the current slice into a single blob file for analysis in Pig.
HDInsight: Analyse with Pig
Custom activity: disperse the data to the array of target tables from blob storage to table storage.
I did this to keep the pipelines as simple as possible and remove the need for any duplication of pipelines/scripts.
References:
Use Custom Activities In Azure Data Factory pipeline
HttpDataDownloader Sample

Related

Handling partitioned data in Azure?

I have some containers in ADLS (gen2) and have multiple folders within that container. I would like to have a mechanism to scan those folders to infer their schema and detect partitions and update them in the data catalog. How do I achieve this functionality in Azure?
Sample:
- container1
---table1-folder
-----10-12-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-13-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-14-1970
-------files1.parquet
-------files2.parquet
----table2-folder
-----zipcode1
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----zipcode2
-------files1.parquet
-------files2.parquet
...
So, what I expect is that in the catalog, it will create two tables (table1 & table2) where table1 will have date-based partitions (3 dates for this case) and have underline data within that table. Same for table2 which will have two partitions and their underline data.
In the AWS world, I can run a Glue crawler that can crawl these files, infers schemas and partitions, and populate Glue data catalogs, later I can query them through Athena. What's the Azure equivalent approach to achieve something similar?
I would recommend looking at Azure Synapse Analytics Serverless SQL. You can create a view which consumes the folders and does partition elimination if you follow this approach:
-- If you do not have a Master Key on your DW you will need to create one
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>' ;
GO
CREATE DATABASE SCOPED CREDENTIAL msi_cred
WITH IDENTITY = 'Managed Service Identity' ;
GO
CREATE EXTERNAL DATA SOURCE ds_container1
WITH
( TYPE = HADOOP ,
LOCATION = 'abfss://container1#mystorageaccount.dfs.core.windows.net' ,
CREDENTIAL = msi_cred
) ;
GO
CREATE VIEW Table2
AS SELECT *, f.filepath(1) AS [zipcode]
FROM
OPENROWSET(
BULK 'table2-folder/*/*.parquet',
DATA_SOURCE = 'ds_container1',
FORMAT='PARQUET'
) AS f
Then setup Azure Purview as your data catalog and have it index your Synapse Serverless SQL pool.

Creating Topic Filter rule via CorrelationFilter with Azure Functions App

I want to create filter rule via CorrelationFilter for subscriptions associated with a Topic, as it is faster than SQLFilter.
The rule: any message that contains a header that equals to a string will go to one subscription, another string will go to different subscription. For example:
Topic: order
Subcription1: header_orderType: orderPlaced
Subcription2: header_orderType: orderPaid
Similar to the one highlighted in blue below via Service Bus Explorer.
Below is other ways that can acheive that.
SQLFilter in code
https://dzone.com/articles/everything-you-need-know-about-5
SQLFilter
https://github.com/Azure/azure-service-bus/tree/master/samples/DotNet/Microsoft.Azure.ServiceBus/TopicFilters
PS
https://learn.microsoft.com/en-us/powershell/module/azurerm.servicebus/New-AzureRmServiceBusRule?view=azurermps-6.13.0
The TopicFilters sample covers correlation filter too which is setup using an ARM template. The same should be possible in C# and PS as well.
C#
You will have to first create a Microsoft.Azure.ServiceBus.CorrelationFilter object
var orderPlacedFilter = new CorrelationFilter();
filter.Properties["header_orderType"] = "orderPlaced";
And then add it to your subscription client object by calling Microsoft.Azure.ServiceBus.SubscriptionClient.AddRuleAsync()
subsClient.AddRuleAsync("orderPlacedFilter", orderPlacedFilter);
Similarly, for the other subscription and its filter.
PowerShell
Guess the documentation isn't really great on this one but I believe this should work
$rule = New-AzServiceBusRule -ResourceGroupName prvalav-common -Namespace prvalav-common -Topic test -Subscription test -Name SBRule -SqlExpression "test = 0"
$rule.FilterType = 1
$rule.SqlFilter = $null
$rule.CorrelationFilter.Properties["header_orderType"] = "orderPlaced"
Set-AzServiceBusRule -ResourceGroupName prvalav-common -Namespace prvalav-common -Topic test -Subscription test -Name SBRule -InputObject $rule
If you were wondering about the FilterType = 1, check the FilterType enum.
After setting this up, in your function app, you would just use the Service Bus Trigger with the topic/subscription details.

Accessing pipeline activity status from an Azure Function

I have an Azure Function which triggers a Pipeline and I'm able to poll the pipeline status to check when it completes using: Pipeline.Properties.RuntimeInfo.PipelineState
My pipeline uses several parallel Copy activities and I'd like to be able to access the status of these activities incase they fail. The Azure documentation describes how to access the pipeline activities but you can only get at static properties like name and description but not dynamic properties like Status (like you can for the Pipeline via its RuntimeInfo property).
For completeness, I've accessed the activity list using:
IList<Microsoft.Azure.Management.DataFactories.Models.Activity> activityList = plHandle.Pipeline.Properties.Activities;
Is it possible to check individual activity statuses programmatically?
Its certainly possible.
I use the ADF PowerShell cmdlets in the Azure module to monitor our data factories.
Maybe do something like the below for what you need with Get-AzureRmDataFactoryActivityWindow command.
Eg:
$ActivityWindows = Get-AzureRmDataFactoryActivityWindow `
-DataFactoryName $ADFName.DataFactoryName `
-ResourceGroupName $ResourceGroup `
| ? {$_.WindowStart -ge $Now} `
| SELECT ActivityName, ActivityType, WindowState, RunStart, InputDatasets, OutputDatasets `
| Sort-Object ActivityName
This gives you the activity level details including the status. Being:
Ready
In Progress
Waiting
Failed
... I list them because they differ slightly from what you see in the portal blades.
The datasets are also arrays if you have multiple inputs and outputs for particular activities.
More ADF cmdlets available here: https://learn.microsoft.com/en-gb/powershell/module/azurerm.datafactories/?view=azurermps-3.8.0
Hope this helps
I've managed to resolve this by accessing the DataSliceRuns (i.e. activities) for the pipeline as follows:
var datasets = client.Datasets.ListAsync(<resourceGroupName>, <DataFactoryName>).Result;
foreach (var dataset in datasets.Datasets)
{
// Check the activity statuses for the pipelines activities.
var datasliceRunlistResponse = client.DataSliceRuns.List(<resourceGroupName>, <dataFactoryName>,<DataSetName>, new DataSliceRunListParameters()
{
DataSliceStartTime = PipelineStartTime.ConvertToISO8601DateTimeString()
});
foreach (DataSliceRun run in datasliceRunlistResponse.DataSliceRuns)
{
// Do stuff...
}
}

List directories in a container

How can I get a list of directories in my container?
I can use Get-AzureStorageBlob to get all the blobs and filter by distinct prefix /name/, but it might be slow with millions of blobs.
Is there a proper way of achieving this in PowerShell?
There's no concept of directories, only containers and blobs. A blob name may have delimiters with look like directories, and may be filtered.
If you choose to store millions of blobs in a container, then you'll be searching through millions of blob names, even with delimiter filtering, whether using PowerShell, SDK, or direct REST calls.
As far as "proper" way: There is no proper way: Only you can decide how you organize your containers and blobs, and where (or if) you choose to store metadata for more efficient searching (such as a database).
The other answer is correct that there is nothing out of the box as there is no real thing as a folder however only file names that contain a folder like path.
Using a regex in PowerShell you can find top-level folders. As mentioned, this may be slow is there are millions of items in your account but for a small number it may work for you.
$context = New-AzureStorageContext -ConnectionString '[XXXXX]'
$containerName = '[XXXXX]'
$blobs = Get-AzureStorageBlob -Container $containerName -Context $context
$folders = New-Object System.Collections.Generic.List[System.Object]
foreach ($blob in $blobs)
{
if($blob.Name -match '^[^\/]*\/[^\/]*$')
{
$folder = $blob.Name.Substring(0,$blob.Name.IndexOf("/"));
if(!$folders.Contains($folder))
{
$folders.Add($folder)
}
}
}
foreach ($folder in $folders)
{
Write-Host $folder
}

What is the azure table storage query equivalent of T-sql's LIKE command?

I'm querying Azure table storage using the Azure Storage Explorer. I want to find all messages that contain the given text, like this in T-SQL:
message like '%SysFn%'
Executing the T-SQL gives "An error occurred while processing this request"
What is the equivalent of this query in Azure?
There's no direct equivalent, as there is no wildcard searching. All supported operations are listed here. You'll see eq, gt, ge, lt, le, etc. You could make use of these, perhaps, to look for specific ranges.
Depending on your partitioning scheme, you may be able to select a subset of entities based on specific partition key, and then scan through each entity, examining message to find the specific ones you need (basically a partial partition scan).
While an advanced wildcard search isn't strictly possible in Azure Table Storage, you can use a combination of the "ge" and "lt" operators to achieve a "prefix" search. This process is explained in a blog post by Scott Helme here.
Essentially this method uses ASCII incrementing to query Azure Table Storage for any rows whose property begins with a certain string of text. I've written a small Powershell function that generates the custom filter needed to do a prefix search.
Function Get-AzTableWildcardFilter {
param (
[Parameter(Mandatory=$true)]
[string]$FilterProperty,
[Parameter(Mandatory=$true)]
[string]$FilterText
)
Begin {}
Process {
$SearchArray = ([char[]]$FilterText)
$SearchArray[-1] = [char](([int]$SearchArray[-1]) + 1)
$SearchString = ($SearchArray -join '')
}
End {
Write-Output "($($FilterProperty) ge '$($FilterText)') and ($($FilterProperty) lt '$($SearchString)')"
}
}
You could then use this function with Get-AzTableRow like this (where $CloudTable is your Microsoft.Azure.Cosmos.Table.CloudTable object):
Get-AzTableRow -Table $CloudTable -CustomFilter (Get-AzTableWildcardFilter -FilterProperty 'RowKey' -FilterText 'foo')
Another option would be export the logs from Azure Table storage to csv. Once you have the csv you can open this in excel or any other app and search for the text.
You can export table storage data using TableXplorer (http://clumsyleaf.com/products/tablexplorer). In this there is an option to export the filtered data to csv.

Resources