Azure Data Factory copy activity very slow - azure

Summarize the problem
I've seeing particularly slow performance out of Azure Data Factory. Searching for similar questions on StackOverFlow turns up nothing except for the advice to contact support.
I'm rolling the dice here to see if anyone has seen something similar and knows how to fix it.
In short, every operation I try in ADF results in excruciatingly slow performance.
This includes:
Extracting a zip in blob storage to blob storage
Copying a number of small compressed files into Azure Data Explorer
Copying a number of small uncompressed json files into Azure Data Explorer
Extracting ZIP
Copying to ADX
In both cases the performance is in the kilobytes per second range.
In both cases the copy/import will eventually work but it can take hours.
Describe what you've tried
I've tried:
using different regions
creating and using my own Integration Runtime
playing with different parameters that could potentially affect performance such as parallel connections etc.
Contacting Microsoft support (who sent me here)
Show some code
Not really any code to share. To reproduce just try extracting a zip to and from blob storage. I get ~400KB/s.
In summary, any advice would be gratefully received. If I can't get this bit working I have to implement a the ingestion factory manually, which on reflection sounds like fun than I've been having with ADF.

Thease 'deep' folders will affect copy speed. We should minimize the depth and increase the amount of copy activity. You can reference this document to troubleshoot copy activity performance. Or you can send a feedback to Microsoft Azure.

Related

Copy millions of files form root AZStorage Blob to subfolders

I’ve got multiple Azure storage blob containers each with over 1M JSON files include the root. Impossible to work with (no shocker) so trying to use Data Factory to move them to multiple folders using a timestamp in the files to create a YYYY-MM-DD/HH folder setup as a partition system. But every approach I’ve tried fails with timeouts / too many item limits. Need to open each file, get the timestamp, and use it to move the file to a dynamic path using the timestamp data. Ideas?
UPDATE: I was able to get around this, but I wouldn't call it a "answer" so I'll just update the question. To create smaller collections, I parameterized the pipeline to accept a file name wildcard. I then created another pipeline that uses an array of 0-9,a-z to use that as an parameter on the dataset. Brute force workaround... assume there's got to be a better solution, but this works for now.
Read doc: Move data to and from Azure Blob storage
The following articles describe how to move data to and from Azure Blob storage using different technologies.
Azure Storage-Explorer
AzCopy
Python-SDK (Others: .NET, Java, Node.js, Python, Go, PHP, Ruby.)
SSIS
In your case, I would suggest you to use SDK, which supports .NET, Java, Node.js, Python, Go, PHP, Ruby.
Believe me , if you want to migrate your datas from AzureBlob , DataFactory is not a good way, it makes the problem more complicated.
( This is my suggestion after I migrated over 100 million JSON-files (over 2TB) from AzureBlob)
If you have time... I would do the following:
Create an Azure Function to read the file and get your timestamp and do your move operation. scope the function just to use a single file. Then use events (EventGrid) in the storage account to trigger the function on create of a blob. Then you know for any new files it will move the file to the right spot. (Remember you need to reach a million executions in the consumption model for functions to start billing, so this is a low cost option).
For the current files, create another function (or if you want some more control, use a logic app, but your cost will be a bit more) and set your parralelism on the function or logic app to a low amount (to keep an eye on your executions). that run a simple for each with limits that run your first function. This will slowly move your files out of that container eventually getting you into a reasonable item count to work with on with stuff like ADF. This might just solve your problem for the long run as any new files will be categorized accordingly, and your backlog is slowly being moved as required. If you need to update a DB with a pointer to where your file lives you could put that piece of code also in your function or logic app. Just my two cents :)
It is not clear if you are using the hierarchical folder structure provided by Azure Data Lake Storage Gen2, the generation 1 simulates a folders structure but it is not optimum.
There are several advantages on the ADLSV2 that should help in your case mainly related to move operations.
To migrate from ADLS Gen 1 to ADLS Gen 2 have a look here.
Additionally, you may explore optimizations on your specific case with the following paper here.

SAP BW to PowerBI data loading taking huge time?

We have a cube that contains 1.6 years of data and it is taking a long time to load. Previously we got a memory issue error, but we have increased the SAP Memory size. Can anyone explain me any ways to troubleshoot, or any best practices that we can follow?
We are currently pulling 30-35 combinations of Dimensions and Characteristics and still its taking a lot of time, and we don’t have that amount of time in order to get the error and then act on it.
It is the internal MDX limitation that you have to live with. In order to mitigate this, you will have to use filters or variables to restrict the return volume. If you don't mind moving data out of SAP onto Azure storage first, then you will gain much better user experience by pointing Power BI to Azure DW, Azure SQL database, or even Blob. Otherwise, you will have to be stuck with SAP bottleneck.
Because Power BI and ADF share the same underlying engine to access SAP BW, for your reference, you can check our blog out for comparison and further explanation in context of ADF and BW integration:

Archive tables in azure

I have a table storage in Azure where-in one of the tables is growing rapidly and I need to archive the tables for any data older than 90 days. Tried reading online and the only solution I can get online is to use Eventually consistent transactions pattern : https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/. Although the document takes an example of an employee table and can help me in achieving my objective, the intention of posting this question is to identify if there is a better solution.
Please note I am very new to Azure so might be missing a very easy step to achieve this.
Regards Tarun
The guide you are referencing is a good source. Note that if you are using Tables for logging data (has a common requirement for archiving older data) then you might want to look at blob storage instead - see the log data pattern in the guide you reference. Btw, AzCopy can also be used to export the data either to a blob or to local file system. See here for more information: https://azure.microsoft.com/en-us/documentation/articles/storage-use-azcopy/#copy-entities-in-an-azure-table-with-azcopy-preview-version-only.

Azure: SQL Compact possible?

I have a RESTful service running on azure. Currently, it has zero persistence. (It is just a REST gateway to another api.) I run it in a single, minimal Azure instance, and expect this will handle all the load this will ever get.
I now need to add some very lightweight persistence to it. A simple table, of 40-200 rows, eight data columns. The data is very static.
Doing the whole SQL Azure thing seems big overkill for my needs.
My thoughts have been to use:
An XML file, and load it into memory, as the db. XML file is
deployed with code.
Some better way to deploy XML, so it can be
rolled out/updated easier
SQL Compact (can I do this on Azure?)
___ ?
What is the right path here?
Thank you!
SQL Server Compact would need to store its data somewhere in persistent manner, so you would need to sync it regularly to a persistent storage and that's a lot of extra work and I have no idea how to do that reliably, so it's likely not a very good idea.
For your simple table the Azure Table Storage might be just enough. If that's not enough then SQL Azure is the next choice.
You can use the XML file as your store, there is no harm it it, rather this is a very easy and cost efficient solution, but there is a catch. As you mentioned currently you are using only azure instance, in this case you can store the XML file in your App_Data, but if in future if you want to shift to 2 azure instance, you will have to replicate the App_Data folder. In other words you will need to keep App_Data folder in sync.
Suggestion
Instead of storing file in App_Data store it in BLOB, you can retrieve it using WebClient and the store it in memory.
Pros: The advantage of BLOB is, you don't have to sync it.
Cons: There is a cost associated on the number of transactions you can make. This will depend upon how many times you update the file.
Summary
If you are going to work with only one Azure Instance, use App_Data
More than one Azure Instance, use BLOB with no syncing or use App_Data with sync.
Do not use Azure Table, as BLOB is the designated store provided for this purpose only.
EDIT
From MSDN post
As far as I know, Windows Azure does not support SQL Compact Edition. SQL Compact Edition stores data in file system which will not be synchronized in multiple instances (a web role may be deployed to more than one instance. An instance is similar to a virtual machine). And files stored in file system will lost when the instance is restarted or reimaged.
Hope this helps you.

Azure table storage with large entity sizes

A couple of questions that I can't find any answers to. Hope someone can help:
I will be using entity sizes of almost 1MB. I can't find any information on read latency for these large entity sizes. Is there anyone out there that has any information on this.
Is there any way to determine how much space is used for a row in Azure table storage. Any API for this
Thanks
If most of our entities are large, then it somehow defeats the purpose of Table Storage in the first place. Indeed, you will only be able to retrieve update them 4 by 4 max, as entity transactions are limited to 4MB.
You can get many useful measurements on the AzureScope project concerning the Table Storage, but also the other storage services of Azure.
Then, if you want to accurately check the weight of your rows, just use Fiddler to intercept your web requests, and directly look at the XML being produced.

Resources