high latency when reading/writing using azure-kusto-spark - apache-spark

I'm writing data to Kusto using azure-kusto-spark. This write has very small data. This spark-kusto connector uses batch streaming.
But I see this write has high latency running in like 8 minutes.
From the logs I see, this high latency is in the staging ingestion part. I see that a temporary staging table is created, and data is ingested to it via multiple jobs. Only when staging ingestion is completed, data is finally merged and written to kusto table.
Can anyone give pointers why this small data is having so much of latency ?

It might be because of your Kusto table ingestion policy as described here https://github.com/Azure/azure-kusto-spark/blob/master/docs/KustoSink.md#performance-considerations
You might want to modify the defautl ingestion policy and set the time to less than the default value which is 5 min. See this doc for reference https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/batchingpolicy#defaults-and-limits

Related

Azure Data factory and Data flow taking too much time to process data from staging to Database

So I have one data factory which runs every day, and it selects data from oracle on-premise database around 80M records and moves it to parquet file, which is taking around 2 hours I want to speed up this process... also the data flow process which insert and update data in db
parquet file setting
Next step is from parquet file it call the data flow which move data as upsert to database but this also taking too much time
data flow Setting
Let me know which compute type for data flow
Memory Optimized
Computed Optimized
General Purpose
After Round Robin Update
Sink Time
Can you open the monitoring detailed execution plan for the data flow? Click on each stage in your data flow and look to see where the bulk of the time is being spent. You should see on the top of the view how much time was spent setting-up the compute environment, how much time was taken to read your source, and also check the total write time on your sinks.
I have some examples of how to view and optimize this here.
Well, I would surmise that 45 min to stuff 85M files into a SQL DB is not horrible. You can break the task down into chunks and see what's taking the longest time to complete. Do you have access to Databricks? I do a lot of pre-processing with Databricks, and I have found Spark to be super-super-fast!! If you can pre-process in Databricks and push everything into your SQL world, you may have an optimal solution there.
As per the documentation - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-sink can you try modifying your partition settings under Optimize tab of your Sink ?
I faced similar issue with the default partitioning setting, where the data load was taking close to 30+ mins for 1M records itself, after changing the partition strategy to round robin and provided number of partitions as 5 (for my case) load is happening in less than a min.
Try experimenting with both Source partition (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-source) & Sink partition settings to come up with the optimum strategy. That should improve the data load time

Data Factory V2 copy Data Activities and Data flow ETL

My question about the Data Factory V2 copy-data activity i have 5 questions.
Questions 1
Should I use parquet file or SQL server With 500 DTU I want to transfer data fast to staging table or staging parquet file
Questions 2
Copy data activity data integration Unit should i use auto or 32 data integration Unit
Questions 3
What benefit of using degree of copy parallelism should I use Auto or use 32 again I want to transfer everything quick as possible I have around 50 million rows every day.
Questions 4
Data Flow Integration run time so should I use General Purpose, Compute Optimized or Memory Optimized as I mention we have 50 million rows every day, so we want to process the data as quickly as possible and somehow cheap if we can in Data Flow
Questions 5
A bulk insert is better in Data Factory and Data flow Sink
I think you have too many questions about too many topics, the answers to which will depend entirely on your desired end result. Even so, I will do my best to briefly address your situation.
If you are dealing with large volume and/or frequency, Data Flow (ADFDF) would probably be better than Copy activity. ADFDF runs on Spark via Data Bricks and is built from the ground up to run parallel workloads. Parquet is also built to support parallel workloads. If your SQL is an Azure Synapse (SQLDW) instance, then ADFDF will use Polybase to manage the upload, which is very fast because it is also built for parallel workloads. I'm not sure how this differs for Azure SQL, and there is no way to tell you what DTU level will work best for your task.
If having Parquet as your end result is acceptable, then that would probably be the easiest and least expensive to configure since it is just blob storage. ADFDF works just fine with Parquet, as either Source or Sink. For ETL workloads, Compute is the most likely IR configuration. The good news is it is the least expensive of the three. The bad news is I have no way to know what the core count should be, you'll just have to find out through trial and error. 50 million rows may sound like a lot, but it really depends on the row size (byte count and column count), and frequency. If the process is running many times a day, then you can include a "Time to live" value in the IR configuration. This will keep the cluster warm while it waits for another job, thus potentially reducing startup time (but incurring more run time cost).

DWU for Azure SQL DWH

I'm trying to provision Azure SQL DWH mostly Gen 2 type, but i'm not sure about the DWU that i need to set.
After analyzing source systems, on an average per day the DWH might be expecting nearly 1.5 million records. It will be inserted/updated to different set of tables.
With the no. of records is it possible to ascertain the DWU that needs to be set at DWH level.
Please advice.
While the volume of inserts is a useful number, it is more important to know the volume of data that will be frequently queried. We call this the active data set.
Let's say that you have 10Tb of data. If most of your queries address that whole 10Tb, then your active data set is 10Tb. However, if most of your queries only deal with 10% of your data, your active data set is 1Tb.
Some general guideline examples for DWUc by active data set:
1Tb: 500c
3Tb: 1000c
10Tb: 3000c
That said, in my experience the 1Tb/500c recommendation is a little small. That is because you're still working on a single node at less than 1000c; and your number of concurrent queries is limited to 20, with 20 concurrency slots. I like to see customers start at 1000c, and only use lesser DWU for dev/test, or during very quiet periods when the DW can't otherwise be paused.

How to speed up copy from Azure Data Lake to Cosmos DB

I'm using Azure Data Factory to copy data from Azure Data Lake Store to a collection in Cosmos DB. We will have a few thousand JSON files in data lake and each JSON file is approx. 3 GB. I'm using data factory's copy activity and in the initial run, one file took 3.5 hours to load with the collection set to 10000 RU/s and data factory using default settings. Now I've scaled it up to 50000 RU/s, set cloudDataMovementUnits to 32 and writeBatchSize to 10 to see if it improved the speed, and the same file now takes 2.5 hours to load. Still the time to load thousands of files will take way to long time.
Is there some way to do this in a better way?
You say you are inserting "millions" of json documents per 3Gb batch file. Such lack of precision is not helpful when asking this type of question.
Let's run the numbers for 10 million docs per file.
This indicates 300 bytes per json doc which implies quite a lot of fields per doc to index on each CosmosDb insert.
If each insert costs 10 RUs then at your budgeted 10,000 RU per sec the doc insert rate would be 1000 x 3600 (seconds per hour) = 3.6 million doc inserts per hour.
So your observation of 3.5 hours to insert 3 Gb of data representing an assumed 10 million docs is highly consistent with your purchased CosmosDb throughput.
This document https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance illustrates that the DataLake to CosmosDb Cloud Sink performs poorly compared to other options. I guess the poor performance can be attributed to the default index-everything policy of CosmosDb.
Does your application need everything indexed? Does the CommosDb Cloud Sink utilise less strict eventual consistency when performing bulk inserts?
You ask, is there a better way? The performance table in the linked MS document shows that Data Lake to Polybase Azure Data Warehouse is 20,000 times more performant.
One final thought. Does the increased concurrency of your second test trigger CosmosDb throttling? The MS performance doc warns about monitoring for these events.
The bottom line is that trying to copy millions of Json files will take time. If it was organized GB of data you could get away with shorter time batch transfers but not with millions of different files.
I don't know if you plan on transferring this type of file from Data Lake often but a good strategy could be to write an application dedicated to do that. Using Microsoft.Azure.DocumentDB Client Library you can easily create a C# web app that manages your transfers.
This way you can automate those transfers, throttle them, schedule them, etc. You can also host this app on a vm or app service and never really have to think about it.

What would be the proper way to tune Apache Spark for responsive web applications?

I have previously used Apache Spark for streaming applications where it does a wonderful job for ETL pipelines and predictions using Machine Learning.
However, Spark for EDA may not be as fast as one may want. For example, if you would like to do basic mathematical operations on data coming from Postgres or ElasticSearch using the data frames in Spark, the time it takes to fetch data from the host system and do the analysis is much higher than that taken by the SQL query on Postgres to run.
Even simple aggregations such as sum, average, and count can be done much faster using SQL than doing them on top of Spark-SQL.
From what I understand, this is not because of latency in fetching the data from the host system. If you call the show method on a data frame, you can quickly get the top rows of the data set. However, if you limit the response in SQL, and then call collect the time taken is huge.
This means that the data is there, but the processing being done while calling collect is taking a time.
Regardless of the data source (CSV file, JSON file, ElasticSearch, Parquet, etc.), the behavior remains the same.
What is the reason for this latency on collect and is there any way to reduce it to the point where it can work with responsive applications to make real-time or near real-time queries?

Resources